# Preprocesamiento con Python

Para el preprocesamiento de texto utilizaremos las librerías de Python:
- **Numpy**: para el procesamiento de las operaciones en los Dataframes y Series de Pandas
- **Pandas**: para la manipulación de los datos
- **NLTK**: para el procesamiento de texto por medio de las StopWords, Stemming, Lemmatization y POS tag
- **re**: filtrar datos con para expresiones regulares 

## Lectura de datos con Pandas

In [1]:
# Importacion de librerias
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import WordNetLemmatizer
import re
# Permite desplegar el texto completo en Jupyter
pd.set_option('display.max_colwidth', -1)

In [2]:
# Lectura de CSV
df = pd.read_csv("Tweets_pg_prepared.csv")
data = df.tail() # Muestra los datos

In [3]:
data["text"] # Mostrar los datos de la columna "text"

14635    @AmericanAir thank you we got on a different flight to Chicago.                                                                                       
14637    @AmericanAir Please bring American Airlines to #BlackBerry10                                                                                          
14638    @AmericanAir you have my money, you change my flight, and don't answer your phones! Any other suggestions so I can make my commitment??               
14639    @AmericanAir we have 8 ppl so we need 2 know how many seats are on the next flight. Plz put us on standby for 4 people on the next flight?            
Name: text, dtype: object

## Convertir mayúsculas a minúsculas

In [4]:
data_lower = data["text"].str.lower() # Convertir todo el texto de la columna "text" a minusculas
data_lower # Mostrar

14635    @americanair thank you we got on a different flight to chicago.                                                                                       
14637    @americanair please bring american airlines to #blackberry10                                                                                          
14638    @americanair you have my money, you change my flight, and don't answer your phones! any other suggestions so i can make my commitment??               
14639    @americanair we have 8 ppl so we need 2 know how many seats are on the next flight. plz put us on standby for 4 people on the next flight?            
Name: text, dtype: object

## Remover URLs (Regex)

Explicación de Regex
1. **(http|https|ftp)**: Detectar si empieza con alguno de estos protocolos
2. **://**: Seguido de un "://"
3. **[a-zA-Z0-9\\./]**: E inmediatamente empieza una palabra seguido de un punto (.) una o mas veces (de esta manera se incluye el (.com y variantes)

In [5]:
data_noURL = data_lower.str.replace('(http|https|ftp)://[a-zA-Z0-9\\./]+',"")
data_noURL

14635    @americanair thank you we got on a different flight to chicago.                                                                                       
14637    @americanair please bring american airlines to #blackberry10                                                                                          
14638    @americanair you have my money, you change my flight, and don't answer your phones! any other suggestions so i can make my commitment??               
14639    @americanair we have 8 ppl so we need 2 know how many seats are on the next flight. plz put us on standby for 4 people on the next flight?            
Name: text, dtype: object

## Remover referencias (@Usernames)

Explicación regex
1. **@**: Si empieza con arroba (@)
2. **(\w+)**: y le sigue una o más palabras

In [6]:
data_noUser = data_noURL.str.replace('@(\w+)',"")
data_noUser

14635     thank you we got on a different flight to chicago.                                                                                       
14637     please bring american airlines to #blackberry10                                                                                          
14638     you have my money, you change my flight, and don't answer your phones! any other suggestions so i can make my commitment??               
14639     we have 8 ppl so we need 2 know how many seats are on the next flight. plz put us on standby for 4 people on the next flight?            
Name: text, dtype: object

## Remover hashtags

Explicación regex
1. **#**: Si empieza con arroba (#)
2. **(\w+)**: y le sigue una o más palabras

In [7]:
data_noHashtag = data_noUser.str.replace('#(\w+)',"")
data_noHashtag

14635     thank you we got on a different flight to chicago.                                                                                       
14637     please bring american airlines to                                                                                                        
14638     you have my money, you change my flight, and don't answer your phones! any other suggestions so i can make my commitment??               
14639     we have 8 ppl so we need 2 know how many seats are on the next flight. plz put us on standby for 4 people on the next flight?            
Name: text, dtype: object

## Remover emoticones

NOTA: Falta averiguar si algún analizador de sentimientos le sirven los emoticones, o tal vez se puedan traducir los emojis por alguna palabra que exprese su sentimiento.

In [8]:

data_noEmoji = data_noHashtag.str.replace("["
                           u"\U0001F600-\U0001F64F"  # emoticones
                           u"\U0001F300-\U0001F5FF"  # simbolos & pictografos
                           u"\U0001F680-\U0001F6FF"  # simbolos de transporte y mapas
                           u"\U0001F1E0-\U0001F1FF"  # banderas (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", "")
data_noEmoji

14635     thank you we got on a different flight to chicago.                                                                                       
14637     please bring american airlines to                                                                                                        
14638     you have my money, you change my flight, and don't answer your phones! any other suggestions so i can make my commitment??               
14639     we have 8 ppl so we need 2 know how many seats are on the next flight. plz put us on standby for 4 people on the next flight?            
Name: text, dtype: object

## Remover StopWords

Son palabras que no aportan valor al analizar sentimientos, en Inglés serían palabras como "are, you, have, etc"

In [9]:
#import nltk
#nltk.download('stopwords')
#from nltk.corpus import stopwords
stop = stopwords.words("english")
data_noStopwords = data_noEmoji.apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
data_noStopwords

14635    thank got different flight chicago.                                                                             
14637    please bring american airlines                                                                                  
14638    money, change flight, answer phones! suggestions make commitment??                                              
14639    8 ppl need 2 know many seats next flight. plz put us standby 4 people next flight?                              
Name: text, dtype: object

## Interpretación de Slang (abreviaturas)

### Web Scrapping de los acronimos de Netlingo

In [10]:
# from bs4 import BeautifulSoup
# import requests, json
# resp = requests.get("http://www.netlingo.com/acronyms.php")
# soup = BeautifulSoup(resp.text, "html.parser")
# slangdict = {}
# key = ""
# value = ""
# for div in soup.findAll('div', attrs={'class':'list_box3'}):
#     for li in div.findAll('li'):
#         for a in li.findAll('a'):
#             key = a.text
#         value = li.text.split(key)[1]
#         slangdict[key] = value
# with open('myslang.json','w') as find:
#     json.dump(slangdict, find, indent = 2)

### Leer el archivo de Slang en JSON

In [11]:
slang = pd.read_json("myslang.json", typ = "series")
# slang.to_frame('count') #para convertir a DataFrame
slang

!               I have a comment                             
#FF             Follow Friday                                
(U)             it means arms around you, hug for you        
*$              Starbucks                                    
**//            it means wink wink, nudge nudge              
,!!!!           Talk to the hand                             
/R/             Requesting                                   
02              Your (or my) two cents worth, also seen as m.
10Q             Thank you                                    
1174            Nude club                                    
121             One to one                                   
123             it means I agree                             
1337            Elite -or- leet -or- L337                    
14              it refers to the fourteen words              
142n8ly         Unfortunately                                
143             I love you                                   
1432    

In [12]:
#import re
def translator(user_string):
    user_string = user_string.split(" ")
    j = 0
    slang = pd.read_json("myslang.json", typ = "series")
    for _str in user_string:
        # Removiendo carácteres especiales
        _str = re.sub('[^a-zA-Z0-9-_.]', '', _str)
        for k in slang.keys():
        # Checa si las palabras seleccionadas coinciden con las abreviaturas en el archivo de "slang.txt"
            if _str.upper() == k.upper():
                # Si encuentra una coincidencia, la reemplaza con su respectiva traducción
                user_string[j] = slang[k].lower()
    j = j + 1
    # Retorna la cadena corregida
    return ' '.join(user_string)

In [13]:
data_noSlang = data_noStopwords.apply(lambda x: translator(x))

### Si los slangs hubieran estado en un TXT

In [14]:
#import csv, rea

In [15]:
# def translator(user_string):
#     user_string = user_string.split(" ")
#     j = 0
#     for _str in user_string:
#         # Archivo con las abreviaturas y su traducción
#         fileName = "slang.txt"
#         # Modo de Acceso al archivo (lectura)
#         accessMode = "r"
#         with open(fileName, accessMode) as myCSVfile:
#             # Leer un archivo como un CSV con delimitador como "=", para que la abreviacion sea guardada en row[0] y las frases en row[1]
#             dataFromFile = csv.reader(myCSVfile, delimiter="=")
#             # Removiendo carácteres especiales
#             _str = re.sub('[^a-zA-Z0-9-_.]', '', _str)
#             for row in dataFromFile:
#                 # Checa si las palabras seleccionadas coinciden con las abreviaturas en el archivo de "slang.txt"
#                 if _str.upper() == row[0]:
#                     # Si encuentra una coincidencia, la reemplaza con su respectiva traducción
#                     user_string[j] = row[1].lower()
#             myCSVfile.close()
#         j = j + 1
#     # Retorna la cadena corregida
#     return ' '.join(user_string)

In [16]:
#data_noSlang = data_noStopwords.apply(lambda x: translator(x))
#data_noSlang

## Reducción de carácteres repetidos

Como "haaapppyyyy" a "haappyy"

In [17]:
data_noRepeated = data_noSlang.transform(lambda x: re.sub(r'(.)\1+', r'\1\1', x))
data_noRepeated

14635    thank got different flight chicago.                                                                              
14637    please bring american airlines                                                                                   
14638    money, change flight, answer phones! suggestions make commitment??                                               
14639    for, four ppl need 2 know many seats next flight. plz put us standby 4 people next flight?                       
Name: text, dtype: object

## Stemming (volver a las palabras a su respectiva palabra raiz)

Existen diferentes tipos de Stemmers, para el lenguaje Inglés,  podemos encontrar 2 de las más populares en la librería NLTK

### Porter Stemmer

Es conocido por su simplicidad y velocidad

In [18]:
#from nltk.stem import PorterStemmer

In [19]:
ps = PorterStemmer()
data_PorterStemming = data_noRepeated.apply(lambda x: ' '.join([ps.stem(word) for word in x.split()]))
data_PorterStemming

14635    thank got differ flight chicago.                                                          
14636    locat 20 minut late flight. warn commun 15 minut late flight. that' call shitti custom svc
14637    pleas bring american airlin                                                               
14638    money, chang flight, answer phones! suggest make commitment??                             
14639    for, four ppl need 2 know mani seat next flight. plz put us standbi 4 peopl next flight?  
Name: text, dtype: object

### LancasterStemmer

Es conocido por ser simple, pero tambien en ser muy duro al stemmizar, ya que realiza iteraciones y podría ocurrir una sobre-stemmización

In [20]:
#from nltk.stem import LancasterStemmer

In [21]:
ls = LancasterStemmer()
data_LancasterStemming = data_noRepeated.apply(lambda x: ' '.join([ls.stem(word) for word in x.split()]))
data_LancasterStemming

14635    thank got diff flight chicago.                                                         
14636    loc 20 minut lat flight. warn commun 15 minut lat flight. that's cal shitty custom svc 
14637    pleas bring am airlin                                                                  
14638    money, chang flight, answ phones! suggest mak commitment??                             
14639    for, four ppl nee 2 know many seat next flight. plz put us standby 4 peopl next flight?
Name: text, dtype: object

Sin embargo, ambos stemmers por si solos devuelven la cadena completa como si se tratara de una palabra:

' plu ad commerc experience.. tacky.'

Cuando debería ser:

['plu' 'ad' 'commerc' 'experience' 'tacky']

Para lograr ello realizamos una "Tokenización"

## Tokenización

### Porter Stemmer

In [22]:
nltk.download('punkt')
#from nltk.tokenize import sent_tokenize, word_tokenize
def stemOracion(oracion):
    token_words = word_tokenize(oracion)
    token_words
    stem_sentence = []
    punctuations = "?:!.,;"
    for word in token_words:
        if word in punctuations:
            token_words.remove(word)
            continue
        stem_sentence.append(ps.stem(word))
    return stem_sentence


porter_stemmer_tokenized = data_noRepeated.apply(lambda x: stemOracion(x))
porter_stemmer_tokenized

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\aleja_000\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


14635    [thank, got, differ, flight, chicago]                                                           
14636    [locat, 20, minut, late, flight, commun, 15, minut, late, flight, 's, call, shitti, custom, svc]
14637    [pleas, bring, american, airlin]                                                                
14638    [money, flight, phone, make, commit]                                                            
14639    [for, ppl, need, 2, know, mani, seat, next, flight, put, us, standbi, 4, peopl, next, flight]   
Name: text, dtype: object

### Lancaster Stemmer

In [23]:
nltk.download('punkt')
#from nltk.tokenize import sent_tokenize, word_tokenize
def stemOracion(oracion):
    token_words = word_tokenize(oracion)
    token_words
    stem_sentence = []
    punctuations = "?:!.,;"
    for word in token_words:
        if word in punctuations:
            token_words.remove(word)
            continue
        stem_sentence.append(ls.stem(word))
    return stem_sentence


porter_stemmer_tokenized = data_noRepeated.apply(lambda x: stemOracion(x))
porter_stemmer_tokenized

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\aleja_000\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


14635    [thank, got, diff, flight, chicago]                                                         
14636    [loc, 20, minut, lat, flight, commun, 15, minut, lat, flight, 's, cal, shitty, custom, svc] 
14637    [pleas, bring, am, airlin]                                                                  
14638    [money, flight, phon, mak, commit]                                                          
14639    [for, ppl, nee, 2, know, many, seat, next, flight, put, us, standby, 4, peopl, next, flight]
Name: text, dtype: object

## Lemmatization (es el Stemming pero con otro proceso)

Se desarrollara el Lemmatization para ver si con este proceso se obtienen mejores resultados. El siguiente código tambien incluye la Tokenización

In [24]:
#Fuente: https://www.datacamp.com/community/tutorials/stemming-lemmatization-python
#from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

def lemmatization(oracion):
    wordnet_lemmatizer = WordNetLemmatizer()
    punctuations = "?:!.,;$\"\'\´\``\”\“\''"
    resultado = []
    sentence_words = nltk.word_tokenize(oracion)
    for word in sentence_words:
        if word in punctuations:
            sentence_words.remove(word)
            continue
        resultado.append(wordnet_lemmatizer.lemmatize(word, pos="v"))
    return resultado

data_lemmatized = data_noRepeated.apply(lambda x: lemmatization(x))
data_lemmatized

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\aleja_000\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


14635    [thank, get, different, flight, chicago]                                                                        
14636    [location, 20, minutes, late, flight, communication, 15, minutes, late, flight, 's, call, shitty, customer, svc]
14637    [please, bring, american, airlines]                                                                             
14638    [money, flight, phone, make, commitment]                                                                        
14639    [for, ppl, need, 2, know, many, seat, next, flight, put, us, standby, 4, people, next, flight]                  
Name: text, dtype: object

## Part Of Speech Tagging (POS)

Sirve para etiquetar cada palabra en la oración como verbo, sustantivo o pronombre, etc.

In [25]:
#Fuente: https://towardsdatascience.com/basic-data-cleaning-engineering-session-twitter-sentiment-data-95e5bd2869ec
nltk.download('averaged_perceptron_tagger')
data_POS = data_noRepeated.apply(lambda x: nltk.pos_tag(nltk.word_tokenize(x)))
data_POS

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\aleja_000\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


14635    [(thank, NN), (got, VBD), (different, JJ), (flight, NN), (chicago, NN), (., .)]                                                                                                                                                                              
14637    [(please, NN), (bring, VB), (american, JJ), (airlines, NNS)]                                                                                                                                                                                                 
14638    [(money, NN), (,, ,), (change, NN), (flight, NN), (,, ,), (answer, JJR), (phones, NNS), (!, .), (suggestions, NNS), (make, VBP), (commitment, NN), (?, .), (?, .)]                                                                                           
14639    [(for, IN), (,, ,), (four, CD), (ppl, NN), (need, VBP), (2, CD), (know, VBP), (many, JJ), (seats, NNS), (next, JJ), (flight, NN), (., .), (plz, NN), (put, VBD), (us, PRP), (standby, VB), (4, CD), (peopl

# Feature Extraction

"En este caso, puedes definir una característica por cada palabra, indicando si el documento contiene esa palabra. Para ponerle un número limite de características que el clasificador necesita procesar, se empieza por construir una lista de las 2000 palabras mas frecuentes en el corpus en general"

Fuente: http://www.nltk.org/book/ch06.html

Primero necesitamos hacer una lista de todas las palabras (**Bag of Words**)

Como tengo un objeto de tipo "Series" de pandas, primero necesito convertirlo a una lista, para crear así, una **lista de listas**

In [26]:
l = data_lemmatized.tolist()
# data_lemmatized_prepared = data_lemmatized.apply(lambda x: ' '.join(x))
# data_lemmatized_prepared

Y crear una lista con todas las palabras, iterando la lista de listas y adjuntandolas a una nueva lista unidimensional

In [27]:
all_words = [item for sublist in l for item in sublist]

In [28]:
all_words

['thank',
 'get',
 'different',
 'flight',
 'chicago',
 'location',
 '20',
 'minutes',
 'late',
 'flight',
 'communication',
 '15',
 'minutes',
 'late',
 'flight',
 "'s",
 'call',
 'shitty',
 'customer',
 'svc',
 'please',
 'bring',
 'american',
 'airlines',
 'money',
 'flight',
 'phone',
 'make',
 'commitment',
 'for',
 'ppl',
 'need',
 '2',
 'know',
 'many',
 'seat',
 'next',
 'flight',
 'put',
 'us',
 'standby',
 '4',
 'people',
 'next',
 'flight']

In [29]:
# Definir el feature extractor

# Utilizar FreqDist para encontrar las palabras más utilizadas en todos los documentos
all_words_freq = nltk.FreqDist(all_words)

# Y tomar los primeros 2000
word_features = list(all_words_freq)[:2000]

def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

### Bag of Words

In [30]:
word_features

['thank',
 'get',
 'different',
 'flight',
 'chicago',
 'location',
 '20',
 'minutes',
 'late',
 'communication',
 '15',
 "'s",
 'call',
 'shitty',
 'customer',
 'svc',
 'please',
 'bring',
 'american',
 'airlines',
 'money',
 'phone',
 'make',
 'commitment',
 'for',
 'ppl',
 'need',
 '2',
 'know',
 'many',
 'seat',
 'next',
 'put',
 'us',
 'standby',
 '4',
 'people']

## Ejecutando la funcion

In [31]:
document_features(word_features)

{'contains(thank)': True,
 'contains(get)': True,
 'contains(different)': True,
 'contains(flight)': True,
 'contains(chicago)': True,
 'contains(location)': True,
 'contains(20)': True,
 'contains(minutes)': True,
 'contains(late)': True,
 'contains(communication)': True,
 'contains(15)': True,
 "contains('s)": True,
 'contains(call)': True,
 'contains(shitty)': True,
 'contains(customer)': True,
 'contains(svc)': True,
 'contains(please)': True,
 'contains(bring)': True,
 'contains(american)': True,
 'contains(airlines)': True,
 'contains(money)': True,
 'contains(phone)': True,
 'contains(make)': True,
 'contains(commitment)': True,
 'contains(for)': True,
 'contains(ppl)': True,
 'contains(need)': True,
 'contains(2)': True,
 'contains(know)': True,
 'contains(many)': True,
 'contains(seat)': True,
 'contains(next)': True,
 'contains(put)': True,
 'contains(us)': True,
 'contains(standby)': True,
 'contains(4)': True,
 'contains(people)': True}

## Pivoteo

Ahora necesitamos crear una estructura en donde las filas sean los documentos y las columnas cada palabra en ese documento con su respectiva clasificación

### Convertir los valores de Sentiment a 0 y 1

In [32]:
sentiment = data["airline_sentiment"].replace(to_replace=["positive","neutral","negative"], value=[1,0,-1])
sentiment

14635    1
14636   -1
14637    0
14638   -1
14639    0
Name: airline_sentiment, dtype: int64

### Eliminar Columnas inecesarias

De esta manera solo conservamos las columnas que queremos tener

In [33]:
bag_sentiment = pd.DataFrame(dict(data_lemmatized = data_lemmatized, sentiment = sentiment))
bag_sentiment

Unnamed: 0,data_lemmatized,sentiment
14635,"[thank, get, different, flight, chicago]",1
14636,"[location, 20, minutes, late, flight, communication, 15, minutes, late, flight, 's, call, shitty, customer, svc]",-1
14637,"[please, bring, american, airlines]",0
14638,"[money, flight, phone, make, commitment]",-1
14639,"[for, ppl, need, 2, know, many, seat, next, flight, put, us, standby, 4, people, next, flight]",0


In [52]:
bag = bag_sentiment.values.tolist()
feature_sets = [(document_features(d), c) for (d,c) in bag]
train_set, test_set = feature_sets[100:], feature_sets[:100]
#classifier = nltk.NaiveBayesClassifier.train(train_set)

In [54]:
pd.DataFrame(train_set)

Unnamed: 0,0,1
0,"{'contains(thank)': False, 'contains(get)': False, 'contains(different)': False, 'contains(flight)': True, 'contains(chicago)': False, 'contains(location)': False, 'contains(20)': False, 'contains(minutes)': False, 'contains(late)': False, 'contains(communication)': False, 'contains(15)': False, 'contains('s)': False, 'contains(call)': False, 'contains(shitty)': False, 'contains(customer)': False, 'contains(svc)': False, 'contains(please)': False, 'contains(bring)': False, 'contains(american)': False, 'contains(airlines)': False, 'contains(money)': True, 'contains(phone)': True, 'contains(make)': True, 'contains(commitment)': True, 'contains(for)': False, 'contains(ppl)': False, 'contains(need)': False, 'contains(2)': False, 'contains(know)': False, 'contains(many)': False, 'contains(seat)': False, 'contains(next)': False, 'contains(put)': False, 'contains(us)': False, 'contains(standby)': False, 'contains(4)': False, 'contains(people)': False}",-1
1,"{'contains(thank)': False, 'contains(get)': False, 'contains(different)': False, 'contains(flight)': True, 'contains(chicago)': False, 'contains(location)': False, 'contains(20)': False, 'contains(minutes)': False, 'contains(late)': False, 'contains(communication)': False, 'contains(15)': False, 'contains('s)': False, 'contains(call)': False, 'contains(shitty)': False, 'contains(customer)': False, 'contains(svc)': False, 'contains(please)': False, 'contains(bring)': False, 'contains(american)': False, 'contains(airlines)': False, 'contains(money)': False, 'contains(phone)': False, 'contains(make)': False, 'contains(commitment)': False, 'contains(for)': True, 'contains(ppl)': True, 'contains(need)': True, 'contains(2)': True, 'contains(know)': True, 'contains(many)': True, 'contains(seat)': True, 'contains(next)': True, 'contains(put)': True, 'contains(us)': True, 'contains(standby)': True, 'contains(4)': True, 'contains(people)': True}",0
