# Preprocesamiento con Python

Para el preprocesamiento de texto utilizaremos las librerías de Python:
- **Numpy**: para el procesamiento de las operaciones en los Dataframes y Series de Pandas
- **Pandas**: para la manipulación de los datos
- **NLTK**: para el procesamiento de texto por medio de las StopWords, Stemming, Lemmatization y POS tag
- **re**: filtrar datos con para expresiones regulares 

## Lectura de datos con Pandas

In [2]:
# Importacion de librerias
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import WordNetLemmatizer
import re
import emoji
import seaborn as sns
# Permite desplegar el texto completo en Jupyter
pd.set_option('display.max_colwidth', -1)

In [3]:
# Lectura de CSV
data = pd.read_csv("Tweets_pg_prepared.csv")
data.head(5) # Muestra los datos

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,5.70306e+17,neutral,1.0,Can't Tell,0.0,Virgin America,No value,cairdin,No Value,@VirginAmerica What @dhepburn said.,"[0.0, 0.0]",24/02/2015 11:35,No value,Eastern Time (US & Canada)
1,5.70301e+17,positive,0.3486,Can't Tell,0.0,Virgin America,No value,jnardino,No Value,@VirginAmerica plus you've added commercials to the experience... tacky.,"[0.0, 0.0]",24/02/2015 11:15,No value,Pacific Time (US & Canada)
2,5.70301e+17,neutral,0.6837,Can't Tell,0.0,Virgin America,No value,yvonnalynn,No Value,@VirginAmerica I didn't today... Must mean I need to take another trip!,"[0.0, 0.0]",24/02/2015 11:15,Lets Play,Central Time (US & Canada)
3,5.70301e+17,negative,1.0,Bad Flight,0.7033,Virgin America,No value,jnardino,No Value,"@VirginAmerica it's really aggressive to blast obnoxious ""entertainment"" in your guests' faces &amp; they have little recourse","[0.0, 0.0]",24/02/2015 11:15,No value,Pacific Time (US & Canada)
4,5.70301e+17,negative,1.0,Can't Tell,1.0,Virgin America,No value,jnardino,No Value,@VirginAmerica and it's a really big bad thing about it,"[0.0, 0.0]",24/02/2015 11:14,No value,Pacific Time (US & Canada)


## Remover URLs (Regex)

Explicación Regex
1. **\w+** : Uno o más carácteres alfanumericos
2. **:\/\/** : Un "://"
3. **\S+**: uno o más carácteres que no sean espacios

Explicación otro Regex
1. **(http|https|ftp)**: Detectar si empieza con alguno de estos protocolos
2. **://**: Seguido de un "://"
3. **[a-zA-Z0-9\\./]**: E inmediatamente empieza una palabra seguido de un punto (.) una o mas veces (de esta manera se incluye el (.com y variantes)

In [14]:
data_noURL = data["text"].str.replace('\w+:\/\/\S+',"")
# Otro regex: (http|https|ftp)://[a-zA-Z0-9\\./]+
#data_noURL

## Remover referencias (@Usernames)

Explicación regex
1. **@**: Si empieza con arroba (@)
2. **(\w+)**: y le sigue una o más palabras

In [15]:
data_noUser = data_noURL.str.replace('@(\w+)',"")
#data_noUser

## Remover hashtags

Explicación regex
1. **#**: Si empieza con gato (#)
2. **(\w+)**: y le sigue una o más palabras

In [16]:
data_noHashtag = data_noUser.str.replace('#(\w+)',"")
#data_noHashtag

## Reemplazar Contracciones

In [19]:
diccionario_contracciones = {
        "ain't":"is not",
        "amn't":"am not",
        "aren't":"are not",
        "can't":"cannot",
        "'cause":"because",
        "couldn't":"could not",
        "couldn't've":"could not have",
        "could've":"could have",
        "daren't":"dare not",
        "daresn't":"dare not",
        "dasn't":"dare not",
        "didn't":"did not",
        "doesn't":"does not",
        "don't":"do not",
        "e'er":"ever",
        "em":"them",
        "everyone's":"everyone is",
        "finna":"fixing to",
        "gimme":"give me",
        "gonna":"going to",
        "gon't":"go not",
        "gotta":"got to",
        "hadn't":"had not",
        "hasn't":"has not",
        "haven't":"have not",
        "he'd":"he would",
        "he'll":"he will",
        "he's":"he is",
        "he've":"he have",
        "how'd":"how would",
        "how'll":"how will",
        "how're":"how are",
        "how's":"how is",
        "I'd":"I would",
        "I'll":"I will",
        "I'm":"I am",
        "I'm'a":"I am about to",
        "I'm'o":"I am going to",
        "isn't":"is not",
        "it'd":"it would",
        "it'll":"it will",
        "it's":"it is",
        "I've":"I have",
        "kinda":"kind of",
        "let's":"let us",
        "mayn't":"may not",
        "may've":"may have",
        "mightn't":"might not",
        "might've":"might have",
        "mustn't":"must not",
        "mustn't've":"must not have",
        "must've":"must have",
        "needn't":"need not",
        "ne'er":"never",
        "o'":"of",
        "o'er":"over",
        "ol'":"old",
        "oughtn't":"ought not",
        "shalln't":"shall not",
        "shan't":"shall not",
        "she'd":"she would",
        "she'll":"she will",
        "she's":"she is",
        "shouldn't":"should not",
        "shouldn't've":"should not have",
        "should've":"should have",
        "somebody's":"somebody is",
        "someone's":"someone is",
        "something's":"something is",
        "that'd":"that would",
        "that'll":"that will",
        "that're":"that are",
        "that's":"that is",
        "there'd":"there would",
        "there'll":"there will",
        "there're":"there are",
        "there's":"there is",
        "these're":"these are",
        "they'd":"they would",
        "they'll":"they will",
        "they're":"they are",
        "they've":"they have",
        "this's":"this is",
        "those're":"those are",
        "'tis":"it is",
        "'twas":"it was",
        "wanna":"want to",
        "wasn't":"was not",
        "we'd":"we would",
        "we'd've":"we would have",
        "we'll":"we will",
        "we're":"we are",
        "weren't":"were not",
        "we've":"we have",
        "what'd":"what did",
        "what'll":"what will",
        "what're":"what are",
        "what's":"what is",
        "what've":"what have",
        "when's":"when is",
        "where'd":"where did",
        "where're":"where are",
        "where's":"where is",
        "where've":"where have",
        "which's":"which is",
        "who'd":"who would",
        "who'd've":"who would have",
        "who'll":"who will",
        "who're":"who are",
        "who's":"who is",
        "who've":"who have",
        "why'd":"why did",
        "why're":"why are",
        "why's":"why is",
        "won't":"will not",
        "wouldn't":"would not",
        "would've":"would have",
        "y'all":"you all",
        "you'd":"you would",
        "you'll":"you will",
        "you're":"you are",
        "you've":"you have",
        "Whatcha":"What are you",
        "luv":"love",
        "sux":"sucks",
}

In [20]:
# Creando un conjunto de contracciones
conjunto_contracciones = set(diccionario_contracciones.keys())

In [21]:
def traducir_contracciones(texto):
    texto = texto.split(" ")
    j = 0
    for palabra in texto:
        # Checa si las palabras seleccionadas coinciden con el connjunto de emoticones
        if palabra in conjunto_contracciones:
            #print("Contraccion con ", palabra, " a ", diccionario_contracciones[palabra], " en ", texto)
            # Si encuentra una coincidencia, la reemplaza con su respectiva traducción
            texto[j] = diccionario_contracciones[palabra]
        j = j + 1
    # Retorna la cadena corregida
    return ' '.join(texto)

In [23]:
data_noContracciones = data_noHashtag.str.replace("’","'")
data_noContracciones = data_noContracciones.apply(lambda x: traducir_contracciones(x))
#data_noContracciones

## Tratamiento de emoticones y emojis

Originalmente pensaba que algun analizador los podria detectar, pero despues de leer algunos artículos descubri que es mejor interpretarlos (convertirlos a palabras que expresen el sentimiento del emoticon). Esto es clave para medir la polaridad de un mensaje

### Interpretación de emoticones

In [24]:
diccionario_emoticones = {
        ":)":"smiley",
        ":‑)":"smiley",
        ":-]":"smiley",
        ":-3":"smiley",
        ":->":"smiley",
        "8-)":"smiley",
        ":-}":"smiley",
        ":)":"smiley",
        ":]":"smiley",
        ":3":"smiley",
        ":>":"smiley",
        "8)":"smiley",
        ":}":"smiley",
        ":o)":"smiley",
        ":c)":"smiley",
        ":^)":"smiley",
        "=]":"smiley",
        "=)":"smiley",
        ":-))":"smiley",
        ":-D":"smiley",
        "8‑D":"smiley",
        "x‑D":"smiley",
        "X‑D":"smiley",
        ":D":"smiley",
        "8D":"smiley",
        "xD":"smiley",
        "XD":"smiley",
        ":-d":"smiley",
        "8‑d":"smiley",
        "x‑d":"smiley",
        "X‑d":"smiley",
        ":d":"smiley",
        "8d":"smiley",
        "xd":"smiley",
        "Xd":"smiley",
        ":‑(":"sad",
        ":‑c":"sad",
        ":‑<":"sad",
        ":‑[":"sad",
        ":(":"sad",
        ":c":"sad",
        ":<":"sad",
        ":[":"sad",
        ":-||":"sad",
        ">:[":"sad",
        ":{":"sad",
        ":@":"sad",
        ">:(":"sad",
        ":'‑(":"sad",
        ":'(":"sad",
        ":‑P":"playful",
        "X‑P":"playful",
        "x‑p":"playful",
        ":‑p":"playful",
        ":‑Þ":"playful",
        ":‑þ":"playful",
        ":‑b":"playful",
        ":P":"playful",
        "XP":"playful",
        "xp":"playful",
        ":p":"playful",
        ":Þ":"playful",
        ":þ":"playful",
        ":b":"playful",
        ";p":"playful",
        "<3":"love",
}

In [25]:
# Creando un conjunto de emoticones
conjunto_emoticones = set(diccionario_emoticones.keys())

In [26]:
def traducir_emoticones(texto):
    texto = texto.split(" ")
    j = 0
    for palabra in texto:
        # Checa si las palabras seleccionadas coinciden con el connjunto de emoticones
        if palabra in conjunto_emoticones:
            # Si encuentra una coincidencia, la reemplaza con su respectiva traducción
            texto[j] = diccionario_emoticones[palabra]
        j = j + 1
    # Retorna la cadena corregida
    return ' '.join(texto)

In [27]:
data_noEmoticones = data_noContracciones.apply(lambda x: traducir_emoticones(x))
#data_noEmoticones

### Codigo para remover emojis

In [15]:
# data_noEmoji = data_noHashtag.str.replace("["
#                            u"\U0001F600-\U0001F64F"  # emojis
#                            u"\U0001F300-\U0001F5FF"  # simbolos & pictografos
#                            u"\U0001F680-\U0001F6FF"  # simbolos de transporte y mapas
#                            u"\U0001F1E0-\U0001F1FF"  # banderas (iOS)
#                            u"\U00002702-\U000027B0"
#                            u"\U000024C2-\U0001F251"
#                            "]+", "")

### Interpretación de emojis

In [28]:
data_noEmojis = data_noEmoticones.apply(lambda x: emoji.demojize(x))
data_noEmojis = data_noEmojis.str.replace(":"," ")
#data_noEmojis

## Remover signos de puntuacion

In [29]:
data_noPunctuation = data_noEmojis.str.replace("[\.\,\!\?\:\;\-\=]", " ")
data_noPunctuation = data_noPunctuation.str.replace(" +"," ") # Reducir los espacios a solo 1
#data_noPunctuation

## Convertir mayúsculas a minúsculas

In [30]:
data_lower = data_noPunctuation.str.lower() # Convertir todo el texto de la columna "text" a minusculas
#data_lower 

## Interpretación de Slang (abreviaturas)

### Web Scrapping de los acronimos de Netlingo

In [19]:
#from bs4 import BeautifulSoup
#import requests, json
#resp = requests.get("http://www.netlingo.com/acronyms.php")
#soup = BeautifulSoup(resp.text, "html.parser")
#slangdict = {}
#key = ""
#value = ""
#for div in soup.findAll('div', attrs={'class':'list_box3'}):
#    for li in div.findAll('li'):
#        for a in li.findAll('a'):
#            key = a.text
#        value = li.text.split(key)[1]
#        slangdict[key.upper()] = value
#with open('myslang.json','w') as find:
#    json.dump(slangdict, find, indent = 2)

### Leer el archivo de Slang en JSON

Descubrí que al tener un diccionario más amplio, afecta negativamente al análisis semántico porque traduce palabras como "TIME" a "Tears In My Ears" cuando en realidad el texto se refiere al "Tiempo". Así que no lo voy a usar, pero dejo el código para futuras referencias.

In [20]:
# slang = pd.read_json("myslang.json", typ = "series")
# # slang.to_frame('count') #para convertir a DataFrame
# slang_df = slang.reset_index()
# slang_np = slang_df["index"].to_numpy()
# slang_list = slang_np.tolist()
# slang_set = set(slang_list)

In [21]:
#import re
# def translator(user_string):
#     user_string = user_string.split(" ")
#     j = 0
#     for _str in user_string:
#         # Removiendo carácteres especiales
#         #_str = re.sub('[^a-zA-Z0-9-_.]', '', _str)
#         _str = _str.upper()
#         # Checa si las palabras seleccionadas coinciden con las abreviaturas en el archivo de "slang.txt"
#         if _str in slang_set:
#             print("entro en ", user_string, " con: ", _str, " a ", slang[_str].lower())
#             # Si encuentra una coincidencia, la reemplaza con su respectiva traducción
#             user_string[j] = slang[_str].lower()
#         j = j + 1
#     # Retorna la cadena corregida
#     return ' '.join(user_string)

In [22]:
# data_noSlang = data_lower.apply(lambda x: translator(x))
# data_noSlang

### Si los slangs hubieran estado en un TXT

In [31]:
# Lectura de archivo
slang_df = pd.read_csv("slang.txt", sep = "=")
slang_df.columns = ["Slang", "Meaning"]

# Crear conjunto de Slangs
slang_np = slang_df["Slang"].to_numpy()
slang_list = slang_np.tolist()
slang_set = set(slang_list)

# Hacer que la columna "Slang" sean los indices (para busquedas)
slang_df = slang_df.set_index('Slang')
slang = slang_df["Meaning"]

In [32]:
def traducir_slang(texto):
    texto = texto.split(" ")
    j = 0
    for palabra in texto:
        palabra = palabra.upper()
        # Checa si las palabras seleccionadas coinciden con el connjunto de emoticones
        if palabra in slang_set:
            #print("Slang en ", texto, " con ", palabra, " a ", slang[palabra])
            # Si encuentra una coincidencia, la reemplaza con su respectiva traducción
            texto[j] = slang[palabra].lower()
        j = j + 1
    # Retorna la cadena corregida
    return ' '.join(texto)

In [33]:
data_noSlang = data_lower.apply(lambda x: traducir_slang(x))
#data_noSlang

## Reducción de carácteres repetidos

Como "haaapppyyyy" a "haappyy"

In [34]:
data_noRepeated = data_noSlang.transform(lambda x: re.sub(r'(.)\1+', r'\1\1', x))
#data_noRepeated

## Remover StopWords

Son palabras que no aportan valor al analizar sentimientos, en Inglés serían palabras como "are, you, have, etc"

In [35]:
#import nltk
#nltk.download('stopwords')
#from nltk.corpus import stopwords
stop = stopwords.words("english")
stop_set = set(stop)
data_noStopwords = data_noRepeated.apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_set)]))
#data_noStopwords

## Stemming (volver a las palabras a su respectiva palabra raiz)

Existen diferentes tipos de Stemmers, para el lenguaje Inglés,  podemos encontrar 2 de las más populares en la librería NLTK

### Porter Stemmer

Es conocido por su simplicidad y velocidad

In [28]:
#from nltk.stem import PorterStemmer

In [36]:
ps = PorterStemmer()
data_PorterStemming = data_noStopwords.apply(lambda x: ' '.join([ps.stem(word) for word in x.split()]))
#data_PorterStemming

### LancasterStemmer

Es conocido por ser simple, pero tambien en ser muy duro al stemmizar, ya que realiza iteraciones y podría ocurrir una sobre-stemmización

In [30]:
#from nltk.stem import LancasterStemmer

In [37]:
ls = LancasterStemmer()
data_LancasterStemming = data_noStopwords.apply(lambda x: ' '.join([ls.stem(word) for word in x.split()]))
#data_LancasterStemming

Sin embargo, ambos stemmers por si solos devuelven la cadena completa como si se tratara de una palabra:

' plu ad commerc experience.. tacky.'

Cuando debería ser:

['plu' 'ad' 'commerc' 'experience' 'tacky']

Para lograr ello realizamos una "Tokenización"

## Tokenización

### Porter Stemmer

In [38]:
nltk.download('punkt')
#from nltk.tokenize import sent_tokenize, word_tokenize
def stemOracion(oracion):
    token_words = word_tokenize(oracion)
    token_words
    stem_sentence = []
    punctuations = "?:!.,;"
    for word in token_words:
        if word in punctuations:
            token_words.remove(word)
            continue
        stem_sentence.append(ps.stem(word))
    return stem_sentence


porter_stemmer_tokenized = data_noStopwords.apply(lambda x: stemOracion(x))
#porter_stemmer_tokenized

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\GIYELI\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Lancaster Stemmer

In [39]:
nltk.download('punkt')
#from nltk.tokenize import sent_tokenize, word_tokenize
def stemOracion(oracion):
    token_words = word_tokenize(oracion)
    token_words
    stem_sentence = []
    punctuations = "?:!.,;"
    for word in token_words:
        if word in punctuations:
            token_words.remove(word)
            continue
        stem_sentence.append(ls.stem(word))
    return stem_sentence


porter_stemmer_tokenized = data_noStopwords.apply(lambda x: stemOracion(x))
#porter_stemmer_tokenized

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\GIYELI\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Lemmatization (es el Stemming pero con otro proceso)

Se desarrollara el Lemmatization para ver si con este proceso se obtienen mejores resultados. El siguiente código tambien incluye la Tokenización

In [59]:
#Fuente: https://www.datacamp.com/community/tutorials/stemming-lemmatization-python
#from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

def lemmatization(oracion):
    wordnet_lemmatizer = WordNetLemmatizer()
    punctuations = "?:!.,;$\"\'\´\``\”\“\''"
    resultado = []
    sentence_words = nltk.word_tokenize(oracion)
    for word in sentence_words:
        if word in punctuations:
            sentence_words.remove(word)
            continue
        resultado.append(wordnet_lemmatizer.lemmatize(word, pos="v"))
    return " ".join(resultado)

data_lemmatized = data_noStopwords.apply(lambda x: lemmatization(x))
data_lemmatized

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\GIYELI\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


0        say                                                                                                        
1        plus add commercials experience tacky                                                                      
2        today must mean need take another trip                                                                     
3        really aggressive blast obnoxious & amp little recourse                                                    
4        really big bad thing                                                                                       
5        seriously would pay flight seat play really bad thing fly va                                               
6        yes nearly every time fly vx worm away smiley                                                              
7        really miss prime opportunity men without hat parody                                                       
8        well didn't…but smiley                                 

## Part Of Speech Tagging (POS)

Sirve para etiquetar cada palabra en la oración como verbo, sustantivo o pronombre, etc.

In [41]:
#Fuente: https://towardsdatascience.com/basic-data-cleaning-engineering-session-twitter-sentiment-data-95e5bd2869ec
nltk.download('averaged_perceptron_tagger')
data_POS = data_noStopwords.apply(lambda x: nltk.pos_tag(nltk.word_tokenize(x)))
#data_POS

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\GIYELI\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


# Guardar Lemmatization a CSV

### Convertir los valores de Sentiment a 0 y 1

In [53]:
sentiment = data["airline_sentiment"].replace(to_replace=["positive","neutral","negative"], value=[1,0,-1])
#sentiment

### Eliminar Columnas inecesarias

In [60]:
bag_sentiment = pd.DataFrame(dict(data_lemmatized = data_lemmatized, sentiment = sentiment))

### Dataframe a CSV

In [61]:
export_csv = bag_sentiment.to_csv (r'data_lemmatized.csv', index = None, header=True) 