# Preprocesamiento con Python

## Lectura de datos con Pandas

In [1]:
# Importacion de librerias
import pandas as pd
import numpy as np
# Permite desplegar el texto completo
pd.set_option('display.max_colwidth', -1)

In [2]:
# Lectura de CSV
data = pd.read_csv("Tweets_pg_prepared.csv")
data.head() # Muestra los datos

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,5.70306e+17,neutral,1.0,Can't Tell,0.0,Virgin America,No value,cairdin,No Value,@VirginAmerica What @dhepburn said.,"[0.0, 0.0]",24/02/2015 11:35,No value,Eastern Time (US & Canada)
1,5.70301e+17,positive,0.3486,Can't Tell,0.0,Virgin America,No value,jnardino,No Value,@VirginAmerica plus you've added commercials to the experience... tacky.,"[0.0, 0.0]",24/02/2015 11:15,No value,Pacific Time (US & Canada)
2,5.70301e+17,neutral,0.6837,Can't Tell,0.0,Virgin America,No value,yvonnalynn,No Value,@VirginAmerica I didn't today... Must mean I need to take another trip!,"[0.0, 0.0]",24/02/2015 11:15,Lets Play,Central Time (US & Canada)
3,5.70301e+17,negative,1.0,Bad Flight,0.7033,Virgin America,No value,jnardino,No Value,"@VirginAmerica it's really aggressive to blast obnoxious ""entertainment"" in your guests' faces &amp; they have little recourse","[0.0, 0.0]",24/02/2015 11:15,No value,Pacific Time (US & Canada)
4,5.70301e+17,negative,1.0,Can't Tell,1.0,Virgin America,No value,jnardino,No Value,@VirginAmerica and it's a really big bad thing about it,"[0.0, 0.0]",24/02/2015 11:14,No value,Pacific Time (US & Canada)


In [3]:
data["text"] # Mostrar los datos de la columna "text"

0        @VirginAmerica What @dhepburn said.                                                                                                                   
1        @VirginAmerica plus you've added commercials to the experience... tacky.                                                                              
2        @VirginAmerica I didn't today... Must mean I need to take another trip!                                                                               
3        @VirginAmerica it's really aggressive to blast obnoxious "entertainment" in your guests' faces &amp; they have little recourse                        
4        @VirginAmerica and it's a really big bad thing about it                                                                                               
5        @VirginAmerica seriously would pay $30 a flight for seats that didn't have this playing.\nit's really the only bad thing about flying VA              
6        @VirginAmerica yes, nearly ever

## Convertir mayúsculas a minúsculas

In [4]:
data_lower = data["text"].str.lower() # Convertir todo el texto de la columna "text" a minusculas
data_lower # Mostrar

0        @virginamerica what @dhepburn said.                                                                                                                   
1        @virginamerica plus you've added commercials to the experience... tacky.                                                                              
2        @virginamerica i didn't today... must mean i need to take another trip!                                                                               
3        @virginamerica it's really aggressive to blast obnoxious "entertainment" in your guests' faces &amp; they have little recourse                        
4        @virginamerica and it's a really big bad thing about it                                                                                               
5        @virginamerica seriously would pay $30 a flight for seats that didn't have this playing.\nit's really the only bad thing about flying va              
6        @virginamerica yes, nearly ever

## Remover URLs (Regex)

Explicación de Regex
1. **(http|https|ftp)**
    * Detectar si empieza con alguno de estos protocolos
2. **://**  
    * Seguido de un "://"
3. **[a-zA-Z0-9\\./]**
    * E inmediatamente empieza una palabra seguido de un punto (.) una o mas veces (de esta manera se incluye el (.com y variantes)

In [5]:
data_noURL = data_lower.str.replace('(http|https|ftp)://[a-zA-Z0-9\\./]+',"")
data_noURL

0        @virginamerica what @dhepburn said.                                                                                                                   
1        @virginamerica plus you've added commercials to the experience... tacky.                                                                              
2        @virginamerica i didn't today... must mean i need to take another trip!                                                                               
3        @virginamerica it's really aggressive to blast obnoxious "entertainment" in your guests' faces &amp; they have little recourse                        
4        @virginamerica and it's a really big bad thing about it                                                                                               
5        @virginamerica seriously would pay $30 a flight for seats that didn't have this playing.\nit's really the only bad thing about flying va              
6        @virginamerica yes, nearly ever

## Remover referencias (@Usernames)

Explicación regex
1. **@**
    * Si empieza con arroba (@)
2. **(\w+)**
    * y le sigue una o más palabras

In [6]:
data_noUser = data_noURL.str.replace('@(\w+)',"")
data_noUser

0         what  said.                                                                                                                              
1         plus you've added commercials to the experience... tacky.                                                                                
2         i didn't today... must mean i need to take another trip!                                                                                 
3         it's really aggressive to blast obnoxious "entertainment" in your guests' faces &amp; they have little recourse                          
4         and it's a really big bad thing about it                                                                                                 
5         seriously would pay $30 a flight for seats that didn't have this playing.\nit's really the only bad thing about flying va                
6         yes, nearly every time i fly vx this “ear worm” won’t go away :)                                      

## Remover hashtags

Explicación regex
1. **#**
    * Si empieza con arroba (#)
2. **(\w+)**
    * y le sigue una o más palabras

In [7]:
data_noHashtag = data_noUser.str.replace('#(\w+)',"")
data_noHashtag

0         what  said.                                                                                                                              
1         plus you've added commercials to the experience... tacky.                                                                                
2         i didn't today... must mean i need to take another trip!                                                                                 
3         it's really aggressive to blast obnoxious "entertainment" in your guests' faces &amp; they have little recourse                          
4         and it's a really big bad thing about it                                                                                                 
5         seriously would pay $30 a flight for seats that didn't have this playing.\nit's really the only bad thing about flying va                
6         yes, nearly every time i fly vx this “ear worm” won’t go away :)                                      

## Remover emoticones

Falta averiguar si algún analizador de sentimientos le sirven los emoticones, o tal vez se puedan traducir los emojis por alguna palabra que exprese su sentimiento.

In [8]:

data_noEmoji = data_noHashtag.str.replace("["
                           u"\U0001F600-\U0001F64F"  # emoticones
                           u"\U0001F300-\U0001F5FF"  # simbolos & pictografos
                           u"\U0001F680-\U0001F6FF"  # simbolos de transporte y mapas
                           u"\U0001F1E0-\U0001F1FF"  # banderas (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", "")
data_noEmoji

0         what  said.                                                                                                                              
1         plus you've added commercials to the experience... tacky.                                                                                
2         i didn't today... must mean i need to take another trip!                                                                                 
3         it's really aggressive to blast obnoxious "entertainment" in your guests' faces &amp; they have little recourse                          
4         and it's a really big bad thing about it                                                                                                 
5         seriously would pay $30 a flight for seats that didn't have this playing.\nit's really the only bad thing about flying va                
6         yes, nearly every time i fly vx this “ear worm” won’t go away :)                                      

## Remover StopWords ( Palabras que no aportan valor al analizar sentimientos)

In [9]:
import nltk
#nltk.download('stopwords')
from nltk.corpus import stopwords
stop = stopwords.words("english")
data_noStopwords = data_noEmoji.apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
data_noStopwords

0        said.                                                                                                           
1        plus added commercials experience... tacky.                                                                     
2        today... must mean need take another trip!                                                                      
3        really aggressive blast obnoxious "entertainment" guests' faces &amp; little recourse                           
4        really big bad thing                                                                                            
5        seriously would pay $30 flight seats playing. really bad thing flying va                                        
6        yes, nearly every time fly vx “ear worm” won’t go away :)                                                       
7        really missed prime opportunity men without hats parody, there.                                                 
8        well, didn't…bu

## Interpretación de Slang (abreviaturas)

In [10]:
import csv, re

In [11]:
def translator(user_string):
    user_string = user_string.split(" ")
    j = 0
    for _str in user_string:
        # Archivo con las abreviaturas y su traducción
        fileName = "slang.txt"
        # Modo de Acceso al archivo (lectura)
        accessMode = "r"
        with open(fileName, accessMode) as myCSVfile:
            # Leer un archivo como un CSV con delimitador como "=", para que la abreviacion sea guardada en row[0] y las frases en row[1]
            dataFromFile = csv.reader(myCSVfile, delimiter="=")
            # Removiendo carácteres especiales
            _str = re.sub('[^a-zA-Z0-9-_.]', '', _str)
            for row in dataFromFile:
                # Checa si las palabras seleccionadas coinciden con las abreviaturas en el archivo de "slang.txt"
                if _str.upper() == row[0]:
                    # Si encuentra una coincidencia, la reemplaza con su respectiva traducción
                    user_string[j] = row[1].lower()
            myCSVfile.close()
        j = j + 1
    # Retorna la cadena corregida
    return ' '.join(user_string)

In [12]:
data_noSlang = data_noStopwords.apply(lambda x: translator(x))
data_noSlang

0        said.                                                                                                               
1        plus added commercials experience... tacky.                                                                         
2        today... must mean need take another trip!                                                                          
3        really aggressive blast obnoxious "entertainment" guests' faces &amp; little recourse                               
4        really big bad thing                                                                                                
5        seriously would pay $30 flight seats playing. really bad thing flying va                                            
6        yes, nearly every time fly vx “ear worm” won’t go away :)                                                           
7        really missed prime opportunity men without hats parody, there.                                              

## Remover carácteres repetidos

In [13]:
data_noRepeated = data_noSlang.transform(lambda x: re.sub(r'(.)\1+', r'\1\1', x))
data_noRepeated

0        said.                                                                                                               
1        plus added commercials experience.. tacky.                                                                          
2        today.. must mean need take another trip!                                                                           
3        really aggressive blast obnoxious "entertainment" guests' faces &amp; little recourse                               
4        really big bad thing                                                                                                
5        seriously would pay $30 flight seats playing. really bad thing flying va                                            
6        yes, nearly every time fly vx “ear worm” won’t go away :)                                                           
7        really missed prime opportunity men without hats parody, there.                                              

## Stemming (volver a las palabras a su respectiva palabra raiz)

Existen diferentes tipos de Stemmers, para el lenguaje Inglés,  podemos encontrar 2 de las más populares en la librería NLTK

### Porter Stemmer

Es conocido por su simplicidad y velocidad

In [14]:
from nltk.stem import PorterStemmer

In [15]:
ps = PorterStemmer()
data_PorterStemming = data_noRepeated.apply(lambda x: ' '.join([ps.stem(word) for word in x.split()]))
data_PorterStemming

0        said.                                                                                              
1        plu ad commerci experience.. tacky.                                                                
2        today.. must mean need take anoth trip!                                                            
3        realli aggress blast obnoxi "entertainment" guests' face &amp; littl recours                       
4        realli big bad thing                                                                               
5        serious would pay $30 flight seat playing. realli bad thing fli va                                 
6        yes, nearli everi time fli vx “ear worm” won’t go away :)                                          
7        realli miss prime opportun men without hat parody, there.                                          
8        well, didn't…but do! :-d                                                                           
9        amazing, a

### LancasterStemmer

Es conocido por ser simple, pero tambien en ser muy duro al stemmizar, ya que realiza iteraciones y podría ocurrir una sobre-stemmización

In [16]:
from nltk.stem import LancasterStemmer

In [17]:
ls = LancasterStemmer()
data_LancasterStemming = data_noRepeated.apply(lambda x: ' '.join([ls.stem(word) for word in x.split()]))
data_LancasterStemming

0        said.                                                                                          
1        plu ad commerc experience.. tacky.                                                             
2        today.. must mean nee tak anoth trip!                                                          
3        real aggress blast obnoxy "entertainment" guests' fac &amp; littl recours                      
4        real big bad thing                                                                             
5        sery would pay $30 flight seat playing. real bad thing fly va                                  
6        yes, near every tim fly vx “ear worm” won’t go away :)                                         
7        real miss prim opportun men without hat parody, there.                                         
8        well, didn't…but do! :-d                                                                       
9        amazing, ar hour early. good me.              

Sin embargo, ambos stemmers por si solos devuelven la cadena completa como si se tratara de una palabra:

' plu ad commerc experience.. tacky.'

Cuando debería ser:
'plu' 'ad' 'commerc' 'experience' 'tacky'

Para lograr ello realizamos una "Tokenización"

## Tokenización

In [18]:
nltk.download('punkt')
from nltk.tokenize import sent_tokenize, word_tokenize
def stemOracion(oracion):
    token_words = word_tokenize(oracion)
    token_words
    stem_sentence = []
    for word in token_words:
        stem_sentence.append(ps.stem(word))
        stem_sentence.append(" ")
    return "".join(stem_sentence)


porter_stemmer_tokenized = data_noRepeated.apply(lambda x: stemOracion(x))
porter_stemmer_tokenized

0        said .                                                                                             
1        plu ad commerci experience.. tacki .                                                               
2        today.. must mean need take anoth trip !                                                           
3        realli aggress blast obnoxi `` entertain '' guest ' face & amp ; littl recours                     
4        realli big bad thing                                                                               
5        serious would pay $ 30 flight seat play . realli bad thing fli va                                  
6        ye , nearli everi time fli vx “ ear worm ” won ’ t go away : )                                     
7        realli miss prime opportun men without hat parodi , there .                                        
8        well , didn't…but do ! : -d                                                                        
9        amaz , arr

## Lemmatization (es el Stemming pero con otro proceso)

Se desarrollara el Lemmatization para ver si con este proceso se obtienen mejores resultados

In [26]:
# WIP (Work In Progress)
#https://www.datacamp.com/community/tutorials/stemming-lemmatization-python
from nltk.stem import WordNetLemmatizer
ltk.download('wordnet')

def lemmatization(oracion):
    wordnet_lemmatizer = WordNetLemmatizer()
    punctuations = "?:!.,;"
    resultado = []
    sentence_words = nltk.word_tokenize(oracion)
    for word in sentence_words:
        if word in punctuations:
            sentence_words.remove(word)
    for word in sentence_words:
        resultado.append(wordnet_lemmatizer.lemmatize(word, pos="v"))
    return " ".join(resultado)

data_lemmatized = data_noRepeated.apply(lambda x: lemmatization(x))
data_lemmatized

0        say                                                                                                        
1        plus add commercials experience.. tacky                                                                    
2        today.. must mean need take another trip                                                                   
3        really aggressive blast obnoxious `` entertainment '' guests ' face & amp little recourse                  
4        really big bad thing                                                                                       
5        seriously would pay $ 30 flight seat play really bad thing fly va                                          
6        yes nearly every time fly vx “ ear worm ” win ’ t go away )                                                
7        really miss prime opportunity men without hat parody there                                                 
8        well didn't…but do : -d                                

## Part Of Speech Tagging (POS)

Sirve para etiquetar cada palabra en la oración como verbo, sustantivo o pronombre, etc.

In [None]:
# WIP (Work In Progress)
#https://towardsdatascience.com/basic-data-cleaning-engineering-session-twitter-sentiment-data-95e5bd2869ec
nltk.download('averaged_perceptron_tagger')
data_POS = data_noRepeated.apply(lambda x: nltk.pos_tag(nltk.word_tokenize(x)))
data_POS

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\GIYELI\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.
