# Preprocesamiento con Python

Para el preprocesamiento de texto utilizaremos las librerías de Python:
- **Numpy**: para el procesamiento de las operaciones en los Dataframes y Series de Pandas
- **Pandas**: para la manipulación de los datos
- **NLTK**: para el procesamiento de texto por medio de las StopWords, Stemming, Lemmatization y POS tag
- **re**: filtrar datos con para expresiones regulares 

## Lectura de datos con Pandas

In [1]:
# Importacion de librerias
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import WordNetLemmatizer
import re
# Permite desplegar el texto completo en Jupyter
pd.set_option('display.max_colwidth', -1)

In [2]:
# Lectura de CSV
data = pd.read_csv("Tweets_pg_prepared.csv")
data.head(20) # Muestra los datos

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,5.70306e+17,neutral,1.0,Can't Tell,0.0,Virgin America,No value,cairdin,No Value,@VirginAmerica What @dhepburn said.,"[0.0, 0.0]",24/02/2015 11:35,No value,Eastern Time (US & Canada)
1,5.70301e+17,positive,0.3486,Can't Tell,0.0,Virgin America,No value,jnardino,No Value,@VirginAmerica plus you've added commercials to the experience... tacky.,"[0.0, 0.0]",24/02/2015 11:15,No value,Pacific Time (US & Canada)
2,5.70301e+17,neutral,0.6837,Can't Tell,0.0,Virgin America,No value,yvonnalynn,No Value,@VirginAmerica I didn't today... Must mean I need to take another trip!,"[0.0, 0.0]",24/02/2015 11:15,Lets Play,Central Time (US & Canada)
3,5.70301e+17,negative,1.0,Bad Flight,0.7033,Virgin America,No value,jnardino,No Value,"@VirginAmerica it's really aggressive to blast obnoxious ""entertainment"" in your guests' faces &amp; they have little recourse","[0.0, 0.0]",24/02/2015 11:15,No value,Pacific Time (US & Canada)
4,5.70301e+17,negative,1.0,Can't Tell,1.0,Virgin America,No value,jnardino,No Value,@VirginAmerica and it's a really big bad thing about it,"[0.0, 0.0]",24/02/2015 11:14,No value,Pacific Time (US & Canada)
5,5.70301e+17,negative,1.0,Can't Tell,0.6842,Virgin America,No value,jnardino,No Value,@VirginAmerica seriously would pay $30 a flight for seats that didn't have this playing.\r\nit's really the only bad thing about flying VA,"[0.0, 0.0]",24/02/2015 11:14,No value,Pacific Time (US & Canada)
6,5.70301e+17,positive,0.6745,Can't Tell,0.0,Virgin America,No value,cjmcginnis,No Value,"@VirginAmerica yes, nearly every time I fly VX this “ear worm” won’t go away :)","[0.0, 0.0]",24/02/2015 11:13,San Francisco CA,Pacific Time (US & Canada)
7,5.703e+17,neutral,0.634,Can't Tell,0.0,Virgin America,No value,pilot,No Value,"@VirginAmerica Really missed a prime opportunity for Men Without Hats parody, there. https://t.co/mWpG7grEZP","[0.0, 0.0]",24/02/2015 11:12,Los Angeles,Pacific Time (US & Canada)
8,5.703e+17,positive,0.6559,Can't Tell,0.0,Virgin America,No value,dhepburn,No Value,"@virginamerica Well, I didn't…but NOW I DO! :-D","[0.0, 0.0]",24/02/2015 11:11,San Diego,Pacific Time (US & Canada)
9,5.70295e+17,positive,1.0,Can't Tell,0.0,Virgin America,No value,YupitsTate,No Value,"@VirginAmerica it was amazing, and arrived an hour early. You're too good to me.","[0.0, 0.0]",24/02/2015 10:53,Los Angeles,Eastern Time (US & Canada)


In [3]:
data["text"] # Mostrar los datos de la columna "text"

0        @VirginAmerica What @dhepburn said.                                                                                                                   
1        @VirginAmerica plus you've added commercials to the experience... tacky.                                                                              
2        @VirginAmerica I didn't today... Must mean I need to take another trip!                                                                               
3        @VirginAmerica it's really aggressive to blast obnoxious "entertainment" in your guests' faces &amp; they have little recourse                        
4        @VirginAmerica and it's a really big bad thing about it                                                                                               
5        @VirginAmerica seriously would pay $30 a flight for seats that didn't have this playing.\r\nit's really the only bad thing about flying VA            
6        @VirginAmerica yes, nearly ever

## Remover URLs (Regex)

Explicación Regex
1. **\w+** : Uno o más carácteres alfanumericos
2. **:\/\/** : Un "://"
3. **\S+**: uno o más carácteres que no sean espacios

Explicación otro Regex
1. **(http|https|ftp)**: Detectar si empieza con alguno de estos protocolos
2. **://**: Seguido de un "://"
3. **[a-zA-Z0-9\\./]**: E inmediatamente empieza una palabra seguido de un punto (.) una o mas veces (de esta manera se incluye el (.com y variantes)

In [4]:
data_noURL = data["text"].str.replace('\w+:\/\/\S+',"")
# Otro regex: (http|https|ftp)://[a-zA-Z0-9\\./]+
data_noURL

0        @VirginAmerica What @dhepburn said.                                                                                                                   
1        @VirginAmerica plus you've added commercials to the experience... tacky.                                                                              
2        @VirginAmerica I didn't today... Must mean I need to take another trip!                                                                               
3        @VirginAmerica it's really aggressive to blast obnoxious "entertainment" in your guests' faces &amp; they have little recourse                        
4        @VirginAmerica and it's a really big bad thing about it                                                                                               
5        @VirginAmerica seriously would pay $30 a flight for seats that didn't have this playing.\r\nit's really the only bad thing about flying VA            
6        @VirginAmerica yes, nearly ever

## Remover referencias (@Usernames)

Explicación regex
1. **@**: Si empieza con arroba (@)
2. **(\w+)**: y le sigue una o más palabras

In [5]:
data_noUser = data_noURL.str.replace('@(\w+)',"")
data_noUser

0         What  said.                                                                                                                              
1         plus you've added commercials to the experience... tacky.                                                                                
2         I didn't today... Must mean I need to take another trip!                                                                                 
3         it's really aggressive to blast obnoxious "entertainment" in your guests' faces &amp; they have little recourse                          
4         and it's a really big bad thing about it                                                                                                 
5         seriously would pay $30 a flight for seats that didn't have this playing.\r\nit's really the only bad thing about flying VA              
6         yes, nearly every time I fly VX this “ear worm” won’t go away :)                                      

## Remover hashtags

Explicación regex
1. **#**: Si empieza con gato (#)
2. **(\w+)**: y le sigue una o más palabras

In [6]:
data_noHashtag = data_noUser.str.replace('#(\w+)',"")
data_noHashtag

0         What  said.                                                                                                                              
1         plus you've added commercials to the experience... tacky.                                                                                
2         I didn't today... Must mean I need to take another trip!                                                                                 
3         it's really aggressive to blast obnoxious "entertainment" in your guests' faces &amp; they have little recourse                          
4         and it's a really big bad thing about it                                                                                                 
5         seriously would pay $30 a flight for seats that didn't have this playing.\r\nit's really the only bad thing about flying VA              
6         yes, nearly every time I fly VX this “ear worm” won’t go away :)                                      

## Reemplazar Contracciones

In [7]:
diccionario_contracciones = {
        "ain't":"is not",
        "amn't":"am not",
        "aren't":"are not",
        "can't":"cannot",
        "'cause":"because",
        "couldn't":"could not",
        "couldn't've":"could not have",
        "could've":"could have",
        "daren't":"dare not",
        "daresn't":"dare not",
        "dasn't":"dare not",
        "didn't":"did not",
        "doesn't":"does not",
        "don't":"do not",
        "e'er":"ever",
        "em":"them",
        "everyone's":"everyone is",
        "finna":"fixing to",
        "gimme":"give me",
        "gonna":"going to",
        "gon't":"go not",
        "gotta":"got to",
        "hadn't":"had not",
        "hasn't":"has not",
        "haven't":"have not",
        "he'd":"he would",
        "he'll":"he will",
        "he's":"he is",
        "he've":"he have",
        "how'd":"how would",
        "how'll":"how will",
        "how're":"how are",
        "how's":"how is",
        "I'd":"I would",
        "I'll":"I will",
        "I'm":"I am",
        "I'm'a":"I am about to",
        "I'm'o":"I am going to",
        "isn't":"is not",
        "it'd":"it would",
        "it'll":"it will",
        "it's":"it is",
        "I've":"I have",
        "kinda":"kind of",
        "let's":"let us",
        "mayn't":"may not",
        "may've":"may have",
        "mightn't":"might not",
        "might've":"might have",
        "mustn't":"must not",
        "mustn't've":"must not have",
        "must've":"must have",
        "needn't":"need not",
        "ne'er":"never",
        "o'":"of",
        "o'er":"over",
        "ol'":"old",
        "oughtn't":"ought not",
        "shalln't":"shall not",
        "shan't":"shall not",
        "she'd":"she would",
        "she'll":"she will",
        "she's":"she is",
        "shouldn't":"should not",
        "shouldn't've":"should not have",
        "should've":"should have",
        "somebody's":"somebody is",
        "someone's":"someone is",
        "something's":"something is",
        "that'd":"that would",
        "that'll":"that will",
        "that're":"that are",
        "that's":"that is",
        "there'd":"there would",
        "there'll":"there will",
        "there're":"there are",
        "there's":"there is",
        "these're":"these are",
        "they'd":"they would",
        "they'll":"they will",
        "they're":"they are",
        "they've":"they have",
        "this's":"this is",
        "those're":"those are",
        "'tis":"it is",
        "'twas":"it was",
        "wanna":"want to",
        "wasn't":"was not",
        "we'd":"we would",
        "we'd've":"we would have",
        "we'll":"we will",
        "we're":"we are",
        "weren't":"were not",
        "we've":"we have",
        "what'd":"what did",
        "what'll":"what will",
        "what're":"what are",
        "what's":"what is",
        "what've":"what have",
        "when's":"when is",
        "where'd":"where did",
        "where're":"where are",
        "where's":"where is",
        "where've":"where have",
        "which's":"which is",
        "who'd":"who would",
        "who'd've":"who would have",
        "who'll":"who will",
        "who're":"who are",
        "who's":"who is",
        "who've":"who have",
        "why'd":"why did",
        "why're":"why are",
        "why's":"why is",
        "won't":"will not",
        "wouldn't":"would not",
        "would've":"would have",
        "y'all":"you all",
        "you'd":"you would",
        "you'll":"you will",
        "you're":"you are",
        "you've":"you have",
        "Whatcha":"What are you",
        "luv":"love",
        "sux":"sucks",
}

In [8]:
# Creando un conjunto de contracciones
conjunto_contracciones = set(diccionario_contracciones.keys())

In [9]:
def traducir_contracciones(texto):
    texto = texto.split(" ")
    j = 0
    for palabra in texto:
        # Checa si las palabras seleccionadas coinciden con el connjunto de emoticones
        if palabra in conjunto_contracciones:
            print("Contraccion con ", palabra, " a ", diccionario_contracciones[palabra], " en ", texto)
            # Si encuentra una coincidencia, la reemplaza con su respectiva traducción
            texto[j] = diccionario_contracciones[palabra]
        j = j + 1
    # Retorna la cadena corregida
    return ' '.join(texto)

In [10]:
data_noContracciones = data_noHashtag.str.replace("’","'")
data_noContracciones = data_noContracciones.apply(lambda x: traducir_contracciones(x))
data_noContracciones

Contraccion con  you've  a  you have  en  ['', 'plus', "you've", 'added', 'commercials', 'to', 'the', 'experience...', 'tacky.']
Contraccion con  didn't  a  did not  en  ['', 'I', "didn't", 'today...', 'Must', 'mean', 'I', 'need', 'to', 'take', 'another', 'trip!']
Contraccion con  it's  a  it is  en  ['', "it's", 'really', 'aggressive', 'to', 'blast', 'obnoxious', '"entertainment"', 'in', 'your', "guests'", 'faces', '&amp;', 'they', 'have', 'little', 'recourse']
Contraccion con  it's  a  it is  en  ['', 'and', "it's", 'a', 'really', 'big', 'bad', 'thing', 'about', 'it']
Contraccion con  didn't  a  did not  en  ['', 'seriously', 'would', 'pay', '$30', 'a', 'flight', 'for', 'seats', 'that', "didn't", 'have', 'this', "playing.\r\nit's", 'really', 'the', 'only', 'bad', 'thing', 'about', 'flying', 'VA']
Contraccion con  won't  a  will not  en  ['', 'yes,', 'nearly', 'every', 'time', 'I', 'fly', 'VX', 'this', '“ear', 'worm”', "won't", 'go', 'away', ':)']
Contraccion con  haven't  a  have not

Contraccion con  what's  a  what is  en  ['', "what's", 'a', 'good', 'number', 'to', 'call', 'to', 'speak', 'with', 'someone', 'about', 'how', 'you', 'can', 'fix', 'what', 'you', 'did', 'to', '50', 'people', 'and', 'their', 'luggage', 'on', 'Saturday?']
Contraccion con  where's  a  where is  en  ['', "where's", 'my', 'damn', 'bag??']
Contraccion con  it's  a  it is  en  ['', 'diverted', 'and', 'missed', 'our', 'connecting', 'flight.', 'Was', 'just', 'told', 'that', 'my', 'bag', 'is', 'on', "it's", 'way', 'to', 'MSY.', 'If', 'you', 'only', 'had', 'people', 'that', 'cared']
Contraccion con  can't  a  cannot  en  ['', 'Male', 'agnt', 'in', 'LAS', 'threatens', 'Canadian', 'cust', 'when', 'cust', 'takes', 'pic', 'of', 'him', 'at', 'gate', 'after', 'agents', 'announce', "can't", 'help', 'rebook.', '?']
Contraccion con  what's  a  what is  en  ['', 'so', "what's", 'the', 'deal?', 'Do', 'u', 'provide', 'voucher', 'for', 'overnight', 'or', 'am', 'I', 'cozy', 'on', 'the', 'floor', 'at', '', '?',

Contraccion con  can't  a  cannot  en  ['', 'tried', 'to', 'book', 'a', 'flight', 'IAH-MNL', 'departing', '3/31/15', 'returning', '4/17/15', 'you', 'are', 'advertising', '9', 'flights', 'for', '$1051', 'that', "can't", 'be', 'book!']
Contraccion con  couldn't  a  could not  en  ['', 'Flight', '472', 'from', 'ORD', "couldn't", 'let', 'me', 'know', 'about', 'this?', 'Found', 'out', 'via', 'app', 'minutes', 'before', 'landing.', 'Awful', 'flight.', '']
Contraccion con  didn't  a  did not  en  ['', 'A', 'very', 'disappointing', 'experience', '-', 'plane', 'mech.', 'delay', 'and', 'next', 'one', "didn't", 'wait.', 'No', 'sincere', 'apology,', 'just', 'told', 'me', 'to', 'complain', 'online']
Contraccion con  you've  a  you have  en  ['', '', 'with', 'the', 'exception', 'of', 'everything', "you've", 'asked', 'for.', 'Heh']
Contraccion con  I've  a  I have  en  ['', 'worst', 'flights', "I've", 'ever', 'had.', 'ground', 'crew', 'ignored', 'our', 'plane,', 'made', 'me', 'miss', 'flight', 'and',

Contraccion con  won't  a  will not  en  ['', 'God', 'damn', 'fucking', 'crew', "won't", 'be', 'here', 'till', '6:40,', 'so', "you've", 'known', 'for', 'over', 'an', 'hour', 'and', 'a', 'half', 'the', 'flight', 'time', 'was', 'bullshit.']
Contraccion con  you've  a  you have  en  ['', 'God', 'damn', 'fucking', 'crew', 'will not', 'be', 'here', 'till', '6:40,', 'so', "you've", 'known', 'for', 'over', 'an', 'hour', 'and', 'a', 'half', 'the', 'flight', 'time', 'was', 'bullshit.']
Contraccion con  they're  a  they are  en  ['', 'you', 'really', 'do', 'have', 'a', 'culture', 'problem.', 'Everyone', 'I', 'tried', 'to', 'work', 'with', 'blamed', 'someone', 'else', 'or', 'told', 'me', 'how', "they're", 'short', 'staffed']
Contraccion con  where's  a  where is  en  ['', "where's", 'the', 'crew', 'for', 'ua748?']
Contraccion con  I'm  a  I am  en  ['', 'thanks', '...', 'not', 'sure', 'arranged', 'move', 'to', 'the', 'earlier', 'flight', 'but', "I'm", 'at', 'the', 'gate', 'with', 'a', 'seat', 'as

Contraccion con  I've  a  I have  en  ['', '5.5', 'hours', 'Late', 'Flightr', "I've", 'been', 'in', 'transit', 'for', 'a', 'total', 'of', 'twelve', 'hours...please', 'just', 'change', 'the', 'plane', 'on', 'flight', '600', 'this', 'is', 'ridiculous', 'SFO']
Contraccion con  isn't  a  is not  en  ['', 'all', 'good', 'man', 'it', "isn't", 'your', 'fault', 'that', 'plane', 'is', 'having', 'maintenance', 'issues']
Contraccion con  doesn't  a  does not  en  ['', 'this', 'delay', 'of', 'flight', 'UA4636', 'has', 'been', 'painful.', 'I', 'sure', 'hope', 'it', "doesn't", 'cause', 'me', 'or', 'my', 'luggage)', 'to', 'miss', 'my', 'UA82', 'flight', 'to', 'New', 'Delhi!']
Contraccion con  it's  a  it is  en  ['', 'it', 'should', 'be', 'free', 'like', 'other', 'airlines!', 'Again,', "it's", 'not', '1997', 'anymore.']
Contraccion con  I'm  a  I am  en  ['', 'please', 'tell', 'me', "I'm", 'going', 'to', 'make', 'my', 'connecting', 'flight', 'from', "O'hare", 'to', '', '', '🙏', '']
Contraccion con  I

Contraccion con  I'd  a  I would  en  ['', 'What', 'delivery', 'service', 'do', 'you', 'use?', "I'd", 'like', 'to', 'call', 'them', 'myself.']
Contraccion con  can't  a  cannot  en  ['', 'still', 'waiting', '4', 'our', 'bags.', 'Web', 'STILL', "can't", 'tell', 'me', 'its', 'location', '.', 'How', 'come', '', '', 'can', 'tell', 'u', 'at', 'any', 'given', 'minute', 'where', 'it', 'is?']
Contraccion con  haven't  a  have not  en  ['', '"We', 'like', 'hearing', 'from', 'you."', 'So', 'why', "haven't", 'you', 'replied', 'to', 'my', 'tweet', 'and/or', 'email', 'yet?', '']
Contraccion con  I'm  a  I am  en  ['', "I'm", 'not', 'as', 'sure', 'as', 'you', 'are.', '']
Contraccion con  couldn't  a  could not  en  ['', 'how', 'does', 'that', 'help', 'me', 'with', 'my', 'customers', 'that', 'I', "couldn't", 'meet', 'with', '(and', 'subsequently', 'lost)?']
Contraccion con  can't  a  cannot  en  ['', 'question', '-', 'was', 'given', 'food', 'vouchers', 'but', "can't", 'use', 'on', 'plane..how', '', '

Contraccion con  couldn't  a  could not  en  ['', 'why', "couldn't", 'you', 'have', 'changed', 'the', 'tire', 'of', 'my', 'delayed', 'UA1127', 'flight', 'when', 'it', 'arrived', 'instead', 'of', 'waiting', 'until', 'boarding?']
Contraccion con  I'm  a  I am  en  ['', 'I', 'want', 'my', 'bags.', 'There', 'is', 'vital', 'equipment', 'in', 'there.', 'You', 'are', 'royally', 'screwing', 'me.', "I'm", 'cranky', 'and', 'want', 'an', 'update.']
Contraccion con  it's  a  it is  en  ['', 'unless', "it's", 'on', 'you', 'guys,', 'im', 'good.']
Contraccion con  weren't  a  were not  en  ['', 'Where', 'are', 'my', 'bags!!!', '', 'They', "weren't", 'in', 'LAX', 'like', 'your', 'promised.', '', '9', 'out', 'of', '10', 'things', 'today', 'were', 'a', 'mess', 'today', 'because', 'of', 'you.']
Contraccion con  haven't  a  have not  en  ['', 'still', "haven't", 'received', 'a', 'response.', 'Please', 'direct', 'message', 'me', 'for', 'my', 'contact', 'information.']
Contraccion con  I've  a  I have  en  

Contraccion con  isn't  a  is not  en  ['', 'I', 'hope', 'this', "isn't", 'a', 'real', 'life', 'scene', 'out', 'of', 'the', 'movie', "'Flight'", '.....']
Contraccion con  don't  a  do not  en  ['', '', 'And', 'this', 'is', 'why', 'I', 'love', 'flying', 'Southwest.', 'Excellent', 'service,', 'and', 'you', "don't", 'take', 'yourselves', 'too', 'seriously!']
Contraccion con  what's  a  what is  en  ['', '', "what's", 'even', 'better', 'is', 'the', 'price', 'changed', 'in', 'the', '2', 'minutes', 'since', 'I', 'talked', 'to', 'the', 'lady', 'and', 'they', 'still', 'honored', 'the', 'cheap1']
Contraccion con  I've  a  I have  en  ['', "I've", 'had', 'TERRIBLE', 'service', 'in', 'three', 'airports', 'in', '10', 'hours.', 'Glad', 'they', "don't", 'care', 'we', 'kind', 'of', 'need', 'to', 'be', 'home.', '']
Contraccion con  don't  a  do not  en  ['', 'I have', 'had', 'TERRIBLE', 'service', 'in', 'three', 'airports', 'in', '10', 'hours.', 'Glad', 'they', "don't", 'care', 'we', 'kind', 'of', 'ne

Contraccion con  I've  a  I have  en  ['', 'thanks', 'so', 'much', 'just', 'had', 'to', 'make', 'a', 'Cancelled', 'Flightlation!', "I've", 'sent', 'u', 'the', 'info.']
Contraccion con  I've  a  I have  en  ['', 'I', 'never', 'got', 'my', 'flight', 'confirmation.', "I've", 'been', 'on', 'hold', 'for', 'an', 'hour', 'and', 'I', "can't", 'get', 'info', 'online.', 'What', 'am', 'I', 'to', 'do?']
Contraccion con  can't  a  cannot  en  ['', 'I', 'never', 'got', 'my', 'flight', 'confirmation.', 'I have', 'been', 'on', 'hold', 'for', 'an', 'hour', 'and', 'I', "can't", 'get', 'info', 'online.', 'What', 'am', 'I', 'to', 'do?']
Contraccion con  won't  a  will not  en  ['', '😅', 'you', "won't", 'let', 'me', 'change', 'my', 'reservation', 'online', 'so', 'now', "I'm", 'just', 'wasting', 'my', 'time.', '']
Contraccion con  I'm  a  I am  en  ['', '😅', 'you', 'will not', 'let', 'me', 'change', 'my', 'reservation', 'online', 'so', 'now', "I'm", 'just', 'wasting', 'my', 'time.', '']
Contraccion con  I'm

Contraccion con  can't  a  cannot  en  ['', 'Why', "can't", 'I', 'find', 'a', 'cheap', 'flight', 'from', 'DC', 'to', 'St', 'Louis?', 'The', 'prices', 'went', 'up', 'like', 'crazy', 'for', 'April', 'weekends!']
Contraccion con  can't  a  cannot  en  ['', "can't", 'DM', 'you', 'without', 'you', 'following', 'me...']
Contraccion con  we'd  a  we would  en  ['“:', ',', "we'd", 'still', 'be', 'rocking', 'out', 'to', 'this', 'chart-topping', 'hit".', 'What?', '', 'The', "80's", 'are', 'over?']
Contraccion con  y'all  a  you all  en  ['', 'Do', "y'all", 'know', 'when', 'the', 'new', 'routes', 'from', 'HOU', 'to', 'Aruba', '&amp;', 'Puerto', 'Vallarta', 'will', 'be', 'available?']
Contraccion con  can't  a  cannot  en  ['', 'no', 'one', 'has', 'answers...no', 'one', 'can', 'help.', 'There', 'is', 'always', 'a', 'different', 'story', 'to', 'why', 'my', 'and', 'my', "fiancee'", "can't", 'be', 'helped.']
Contraccion con  luv  a  love  en  ['', 'can', 'I', 'get', 'some', 'luv', 'with', 'a', 'fallo

Contraccion con  I'm  a  I am  en  ['', '', 'Great', 'job', 'celebrating', '', 'today', 'at', 'Atlanta', 'Airport.', 'Another', 'reason', "I'm", 'nuts', 'for', 'you!', '']
Contraccion con  you're  a  you are  en  ['', 'your', 'customer', 'service', 'is', 'terrible,', "you're", 'terrible,', 'thought', 'you', 'should', 'know.']
Contraccion con  won't  a  will not  en  ['', 'my', 'pts', 'expired.', 'I', 'made', 'a', 'prchase', '@', 'an', 'online', 'retailer', 'to', 'b', 'told', 'by', 'SW', 'that', 'those', "won't", 'show', 'for', '6-8', 'wks', 'so', 'too', 'L8', '2', 'keepmy', 'pts']
Contraccion con  I'm  a  I am  en  ['', 'TREMENDOUS', 'job.', 'Atlanta', 'Airport', 'saw', 'SW', 'celebrate', 'Mardi', 'Gras.', 'Another', 'reason', "I'm", 'nuts', 'for', 'you', 'guys!', '']
Contraccion con  I've  a  I have  en  ['', 'You', 'officially', 'have', 'the', 'worst', 'customer', 'service', 'of', 'any', 'airline', "I've", 'ever', 'dealt', 'with.', '', '']
Contraccion con  I'm  a  I am  en  ['', "I'm

Contraccion con  I'm  a  I am  en  ['', 'the', 'fact', 'that', '', 'is not', 'trending', 'is', 'how', 'you', 'know', "I'm", 'loyal', ';)']
Contraccion con  haven't  a  have not  en  ['', 'no,', 'I', "haven't", 'done', 'that', 'yet.', 'Is', 'that', 'something', 'I', 'can', 'do', 'online?', 'Thx!']
Contraccion con  I'm  a  I am  en  ['', 'woof', "I'm", 'on', 'the', 'red', 'cArpet', '']
Contraccion con  can't  a  cannot  en  ['', '', 'No', 'wifi', 'on', 'this', 'flight', 'so', 'we', "can't", 'tweet', 'you', 'our', 'Oscar', 'party', 'pics', 'at', '37,000ft.', '', ':-(', 'SEA✈️BOS']
Contraccion con  don't  a  do not  en  ['', 'You', 'just', "don't", 'get', 'it.', '', "It's", 'not', 'about', 'the', 'money,', "It's", 'about', 'PEOPLE!!!!', 'How', 'about', 'a', 'public', 'apology', 'from', 'the', 'president', 'of', 'Jet', 'Blue.']
Contraccion con  what's  a  what is  en  ['', 'hey', 'awesome', 'peeps,', "what's", 'up', 'with', 'flight', '1159', 'from', 'Boston?', 'Delayed', '3hrs?']
Contraccio

Contraccion con  doesn't  a  does not  en  ['', "doesn't", 'help', 'if', 'I', 'have', 'to', 'fly', 'on', 'specific', 'days']
Contraccion con  it's  a  it is  en  ['', "it's", 'noting', 'about', 'me', 'i', 'm', 'perfectly', 'fine', "it's", 'the', 'attitude', 'and', 'dealings', 'with', 'flyers.', 'Stubborn.', 'Demanding.', 'Unwilling', 'to', 'accommodate.']
Contraccion con  it's  a  it is  en  ['', 'it is', 'noting', 'about', 'me', 'i', 'm', 'perfectly', 'fine', "it's", 'the', 'attitude', 'and', 'dealings', 'with', 'flyers.', 'Stubborn.', 'Demanding.', 'Unwilling', 'to', 'accommodate.']
Contraccion con  y'all  a  you all  en  ['', "There's", 'just', 'so', 'many', 'choices', 'for', "y'all", 'south', 'of', 'the', 'border', 'and', 'I', 'know', 'not', 'every', 'airline', 'is', 'equal', '-', 'lowest', 'price', '!=', 'best', 'value.', ';)']
Contraccion con  what's  a  what is  en  ['', "what's", 'up', 'w', 'flt', '4?', 'Brothers', 'fiancé', 'sitting', 'on', 'board', 'for', '30mins', 'w', 'tech

Contraccion con  she's  a  she is  en  ['', 'Thanks!', 'Her', 'flight', 'leaves', 'at', '2', 'but', "she's", 'arriving', 'to', 'the', 'airport', 'early.', 'Wedding', 'is', 'in', 'VT', 'in', 'Sept.', 'Grateful', 'you', 'fly', 'to', 'BTV!!', ':)']
Contraccion con  aren't  a  are not  en  ['', 'is', 'a', 'DM', 'possible', 'if', 'you', "aren't", 'following', 'me?']
Contraccion con  you're  a  you are  en  ['', "you're", 'killing', 'me', 'from', 'the', 'inside']
Contraccion con  can't  a  cannot  en  ['', 'flight', '645', 'to', 'Phoenix', 'deboards', 'passengers', 'going', 'to', 'Salt', 'Lake', 'City', 'because', 'they', "can't", 'resolve', 'a', 'simple', 'bathroom', 'issue.', '']
Contraccion con  isn't  a  is not  en  ['', 'always', 'fun', 'to', 'get', 'screwed', 'out', 'of', 'an', 'earlier', 'flight', '(that', "isn't", 'full)', '', '', '']
Contraccion con  haven't  a  have not  en  ['', 'ok...i', 'know', 'doors', 'close', '~10', 'minutes', 'before', 'takeoff.', 'stop', 'telling', 'me', 'm

Contraccion con  doesn't  a  does not  en  ['', '', 'ok', 'the', 'app', "doesn't", 'seem', 'to', 'be', 'working.', 'just', 'use', 'the', 'mobile', 'site...', '']
Contraccion con  I'm  a  I am  en  ['', "I'm", 'supposed', 'to', 'fly', 'through', 'Dallas.', 'Can', 'you', 'help', 'me', 'get', 'a', 'new', 'itinerary?']
Contraccion con  can't  a  cannot  en  ['', 'I', 'called', 'more', 'than', '25', 'times', 'to', 'redeem', 'mile', 'points', 'and', "can't", 'get', 'through.', '', 'You', 'advertise', 'the', 'miles', 'but', 'make', 'them', 'very', 'hard', 'to', 'use!']
Contraccion con  I've  a  I have  en  ['', 'I', 'need', 'to', 'speak', 'with', 'a', 'live', 'person.', "I've", 'had', 'it', 'with', 'the', 'recordings.', 'I', 'was', 'told', 'on', 'Sat', 'that', "we'd", 'have', 'our', 'luggage', 'by', 'yest.']
Contraccion con  we'd  a  we would  en  ['', 'I', 'need', 'to', 'speak', 'with', 'a', 'live', 'person.', 'I have', 'had', 'it', 'with', 'the', 'recordings.', 'I', 'was', 'told', 'on', 'Sa

Contraccion con  I'm  a  I am  en  ['', 'guys', 'I', 'need', 'help', 'my', 'reservations,', 'tried', 'calling', 'and', "I'm", 'told', 'to', 'call', 'Late', 'Flightr.', 'Please', 'help', 'me', 'here.']
Contraccion con  I'd  a  I would  en  ['', 'this', 'very', 'pregnant', "lady's", 'hoping', '&amp;praying', 'hubbys', 'flight', 'from', 'BWI', 'gets', 'off', 'the', 'ground!', "I'd", 'like', 'him', 'to', 'get', 'here', 'before', 'baby', 'does!']
Contraccion con  can't  a  cannot  en  ['', 'to', 'a', 'booked', 'hotel', 'for', 'the', 'night', 'so', 'I', 'had', 'to', 'find', 'him', 'a', 'room', 'at', 'midnight.', 'Then', 'say', 'his', 'bags', 'will', 'be', 'on', 'the', 'plane,', 'but', "can't", '(2/3)']
Contraccion con  haven't  a  have not  en  ['', "haven't", 'eaten', 'all', 'day', 'either', 'so', 'lemme', 'get', 'that', 'CC#', 'so', 'I', 'can', 'buy', 'a', 'burger', 'across', 'from', 'the', 'terminal.', '', '', '']
Contraccion con  I'm  a  I am  en  ['', 'i', 'have', 'been', 'trying', 'ALL

Contraccion con  won't  a  will not  en  ['', 'been', 'on', 'hold', 'for', 'over', 'an', 'hour,', 'still', 'do not', 'know', 'where', 'my', 'bags', 'are,', "won't", 'refund', 'me', 'my', 'flight?!?!', 'It', "wasn't", 'weather!!']
Contraccion con  wasn't  a  was not  en  ['', 'been', 'on', 'hold', 'for', 'over', 'an', 'hour,', 'still', 'do not', 'know', 'where', 'my', 'bags', 'are,', 'will not', 'refund', 'me', 'my', 'flight?!?!', 'It', "wasn't", 'weather!!']
Contraccion con  I've  a  I have  en  ['', '4', 'flights', 'in', '48hrs', '&amp;', "I've", 'had', 'the', 'same', 'flight', 'attendant', 'for', '3', 'of', 'those', 'flights.', 'Freaky', 'coincidence!', 'Plus', 'side', "she's", 'great.', ':)']
Contraccion con  she's  a  she is  en  ['', '4', 'flights', 'in', '48hrs', '&amp;', 'I have', 'had', 'the', 'same', 'flight', 'attendant', 'for', '3', 'of', 'those', 'flights.', 'Freaky', 'coincidence!', 'Plus', 'side', "she's", 'great.', ':)']
Contraccion con  I'm  a  I am  en  ['', 'can', 'yo

Contraccion con  don't  a  do not  en  ['', 'what', 'I', 'have', 'to', 'say', 'is', 'more', 'than', '140', 'characters!', 'Plus', 'you', "don't", 'follow', 'me']
Contraccion con  we've  a  we have  en  ['', "we've", 'already', 'made', 'other', 'arrangements', 'ourselves.']
Contraccion con  aren't  a  are not  en  ['', 'why', 'would', 'I', 'pay', '$200', 'to', 'reactivate', 'my', 'points', 'that', 'are', 'only', 'useful', 'for', 'certain', 'flights', 'that', "aren't", 'even', 'worth', '$200?']
Contraccion con  it's  a  it is  en  ['', "it's", 'not', 'just', 'frustrating--it', 'was', 'PAID', 'for!', 'how', 'do', 'we', 'get', 'a', 'refund?']
Contraccion con  haven't  a  have not  en  ['.', 'can', 'you', 'connect', 'me', 'to', 'a', 'person', 'without', 'having', 'to', 'wait', '2+', 'hours', 'on', 'hold?', 'I', 'still', "haven't", 'been', 'able', 'to', 'resolve', 'the', 'problem.']
Contraccion con  could've  a  could have  en  ['', 'You', 'are', 'jumping', 'the', 'gun', 'and', 'Cancelled', 

Contraccion con  they're  a  they are  en  ['', 'I', 'understand', "they're", 'busy', '&amp;', 'doing', 'their', 'best.', 'My', 'frustration', 'is', 'an', 'automated', 'call', 'changing', 'my', 'flight', 'w/o', 'allowing', 'me', 'to', 'talk']
Contraccion con  I've  a  I have  en  ['', "I've", 'been', 'trying', 'to', 'get', 'through', 'to', 'reservations', 'since', 'yesterday', 'to', 'make', 'a', 'change', 'to', 'a', 'reservation', 'hold,', 'when', 'will', 'it', 'be', 'up?']
Contraccion con  can't  a  cannot  en  ['', 'I', 'still', "can't", 'get', 'through', 'to', 'change', 'my', 'flight.', 'This', 'is', 'really', 'important', 'plz', 'help!']
Contraccion con  I've  a  I have  en  ['', 'most', 'disgusting', 'chicken', 'entree', "I've", 'ever', 'seen.', '', 'Your', 'standards', 'have', 'seriously', 'nosedived.', '', '']
Contraccion con  can't  a  cannot  en  ['', 'I', 'really', "can't", 'there', 'though', 'to', 'an', 'agent,', 'and', 'I', 'am', 'worried', 'about', 'my', 'reservation.', 'C

Contraccion con  isn't  a  is not  en  ['', 'got', 'call', 'saying', 'flight', 'Cancelled', 'Flightled', 'but', 'checked', 'online', 'and', 'it', "isn't", 'Cancelled', 'Flightled.', 'No', 'where', 'near', 'bad', 'weather.', 'I', 'have', 'job', 'interview']
Contraccion con  you've  a  you have  en  ['', 'not', 'only', 'did', 'you', 'Cancelled', 'Flight', 'our', 'flight', 'from', 'JFK', 'and', 'delay', 'us', 'by', '29', 'hours,', "you've", 'now', 'lost', '2', 'of', 'our', 'bags.', 'Worst', 'airline', 'ever.']
Contraccion con  that's  a  that is  en  ['', 'Your', 'rubbish', 'at', 'Social', 'Media!', 'In', 'the', 'air', 'two', 'hours', 'Late', 'Flight', 'but', "that's", 'better', 'than', 'being', 'Cancelled', 'Flightled!']
Contraccion con  I'm  a  I am  en  ['', 'Thanks', 'guys,', 'you', 'got', 'it.', "I'm", 'heading', 'to', 'Milan', 'on', 'Wednesday,', 'so', 'big', 'week', 'with', 'the', 'AA', 'family', ':)']
Contraccion con  won't  a  will not  en  ['', '', 'Except', 'now', 'there', 'is'

Contraccion con  it's  a  it is  en  ['', 'fair', 'enough.', 'But', 'they', 'could', 'have', 'at', 'least', 'told', 'us.', 'Once', 'again', "it's", 'the', 'lack', 'of', 'communication.', 'Delays', 'happen,', 'just', 'tell', 'me']
Contraccion con  haven't  a  have not  en  ['', 'even', 'with', 'calls', 'you', "haven't", 'been', 'able', 'to', 'help', 'us', 'anyway.', '']
Contraccion con  didn't  a  did not  en  ['', 'Cancelled', 'Flightled', 'my', 'flight,', "didn't", 'notify', 'me,', 'and', 'I', 'can', 'only', 'rebook', 'by', 'phone,', 'but', "can't", 'get', 'through', 'to', 'an', 'agent.', 'Epic.', 'Fail.']
Contraccion con  can't  a  cannot  en  ['', 'Cancelled', 'Flightled', 'my', 'flight,', 'did not', 'notify', 'me,', 'and', 'I', 'can', 'only', 'rebook', 'by', 'phone,', 'but', "can't", 'get', 'through', 'to', 'an', 'agent.', 'Epic.', 'Fail.']
Contraccion con  she's  a  she is  en  ['', 'i', 'was', 'just', 'severely', 'upset', 'by', 'the', 'rude', 'cs', 'rep.', 'I', 'get', "she's", 'p

0         What  said.                                                                                                                              
1         plus you have added commercials to the experience... tacky.                                                                              
2         I did not today... Must mean I need to take another trip!                                                                                
3         it is really aggressive to blast obnoxious "entertainment" in your guests' faces &amp; they have little recourse                         
4         and it is a really big bad thing about it                                                                                                
5         seriously would pay $30 a flight for seats that did not have this playing.\r\nit's really the only bad thing about flying VA             
6         yes, nearly every time I fly VX this “ear worm” will not go away :)                                   

## Tratamiento de emoticones y emojis

Originalmente pensaba que algun analizador los podria detectar, pero despues de leer algunos artículos descubri que es mejor interpretarlos (convertirlos a palabras que expresen el sentimiento del emoticon). Esto es clave para medir la polaridad de un mensaje

### Interpretación de emoticones

In [11]:
diccionario_emoticones = {
        ":)":"smiley",
        ":‑)":"smiley",
        ":-]":"smiley",
        ":-3":"smiley",
        ":->":"smiley",
        "8-)":"smiley",
        ":-}":"smiley",
        ":)":"smiley",
        ":]":"smiley",
        ":3":"smiley",
        ":>":"smiley",
        "8)":"smiley",
        ":}":"smiley",
        ":o)":"smiley",
        ":c)":"smiley",
        ":^)":"smiley",
        "=]":"smiley",
        "=)":"smiley",
        ":-))":"smiley",
        ":-D":"smiley",
        "8‑D":"smiley",
        "x‑D":"smiley",
        "X‑D":"smiley",
        ":D":"smiley",
        "8D":"smiley",
        "xD":"smiley",
        "XD":"smiley",
        ":-d":"smiley",
        "8‑d":"smiley",
        "x‑d":"smiley",
        "X‑d":"smiley",
        ":d":"smiley",
        "8d":"smiley",
        "xd":"smiley",
        "Xd":"smiley",
        ":‑(":"sad",
        ":‑c":"sad",
        ":‑<":"sad",
        ":‑[":"sad",
        ":(":"sad",
        ":c":"sad",
        ":<":"sad",
        ":[":"sad",
        ":-||":"sad",
        ">:[":"sad",
        ":{":"sad",
        ":@":"sad",
        ">:(":"sad",
        ":'‑(":"sad",
        ":'(":"sad",
        ":‑P":"playful",
        "X‑P":"playful",
        "x‑p":"playful",
        ":‑p":"playful",
        ":‑Þ":"playful",
        ":‑þ":"playful",
        ":‑b":"playful",
        ":P":"playful",
        "XP":"playful",
        "xp":"playful",
        ":p":"playful",
        ":Þ":"playful",
        ":þ":"playful",
        ":b":"playful",
        ";p":"playful",
        "<3":"love",
}

In [12]:
# Creando un conjunto de emoticones
conjunto_emoticones = set(diccionario_emoticones.keys())

In [13]:
def traducir_emoticones(texto):
    texto = texto.split(" ")
    j = 0
    for palabra in texto:
        # Checa si las palabras seleccionadas coinciden con el connjunto de emoticones
        if palabra in conjunto_emoticones:
            # Si encuentra una coincidencia, la reemplaza con su respectiva traducción
            texto[j] = diccionario_emoticones[palabra]
        j = j + 1
    # Retorna la cadena corregida
    return ' '.join(texto)

In [14]:
data_noEmoticones = data_noContracciones.apply(lambda x: traducir_emoticones(x))
data_noEmoticones

0         What  said.                                                                                                                              
1         plus you have added commercials to the experience... tacky.                                                                              
2         I did not today... Must mean I need to take another trip!                                                                                
3         it is really aggressive to blast obnoxious "entertainment" in your guests' faces &amp; they have little recourse                         
4         and it is a really big bad thing about it                                                                                                
5         seriously would pay $30 a flight for seats that did not have this playing.\r\nit's really the only bad thing about flying VA             
6         yes, nearly every time I fly VX this “ear worm” will not go away smiley                               

### Codigo para remover emojis

In [15]:
# data_noEmoji = data_noHashtag.str.replace("["
#                            u"\U0001F600-\U0001F64F"  # emojis
#                            u"\U0001F300-\U0001F5FF"  # simbolos & pictografos
#                            u"\U0001F680-\U0001F6FF"  # simbolos de transporte y mapas
#                            u"\U0001F1E0-\U0001F1FF"  # banderas (iOS)
#                            u"\U00002702-\U000027B0"
#                            u"\U000024C2-\U0001F251"
#                            "]+", "")

### Interpretación de emojis

In [16]:
import emoji
data_noEmojis = data_noEmoticones.apply(lambda x: emoji.demojize(x))
data_noEmojis = data_noEmojis.str.replace(":"," ")
data_noEmojis

0         What  said.                                                                                                                              
1         plus you have added commercials to the experience... tacky.                                                                              
2         I did not today... Must mean I need to take another trip!                                                                                
3         it is really aggressive to blast obnoxious "entertainment" in your guests' faces &amp; they have little recourse                         
4         and it is a really big bad thing about it                                                                                                
5         seriously would pay $30 a flight for seats that did not have this playing.\r\nit's really the only bad thing about flying VA             
6         yes, nearly every time I fly VX this “ear worm” will not go away smiley                               

## Remover signos de puntuacion

In [17]:
data_noPunctuation = data_noEmojis.str.replace("[\.\,\!\?\:\;\-\=]", " ")
data_noPunctuation = data_noPunctuation.str.replace(" +"," ") # Reducir los espacios a solo 1
data_noPunctuation

0         What said                                                                                                                              
1         plus you have added commercials to the experience tacky                                                                                
2         I did not today Must mean I need to take another trip                                                                                  
3         it is really aggressive to blast obnoxious "entertainment" in your guests' faces &amp they have little recourse                        
4         and it is a really big bad thing about it                                                                                              
5         seriously would pay $30 a flight for seats that did not have this playing \r\nit's really the only bad thing about flying VA           
6         yes nearly every time I fly VX this “ear worm” will not go away smiley                                            

## Convertir mayúsculas a minúsculas

In [18]:
data_lower = data_noPunctuation.str.lower() # Convertir todo el texto de la columna "text" a minusculas
data_lower # Mostrar

0         what said                                                                                                                              
1         plus you have added commercials to the experience tacky                                                                                
2         i did not today must mean i need to take another trip                                                                                  
3         it is really aggressive to blast obnoxious "entertainment" in your guests' faces &amp they have little recourse                        
4         and it is a really big bad thing about it                                                                                              
5         seriously would pay $30 a flight for seats that did not have this playing \r\nit's really the only bad thing about flying va           
6         yes nearly every time i fly vx this “ear worm” will not go away smiley                                            

## Interpretación de Slang (abreviaturas)

### Web Scrapping de los acronimos de Netlingo

In [19]:
#from bs4 import BeautifulSoup
#import requests, json
#resp = requests.get("http://www.netlingo.com/acronyms.php")
#soup = BeautifulSoup(resp.text, "html.parser")
#slangdict = {}
#key = ""
#value = ""
#for div in soup.findAll('div', attrs={'class':'list_box3'}):
#    for li in div.findAll('li'):
#        for a in li.findAll('a'):
#            key = a.text
#        value = li.text.split(key)[1]
#        slangdict[key.upper()] = value
#with open('myslang.json','w') as find:
#    json.dump(slangdict, find, indent = 2)

### Leer el archivo de Slang en JSON

Descubrí que al tener un diccionario más amplio, afecta negativamente al análisis semántico porque traduce palabras como "TIME" a "Tears In My Ears" cuando en realidad el texto se refiere al "Tiempo". Así que no lo voy a usar, pero dejo el código para futuras referencias.

In [20]:
# slang = pd.read_json("myslang.json", typ = "series")
# # slang.to_frame('count') #para convertir a DataFrame
# slang_df = slang.reset_index()
# slang_np = slang_df["index"].to_numpy()
# slang_list = slang_np.tolist()
# slang_set = set(slang_list)

In [21]:
#import re
# def translator(user_string):
#     user_string = user_string.split(" ")
#     j = 0
#     for _str in user_string:
#         # Removiendo carácteres especiales
#         #_str = re.sub('[^a-zA-Z0-9-_.]', '', _str)
#         _str = _str.upper()
#         # Checa si las palabras seleccionadas coinciden con las abreviaturas en el archivo de "slang.txt"
#         if _str in slang_set:
#             print("entro en ", user_string, " con: ", _str, " a ", slang[_str].lower())
#             # Si encuentra una coincidencia, la reemplaza con su respectiva traducción
#             user_string[j] = slang[_str].lower()
#         j = j + 1
#     # Retorna la cadena corregida
#     return ' '.join(user_string)

In [22]:
# data_noSlang = data_lower.apply(lambda x: translator(x))
# data_noSlang

### Si los slangs hubieran estado en un TXT

In [23]:
# Lectura de archivo
slang_df = pd.read_csv("slang.txt", sep = "=")
slang_df.columns = ["Slang", "Meaning"]

# Crear conjunto de Slangs
slang_np = slang_df["Slang"].to_numpy()
slang_list = slang_np.tolist()
slang_set = set(slang_list)

# Hacer que la columna "Slang" sean los indices (para busquedas)
slang_df = slang_df.set_index('Slang')
slang = slang_df["Meaning"]

In [24]:
def traducir_slang(texto):
    texto = texto.split(" ")
    j = 0
    for palabra in texto:
        palabra = palabra.upper()
        # Checa si las palabras seleccionadas coinciden con el connjunto de emoticones
        if palabra in slang_set:
            #print("Slang en ", texto, " con ", palabra, " a ", slang[palabra])
            # Si encuentra una coincidencia, la reemplaza con su respectiva traducción
            texto[j] = slang[palabra].lower()
        j = j + 1
    # Retorna la cadena corregida
    return ' '.join(texto)

In [25]:
data_noSlang = data_lower.apply(lambda x: traducir_slang(x))
data_noSlang

0         what said                                                                                                                                  
1         plus you have added commercials to the experience tacky                                                                                    
2         i did not today must mean i need to take another trip                                                                                      
3         it is really aggressive to blast obnoxious "entertainment" in your guests' faces &amp they have little recourse                            
4         and it is a really big bad thing about it                                                                                                  
5         seriously would pay $30 a flight for seats that did not have this playing \r\nit's really the only bad thing about flying va               
6         yes nearly every time i fly vx this “ear worm” will not go away smiley                    

## Reducción de carácteres repetidos

Como "haaapppyyyy" a "haappyy"

In [26]:
data_noRepeated = data_noSlang.transform(lambda x: re.sub(r'(.)\1+', r'\1\1', x))
data_noRepeated

0         what said                                                                                                                                  
1         plus you have added commercials to the experience tacky                                                                                    
2         i did not today must mean i need to take another trip                                                                                      
3         it is really aggressive to blast obnoxious "entertainment" in your guests' faces &amp they have little recourse                            
4         and it is a really big bad thing about it                                                                                                  
5         seriously would pay $30 a flight for seats that did not have this playing \r\nit's really the only bad thing about flying va               
6         yes nearly every time i fly vx this “ear worm” will not go away smiley                    

## Remover StopWords

Son palabras que no aportan valor al analizar sentimientos, en Inglés serían palabras como "are, you, have, etc"

In [27]:
#import nltk
#nltk.download('stopwords')
#from nltk.corpus import stopwords
stop = stopwords.words("english")
stop_set = set(stop)
data_noStopwords = data_noRepeated.apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_set)]))
data_noStopwords

0        said                                                                                                              
1        plus added commercials experience tacky                                                                           
2        today must mean need take another trip                                                                            
3        really aggressive blast obnoxious "entertainment" guests' faces &amp little recourse                              
4        really big bad thing                                                                                              
5        seriously would pay $30 flight seats playing really bad thing flying va                                           
6        yes nearly every time fly vx “ear worm” go away smiley                                                            
7        really missed prime opportunity men without hats parody                                                           
8       

## Stemming (volver a las palabras a su respectiva palabra raiz)

Existen diferentes tipos de Stemmers, para el lenguaje Inglés,  podemos encontrar 2 de las más populares en la librería NLTK

### Porter Stemmer

Es conocido por su simplicidad y velocidad

In [28]:
#from nltk.stem import PorterStemmer

In [29]:
ps = PorterStemmer()
data_PorterStemming = data_noStopwords.apply(lambda x: ' '.join([ps.stem(word) for word in x.split()]))
data_PorterStemming

0        said                                                                                         
1        plu ad commerci experi tacki                                                                 
2        today must mean need take anoth trip                                                         
3        realli aggress blast obnoxi "entertainment" guests' face &amp littl recours                  
4        realli big bad thing                                                                         
5        serious would pay $30 flight seat play realli bad thing fli va                               
6        ye nearli everi time fli vx “ear worm” go away smiley                                        
7        realli miss prime opportun men without hat parodi                                            
8        well didn't…but smiley                                                                       
9        amaz arriv hour earli good                                      

### LancasterStemmer

Es conocido por ser simple, pero tambien en ser muy duro al stemmizar, ya que realiza iteraciones y podría ocurrir una sobre-stemmización

In [30]:
#from nltk.stem import LancasterStemmer

In [31]:
ls = LancasterStemmer()
data_LancasterStemming = data_noStopwords.apply(lambda x: ' '.join([ls.stem(word) for word in x.split()]))
data_LancasterStemming

0        said                                                                                    
1        plu ad commerc expery tacky                                                             
2        today must mean nee tak anoth trip                                                      
3        real aggress blast obnoxy "entertainment" guests' fac &amp littl recours                
4        real big bad thing                                                                      
5        sery would pay $30 flight seat play real bad thing fly va                               
6        ye near every tim fly vx “ear worm” go away smiley                                      
7        real miss prim opportun men without hat parody                                          
8        wel didn't…but smiley                                                                   
9        amaz ar hour ear good                                                                   
10       know suicid

Sin embargo, ambos stemmers por si solos devuelven la cadena completa como si se tratara de una palabra:

' plu ad commerc experience.. tacky.'

Cuando debería ser:

['plu' 'ad' 'commerc' 'experience' 'tacky']

Para lograr ello realizamos una "Tokenización"

## Tokenización

### Porter Stemmer

In [32]:
nltk.download('punkt')
#from nltk.tokenize import sent_tokenize, word_tokenize
def stemOracion(oracion):
    token_words = word_tokenize(oracion)
    token_words
    stem_sentence = []
    punctuations = "?:!.,;"
    for word in token_words:
        if word in punctuations:
            token_words.remove(word)
            continue
        stem_sentence.append(ps.stem(word))
    return stem_sentence


porter_stemmer_tokenized = data_noStopwords.apply(lambda x: stemOracion(x))
porter_stemmer_tokenized

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\aleja_000\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


0        [said]                                                                                                        
1        [plu, ad, commerci, experi, tacki]                                                                            
2        [today, must, mean, need, take, anoth, trip]                                                                  
3        [realli, aggress, blast, obnoxi, ``, entertain, '', guest, ', face, &, amp, littl, recours]                   
4        [realli, big, bad, thing]                                                                                     
5        [serious, would, pay, $, 30, flight, seat, play, realli, bad, thing, fli, va]                                 
6        [ye, nearli, everi, time, fli, vx, “, ear, worm, ”, go, away, smiley]                                         
7        [realli, miss, prime, opportun, men, without, hat, parodi]                                                    
8        [well, didn't…but, smiley]     

### Lancaster Stemmer

In [33]:
nltk.download('punkt')
#from nltk.tokenize import sent_tokenize, word_tokenize
def stemOracion(oracion):
    token_words = word_tokenize(oracion)
    token_words
    stem_sentence = []
    punctuations = "?:!.,;"
    for word in token_words:
        if word in punctuations:
            token_words.remove(word)
            continue
        stem_sentence.append(ls.stem(word))
    return stem_sentence


porter_stemmer_tokenized = data_noStopwords.apply(lambda x: stemOracion(x))
porter_stemmer_tokenized

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\aleja_000\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


0        [said]                                                                                                   
1        [plu, ad, commerc, expery, tacky]                                                                        
2        [today, must, mean, nee, tak, anoth, trip]                                                               
3        [real, aggress, blast, obnoxy, ``, entertain, '', guest, ', fac, &, amp, littl, recours]                 
4        [real, big, bad, thing]                                                                                  
5        [sery, would, pay, $, 30, flight, seat, play, real, bad, thing, fly, va]                                 
6        [ye, near, every, tim, fly, vx, “, ear, worm, ”, go, away, smiley]                                       
7        [real, miss, prim, opportun, men, without, hat, parody]                                                  
8        [wel, didn't…but, smiley]                                              

## Lemmatization (es el Stemming pero con otro proceso)

Se desarrollara el Lemmatization para ver si con este proceso se obtienen mejores resultados. El siguiente código tambien incluye la Tokenización

In [34]:
#Fuente: https://www.datacamp.com/community/tutorials/stemming-lemmatization-python
#from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

def lemmatization(oracion):
    wordnet_lemmatizer = WordNetLemmatizer()
    punctuations = "?:!.,;$\"\'\´\``\”\“\''"
    resultado = []
    sentence_words = nltk.word_tokenize(oracion)
    for word in sentence_words:
        if word in punctuations:
            sentence_words.remove(word)
            continue
        resultado.append(wordnet_lemmatizer.lemmatize(word, pos="v"))
    return resultado

data_lemmatized = data_noStopwords.apply(lambda x: lemmatization(x))
data_lemmatized

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\aleja_000\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


0        [say]                                                                                                                        
1        [plus, add, commercials, experience, tacky]                                                                                  
2        [today, must, mean, need, take, another, trip]                                                                               
3        [really, aggressive, blast, obnoxious, &, amp, little, recourse]                                                             
4        [really, big, bad, thing]                                                                                                    
5        [seriously, would, pay, flight, seat, play, really, bad, thing, fly, va]                                                     
6        [yes, nearly, every, time, fly, vx, worm, away, smiley]                                                                      
7        [really, miss, prime, opportunity, men, withou

## Part Of Speech Tagging (POS)

Sirve para etiquetar cada palabra en la oración como verbo, sustantivo o pronombre, etc.

In [35]:
#Fuente: https://towardsdatascience.com/basic-data-cleaning-engineering-session-twitter-sentiment-data-95e5bd2869ec
nltk.download('averaged_perceptron_tagger')
data_POS = data_noStopwords.apply(lambda x: nltk.pos_tag(nltk.word_tokenize(x)))
data_POS

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\aleja_000\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


0        [(said, VBD)]                                                                                                                                                                                                                                     
1        [(plus, CC), (added, JJ), (commercials, NNS), (experience, NN), (tacky, NN)]                                                                                                                                                                      
2        [(today, NN), (must, MD), (mean, VB), (need, MD), (take, VB), (another, DT), (trip, NN)]                                                                                                                                                          
3        [(really, RB), (aggressive, JJ), (blast, NN), (obnoxious, JJ), (``, ``), (entertainment, NN), ('', ''), (guests, NNS), (', POS), (faces, VBZ), (&, CC), (amp, JJ), (little, JJ), (recourse, NN)]                                           

# Feature Extraction

"En este caso, puedes definir una característica por cada palabra, indicando si el documento contiene esa palabra. Para ponerle un número limite de características que el clasificador necesita procesar, se empieza por construir una lista de las 2000 palabras mas frecuentes en el corpus en general"

Fuente: http://www.nltk.org/book/ch06.html

Primero necesitamos hacer una lista de todas las palabras (**Bag of Words**)

Como tengo un objeto de tipo "Series" de pandas, primero necesito convertirlo a una lista, para crear así, una **lista de listas**

In [36]:
l = data_lemmatized.tolist()
# data_lemmatized_prepared = data_lemmatized.apply(lambda x: ' '.join(x))
# data_lemmatized_prepared

Y crear una lista con todas las palabras, iterando la lista de listas y adjuntandolas a una nueva lista unidimensional

In [37]:
all_words = [item for sublist in l for item in sublist]

In [38]:
#all_words

In [39]:
# Definir el feature extractor

# Utilizar FreqDist para encontrar las palabras más utilizadas en todos los documentos
all_words_freq = nltk.FreqDist(all_words)

# Y tomar los primeros 2000
word_features = list(all_words_freq)[:2000]

def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

### Bag of Words

In [40]:
#word_features

## Ejecutando la funcion

In [41]:
#document_features(word_features)

## Pivoteo

Ahora necesitamos crear una estructura en donde las filas sean los documentos y las columnas cada palabra en ese documento con su respectiva clasificación

### Convertir los valores de Sentiment a 0 y 1

In [42]:
sentiment = data["airline_sentiment"].replace(to_replace=["positive","neutral","negative"], value=[1,0,-1])
#sentiment

### Eliminar Columnas inecesarias

De esta manera solo conservamos las columnas que queremos tener

In [43]:
bag_sentiment = pd.DataFrame(dict(data_lemmatized = data_lemmatized, sentiment = sentiment))
#bag_sentiment

# Conjunto de Entrenamiento y de Prueba

In [44]:
bag = bag_sentiment.values.tolist()
feature_sets = [(document_features(d), c) for (d,c) in bag]
train_set, test_set = feature_sets[10:], feature_sets[:10]

# Entrenamiento

In [45]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

# Evaluación

In [46]:
print(nltk.classify.accuracy(classifier, test_set))

0.7


In [47]:
classifier.show_most_informative_features(5)

Most Informative Features
      contains(favorite) = True                1 : -1     =     37.6 : 1.0
       contains(helpful) = True                1 : 0      =     35.4 : 1.0
      contains(passbook) = True                1 : -1     =     35.0 : 1.0
     contains(thumbs_up) = True                1 : -1     =     29.4 : 1.0
         contains(daily) = True                0 : -1     =     28.6 : 1.0
