# Preprocesamiento con Python

Para el preprocesamiento de texto utilizaremos las librerías de Python:
- **Numpy**: para el procesamiento de las operaciones en los Dataframes y Series de Pandas
- **Pandas**: para la manipulación de los datos
- **NLTK**: para el procesamiento de texto por medio de las StopWords, Stemming, Lemmatization y POS tag
- **re**: filtrar datos con para expresiones regulares 

## Lectura de datos con Pandas

In [1]:
# Importacion de librerias
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import WordNetLemmatizer
import re
# Permite desplegar el texto completo en Jupyter
pd.set_option('display.max_colwidth', -1)

In [2]:
# Lectura de CSV
data = pd.read_csv("Tweets_pg_prepared.csv")
data.tail() # Muestra los datos

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,text,tweet_coord,tweet_created,tweet_location,user_timezone
14635,5.69588e+17,positive,0.3487,Can't Tell,0.0,American,No value,KristenReenders,No Value,@AmericanAir thank you we got on a different flight to Chicago.,"[0.0, 0.0]",22/02/2015 12:01,No value,No Value
14636,5.69587e+17,negative,1.0,Customer Service Issue,1.0,American,No value,itsropes,No Value,@AmericanAir leaving over 20 minutes Late Flight. No warnings or communication until we were 15 minutes Late Flight. That's called shitty customer svc,"[0.0, 0.0]",22/02/2015 11:59,Texas,No Value
14637,5.69587e+17,neutral,1.0,Can't Tell,0.0,American,No value,sanyabun,No Value,@AmericanAir Please bring American Airlines to #BlackBerry10,"[0.0, 0.0]",22/02/2015 11:59,"Nigeria,lagos",No Value
14638,5.69587e+17,negative,1.0,Customer Service Issue,0.6659,American,No value,SraJackson,No Value,"@AmericanAir you have my money, you change my flight, and don't answer your phones! Any other suggestions so I can make my commitment??","[0.0, 0.0]",22/02/2015 11:59,New Jersey,Eastern Time (US & Canada)
14639,5.69587e+17,neutral,0.6771,Can't Tell,0.0,American,No value,daviddtwu,No Value,@AmericanAir we have 8 ppl so we need 2 know how many seats are on the next flight. Plz put us on standby for 4 people on the next flight?,"[0.0, 0.0]",22/02/2015 11:58,"dallas, TX",No Value


In [3]:
data["text"] # Mostrar los datos de la columna "text"

0        @VirginAmerica What @dhepburn said.                                                                                                                   
1        @VirginAmerica plus you've added commercials to the experience... tacky.                                                                              
2        @VirginAmerica I didn't today... Must mean I need to take another trip!                                                                               
3        @VirginAmerica it's really aggressive to blast obnoxious "entertainment" in your guests' faces &amp; they have little recourse                        
4        @VirginAmerica and it's a really big bad thing about it                                                                                               
5        @VirginAmerica seriously would pay $30 a flight for seats that didn't have this playing.\r\nit's really the only bad thing about flying VA            
6        @VirginAmerica yes, nearly ever

## Convertir mayúsculas a minúsculas

In [4]:
data_lower = data["text"].str.lower() # Convertir todo el texto de la columna "text" a minusculas
data_lower # Mostrar

0        @virginamerica what @dhepburn said.                                                                                                                   
1        @virginamerica plus you've added commercials to the experience... tacky.                                                                              
2        @virginamerica i didn't today... must mean i need to take another trip!                                                                               
3        @virginamerica it's really aggressive to blast obnoxious "entertainment" in your guests' faces &amp; they have little recourse                        
4        @virginamerica and it's a really big bad thing about it                                                                                               
5        @virginamerica seriously would pay $30 a flight for seats that didn't have this playing.\r\nit's really the only bad thing about flying va            
6        @virginamerica yes, nearly ever

## Remover URLs (Regex)

Explicación de Regex
1. **(http|https|ftp)**: Detectar si empieza con alguno de estos protocolos
2. **://**: Seguido de un "://"
3. **[a-zA-Z0-9\\./]**: E inmediatamente empieza una palabra seguido de un punto (.) una o mas veces (de esta manera se incluye el (.com y variantes)

In [5]:
data_noURL = data_lower.str.replace('(http|https|ftp)://[a-zA-Z0-9\\./]+',"")
data_noURL

0        @virginamerica what @dhepburn said.                                                                                                                   
1        @virginamerica plus you've added commercials to the experience... tacky.                                                                              
2        @virginamerica i didn't today... must mean i need to take another trip!                                                                               
3        @virginamerica it's really aggressive to blast obnoxious "entertainment" in your guests' faces &amp; they have little recourse                        
4        @virginamerica and it's a really big bad thing about it                                                                                               
5        @virginamerica seriously would pay $30 a flight for seats that didn't have this playing.\r\nit's really the only bad thing about flying va            
6        @virginamerica yes, nearly ever

## Remover referencias (@Usernames)

Explicación regex
1. **@**: Si empieza con arroba (@)
2. **(\w+)**: y le sigue una o más palabras

In [6]:
data_noUser = data_noURL.str.replace('@(\w+)',"")
data_noUser

0         what  said.                                                                                                                              
1         plus you've added commercials to the experience... tacky.                                                                                
2         i didn't today... must mean i need to take another trip!                                                                                 
3         it's really aggressive to blast obnoxious "entertainment" in your guests' faces &amp; they have little recourse                          
4         and it's a really big bad thing about it                                                                                                 
5         seriously would pay $30 a flight for seats that didn't have this playing.\r\nit's really the only bad thing about flying va              
6         yes, nearly every time i fly vx this “ear worm” won’t go away :)                                      

## Remover hashtags

Explicación regex
1. **#**: Si empieza con arroba (#)
2. **(\w+)**: y le sigue una o más palabras

In [7]:
data_noHashtag = data_noUser.str.replace('#(\w+)',"")
data_noHashtag

0         what  said.                                                                                                                              
1         plus you've added commercials to the experience... tacky.                                                                                
2         i didn't today... must mean i need to take another trip!                                                                                 
3         it's really aggressive to blast obnoxious "entertainment" in your guests' faces &amp; they have little recourse                          
4         and it's a really big bad thing about it                                                                                                 
5         seriously would pay $30 a flight for seats that didn't have this playing.\r\nit's really the only bad thing about flying va              
6         yes, nearly every time i fly vx this “ear worm” won’t go away :)                                      

## Remover emoticones

NOTA: Falta averiguar si algún analizador de sentimientos le sirven los emoticones, o tal vez se puedan traducir los emojis por alguna palabra que exprese su sentimiento.

In [8]:

data_noEmoji = data_noHashtag.str.replace("["
                           u"\U0001F600-\U0001F64F"  # emoticones
                           u"\U0001F300-\U0001F5FF"  # simbolos & pictografos
                           u"\U0001F680-\U0001F6FF"  # simbolos de transporte y mapas
                           u"\U0001F1E0-\U0001F1FF"  # banderas (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", "")
data_noEmoji

0         what  said.                                                                                                                              
1         plus you've added commercials to the experience... tacky.                                                                                
2         i didn't today... must mean i need to take another trip!                                                                                 
3         it's really aggressive to blast obnoxious "entertainment" in your guests' faces &amp; they have little recourse                          
4         and it's a really big bad thing about it                                                                                                 
5         seriously would pay $30 a flight for seats that didn't have this playing.\r\nit's really the only bad thing about flying va              
6         yes, nearly every time i fly vx this “ear worm” won’t go away :)                                      

## Remover StopWords

Son palabras que no aportan valor al analizar sentimientos, en Inglés serían palabras como "are, you, have, etc"

In [9]:
#import nltk
#nltk.download('stopwords')
#from nltk.corpus import stopwords
stop = stopwords.words("english")
data_noStopwords = data_noEmoji.apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
data_noStopwords

0        said.                                                                                                           
1        plus added commercials experience... tacky.                                                                     
2        today... must mean need take another trip!                                                                      
3        really aggressive blast obnoxious "entertainment" guests' faces &amp; little recourse                           
4        really big bad thing                                                                                            
5        seriously would pay $30 flight seats playing. really bad thing flying va                                        
6        yes, nearly every time fly vx “ear worm” won’t go away :)                                                       
7        really missed prime opportunity men without hats parody, there.                                                 
8        well, didn't…bu

## Interpretación de Slang (abreviaturas)

### Web Scrapping de los acronimos de Netlingo

In [10]:
# from bs4 import BeautifulSoup
# import requests, json
# resp = requests.get("http://www.netlingo.com/acronyms.php")
# soup = BeautifulSoup(resp.text, "html.parser")
# slangdict = {}
# key = ""
# value = ""
# for div in soup.findAll('div', attrs={'class':'list_box3'}):
#     for li in div.findAll('li'):
#         for a in li.findAll('a'):
#             key = a.text
#         value = li.text.split(key)[1]
#         slangdict[key] = value
# with open('myslang.json','w') as find:
#     json.dump(slangdict, find, indent = 2)

### Leer el archivo de Slang en JSON

In [11]:
slang = pd.read_json("myslang.json", typ = "series")
# slang.to_frame('count') #para convertir a DataFrame
slang

!               I have a comment                             
#FF             Follow Friday                                
(U)             it means arms around you, hug for you        
*$              Starbucks                                    
**//            it means wink wink, nudge nudge              
,!!!!           Talk to the hand                             
/R/             Requesting                                   
02              Your (or my) two cents worth, also seen as m.
10Q             Thank you                                    
1174            Nude club                                    
121             One to one                                   
123             it means I agree                             
1337            Elite -or- leet -or- L337                    
14              it refers to the fourteen words              
142n8ly         Unfortunately                                
143             I love you                                   
1432    

In [12]:
#import re
def translator(user_string):
    user_string = user_string.split(" ")
    j = 0
    slang = pd.read_json("myslang.json", typ = "series")
    for _str in user_string:
        # Archivo con las abreviaturas y su traducción
        fileName = "myslang.json"
        # Removiendo carácteres especiales
        _str = re.sub('[^a-zA-Z0-9-_.]', '', _str)
        for k in slang.keys():
        # Checa si las palabras seleccionadas coinciden con las abreviaturas en el archivo de "slang.txt"
            if _str.upper() == k:
                # Si encuentra una coincidencia, la reemplaza con su respectiva traducción
                user_string[j] = slang[k].lower()
    j = j + 1
    # Retorna la cadena corregida
    return ' '.join(user_string)

In [13]:
data_noSlang = data_noStopwords.apply(lambda x: translator(x))
data_noSlang

0        said.                                                                                                            
1        plus added commercials experience... tacky.                                                                      
2        today... must mean need take another trip!                                                                       
3        accelerated mobile pages aggressive blast obnoxious "entertainment" guests' faces &amp; little recourse          
4        really big bad thing                                                                                             
5        seriously would pay $30 flight seats playing. really bad thing flying va                                         
6        tears in my eyes nearly every time fly vx “ear worm” won’t go away :)                                            
7        really missed prime opportunity men without hats parody, there.                                                  
8        well, d

### Si los slangs hubieran estado en un TXT

In [14]:
#import csv, rea

In [15]:
# def translator(user_string):
#     user_string = user_string.split(" ")
#     j = 0
#     for _str in user_string:
#         # Archivo con las abreviaturas y su traducción
#         fileName = "slang.txt"
#         # Modo de Acceso al archivo (lectura)
#         accessMode = "r"
#         with open(fileName, accessMode) as myCSVfile:
#             # Leer un archivo como un CSV con delimitador como "=", para que la abreviacion sea guardada en row[0] y las frases en row[1]
#             dataFromFile = csv.reader(myCSVfile, delimiter="=")
#             # Removiendo carácteres especiales
#             _str = re.sub('[^a-zA-Z0-9-_.]', '', _str)
#             for row in dataFromFile:
#                 # Checa si las palabras seleccionadas coinciden con las abreviaturas en el archivo de "slang.txt"
#                 if _str.upper() == row[0]:
#                     # Si encuentra una coincidencia, la reemplaza con su respectiva traducción
#                     user_string[j] = row[1].lower()
#             myCSVfile.close()
#         j = j + 1
#     # Retorna la cadena corregida
#     return ' '.join(user_string)

In [16]:
#data_noSlang = data_noStopwords.apply(lambda x: translator(x))
#data_noSlang

## Reducción de carácteres repetidos

Como "haaapppyyyy" a "haappyy"

In [17]:
data_noRepeated = data_noSlang.transform(lambda x: re.sub(r'(.)\1+', r'\1\1', x))
data_noRepeated

0        said.                                                                                                            
1        plus added commercials experience.. tacky.                                                                       
2        today.. must mean need take another trip!                                                                        
3        accelerated mobile pages aggressive blast obnoxious "entertainment" guests' faces &amp; little recourse          
4        really big bad thing                                                                                             
5        seriously would pay $30 flight seats playing. really bad thing flying va                                         
6        tears in my eyes nearly every time fly vx “ear worm” won’t go away :)                                            
7        really missed prime opportunity men without hats parody, there.                                                  
8        well, d

## Stemming (volver a las palabras a su respectiva palabra raiz)

Existen diferentes tipos de Stemmers, para el lenguaje Inglés,  podemos encontrar 2 de las más populares en la librería NLTK

### Porter Stemmer

Es conocido por su simplicidad y velocidad

In [18]:
#from nltk.stem import PorterStemmer

In [19]:
ps = PorterStemmer()
data_PorterStemming = data_noRepeated.apply(lambda x: ' '.join([ps.stem(word) for word in x.split()]))
data_PorterStemming

0        said.                                                                                                
1        plu ad commerci experience.. tacky.                                                                  
2        today.. must mean need take anoth trip!                                                              
3        acceler mobil page aggress blast obnoxi "entertainment" guests' face &amp; littl recours             
4        realli big bad thing                                                                                 
5        serious would pay $30 flight seat playing. realli bad thing fli va                                   
6        tear in my eye nearli everi time fli vx “ear worm” won’t go away :)                                  
7        realli miss prime opportun men without hat parody, there.                                            
8        well, didn't…but do! :-d                                                                             
9

### LancasterStemmer

Es conocido por ser simple, pero tambien en ser muy duro al stemmizar, ya que realiza iteraciones y podría ocurrir una sobre-stemmización

In [20]:
#from nltk.stem import LancasterStemmer

In [21]:
ls = LancasterStemmer()
data_LancasterStemming = data_noRepeated.apply(lambda x: ' '.join([ls.stem(word) for word in x.split()]))
data_LancasterStemming

0        said.                                                                                          
1        plu ad commerc experience.. tacky.                                                             
2        today.. must mean nee tak anoth trip!                                                          
3        accel mobl pag aggress blast obnoxy "entertainment" guests' fac &amp; littl recours            
4        real big bad thing                                                                             
5        sery would pay $30 flight seat playing. real bad thing fly va                                  
6        tear in my ey near every tim fly vx “ear worm” won’t go away :)                                
7        real miss prim opportun men without hat parody, there.                                         
8        well, didn't…but do! :-d                                                                       
9        amazing, ar hour early. good me.              

Sin embargo, ambos stemmers por si solos devuelven la cadena completa como si se tratara de una palabra:

' plu ad commerc experience.. tacky.'

Cuando debería ser:

['plu' 'ad' 'commerc' 'experience' 'tacky']

Para lograr ello realizamos una "Tokenización"

## Tokenización

### Porter Stemmer

In [23]:
nltk.download('punkt')
#from nltk.tokenize import sent_tokenize, word_tokenize
def stemOracion(oracion):
    token_words = word_tokenize(oracion)
    token_words
    stem_sentence = []
    punctuations = "?:!.,;"
    for word in token_words:
        if word in punctuations:
            token_words.remove(word)
            continue
        stem_sentence.append(ps.stem(word))
    return stem_sentence


porter_stemmer_tokenized = data_noRepeated.apply(lambda x: stemOracion(x))
porter_stemmer_tokenized

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\aleja_000\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


0        [said]                                                                                                       
1        [plu, ad, commerci, experience.., tacki]                                                                     
2        [today.., must, mean, need, take, anoth, trip]                                                               
3        [acceler, mobil, page, aggress, blast, obnoxi, ``, entertain, '', guest, ', face, &, amp, recours]           
4        [realli, big, bad, thing]                                                                                    
5        [serious, would, pay, $, 30, flight, seat, play, bad, thing, fli, va]                                        
6        [tear, in, my, eye, nearli, everi, time, fli, vx, “, ear, worm, ”, won, ’, t, go, away]                      
7        [realli, miss, prime, opportun, men, without, hat, parodi]                                                   
8        [well, do, -d]                         

### Lancaster Stemmer

In [25]:
nltk.download('punkt')
#from nltk.tokenize import sent_tokenize, word_tokenize
def stemOracion(oracion):
    token_words = word_tokenize(oracion)
    token_words
    stem_sentence = []
    punctuations = "?:!.,;"
    for word in token_words:
        if word in punctuations:
            token_words.remove(word)
            continue
        stem_sentence.append(ls.stem(word))
    return stem_sentence


porter_stemmer_tokenized = data_noRepeated.apply(lambda x: stemOracion(x))
porter_stemmer_tokenized

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\aleja_000\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


0        [said]                                                                                               
1        [plu, ad, commerc, experience.., tacky]                                                              
2        [today.., must, mean, nee, tak, anoth, trip]                                                         
3        [accel, mobl, pag, aggress, blast, obnoxy, ``, entertain, '', guest, ', fac, &, amp, recours]        
4        [real, big, bad, thing]                                                                              
5        [sery, would, pay, $, 30, flight, seat, play, bad, thing, fly, va]                                   
6        [tear, in, my, ey, near, every, tim, fly, vx, “, ear, worm, ”, won, ’, t, go, away]                  
7        [real, miss, prim, opportun, men, without, hat, parody]                                              
8        [wel, do, -d]                                                                                        
9

## Lemmatization (es el Stemming pero con otro proceso)

Se desarrollara el Lemmatization para ver si con este proceso se obtienen mejores resultados. El siguiente código tambien incluye la Tokenización

In [63]:
#Fuente: https://www.datacamp.com/community/tutorials/stemming-lemmatization-python
#from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

def lemmatization(oracion):
    wordnet_lemmatizer = WordNetLemmatizer()
    punctuations = "?:!.,;$\"\'\´\``\”\“\''"
    resultado = []
    sentence_words = nltk.word_tokenize(oracion)
    for word in sentence_words:
        if word in punctuations:
            sentence_words.remove(word)
            continue
        resultado.append(wordnet_lemmatizer.lemmatize(word, pos="v"))
    return resultado

data_lemmatized = data_noRepeated.apply(lambda x: lemmatization(x))
data_lemmatized

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\aleja_000\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


0        [say]                                                                                                           
1        [plus, add, commercials, experience.., tacky]                                                                   
2        [today.., must, mean, need, take, another, trip]                                                                
3        [accelerate, mobile, page, aggressive, blast, obnoxious, &, amp, recourse]                                      
4        [really, big, bad, thing]                                                                                       
5        [seriously, would, pay, flight, seat, play, bad, thing, fly, va]                                                
6        [tear, in, my, eye, nearly, every, time, fly, vx, worm, ’, t, go, away]                                         
7        [really, miss, prime, opportunity, men, without, hat, parody]                                                   
8        [well, do, -d] 

## Part Of Speech Tagging (POS)

Sirve para etiquetar cada palabra en la oración como verbo, sustantivo o pronombre, etc.

In [26]:
#Fuente: https://towardsdatascience.com/basic-data-cleaning-engineering-session-twitter-sentiment-data-95e5bd2869ec
nltk.download('averaged_perceptron_tagger')
data_POS = data_noRepeated.apply(lambda x: nltk.pos_tag(nltk.word_tokenize(x)))
data_POS

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\aleja_000\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


0        [(said, VBD), (., .)]                                                                                                                                                                                                                                                                                            
1        [(plus, CC), (added, JJ), (commercials, NNS), (experience.., VBP), (tacky, JJ), (., .)]                                                                                                                                                                                                                          
2        [(today.., NN), (must, MD), (mean, VB), (need, MD), (take, VB), (another, DT), (trip, NN), (!, .)]                                                                                                                                                                                                               
3        [(accelerated, VBN), (mobile, JJ), (pages, NNS

# Feature Extraction

"En este caso, puedes definir una característica por cada palabra, indicando si el documento contiene esa palabra. Para ponerle un número limite de características que el clasificador necesita procesar, se empieza por construir una lista de las 2000 palabras mas frecuentes en el corpus en general"

Fuente: http://www.nltk.org/book/ch06.html

Primero necesitamos hacer una lista de todas las palabras (**Bag of Words**)

Como tengo un objeto de tipo "Series" de pandas, primero necesito convertirlo a una lista, para crear así, una **lista de listas**

In [64]:
l = data_lemmatized.tolist()

Y crear una lista con todas las palabras, iterando la lista de listas y adjuntandolas a una nueva lista unidimensional

In [65]:
all_words = [item for sublist in l for item in sublist]

In [66]:
all_words

['say',
 'plus',
 'add',
 'commercials',
 'experience..',
 'tacky',
 'today..',
 'must',
 'mean',
 'need',
 'take',
 'another',
 'trip',
 'accelerate',
 'mobile',
 'page',
 'aggressive',
 'blast',
 'obnoxious',
 '&',
 'amp',
 'recourse',
 'really',
 'big',
 'bad',
 'thing',
 'seriously',
 'would',
 'pay',
 'flight',
 'seat',
 'play',
 'bad',
 'thing',
 'fly',
 'va',
 'tear',
 'in',
 'my',
 'eye',
 'nearly',
 'every',
 'time',
 'fly',
 'vx',
 'worm',
 '’',
 't',
 'go',
 'away',
 'really',
 'miss',
 'prime',
 'opportunity',
 'men',
 'without',
 'hat',
 'parody',
 'well',
 'do',
 '-d',
 'amaze',
 'hour',
 'early',
 'me',
 'know',
 'suicide',
 'second',
 'lead',
 'cause',
 'death',
 'among',
 'teens',
 '10-24',
 'pretty',
 'graphics',
 'better',
 'minimal',
 'iconography',
 'd',
 'deal',
 'think',
 '2nd',
 'trip',
 '&',
 'amp',
 'go',
 '1st',
 'trip',
 'yet',
 'p',
 'instant',
 'message',
 '-or-',
 'immediate',
 'message',
 'fly',
 'sky',
 'again',
 'take',
 'away',
 'travel',
 'thank',
 '

In [67]:
# Definir el feature extractor

# Utilizar FreqDist para encontrar las palabras más utilizadas en todos los documentos
all_words_freq = nltk.FreqDist(all_words)

# Y tomar los primeros 2000
word_features = list(all_words_freq)[:2000]

def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

### Bag of Words

In [68]:
word_features

['say',
 'plus',
 'add',
 'commercials',
 'experience..',
 'tacky',
 'today..',
 'must',
 'mean',
 'need',
 'take',
 'another',
 'trip',
 'accelerate',
 'mobile',
 'page',
 'aggressive',
 'blast',
 'obnoxious',
 '&',
 'amp',
 'recourse',
 'really',
 'big',
 'bad',
 'thing',
 'seriously',
 'would',
 'pay',
 'flight',
 'seat',
 'play',
 'fly',
 'va',
 'tear',
 'in',
 'my',
 'eye',
 'nearly',
 'every',
 'time',
 'vx',
 'worm',
 '’',
 't',
 'go',
 'away',
 'miss',
 'prime',
 'opportunity',
 'men',
 'without',
 'hat',
 'parody',
 'well',
 'do',
 '-d',
 'amaze',
 'hour',
 'early',
 'me',
 'know',
 'suicide',
 'second',
 'lead',
 'cause',
 'death',
 'among',
 'teens',
 '10-24',
 'pretty',
 'graphics',
 'better',
 'minimal',
 'iconography',
 'd',
 'deal',
 'think',
 '2nd',
 '1st',
 'yet',
 'p',
 'instant',
 'message',
 '-or-',
 'immediate',
 'sky',
 'again',
 'travel',
 'thank',
 'sfo-pdx',
 'schedule',
 'still',
 'mia',
 'excite',
 'first',
 'cross',
 'country',
 'lax',
 'mco',
 'i',
 "'ve",


## Pivoteo

Ahora necesitamos crear una estructura en donde las filas sean los documentos y las columnas cada palabra en ese documento con su respectiva clasificación

### Eliminar Columnas inecesarias

De esta manera solo conservamos las columnas que queremos tener

In [78]:
sentiment = data["airline_sentiment"]
sentiment

0        neutral 
1        positive
2        neutral 
3        negative
4        negative
5        negative
6        positive
7        neutral 
8        positive
9        positive
10       neutral 
11       positive
12       positive
13       positive
14       positive
15       negative
16       positive
17       negative
18       positive
19       positive
20       negative
21       positive
22       positive
23       neutral 
24       negative
25       negative
26       negative
27       neutral 
28       negative
29       neutral 
          ...    
14610    negative
14611    neutral 
14612    negative
14613    negative
14614    negative
14615    negative
14616    negative
14617    positive
14618    negative
14619    positive
14620    negative
14621    negative
14622    negative
14623    positive
14624    negative
14625    positive
14626    negative
14627    negative
14628    positive
14629    negative
14630    positive
14631    negative
14632    neutral 
14633    negative
14634    n

In [94]:
bag_sentiment = pd.DataFrame(dict(data_lemmatized = data_lemmatized, sentiment = sentiment)).reset_index()
bag_sentiment

Unnamed: 0,index,data_lemmatized,sentiment
0,0,[say],neutral
1,1,"[plus, add, commercials, experience.., tacky]",positive
2,2,"[today.., must, mean, need, take, another, trip]",neutral
3,3,"[accelerate, mobile, page, aggressive, blast, obnoxious, &, amp, recourse]",negative
4,4,"[really, big, bad, thing]",negative
5,5,"[seriously, would, pay, flight, seat, play, bad, thing, fly, va]",negative
6,6,"[tear, in, my, eye, nearly, every, time, fly, vx, worm, ’, t, go, away]",positive
7,7,"[really, miss, prime, opportunity, men, without, hat, parody]",neutral
8,8,"[well, do, -d]",positive
9,9,"[amaze, hour, early, me]",positive


In [101]:
pivot = bag_sentiment.pivot_table(rows=["index"], cols=["data_lemmatized"], values =["data_lemmatized"]   , aggfunc='count')
pivot

TypeError: pivot_table() got an unexpected keyword argument 'rows'