## Reto 3: Nltk Text

### 1. Objetivos:
    - Practicar el uso del objeto Text de la librería Nltk
 
---
    
### 2. Desarrollo:

Vamos a practicar utilizar el objeto Text de la librería Nltk. Para practicar utilizaremos el conjunto de datos que limpiaste en el Reto anterior. Este conjunto de datos debería de contener todas las evaluaciones registradas en el dataset 'amazon_fine_food_reviews-clean.csv' ya limpias y listas para el análisis.

Tu Reto consiste en lo siguiente:

1. Aplica el método 'word_tokenize' de nltk para separar todas las evaluaciones por palabras (esto puede tardar un ratito).
2. Genera una sola lista con todas las listas que obtuviste en el paso 1 y úsala para crear un objeto `nltk.Text` (esto también puede tardar un ratito, sé paciente).
3. Busca las concordancias de las palabras 'boy' y 'girl'.
4. Busca las palabras que tengan contextos similares a las palabras 'boy' y 'girl'.
5. Busca los contextos que tengan en común las palabras 'boy' y 'girl'.
6. Crea una gráfica de dispersión para las palabras 'boy' y 'girl'.
7. Genera una nueva evaluación utilizando el método 'generate'.
8. Cuantifica la riqueza léxica de tu conjunto de datos
9. Sigue tu curiosidad y realiza algunas otras exploraciones por tu cuenta.
10. Comenta tus hallazgos con tus compañeros y la experta.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Cargamos las librerías necesarias y los datos

In [2]:
import pandas as pd
import nltk
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

In [3]:
df = pd.read_csv('/content/drive/MyDrive/Remoto Datasets/Remoto amazon_fine_food_reviews-clean.csv')
df.head()

Unnamed: 0,id,product_id,user_id,profile_name,helpfulness_numerator,helpfulness_denominator,score,time,summary,text
0,258510,B00168V34W,A1672LH9S1XO70,"Lorna J. Loomis ""Canadian Dog Fancier""",13,14,3,1266796800,"Misleading to refer to ""PODS""","This coffee does NOT come in individual ""PODS""..."
1,207915,B000CQID2Y,A42CJC66XO0H7,"Scott Schimmel ""A Butterfly Dreaming""",2,2,5,1279497600,Delicious,I was a little skeptical after looking at the ...
2,522649,B007TJGZ0Y,A16QZBG2UN6Z3X,"Toology ""Toology""",0,0,5,1335830400,One of my favs,Gloia Jeans Butter Toffee is one of my favorit...
3,393368,B000W7PUOW,A3J21CQZG60K35,Hsieh Pei Hsuan,2,2,5,1265673600,Tasty!!,My families and friends love Planters peanuts ...
4,178178,B002FX2IOQ,A1Z7XV6JU0EV8M,"Barbara ""Barbara""",1,6,1,1301788800,"Organic Valley White 1 % Milkfat Lowfat Milk, ...","Organic Valley White 1 % Milkfat Lowfat Milk, ..."


Agrupamos las reseñas por producto

In [4]:
grouped_by_product = df.groupby('product_id')['text'].max()

grouped_by_product

product_id
0006641040    TITLE: Chicken Soup with Rice<br />AUTHOR: Mau...
7310172001    I buy a big tub of these for my dog about ever...
7310172101    This is a great treat for dogs, but do read th...
B00004CI84    well one of the best you just have to have a c...
B00004CXX9    What happens when you say his name three times...
                                    ...                        
B009M2LUEW    When I found out about this product from Jorge...
B009NTCO4O    I purchased this item for a christmas<br />gif...
B009NY1MC4    This is fantastic! It is more of a syrup than ...
B009QEBGIQ    This is a great rice, tender when cooked and s...
B009RB4GO4    Yes, it does have artificial sweetener.  Yes, ...
Name: text, Length: 8629, dtype: object

Limpiamos los textos

In [15]:
grouped_by_product = grouped_by_product.str.lower() #Todo en minúsculas
grouped_bygrouped_by_product_title = grouped_by_product.str.strip() #Eliminamos espacios vacios
grouped_by_product = grouped_by_product.str.replace('[^\w\s]', '') #Reemplazamos caracteres especiales de REGEX
grouped_by_product = grouped_by_product.str.replace('\d', '')
grouped_by_product = grouped_by_product.str.replace('\\n', '')
grouped_by_product = grouped_by_product.dropna()

grouped_by_product

product_id
0006641040    title chicken soup with ricebr author maurice ...
7310172001    i buy a big tub of these for my dog about ever...
7310172101    this is a great treat for dogs but do read the...
B00004CI84    well one of the best you just have to have a c...
B00004CXX9    what happens when you say his name three times...
                                    ...                        
B009M2LUEW    when i found out about this product from jorge...
B009NTCO4O    i purchased this item for a christmasbr gift f...
B009NY1MC4    this is fantastic it is more of a syrup than w...
B009QEBGIQ    this is a great rice tender when cooked and st...
B009RB4GO4    yes it does have artificial sweetener  yes you...
Name: text, Length: 8629, dtype: object

In [20]:
#nltk.download('punkt')

Separamos cada oración por palabras

In [21]:
tokenized = grouped_by_product.apply(nltk.word_tokenize)

In [22]:
tokenized

product_id
0006641040    [title, chicken, soup, with, ricebr, author, m...
7310172001    [i, buy, a, big, tub, of, these, for, my, dog,...
7310172101    [this, is, a, great, treat, for, dogs, but, do...
B00004CI84    [well, one, of, the, best, you, just, have, to...
B00004CXX9    [what, happens, when, you, say, his, name, thr...
                                    ...                        
B009M2LUEW    [when, i, found, out, about, this, product, fr...
B009NTCO4O    [i, purchased, this, item, for, a, christmasbr...
B009NY1MC4    [this, is, fantastic, it, is, more, of, a, syr...
B009QEBGIQ    [this, is, a, great, rice, tender, when, cooke...
B009RB4GO4    [yes, it, does, have, artificial, sweetener, y...
Name: text, Length: 8629, dtype: object

Lista con todas las palabras

In [23]:
all_words = tokenized.sum()
text = nltk.Text(all_words)

text

<Text: title chicken soup with ricebr author maurice sendakbr...>

Concordancia con 'boy'

In [24]:
text.concordance('boy', lines=20)

Displaying 20 of 54 matches:
h by telling his teammates that every boy at the academy gets a shot at playing
ng and jameson can barely look at the boy without cringing coopersmiths dark di
up its calorie free and carb free and boy can you tell i tried this twice wante
ar on your creme of wheat and oatmeal boy do i pour this stuff on want to make 
 making of one down to the letter and boy oh boy not only was my husband thrill
 of one down to the letter and boy oh boy not only was my husband thrilled i no
nyone knows of one please let me know boy oh boy walden farms is really a hit a
nows of one please let me know boy oh boy walden farms is really a hit and miss
e and more scarce when we adopted our boy nemo he was a few pounds overweight a
me vet after my year old russian blue boy developed chrystals in hir urine and 
ngest sons favorite lunch as my older boy cant have cheese the microwave packet
sugar and water stir and put that bad boy in your fridge or out on the porch fo
ese and kee

Concordancia con 'girl'

In [25]:
text.concordance('girl', lines=20)

Displaying 20 of 41 matches:
s to begin with since i was a little girl i have always love this stuff i even
 to go back inside and ask the sales girl for assistance which she very gracio
t it tastes great recommended my chi girl loves these she gets excited when sh
at were made in chinabr br my little girl was having trouble with dry stools s
vegans and vegetarians being a texas girl growing up on mexican food i add cho
lar cookiesbr they taste almost like girl scout thin mint patty cookies they a
dry no complaints on this product my girl has loved these for two years when t
 this is one of the few treats wilhi girl enjoys my yorkies loves these better
 on purpose mine was as a gift to my girl so we ended up cutting and placing i
ut the taste of the food and even my girl who we used to have to push to finis
 the storethe employee a nice korean girl comes into play later and a young co
ut at home i walked up and asked the girl if we could try the last one the dud
corn making sweet love 

Contexto similar a 'boy'

In [26]:
text.similar('boy')

i it that there this he she so and flavor cat taste dog price but just
what product cats can


Contexto similar a 'girl'

In [27]:
text.similar('girl')

dog product flavor food taste tea time cat door wife one price store
dogs bit world money mouth bag order


Contextos en común

In [28]:
text.common_contexts(['boy', 'girl'])

little_i


Total de palabras distintas

In [29]:
len(set(text))

24206

Riqueza léxica

In [30]:
len(set(text)) / len(text)

0.03724515702174147

In [32]:
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

Palabras poco comunes

In [33]:
text.collocations()

highly recommend; peanut butter; grocery store; gluten free; green
tea; year old; olive oil; years ago; much better; dont know; dog food;
ive tried; local grocery; dark chocolate; long time; cat food; even
though; ive ever; ice cream; earl grey


In [34]:
freq_dist = nltk.FreqDist(all_words)

print(freq_dist)

<FreqDist with 24206 samples and 649910 outcomes>


In [36]:
freq_dist.most_common(20)

[('the', 26540),
 ('i', 21045),
 ('and', 18381),
 ('a', 17640),
 ('to', 14478),
 ('it', 13123),
 ('of', 11432),
 ('is', 10802),
 ('this', 10107),
 ('in', 8199),
 ('for', 7989),
 ('my', 6415),
 ('that', 6098),
 ('but', 5257),
 ('with', 5087),
 ('not', 4986),
 ('have', 4940),
 ('was', 4762),
 ('you', 4722),
 ('are', 4622)]

In [37]:
from nltk.corpus import stopwords

In [38]:
english_stop_words = stopwords.words('english')

all_words_except_stop_words = [word for word in all_words if word not in english_stop_words]

freq_dist_no_stop_words = nltk.FreqDist(all_words_except_stop_words)

In [39]:
freq_dist_no_stop_words.most_common(20)

[('br', 3850),
 ('like', 3352),
 ('good', 2954),
 ('one', 2351),
 ('great', 2351),
 ('taste', 2316),
 ('product', 2272),
 ('tea', 2249),
 ('flavor', 1911),
 ('food', 1829),
 ('coffee', 1816),
 ('love', 1746),
 ('would', 1736),
 ('get', 1551),
 ('really', 1443),
 ('much', 1386),
 ('dont', 1325),
 ('little', 1270),
 ('time', 1231),
 ('also', 1228)]