<a href="https://colab.research.google.com/github/JonathanMartignon/DatosMasivosII/blob/main/MiniProyecto.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Martiñón Luna Jonathan josé
## Octubre 7, 2020
## Datos Masivos II
## Licenciatura en Ciencia de Datos.

----
**Instrucciones**

- El objetivo de este mini-proyecto es identificar los tópicos a partir de un conjunto de comentarios usando el método de SVD.

- La base de datos a usar es: https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews

- La columna a usar es: Review Text
----


In [None]:
import nltk
#Me pedía descargar las stopwords.
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
# Importamos las librerías
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

In [None]:
# Leemos el csv
data = pd.read_csv("/content/Womens.csv")
data.head(5)

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


In [None]:
#Extraemos los "documentos" con los que trabajaremos
documentos = list(data["Review Text"])
print(f"Contamos con {len(documentos)} comentarios (Documentos)\n")
documentos[0:3]

Contamos con 23486 comentarios (Documentos)



['Absolutely wonderful - silky and sexy and comfortable',
 'Love this dress!  it\'s sooo pretty.  i happened to find it in a store, and i\'m glad i did bc i never would have ordered it online bc it\'s petite.  i bought a petite and am 5\'8".  i love the length on me- hits just a little below the knee.  would definitely be a true midi on someone who is truly petite.',
 'I had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small (my usual size) but i found this to be outrageously small. so small in fact that i could not zip it up! i reordered it in petite medium, which was just ok. overall, the top half was comfortable and fit nicely, but the bottom half had a very tight under layer and several somewhat cheap (net) over layers. imo, a major design flaw was the net over layer sewn directly into the zipper - it c']

--- 

Para el ejemplo de clase se hizo notar que las noticias pertenecían a 20 grupos diferentes. En este caso tomaremos las diferentes clases para los comentarios. 

---

In [None]:
print(f"Contamos con {len(data['Class Name'].unique())} clases diferentes\n")
data['Class Name'].unique()

Contamos con 21 clases diferentes



array(['Intimates', 'Dresses', 'Pants', 'Blouses', 'Knits', 'Outerwear',
       'Lounge', 'Sweaters', 'Skirts', 'Fine gauge', 'Sleep', 'Jackets',
       'Swim', 'Trend', 'Jeans', 'Legwear', 'Shorts', 'Layering',
       'Casual bottoms', nan, 'Chemises'], dtype=object)

---

# Preprocesamiento

---

In [None]:
#Convertimos nuestra lista a Dataframe
pre_data = pd.DataFrame({'document':documentos})

In [None]:
#Encontramos datos nulos
nulos = pre_data["document"].isnull().sum()
print(f"Contamos con {nulos} valores nulos")

Contamos con 845 valores nulos


In [None]:
# Eliminamos los valores nulos y reseteamos el índice
pre_data.dropna(inplace=True)
pre_data.reset_index(drop=True, inplace=True)

In [None]:
#Verificamos ausencia de  datos nulos
nulos = pre_data["document"].isnull().sum()
print(f"Contamos con {nulos} valores nulos")

Contamos con 0 valores nulos


In [None]:
# Limpiamos el dataset
pre_data['clean_doc'] = pre_data['document'].str.replace("[^a-zA-Z#]", " ")#se remueve signos, caracteres especiales..
pre_data['clean_doc'] = pre_data['clean_doc'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>3])) #se remueve palabras cortas (long menor a 3)
pre_data['clean_doc'] = pre_data['clean_doc'].apply(lambda x: x.lower()) #se convierte el texto a minúsculas.

In [None]:
pre_data.head(3)

Unnamed: 0,document,clean_doc
0,Absolutely wonderful - silky and sexy and comf...,absolutely wonderful silky sexy comfortable
1,Love this dress! it's sooo pretty. i happene...,love this dress sooo pretty happened find stor...
2,I had such high hopes for this dress and reall...,such high hopes this dress really wanted work ...


In [None]:
# Eliminando stop words y tokenizando

stop_words = stopwords.words('english') # Cargamos las stop words

tokenized_doc = pre_data['clean_doc'].apply(lambda x: x.split()) #tokenización
tokenized_doc = tokenized_doc.apply(lambda x: [item for item in x if item not in stop_words]) #eliminación de stop-words

In [None]:
tokenized_doc

0        [absolutely, wonderful, silky, sexy, comfortable]
1        [love, dress, sooo, pretty, happened, find, st...
2        [high, hopes, dress, really, wanted, work, ini...
3        [love, love, love, jumpsuit, flirty, fabulous,...
4        [shirt, flattering, adjustable, front, perfect...
                               ...                        
22636    [happy, snag, dress, great, price, easy, slip,...
22637    [reminds, maternity, clothes, soft, stretchy, ...
22638    [well, never, would, worked, glad, able, store...
22639    [bought, dress, wedding, summer, cute, unfortu...
22640    [dress, lovely, platinum, feminine, fits, perf...
Name: clean_doc, Length: 22641, dtype: object

In [None]:
detokenized_doc = [] # volvemos a unir lo que quedó

for i in range(len(pre_data)):
    t = ' '.join(tokenized_doc[i])
    detokenized_doc.append(t)
    
pre_data['clean_doc'] = detokenized_doc

In [None]:
pre_data.head(5)

Unnamed: 0,document,clean_doc
0,Absolutely wonderful - silky and sexy and comf...,absolutely wonderful silky sexy comfortable
1,Love this dress! it's sooo pretty. i happene...,love dress sooo pretty happened find store gla...
2,I had such high hopes for this dress and reall...,high hopes dress really wanted work initially ...
3,"I love, love, love this jumpsuit. it's fun, fl...",love love love jumpsuit flirty fabulous every ...
4,This shirt is very flattering to all due to th...,shirt flattering adjustable front perfect leng...


---
Creando la matriz de términos

---

In [None]:
#Usamos TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words='english', 
                            max_features= 1000, # máximo número de términos
                            max_df = 0.5, 
                            smooth_idf=True)
X = vectorizer.fit_transform(pre_data['clean_doc'])
print("Tamaño de la matriz T-D: ", X.shape) # visualizamos el tamaño de la matriz
print(X)

Tamaño de la matriz T-D:  (22641, 1000)
  (0, 150)	0.26908313656134575
  (0, 714)	0.5029911966336686
  (0, 741)	0.54868250035246
  (0, 978)	0.4745023713096423
  (0, 1)	0.3852131035619599
  (1, 895)	0.28310425687868823
  (1, 495)	0.3282385862459109
  (1, 894)	0.1735493428036656
  (1, 199)	0.18613776111958308
  (1, 420)	0.23728076471436937
  (1, 458)	0.1406400619673494
  (1, 367)	0.23917659146304146
  (1, 443)	0.15549783676623677
  (1, 76)	0.15054782797842423
  (1, 587)	0.5019192913125554
  (1, 542)	0.17891065485646476
  (1, 551)	0.1380808236247941
  (1, 326)	0.2220438270605945
  (1, 797)	0.16807613348792108
  (1, 347)	0.32610477784015773
  (1, 624)	0.16351508888444327
  (1, 226)	0.11109325804703755
  (1, 473)	0.203073219042927
  (2, 999)	0.18478399575763998
  (2, 713)	0.21047708055845665
  :	:
  (22639, 988)	0.22882316313572645
  (22639, 277)	0.22150294249301758
  (22639, 187)	0.15359419245570932
  (22639, 465)	0.2017099583125508
  (22639, 932)	0.17142127973800325
  (22639, 285)	0.15560

---
Calculando la descomposición de valores singulares.

---

In [None]:
# Usamos la función "TruncatedSVD"

#Creamos el modelo
svd_model = TruncatedSVD(n_components=21, 
                        algorithm='randomized', 
                         n_iter=100, 
                         random_state=122)

svd_model.fit(X) #Entrenamos
len(svd_model.components_) #Mostramos la cantidad de componentes

21

---

Obteniendo los tópicos

---

In [None]:
#Obtenemos los tópicos a partir de los componentes del modelo
terms = vectorizer.get_feature_names()
terms[0:10]

['able',
 'absolutely',
 'actual',
 'actually',
 'added',
 'addition',
 'adds',
 'adjustable',
 'adorable',
 'adore']

In [None]:
#Visualizamos algunas de las plabras más importantes en cada uno de los tópicos
for i, comp in enumerate(svd_model.components_):
    terms_comp = zip(terms, comp)
    sorted_terms = sorted(terms_comp, key= lambda x:x[1], reverse=True)[:7]
    print("Topic "+str(i)+":")
    for t in sorted_terms:
        print(t[0])

Topic 0:
dress
love
size
great
like
wear
small
Topic 1:
dress
beautiful
slip
dresses
wedding
bust
belt
Topic 2:
love
great
comfortable
jeans
dress
soft
perfect
Topic 3:
size
love
small
great
true
wear
perfect
Topic 4:
shirt
cute
small
dress
super
runs
comfortable
Topic 5:
great
jeans
pants
looks
size
cute
skirt
Topic 6:
sweater
great
small
medium
large
wear
looks
Topic 7:
great
size
beautiful
shirt
color
true
fits
Topic 8:
sweater
size
cute
true
soft
comfortable
super
Topic 9:
cute
love
super
really
runs
like
look
Topic 10:
comfortable
fabric
skirt
soft
nice
wear
flattering
Topic 11:
length
skirt
perfect
petite
short
little
long
Topic 12:
skirt
wear
sweater
fits
waist
jeans
size
Topic 13:
cute
skirt
color
store
beautiful
online
tried
Topic 14:
large
pants
runs
jeans
nice
sweater
beautiful
Topic 15:
comfortable
sweater
flattering
skirt
shirt
soft
petite
Topic 16:
jeans
beautiful
color
small
perfect
looks
ordered
Topic 17:
perfect
colors
pants
quality
look
price
sale
Topic 18:
pants
litt

# Duda
¿Cada tópico corresponde en ese orden a una clase?

In [None]:
data['Class Name'].unique()

array(['Intimates', 'Dresses', 'Pants', 'Blouses', 'Knits', 'Outerwear',
       'Lounge', 'Sweaters', 'Skirts', 'Fine gauge', 'Sleep', 'Jackets',
       'Swim', 'Trend', 'Jeans', 'Legwear', 'Shorts', 'Layering',
       'Casual bottoms', nan, 'Chemises'], dtype=object)