# HW3

Submit via Slack. Due on Monday, April 13th, 2020, 11:59pm PST. You may work with one other person.

## TF-IDF

You are an analyst working at Amazon as a product analyst, and charged with identifying areas for improvement to the Amazon toy product lines, which have been suffering recently from lower reviews.

Using the **`poor_amazon_toy_reviews.txt`** and **`good_amazon_toy_reviews.txt`** datasets, clean and parse the text reviews. Explain the decisions you make:
- why remove/keep stopwords?
- stemming versus lemmatization?
- regex cleaning and substitution?
- adding in custom stopwords?
- what `n` for your `n-grams`?
- which words to collocate together?

Finally, generate a TF-IDF report that **visualizes**:
* the features your analysis showed that customers cited as reasons for a 5 star review
* the most common issues identified from your analysis that generated customer dissatisfaction.

Explain to what degree the TF-IDF findings make sense - what are its limitations?



Here is a brief summary of the decision I made:
- I removed stopwords because I think that those stopwords are not quite helpful for me to understand the reviews. In addition, due to the large dataset and high dimension, it is better to remove stopwords to save computational power.
- I used lemmatization because I think it's important to take the original meaning and context into consideration when analyzing the reviews. 
- For the regex cleaning and substituion part, I replace the new line character with nothing because I don't want to lose information and don't want \n to appear in the analysis since it is not helpful.
- For custom stopwards, I added in several adverbs like probably, actually, really, etc. because I believe that the don't have any actual meaning.
- For n-grams, I used n=2.

For TF-IDF, it is simple to use and could return words that are descriptive and relevant to specific document. However, it is computationally intensive, especially when the vocabulary size is large. In addition, it does not take sequence of words into consideration, therfore, cannot capture the meaning.

In [1]:
import pandas as pd
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer

In [2]:
good_file = open("/Users/yanboyang/Desktop/USC/20Spring/dso-560-nlp-and-text-analytics/week1/good_amazon_toy_reviews.txt", "r")
good_file.readline()
poor_file = open("/Users/yanboyang/Desktop/USC/20Spring/dso-560-nlp-and-text-analytics/week1/poor_amazon_toy_reviews.txt", "r")
poor_file.readline()
good = list(map(lambda review: review.replace('\n', '').replace('\\\\', '').replace('/',''), good_file))
poor = list(map(lambda review: review.replace('\n', '').replace('\\\\', '').replace('/',''), poor_file))

In [4]:
# stopwords
from nltk.corpus import stopwords
sw = stopwords.words('english') + ['actually','usually','oh','always','thing','really','probably']

# Good Reviews

In [16]:
vectorizer = TfidfVectorizer(ngram_range=(2,2),
                             token_pattern=r'\b[a-zA-Z0-9]{2,}\b',
                             min_df=5, stop_words=sw)
X_good = vectorizer.fit_transform(good)
terms = vectorizer.get_feature_names()
tf_idf = pd.DataFrame(X_good.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
good_score = pd.DataFrame(tf_idf, columns=["score"])
#score["word"] = terms
good_score.sort_values(by="score", ascending=False, inplace=True)

In [17]:
good_score.head()

Unnamed: 0,score
year old,1270.402445
daughter loves,857.957543
son loves,812.171447
great product,788.75304
kids love,761.55536


In [18]:
good_score.to_csv("good.csv")

In [19]:
from nltk import word_tokenize
import string
lemmatizer = WordNetLemmatizer()
result = []
for t in good:
    words = word_tokenize(t)
    final = []
    for w in words:
        final.append(lemmatizer.lemmatize(w).strip(string.punctuation))
    res = " ".join(final)
    result.append(res)

In [20]:
stopwords = set(sw + [".",'.', ",",":", "''", "'s", "'", "``", "(", ")", "-"," ",""])
new_documents = []
for review in result:
    new_document = []
    for word in review.split(' '):
        if word.strip().lower() not in stopwords:
            new_document.append(word)
    new_documents.append(new_document)

In [21]:
#new_documents

In [22]:
from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures

collocation_finder = BigramCollocationFinder.from_documents(new_documents)
measures = BigramAssocMeasures()

collocation_finder.nbest(measures.raw_freq, 25)

[('br', 'br'),
 ('year', 'old'),
 ('daughter', 'love'),
 ('son', 'love'),
 ('well', 'made'),
 ('ca', "n't"),
 ('doe', "n't"),
 ('kid', 'love'),
 ('old', 'love'),
 ('put', 'together'),
 ('3', 'year'),
 ('Great', 'product'),
 ('grandson', 'love'),
 ('good', 'quality'),
 ('2', 'year'),
 ('wa', 'great'),
 ('month', 'old'),
 ('much', 'fun'),
 ('lot', 'fun'),
 ('yr', 'old'),
 ('highly', 'recommend'),
 ('4', 'year'),
 ('absolutely', 'love'),
 ('birthday', 'party'),
 ('look', 'like')]

![alt text](goodviz.png "Title")

# Poor Reviews

In [23]:
vectorizer = TfidfVectorizer(ngram_range=(2,2),
                             token_pattern=r'\b[a-zA-Z0-9]{2,}\b',
                             min_df=5, stop_words=sw)
X_poor = vectorizer.fit_transform(poor)
terms = vectorizer.get_feature_names()
tf_idf = pd.DataFrame(X_poor.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score = pd.DataFrame(tf_idf, columns=["score"])
#score["word"] = terms
score.sort_values(by="score", ascending=False, inplace=True)

In [24]:
score.head(20)

Unnamed: 0,score
waste money,256.075722
year old,131.394717
br br,130.154481
poor quality,124.699732
cheaply made,104.78314
fell apart,65.176357
would recommend,64.475933
first time,63.293833
looks like,61.654188
put together,59.676626


In [25]:
score.to_csv("poor.csv")

In [12]:
lemmatizer = WordNetLemmatizer()
result = []
for t in poor:
    words = word_tokenize(t)
    final = []
    for w in words:
        final.append(lemmatizer.lemmatize(w).strip(string.punctuation))
    res = " ".join(final)
    result.append(res)

In [13]:
new_documents = []
for review in result:
    new_document = []
    for word in review.split(' '):
        if word.strip().lower() not in stopwords:
            new_document.append(word)
    new_documents.append(new_document)

In [14]:
from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures

collocation_finder = BigramCollocationFinder.from_documents(new_documents)
measures = BigramAssocMeasures()

collocation_finder.nbest(measures.raw_freq, 25)

[('br', 'br'),
 ('doe', "n't"),
 ('waste', 'money'),
 ('year', 'old'),
 ("n't", 'work'),
 ("n't", 'even'),
 ('ca', "n't"),
 ('would', "n't"),
 ('look', 'like'),
 ('wa', 'disappointed'),
 ('wa', "n't"),
 ('wo', "n't"),
 ("n't", 'buy'),
 ("n't", 'waste'),
 ('poor', 'quality'),
 ('could', "n't"),
 ('cheaply', 'made'),
 ('put', 'together'),
 ('first', 'time'),
 ("n't", 'get'),
 ('wa', 'excited'),
 ("n't", 'stay'),
 ('thought', 'wa'),
 ('first', 'day'),
 ('stopped', 'working')]

![alt text](poorviz.png "Title")

## Product Attribution (Feature Engineering and Regex Practice)

Download the [dataset](https://dso-560-nlp-text-analytics.s3.amazonaws.com/truncated_catalog.csv) from the class S3 bucket (`dso560-nlp-text-analytics`).

In preparation for the group project, our client company has provided a dataset of women's clothing products they are considering cataloging. 

1. Filter for only **women's clothing items**.

2. For each clothing item:

* Identify its **category**:
```
Bottom
One Piece
Shoe
Handbag
Scarf
```
* Identify its **color**:
```
Beige
Black
Blue
Brown
Burgundy
Gold
Gray
Green
Multi 
Navy
Neutral
Orange
Pinks
Purple
Red
Silver
Teal
White
Yellow
```

Your output will be the same dataset, except with **3 additional fields**:
* `is_womens_clothing`
* `product_category`
* `colors`

In [30]:
dataset = pd.read_csv("truncated_catalog.csv")

In [53]:
# credit to stackoverflow
dataset['all_info'] = dataset[dataset.columns[1:]].apply(
    lambda x: ','.join(x.dropna().astype(str).str.lower()),
    axis=1
)

In [54]:
import re
dataset['woman']=dataset['all_info'].str.findall(r'\b(woman|women|girl|grils|female|wife|daughter|daughters|girlfriend|girlfriends|mother|mom|mommy|mommies)\b')

In [55]:
import numpy as np
def func(x):
    if len(x) > 0:
        return True
    else:
        return False
dataset['is_womens_clothing']=dataset['woman'].apply(lambda x: func(x))
dataset.drop('woman',axis=1,inplace=True)

In [68]:
dataset['cat']=dataset['all_info'].str.findall(r'(bottom|shoe|handbag|scarf|one piece|one-piece|onepiece|hand bag)')

In [71]:
dataset['category'] = dataset['cat'].apply(lambda x: list(set(x)))
dataset.drop('cat', axis=1, inplace=True)

In [75]:
dataset['color']=dataset['all_info'].str.findall(r'(beige|black|blue|brown|burgundy|gold|gray|green|multi|navy|neutral|orange|pink|pinks|purple|red|silver|teal|white|yellow)')
dataset['colors'] = dataset['color'].apply(lambda x: list(set(x)))
dataset.drop('color', axis=1, inplace=True)
dataset.head()

Unnamed: 0,brand,name,description,brand_category,brand_canonical_url,details,tsv,all_info,is_womens_clothing,category,colors
0,FILA,Original Fitness Sneakers,Vintage Fitness leather sneakers with logo pri...,TheMensStore/Shoes/Sneakers/LowTop,https://www.saksfifthavenue.com/fila-original-...,Leather/synthetic upper\nLace-up closure\nText...,"'design':12 'fila':1A 'fit':3A,6 'leather':7 '...","original fitness sneakers,vintage fitness leat...",False,[shoe],[]
1,CHANEL,HAT,,Unknown,https://www.saksfifthavenue.com/chanel-hat/pro...,WOOL TWEED & FELT,'chanel':1A 'hat':2A,"hat,unknown,https://www.saksfifthavenue.com/ch...",False,[],[]
2,Frame,Petit Oval Buckle Belt,A Timeless Leather Belt Crafted From Smooth Co...,Accessories,https://frame-store.com/products/petit-oval-bu...,,"'belt':5A,9 'buckl':4A,21 'cowhid':13 'craft':...","petit oval buckle belt,a timeless leather belt...",False,[],"[multi, gold]"
3,Lilly Pulitzer Kids,Little Gir's & Girl's Ariana One-Piece UPF 50+...,Pretty ruffle sleeves and trim elevate essenti...,"JustKids/Girls214/Girls/SwimwearCoverups,JustK...",https://www.saksfifthavenue.com/lilly-pulitzer...,Scoopneck\nAdjustable straps\nFlutter sleeves\...,'50':14A 'allov':28 'ariana':9A 'color':27 'el...,little gir's & girl's ariana one-piece upf 50+...,True,[one-piece],[]
4,Kissy Kissy,Baby Girl's Endearing Elephants Pima Cotton Co...,Versatile convertible gown with elephant applique,JustKids/Baby024months/InfantGirls/FootiesRompers,https://www.saksfifthavenue.com/kissy-kissy-ba...,V-neckline\nLong sleeves\nFront snap closure\n...,"'appliqu':17 'babi':3A 'convert':10A,13 'cotto...",baby girl's endearing elephants pima cotton co...,True,[bottom],[]
