# HW3

Submit via Slack. Due on Monday, April 13th, 2020, 11:59pm PST. You may work with one other person.

## TF-IDF

You are an analyst working at Amazon as a product analyst, and charged with identifying areas for improvement to the Amazon toy product lines, which have been suffering recently from lower reviews.

Using the **`poor_amazon_toy_reviews.txt`** and **`good_amazon_toy_reviews.txt`** datasets, clean and parse the text reviews. Explain the decisions you make:
- why remove/keep stopwords?
- stemming versus lemmatization?
- regex cleaning and substitution?
- adding in custom stopwords?
- what `n` for your `n-grams`?
- which words to collocate together?

Finally, generate a TF-IDF report that **visualizes**:
* the features your analysis showed that customers cited as reasons for a 5 star review
* the most common issues identified from your analysis that generated customer dissatisfaction.

Explain to what degree the TF-IDF findings make sense - what are its limitations?



In [1]:
import pandas as pd
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer

In [2]:
good_file = open("/Users/yanboyang/Desktop/USC/20Spring/dso-560-nlp-and-text-analytics/week1/good_amazon_toy_reviews.txt", "r")
good_file.readline()
poor_file = open("/Users/yanboyang/Desktop/USC/20Spring/dso-560-nlp-and-text-analytics/week1/poor_amazon_toy_reviews.txt", "r")
poor_file.readline()
good = list(map(lambda review: review.replace('\n', '').replace('\\\\', '').replace('/',''), good_file))[:10000]
poor = list(map(lambda review: review.replace('\n', '').replace('\\\\', '').replace('/',''), poor_file))

In [3]:
#good

In [4]:
# stopwords
from nltk.corpus import stopwords
sw = stopwords.words('english') + ['actually','usually','oh','always','thing','really','probably']

In [23]:
vectorizer = TfidfVectorizer(ngram_range=(2,3),
                             token_pattern=r'\b[a-zA-Z0-9]{2,}\b',
                             max_df=0.5, stop_words=sw)
X_good = vectorizer.fit_transform(good)
terms = vectorizer.get_feature_names()
tf_idf = pd.DataFrame(X_good.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score = pd.DataFrame(tf_idf, columns=["score"])
score["word"] = terms
score.sort_values(by="score", ascending=False, inplace=True)

In [24]:
score.head()

Unnamed: 0,score,term
year old,64.871441,year old
daughter loves,54.891959,daughter loves
great product,52.868135,great product
kids love,51.740721,kids love
son loves,51.696421,son loves


In [5]:
from nltk import word_tokenize
import string
lemmatizer = WordNetLemmatizer()
result = []
for t in good:
    words = word_tokenize(t)
    final = []
    for w in words:
        final.append(lemmatizer.lemmatize(w).strip(string.punctuation))
    res = " ".join(final)
    result.append(res)

In [35]:
stopwords = set(sw + [".",'.', ",",":", "''", "'s", "'", "``", "(", ")", "-"," ",""])
new_documents = []
for review in result:
    new_document = []
    for word in review.split(' '):
        if word.strip().lower() not in stopwords:
            new_document.append(word)
    new_documents.append(new_document)

In [38]:
#new_documents

In [37]:
from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures

collocation_finder = BigramCollocationFinder.from_documents(new_documents)
measures = BigramAssocMeasures()

collocation_finder.nbest(measures.raw_freq, 25)

[('br', 'br'),
 ('year', 'old'),
 ('daughter', 'love'),
 ('son', 'love'),
 ('ca', "n't"),
 ('well', 'made'),
 ('doe', "n't"),
 ('kid', 'love'),
 ('put', 'together'),
 ('Great', 'product'),
 ('old', 'love'),
 ('3', 'year'),
 ('grandson', 'love'),
 ('month', 'old'),
 ('old', 'grandson'),
 ('wa', 'great'),
 ('lot', 'fun'),
 ('4', 'year'),
 ('absolutely', 'love'),
 ('granddaughter', 'love'),
 ('wa', 'perfect'),
 ('much', 'fun'),
 ('2', 'year'),
 ('highly', 'recommend'),
 ('look', 'like')]

In [39]:
vectorizer = TfidfVectorizer(ngram_range=(2,3),
                             token_pattern=r'\b[a-zA-Z0-9]{2,}\b',
                             max_df=0.5, stop_words=stopwords)
X_good = vectorizer.fit_transform(result)
terms = vectorizer.get_feature_names()
tf_idf = pd.DataFrame(X_good.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score = pd.DataFrame(tf_idf, columns=["score"])
score["word"] = terms
score.sort_values(by="score", ascending=False, inplace=True)

In [40]:
score.head(20)

Unnamed: 0,score,word
year old,69.715294,year old
daughter love,59.568567,daughter love
son love,54.062531,son love
great product,52.949429,great product
grandson love,49.34997,grandson love
kid love,45.690448,kid love
granddaughter love,39.267746,granddaughter love
well made,29.587302,well made
granddaughter loved,29.002483,granddaughter loved
excellent product,28.502335,excellent product


## Product Attribution (Feature Engineering and Regex Practice)

Download the [dataset](https://dso-560-nlp-text-analytics.s3.amazonaws.com/truncated_catalog.csv) from the class S3 bucket (`dso560-nlp-text-analytics`).

In preparation for the group project, our client company has provided a dataset of women's clothing products they are considering cataloging. 

1. Filter for only **women's clothing items**.

2. For each clothing item:

* Identify its **category**:
```
Bottom
One Piece
Shoe
Handbag
Scarf
```
* Identify its **color**:
```
Beige
Black
Blue
Brown
Burgundy
Gold
Gray
Green
Multi 
Navy
Neutral
Orange
Pinks
Purple
Red
Silver
Teal
White
Yellow
```

Your output will be the same dataset, except with **3 additional fields**:
* `is_womens_clothing`
* `product_category`
* `colors`