# Predicting review scores 

AI Black Belt - Yellow (June 2019).

---

In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.neighbors import KNeighborsRegressor

In [3]:
df = pd.read_csv("data/amazon-reviews.csv")
df

Unnamed: 0,Summary,Text,Score
0,No more breast milk!,Let me start off by saying that I am not a tea...,5.0
1,great,I'm one of the few ones that actually like dri...,5.0
2,Not for me,Clear Scalp & Hair Beauty Therapy severely dam...,1.0
3,Yum!,I ordered this wanting to try a hot chocolate ...,4.0
4,Worthy of a subscription?...YES,I wondered if I should to subscribe to the sam...,5.0
5,Wow! Tangy!,Yummy Earth Organic lollipops are tangy and ta...,5.0
6,Superior nutrition at excellent savings,My 14 year old feline - Archie - was diagnosed...,5.0
7,My Dog is Repelled by these,I have six left in the packet. My dog will not...,2.0
8,Funny great movie for kids adults anyone with ...,It's Tim Burton's Beetlejuice! It's a great fi...,5.0
9,"Tangy, spicy, and sweet- oh my!",Kettle Chips Spicy Thai potato chips have the ...,4.0


## Text Feature Extraction with Bag-of-Words

In many tasks, like in the classical spam detection, your input data is text.
Free text with variables length is very far from the fixed length numeric representation that we need to do machine learning with scikit-learn.
However, there is an easy and effective way to go from text data to a numeric representation using the so-called bag-of-words model, which provides a data structure that is compatible with the machine learning aglorithms in scikit-learn.

<img src="figures/day3/bag_of_words.svg" width="100%">


Let's assume that each sample in your dataset is represented as one string, which could be just a sentence, an email, or a whole news article or book. To represent the sample, we first split the string into a list of tokens, which correspond to (somewhat normalized) words. A simple way to do this to just split by whitespace, and then lowercase the word. 

Then, we build a vocabulary of all tokens (lowercased words) that appear in our whole dataset. This is usually a very large vocabulary.
Finally, looking at our single sample, we could show how often each word in the vocabulary appears.
We represent our string by a vector, where each entry is how often a given word in the vocabulary appears in the string.

As each sample will only contain very few words, most entries will be zero, leading to a very high-dimensional but sparse representation.

The method is called "bag-of-words," as the order of the words is lost entirely.

In [4]:
df = df[~df["Summary"].isnull()]

In [5]:
X_raw = df["Summary"]

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features=500) #top 2000 most frequent words
vectorizer.fit(X_raw)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=500, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [7]:
X_bag_of_words = vectorizer.transform(X_raw).toarray()

In [8]:
wordCounts = zip(vectorizer.get_feature_names(), np.sum(X_bag_of_words, axis=0))
pd.DataFrame([{"word":word, "count":count} for word,count in wordCounts]).sort_values("count", ascending=False)

Unnamed: 0,count,word
186,3304,great
182,2393,good
430,2318,the
88,2125,coffee
163,2022,for
20,1700,and
222,1658,it
290,1606,not
280,1370,my
60,1292,but


In [9]:
vectorizer.inverse_transform(X_bag_of_words[:10])

[array(['milk', 'more', 'no'], dtype='<U13'),
 array(['great'], dtype='<U13'),
 array(['for', 'me', 'not'], dtype='<U13'),
 array(['yum'], dtype='<U13'),
 array(['of', 'yes'], dtype='<U13'),
 array(['tangy', 'wow'], dtype='<U13'),
 array(['at', 'excellent'], dtype='<U13'),
 array(['by', 'dog', 'is', 'my', 'these'], dtype='<U13'),
 array(['for', 'great', 'kids', 'movie', 'of', 'with'], dtype='<U13'),
 array(['and', 'my', 'oh', 'spicy', 'sweet', 'tangy'], dtype='<U13')]

## tf-idf Encoding


A useful transformation that is often applied to the bag-of-word encoding is the so-called term-frequency inverse-document-frequency (tf-idf) scaling, which is a non-linear transformation of the word counts.

The tf-idf encoding rescales words that are common to have less weight:

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(max_features=500)
tfidf_vectorizer.fit(X_raw)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=500, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [11]:
X_tfidf = tfidf_vectorizer.transform(X_raw).toarray()

## Linear regression 


In [12]:
from sklearn.model_selection import train_test_split

X = X_bag_of_words
y = df["Score"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [13]:
regressor = Ridge()
regressor.fit(X_train, y_train)

Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [14]:
wordCoef = pd.DataFrame({"word":list(vectorizer.get_feature_names()), "coef":regressor.coef_})
#wordCoef["abs"] = wordCoef["coef"].abs()
wordCoef = wordCoef.sort_values("coef", ascending=False)
wordCoef

Unnamed: 0,word,coef
143,fabulous,1.182833
366,rinds,1.057926
17,amazing,0.955278
200,heaven,0.920393
484,wonderful,0.916722
30,awesome,0.907130
146,fantastic,0.894612
238,life,0.881231
107,delish,0.879091
106,delicious,0.860684


In [15]:
y_pred_train = regressor.predict(X_train)
regressor.score(X_test, y_test)

0.3544510578512986

<div class="alert alert-success">
    <b>EXERCISE</b>:

Try to use the TF-IDF encoder to see if you obtain better results.
</div>

In [16]:
# %load solutions/day3-03-01.py

<div class="alert alert-success">
    <b>EXERCISE</b>:

Predict the score for the test_sentences defined below.
</div>

In [17]:
test_sentences = [
    "Great coffee. yummy",
    "This pork is disgusting",
    "dangerous and nasty",
    "look expired",
]

In [18]:
# %load solutions/day3-03-02.py

<div class="alert alert-success">
    <b>EXERCISE</b>:

Try to use the review text instead of the review summary to see if we can improve the score.
</div>

<div class="alert alert-success">
    <b>EXERCISE</b>:

Try to combine the review text with the review summary to see if we can improve the score further.
</div>