# Classification Primer
## Mikołaj Nawojowski, Wojciech Nowak

In [1]:
import numpy as np
import pandas as pd
data = pd.read_csv("amazon_baby.csv")
pd.options.display.max_rows=10

In [2]:
data

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5
...,...,...,...
183526,Baby Teething Necklace for Mom Pretty Donut Sh...,Such a great idea! very handy to have and look...,5
183527,Baby Teething Necklace for Mom Pretty Donut Sh...,This product rocks! It is a great blend of fu...,5
183528,Abstract 2 PK Baby / Toddler Training Cup (Pink),This item looks great and cool for my kids.......,5
183529,"Baby Food Freezer Tray - Bacteria Resistant, B...",I am extremely happy with this product. I have...,5


### 1) Perform text cleaning

In [3]:
def remove_punctuation(text):
    import string
    return text.translate(str.maketrans('','',string.punctuation))

In [4]:
data['review'] = data['review'].fillna('')
data['review_clean'] = data['review'].apply(remove_punctuation)

data['review_clean']

0         These flannel wipes are OK but in my opinion n...
1         it came early and was not disappointed i love ...
2         Very soft and comfortable and warmer than it l...
3         This is a product well worth the purchase  I h...
4         All of my kids have cried nonstop when I tried...
                                ...                        
183526    Such a great idea very handy to have and look ...
183527    This product rocks  It is a great blend of fun...
183528    This item looks great and cool for my kidsI kn...
183529    I am extremely happy with this product I have ...
183530    I love this product very mush  I have bought m...
Name: review_clean, Length: 183531, dtype: object

### 2) Extract Sentiments
We take into account only positive or negative reviews, as there is scale of 5 stars,
reviews with 3 stars are neutral reviews and should not have an impact on classifier.
Sentiment column will be helpful to easily recognize whether the review is positive or negative.

In [7]:
data = data[data['rating'] != 3]
data['sentiment'] = data['rating'].apply(lambda x : +1 if x > 3 else -1)
data['sentiment']

1         1
2         1
3         1
4         1
5         1
         ..
183526    1
183527    1
183528    1
183529    1
183530    1
Name: sentiment, Length: 166752, dtype: int64

### 3) Split into training and test sets
Random split data set with given probability (parts size) using list of booleans.

In [8]:
msk = np.random.rand(len(data)) < 0.8
train_data = data[msk]
test_data = data[~msk]

[False  True  True ...,  True  True  True]


### 4) Build the word count vector for each review
CountVectorizer class builds the vector of words. Method fit for each word found in given argument assign unique number which will be their representation. It creates vocabulary seen on output. What is important method fit should be called only once on train_data as we want our test_data be real test_data. Method transform creates sparse matrix of each review bases on vocabulary. In this particular example function fit_transform can be splitted into two separated methods but tis builtin function is well optimised what is relevant as building matrix is quite long.

In [14]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

vectorizer = CountVectorizer(token_pattern = r'\b\w+\b')
train_matrix = vectorizer.fit_transform(train_data['review_clean'])
test_matrix = vectorizer.transform(test_data['review_clean'])

vectorizer.vocabulary_

{'very': 115667,
 'soft': 98136,
 'and': 10552,
 'comfortable': 26726,
 'warmer': 116844,
 'than': 106197,
 'it': 57560,
 'looksfit': 63316,
 'the': 106385,
 'full': 45936,
 'size': 96309,
 'bed': 15980,
 'perfectlywould': 79162,
 'recommend': 87522,
 'to': 109106,
 'anyone': 11109,
 'looking': 63277,
 'for': 44591,
 'this': 107640,
 'type': 112419,
 'of': 73741,
 'quilt': 86017,
 'is': 57259,
 'a': 7367,
 'product': 83788,
 'well': 118111,
 'worth': 120512,
 'purchase': 85088,
 'i': 54314,
 'have': 51029,
 'not': 72639,
 'found': 45003,
 'anything': 11157,
 'else': 38458,
 'like': 62168,
 'positive': 82006,
 'ingenious': 55786,
 'approach': 11577,
 'losing': 63482,
 'binky': 17296,
 'what': 118598,
 'love': 63663,
 'most': 68979,
 'about': 7502,
 'how': 53786,
 'much': 69465,
 'ownership': 77074,
 'my': 70002,
 'daughter': 31567,
 'has': 50911,
 'in': 55025,
 'getting': 47151,
 'rid': 89969,
 'she': 94514,
 'so': 97992,
 'proud': 84589,
 'herself': 52087,
 'loves': 63748,
 'her': 5191

### 5) Train a sentiment classifier with logistic regression
Now we build LogisticRegression classifier which implements predict algorithm. It bases on the previously evaluated word matrix and information whether the review is positive on negative (train_data["sentiment"] column).

In [15]:
logisticRegression = LogisticRegression()
sentiment_model = logisticRegression.fit(train_matrix, y=train_data["sentiment"])

### 6) Prediciting Sentiment

In [16]:
y = []
counter = 0
arrSize = sentiment_model.coef_.size
for i in range(arrSize):
    if sentiment_model.coef_.flat[i] >= 0:
        counter += 1
        y.append(+1)
    else:
        y.append(-1)
print("Number of positive reviews: ", counter)

Number of positive reviews:  87343


In [35]:
# a = sentiment_model.predict(train_matrix)
# a.mean()
# sentiment_model.coef_

array([[ -1.03949845e+00,   2.42904572e-02,   7.28151204e-03, ...,
          1.77386588e-02,   5.24401572e-03,  -1.94325933e-05]])

In [None]:
# 7) Probability Predictions
# P = []
# arrSize = sentiment_model.coef_.size
# for i in range(arr`Size):
#     P.append(1 / (1 + np.exp(-sentiment_model.coef_[0])))
# P.mean()

In [85]:
# 8) Find the most positive (and negative) review

#test_data["probability"] = sentiment_model.predict_proba(train_matrix)

#top_20 = test_data.topk("probability", k=20)
#top_20.print_rows(20)

#bottom_20 = test_data.topk("probability", k=20, reverse = True)
#bottom_20.print_rows(20)

### 8) Compute accuracy of the classifier
We will now evaluate the accuracy of the trained classifier.
Recall that the accuracy is given by accuracy=# correctly classified examples# total examples

In [18]:
# Step 1: Use the sentiment_model to compute class predictions.
predictions = sentiment_model.predict(test_matrix)
# Step 2: Count the number of data points when the predicted class labels match the ground truth labels.
num_correct = 0
j = 0
for i in test_data['sentiment']:
      if predictions[j] == i:
         num_correct = num_correct + 1
# Step 3: Divide the total number of correct predictions by the total number of data points in the dataset.
accuracy = num_correct / len(test_data)

print("Classifier accuracy based on results from test_data: ", accuracy)

Classifier accuracy based on results from test_data:  0.8419358719845996


### 9) Learn another classifier with fewer words

In [20]:
significant_words = ['love', 'great', 'easy', 'old', 'little', 'perfect', 'loves', 
      'well', 'able', 'car', 'broke', 'less', 'even', 'waste', 'disappointed', 
      'work', 'product', 'money', 'would', 'return']

Create vectorizer based on 20 choosen words so that only that words will be taken into account during classyfying.

In [24]:
vectorizer_word_subset = CountVectorizer(vocabulary=significant_words)

In [25]:
train_matrix_word_subset = vectorizer_word_subset.fit_transform(train_data['review_clean'])
test_matrix_word_subset = vectorizer_word_subset.transform(test_data['review_clean'])

In [26]:
vectorizer_word_subset.vocabulary_

{'able': 8,
 'broke': 10,
 'car': 9,
 'disappointed': 14,
 'easy': 2,
 'even': 12,
 'great': 1,
 'less': 11,
 'little': 4,
 'love': 0,
 'loves': 6,
 'money': 17,
 'old': 3,
 'perfect': 5,
 'product': 16,
 'return': 19,
 'waste': 13,
 'well': 7,
 'work': 15,
 'would': 18}

### 10) Train a logistic regression model on a subset of data
Now build a logistic regression classifier with train_matrix_word_subset as features and sentiment

In [33]:
logisticRegression_word_subset = LogisticRegression()
simple_model = logisticRegression_word_subset.fit(train_matrix_word_subset, y=train_data["sentiment"])

# Let us inspect the weights (coefficients) of the simple_model. First, build a table to store (word, coefficient) pairs.
# Sort the data frame by the coefficient value in descending order.
table = sorted(list(zip(vectorizer_word_subset.vocabulary_,simple_model.coef_[0])), key=lambda x:(-x[1],x[0]))
print("The most positive word in our dictionary is: '", table[0][0], "' with rating: ", table[0][1])

The most positive word in our dictionary is: ' loves ' with rating:  1.7278025159


### 11) Comparing models
We will now compare the accuracy of the sentiment_model and the simple_model.
First, compute the classification accuracy of the sentiment_model on the train_data.
Now, compute the classification accuracy of the simple_model on the train_data.
Next, compute the classification accuracy of the simple_model on the test_data.

Something goes wrong as the accuracy of sentiment_model and simple_model is exactly the same what in my opinion is quite odd.

In [34]:
def get_classification_accuracy(model, data, true_labels, matrix):
    # First get the predictions
    predictions = model.predict(matrix)
    
    # Compute the number of correctly classified examples
    num_correct = 0
    j = 0
    for i in true_labels:
          if predictions[j] == i:
             num_correct = num_correct + 1

    # Then compute accuracy by dividing num_correct by total number of examples
    accuracy = num_correct / len(data)
    
    return accuracy

print("Sentiment_model clasiffier on train_data:", get_classification_accuracy(sentiment_model, train_data, train_data['sentiment'], train_matrix))
print("Simple_model clasiffier on train_data:\t ", get_classification_accuracy(simple_model, train_data, train_data['sentiment'], train_matrix_word_subset))
print("Simple_model clasiffier on test_data:\t ", get_classification_accuracy(simple_model, test_data, test_data['sentiment'], test_matrix_word_subset))

Sentiment_model clasiffier on train_data: 0.8409210072955523
Simple_model clasiffier on train_data:	  0.8409210072955523
Simple_model clasiffier on test_data:	  0.8419358719845996


### 12 Baseline: Majority class prediction
It is quite common to use the majority class classifier as the a baseline (or reference) model
for comparison with your classifier model. The majority classifier model predicts the majority
class for all data points. At the very least, you should healthily beat the majority class
classifier, otherwise, the model is (usually) pointless.

In [36]:
train_num_positive  = (train_data['sentiment'] == +1).sum()
print("Majority clasiffier for train_data accuracy: ", train_num_positive/len(train_data))

test_num_positive  = (test_data['sentiment'] == +1).sum()
print("Majority clasiffier for test_data accuracy: ", test_num_positive/len(test_data))

Majority clasiffier for train_data accuracy:  0.840921007296
Majority clasiffier for test_data accuracy:  0.841935871985
