<a href="https://colab.research.google.com/github/ShaunakSen/Natural-Language-Processing/blob/master/Real_World_Python_Machine_Learning_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Real-World Python Machine Learning Tutorial

> Based on the YouTube tutorial by Keith Galli: link: https://www.youtube.com/watch?v=M9Itm95JzL0

---



In [16]:
!pip install rich



In [0]:
import json
import numpy as np
import pandas as pd
from rich import print as r_print
import random

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score

In [0]:
class Sentiment:
    NEGATIVE = 'NEGATIVE'
    NEUTRAL = 'NEUTRAL'
    POSITIVE = 'POSITIVE'


class Review:
    def __init__(self, text, score):
        self.text = text
        self.score = score
        self.sentiment = self.get_sentiment()
    
    def get_sentiment(self):
        if self.score <= 2:
            return Sentiment.NEGATIVE
        elif self.score == 3:
            return Sentiment.NEUTRAL
        else:
            return Sentiment.POSITIVE

class ReviewContainer:
    """
    used to evenly split up the train and test reviews into +ve and -ve sentiments
    """
    def __init__(self, reviews):
        self.reviews = reviews

    def get_text(self):
        return [x.text for x in self.reviews]
    def get_sentiment(self):
        return [x.sentiment for x in self.reviews]

    def evenly_distribute(self):
        negative = list(filter(lambda x: x.sentiment == Sentiment.NEGATIVE, self.reviews))
        positive = list(filter(lambda x: x.sentiment == Sentiment.POSITIVE, self.reviews))
        print (f'Initial length of positive: {len(positive)}, negative: {len(negative)}')

        positive_shrunk = positive[:len(negative)]

        self.reviews = negative + positive_shrunk

        print (f'Final length of positive: {len(positive_shrunk)}, negative: {len(negative)}')



        

In [0]:
file_name = './Books_small_10000.json'

reviews = [] # [review_obj1, review_obj2, ...]

# read the data and append to list
with open(file_name) as f:
    for line in f:
        review = json.loads(line)
        # create a Review obj
        review_obj = Review(text=review['reviewText'], score=review['overall'])
        reviews.append(review_obj)

r_print (reviews[0].sentiment)

POSITIVE


In [0]:
print (f'Number of reviews: {len(reviews)}')

Number of reviews: 10000


### Bag of words

ML models work really well with numeric data and not so well with text data

So we need a method to convert text to vectors

![](https://i.ibb.co/V3ycp0n/diag1.png)

The first 2 sentences are the training set and the last one is the test set

Note: In the test set there are words like 'a' and 'very' that are not there in the training data

So we simply ignore them

### Split the data

In [0]:
training, test = train_test_split(reviews, test_size=0.33, random_state=42)

print (len(training))

print (training[0].text, training[0].sentiment)

6700
Olivia Hampton arrives at the Dunraven family home as cataloger of their extensive library. What she doesn't expect is a broken carriage wheel on the way. Nor a young girl whose mind is clearly gone, an old man in need of care himself (and doesn&#8217;t quite seem all there in Olivia&#8217;s opinion). Furthermore, Marion Dunraven, the only sane one of the bunch and the one Olivia is inexplicable drawn to, seems captive to everyone in the dusty old house. More importantly, she doesn't expect to fall in love with Dunraven's daughter Marion.Can Olivia truly believe the stories of sadness and death that surround the house, or are they all just local neighborhood rumor?Was that carriage trouble just a coincidence or a supernatural sign to stay away? If she remains, will the Castle&#8217;s dark shadows take Olivia down with them or will she and Marion long enough to declare their love?Patty G. Henderson has created an atmospheric and intriguing story in her Gothic tale. I found this to 

In [17]:
### split up the reviews evenly
review_container_tr = ReviewContainer(training)
review_container_te = ReviewContainer(test)
review_container_tr.evenly_distribute()
review_container_te.evenly_distribute()


print (f'Length of training data after shrinking is {len(review_container_tr.reviews)}')
print (f'Length of test data after shrinking is {len(review_container_te.reviews)}')

Initial length of positive: 5611, negative: 436
Final length of positive: 436, negative: 436
Initial length of positive: 2767, negative: 208
Final length of positive: 208, negative: 208
Length of training data after shrinking is 872
Length of test data after shrinking is 416


Now we split the training and test data into X and y (inputs and ops):

In [18]:
train_X = review_container_tr.get_text()
train_y = review_container_tr.get_sentiment()

test_X = review_container_te.get_text()
test_y = review_container_te.get_sentiment()

print (train_X[0], train_y[0], test_X[0], test_y[0])

It was just one of those books that never went anywhere. I like books that get your attention in the beginning and not drag out until a quarter way through. I decided to give it an early death - delete! NEGATIVE Story is very inaccurate with modern words, phrases and actions.  In the second chapter the author has the bagpipes playing "Amazing Grace" and according to her it is a song as old as time.  As someone who learned to play Amazing Grace on the piano I can state for a fact the song is not old as time. It was not even published until 1779; author has the book set in 1714. 65 years before John Newton wrote and published the songFiona and Juliet speak like they are in the 21 century. Not a young miss in the early 18th century.I have no problem reading about God in books. My problem is when authors take too much leeway and write using modern phrases in historical books.Really, wondering if this author did any 'real' research or just used what she remembered from high school world his

In [19]:
vectorizer = CountVectorizer()
train_X_vectors = vectorizer.fit_transform(train_X)

print (train_X_vectors.shape)

(872, 8906)


In [21]:
print (vectorizer.get_feature_names()[4277])

it


In [20]:
print (train_X_vectors[0])

  (0, 4277)	2
  (0, 8608)	1
  (0, 4409)	1
  (0, 5514)	1
  (0, 5478)	1
  (0, 7984)	1
  (0, 996)	2
  (0, 7925)	2
  (0, 5350)	1
  (0, 8666)	1
  (0, 473)	1
  (0, 4684)	1
  (0, 3374)	1
  (0, 8883)	1
  (0, 634)	1
  (0, 4034)	1
  (0, 7929)	1
  (0, 816)	1
  (0, 423)	1
  (0, 5408)	1
  (0, 2430)	1
  (0, 5589)	1
  (0, 8403)	1
  (0, 6305)	1
  (0, 8627)	1
  (0, 8005)	1
  (0, 2042)	1
  (0, 8052)	1
  (0, 3393)	1
  (0, 416)	1
  (0, 2526)	1
  (0, 2017)	1
  (0, 2081)	1


This is basically like a sparse matrix

It only prints the positions of the non-zero elements

We can see that for the first review the word at idx 350 occurs twice nad this is the word 'and' and it does indeed occur twice in the review


In [22]:
print (train_X[0])

It was just one of those books that never went anywhere. I like books that get your attention in the beginning and not drag out until a quarter way through. I decided to give it an early death - delete!


Now we transfor the test data using the same fitted model

In [24]:
test_X_vectors = vectorizer.transform(test_X)

print (test_X_vectors.shape)

(416, 8906)


Now our final data is `train_X_vectors, train_y` and `test_X_vectors, test_y` and we want to create our model for this data

### Classification 


#### Linear SVM

In [0]:
clf_svm = svm.SVC(kernel='linear')

In [26]:
clf_svm.fit(train_X_vectors, train_y)

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [27]:
test_X[0], test_y[0]

('Story is very inaccurate with modern words, phrases and actions.  In the second chapter the author has the bagpipes playing "Amazing Grace" and according to her it is a song as old as time.  As someone who learned to play Amazing Grace on the piano I can state for a fact the song is not old as time. It was not even published until 1779; author has the book set in 1714. 65 years before John Newton wrote and published the songFiona and Juliet speak like they are in the 21 century. Not a young miss in the early 18th century.I have no problem reading about God in books. My problem is when authors take too much leeway and write using modern phrases in historical books.Really, wondering if this author did any \'real\' research or just used what she remembered from high school world history?Really, how many young ladies will tell someone they just met that they were compromised? How many young ladies are going to travel with out any type of female companion? Juliet is traveling with 3 men. 

In [28]:
clf_svm.predict(test_X_vectors[0])

array(['NEGATIVE'], dtype='<U8')

So it predicts the first test review correctly

#### Decison Tree

In [29]:
clf_DT = DecisionTreeClassifier()
clf_DT.fit(train_X_vectors, train_y)
clf_DT.predict(test_X_vectors[0])

array(['POSITIVE'], dtype='<U8')

### Evaluation

In [31]:
# Mean Accuracy
print(clf_svm.score(test_X_vectors, test_y))
print(clf_DT.score(test_X_vectors, test_y))

0.7980769230769231
0.6153846153846154


In [34]:
f1_score(y_true=test_y, y_pred=clf_svm.predict(test_X_vectors), average=None, labels=[Sentiment.POSITIVE, Sentiment.NEGATIVE])

array([0.8028169 , 0.79310345])

In [0]:
f1_score(y_true=test_y, y_pred=clf_DT.predict(test_X_vectors), average=None, labels=[Sentiment.POSITIVE, Sentiment.NEGATIVE])