<a href="https://colab.research.google.com/github/ShaunakSen/Natural-Language-Processing/blob/master/Real_World_Python_Machine_Learning_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Real-World Python Machine Learning Tutorial

> Based on the YouTube tutorial by Keith Galli: link: https://www.youtube.com/watch?v=M9Itm95JzL0

---



In [0]:
!pip install rich

In [0]:
import json
import numpy as np
import pandas as pd
from rich import print as r_print
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import svm

In [0]:
class Sentiment:
    NEGATIVE = 'NEGATIVE'
    NEUTRAL = 'NEUTRAL'
    POSITIVE = 'POSITIVE'


class Review:
    def __init__(self, text, score):
        self.text = text
        self.score = score
        self.sentiment = self.get_sentiment()
    
    def get_sentiment(self):
        if self.score <= 2:
            return Sentiment.NEGATIVE
        elif self.score == 3:
            return Sentiment.NEUTRAL
        else:
            return Sentiment.POSITIVE

In [11]:
file_name = './Books_small.json'

reviews = [] # [review_obj1, review_obj2, ...]

# read the data and append to list
with open(file_name) as f:
    for line in f:
        review = json.loads(line)
        # create a Review obj
        review_obj = Review(text=review['reviewText'], score=review['overall'])
        reviews.append(review_obj)

r_print (reviews[0].sentiment)

POSITIVE


In [12]:
print (f'Number of reviews: {len(reviews)}')

Number of reviews: 1000


### Bag of words

ML models work really well with numeric data and not so well with text data

So we need a method to convert text to vectors

![](https://i.ibb.co/V3ycp0n/diag1.png)

The first 2 sentences are the training set and the last one is the test set

Note: In the test set there are words like 'a' and 'very' that are not there in the training data

So we simply ignore them

### Split the data

In [18]:
training, test = train_test_split(reviews, test_size=0.33, random_state=42)

print (len(training))

print (training[0].text, training[0].sentiment)

670
Vivid characters and descriptions. The author has created a tale that grabs your attention and I couldn't put it down. POSITIVE


Now we split the training and test data into X and y (inputs and ops):

In [20]:
train_X = [train_.text for train_ in training]
train_y = [train_.sentiment for train_ in training]

test_X = [test_.text for test_ in test]
test_y = [test_.sentiment for test_ in test]

print (train_X[0], train_y[0], test_X[0], test_y[0])

Vivid characters and descriptions. The author has created a tale that grabs your attention and I couldn't put it down. POSITIVE Every new Myke Cole book is better than the last, and this is no exception. If you haven't read the Shadow Ops series before start with Control Point, but go ahead and order Fortress Frontier and Breach Zone as well - you're going to want them. POSITIVE


In [22]:
vectorizer = CountVectorizer()
train_X_vectors = vectorizer.fit_transform(train_X)

print (train_X_vectors.shape)

(670, 7372)


In [28]:
print (vectorizer.get_feature_names()[350])

and


In [23]:
print (train_X_vectors[0])

  (0, 7086)	1
  (0, 1148)	1
  (0, 350)	2
  (0, 1800)	1
  (0, 6595)	1
  (0, 562)	1
  (0, 3054)	1
  (0, 1558)	1
  (0, 6475)	1
  (0, 6593)	1
  (0, 2895)	1
  (0, 7353)	1
  (0, 539)	1
  (0, 1515)	1
  (0, 5197)	1
  (0, 3545)	1
  (0, 2007)	1


This is basically like a sparse matrix

It only prints the positions of the non-zero elements

We can see that for the first review the word at idx 350 occurs twice nad this is the word 'and' and it does indeed occur twice in the review


In [26]:
print (train_X[0])

(1, 7372) Vivid characters and descriptions. The author has created a tale that grabs your attention and I couldn't put it down.


Now we transfor the test data using the same fitted model

In [29]:
test_X_vectors = vectorizer.transform(test_X)

print (test_X_vectors.shape)

(330, 7372)


Now our final data is `train_X_vectors, train_y` and `test_X_vectors, test_y` and we want to create our model for this data

### Classification 


#### Linear SVM

In [0]:
clf_svm = svm.SVC(kernel='linear')

In [34]:
clf_svm.fit(train_X_vectors, train_y)

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [36]:
test_X[0], test_y[0]

("Every new Myke Cole book is better than the last, and this is no exception. If you haven't read the Shadow Ops series before start with Control Point, but go ahead and order Fortress Frontier and Breach Zone as well - you're going to want them.",
 'POSITIVE')

In [37]:
clf_svm.predict(test_X_vectors[0])

array(['POSITIVE'], dtype='<U8')

So it predicts the first test review correctly