CB.EN.U4CSE19301 - Adheena B

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import spacy
import string
import pickle
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.base import TransformerMixin 
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
revs = pd.read_csv('/content/drive/MyDrive/Machine Learning/review3.csv', error_bad_lines=False)
revs.shape

(306684, 5)

In [None]:
revs.head()

Unnamed: 0,reviewerID,asin,reviews,rating,sentiment
0,A2QK1U70OJ74P,B000FA64PA,well written interesting to see sideous thro...,3,0
1,A3SZMGJMV0G16C,B000FA64PK,troy denning s novella recovery was originally...,3,0
2,A38Z3Q6DTDIH9J,B000FA64PK,another well written ebook by troy denning bu...,3,0
3,A3SZMGJMV0G16C,B000FA64QO,with ylesia a novella originally published in...,2,0
4,A22CW0ZHY3NJH8,B000FA64QO,the events of ylesia take place during dest...,3,0


In [None]:
revs.isna().sum()

reviewerID    0
asin          0
reviews       0
rating        0
sentiment     0
dtype: int64

In [None]:
revs.asin.value_counts()

B006GWO5WK    335
B0093MU7QS    230
B00BTIDW4S    221
B005C5YZ86    182
B007YJ3JV2    145
             ... 
B005VRZXOA      1
B005U7F1YS      1
B00CNCUL2U      1
B00I1R8WVI      1
B004TU4YD6      1
Name: asin, Length: 57450, dtype: int64

In [None]:
revs.reviewerID.value_counts()

A3A7FF87LEVCQ1    571
A13QTZ8CIMHHG4    403
A2VXSQHJWZAQGY    389
A20R37WRPLUM1D    286
A8MTDB180W1XE     256
                 ... 
ADP7WXXL52ZQM       1
A2OYWI3HZJGIL2      1
A1APCB56AV2LQ7      1
AQ3DINQH0WH46       1
AFMG0Z68FCJ6A       1
Name: reviewerID, Length: 61920, dtype: int64

In [None]:
# Load English tokenizer, tagger, parser, NER and word vectors
nlp = English()
stopwords = list(STOP_WORDS)
punctuations = string.punctuation

We’ll create a tokenizer() function that accepts a sentence as input and processes the sentence into tokens, performing lemmatization, lowercasing, and removing stop words

In [None]:
def tokenizer(sentence):
    # create documents with linguistic annotations
    mytokens = nlp(sentence)
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]
    mytokens = [ word for word in mytokens if word not in stopwords and word not in punctuations ]
    return mytokens

To further clean our text data, we’ll also want to create a custom transformer for removing initial and end spaces and converting text into lower case. Here, we will create a custom predictors class wich inherits the TransformerMixin class. This class overrides the transform, fit and get_params methods. We’ll also create a clean_text() function that removes spaces and converts text into lowercase.

In [None]:
# Custom Transformer for cleaning the text data
class predictors(TransformerMixin):
    def transform(self, X, **transform_params):
        return [clean_text(text) for text in X]
    def fit(self, X, y, **fit_params):
        return self
    def get_params(self, deep=True):
        return {}

# Basic function to clean the text 
def clean_text(text):     
    return text.strip().lower()

When we classify text, we end up with text snippets matched with their respective labels. But we can’t simply use text strings in our machine learning model; we need a way to convert our text into something that can be represented numerically. So we need a way to represent our text numerically. 

BoW converts text into the matrix of occurrence of words within a given document. It focuses on whether given words occurred or not in the document, and it generates a matrix that we might see referred to as a BoW matrix or a document term matrix.

We can generate a BoW matrix for our text data by using scikit-learn‘s CountVectorizer. In the code below, we’re telling CountVectorizer to use the custom spacy_tokenizer function we built as its tokenizer, and defining the ngram range we want.

N-grams are combinations of adjacent words in a given text, where n is the number of words that incuded in the tokens.

TF-IDF (Term Frequency-Inverse Document Frequency) - it’s a way of representing how important a particular term is in the context of a given document, based on how many times the term appears and how many other documents that same term appears in. The higher the TF-IDF, the more important that term is to that document.



In [None]:
vectorizer = CountVectorizer(stop_words = None,tokenizer = tokenizer, ngram_range=(1,1)) 
tfvectorizer = TfidfVectorizer(tokenizer = tokenizer)

In [None]:
X = revs['reviews']
y = revs['sentiment']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=77)

We’ll create a pipeline with three components: a cleaner, a vectorizer, and a classifier. The cleaner uses our predictors class object to clean and preprocess the text. The vectorizer uses countvector objects to create the bag of words matrix for our text. The classifier is an object that performs the logistic regression to classify the sentiments.

In [None]:
classifier = LogisticRegression()
LRmodel = Pipeline([("cleaner", predictors()),
                 ('vectorizer', vectorizer),
                 ('classifier', classifier)])

# Train the Model
LRmodel.fit(X_train,y_train)   
LRpred = LRmodel.predict(X_test)
print(f'Confusion Matrix:\n{confusion_matrix(y_test,LRpred)}')
print(f'\nClassification Report:\n{classification_report(y_test,LRpred)}')
print(f'Accuracy: {accuracy_score(y_test,LRpred)*100}%')

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Confusion Matrix:
[[51127  9948]
 [ 9849 51750]]

Classification Report:
              precision    recall  f1-score   support

           0       0.84      0.84      0.84     61075
           1       0.84      0.84      0.84     61599

    accuracy                           0.84    122674
   macro avg       0.84      0.84      0.84    122674
weighted avg       0.84      0.84      0.84    122674

Accuracy: 83.86210606974583%


[[TN, FP],

 [FN, TP]]

In [None]:
DTclassifier = DecisionTreeClassifier()
DTmodel = Pipeline([("cleaner", predictors()),
                 ('vectorizer', vectorizer),
                 ('classifier', DTclassifier)])

# Train the Model
DTmodel.fit(X_train,y_train)   
DTpred = DTmodel.predict(X_test)
print(f'Confusion Matrix:\n{confusion_matrix(y_test,DTpred)}')
print(f'\nClassification Report:\n{classification_report(y_test,DTpred)}')
print(f'Accuracy: {accuracy_score(y_test,DTpred)*100}%')

Confusion Matrix:
[[44843 16232]
 [16708 44891]]

Classification Report:
              precision    recall  f1-score   support

           0       0.73      0.73      0.73     61075
           1       0.73      0.73      0.73     61599

    accuracy                           0.73    122674
   macro avg       0.73      0.73      0.73    122674
weighted avg       0.73      0.73      0.73    122674

Accuracy: 73.1483443924548%


In [None]:
SVCclassifier = LinearSVC()
SVCmodel = Pipeline([("cleaner", predictors()),
                 ('vectorizer', vectorizer),
                 ('classifier', SVCclassifier)])

# Train the Model
SVCmodel.fit(X_train,y_train)   
SVCpred = SVCmodel.predict(X_test)
print(f'Confusion Matrix:\n{confusion_matrix(y_test,SVCpred)}')
print(f'\nClassification Report:\n{classification_report(y_test,SVCpred)}')
print(f'Accuracy: {accuracy_score(y_test,SVCpred)*100}%')



Confusion Matrix:
[[50049 11026]
 [11294 50305]]

Classification Report:
              precision    recall  f1-score   support

           0       0.82      0.82      0.82     61075
           1       0.82      0.82      0.82     61599

    accuracy                           0.82    122674
   macro avg       0.82      0.82      0.82    122674
weighted avg       0.82      0.82      0.82    122674

Accuracy: 81.80543554461418%


In [None]:
from sklearn.ensemble import RandomForestClassifier
RFclassifier = RandomForestClassifier(n_estimators = 10)
RFmodel = Pipeline([("cleaner", predictors()),
                 ('vectorizer', vectorizer),
                 ('classifier', RFclassifier)])

# Train the Model
RFmodel.fit(X_train,y_train)   
RFpred = RFmodel.predict(X_test)
print(f'Confusion Matrix:\n{confusion_matrix(y_test,RFpred)}')
print(f'\nClassification Report:\n{classification_report(y_test,RFpred)}')
print(f'Accuracy: {accuracy_score(y_test,RFpred)*100}%')

Confusion Matrix:
[[51454  9621]
 [18512 43087]]

Classification Report:
              precision    recall  f1-score   support

           0       0.74      0.84      0.79     61075
           1       0.82      0.70      0.75     61599

    accuracy                           0.77    122674
   macro avg       0.78      0.77      0.77    122674
weighted avg       0.78      0.77      0.77    122674

Accuracy: 77.06686013336159%


**Observations :**

We can see that since we handled the imbalance in our dataset by undersampling, the precision, recall, f1-score for both classes are almost close in all our models.

# Conclusions
We consider **accuracy** as our evaluation metrics because we wish to see the general performance of the models. And we can confidently use accuracy score because we have dealt with the imbalance in the dataset.

We can see that the accuracy of **Logistic Regression** is considerably higher than the other models. So we choose logistic regression as our final model to make predictions. 




In [None]:
y_pred_final = LRmodel.predict(X)

In [None]:
revs['sentiment'] = y_pred_final

In [None]:
revs.sample(5)

Unnamed: 0,reviewerID,asin,reviews,rating,sentiment
271106,A35VNGMUH59VCH,B00G6Q0K66,very informative book full of good tips and h...,5,1
2775,A2PHUP1WN3IZPC,B0033UV8HI,i hate to give a book a low rating based on te...,2,0
44395,AE41TLMIZPAE7,B007TKNCSG,what can i say about this short story that wil...,3,0
185661,A2OJOUTOC3LNZK,B00685NFI0,i had to give this books thumbs up and if i ...,5,1
136920,A2K73LH7X2PFR1,B00I2T7BW6,first off let me say i ve read the dressage c...,3,1


In [None]:
revs.to_csv('/content/drive/MyDrive/Machine Learning/revs_final.csv', index = False)