Importing all the Required Libraries

In [1]:
import numpy as np
import pandas as pd
import string 
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import nltk 
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn .svm import SVC
from sklearn.linear_model import LogisticRegression

In [None]:
from nltk.stem import PorterStemmer

In [None]:
stemmer = PorterStemmer()

Loading the Dataset Collected

In [2]:
dataframe = pd.read_csv('dataset.csv')
dataframe.head()

Unnamed: 0,category,rating,LABEL,text_
0,Home_and_Kitchen_5,5,CG,"Love this! Well made, sturdy, and very comfor..."
1,Home_and_Kitchen_5,5,CG,"love it, a great upgrade from the original. I..."
2,Home_and_Kitchen_5,5,CG,This pillow saved my back. I love the look and...
3,Home_and_Kitchen_5,1,CG,"Missing information on how to use it, but it i..."
4,Home_and_Kitchen_5,5,CG,Very nice set. Good quality. We have had the s...


Creating a new column named Length for the length of each review

In [3]:
dataframe.dropna(inplace=True)
dataframe['Length'] = dataframe['text_'].apply(len)
print(dataframe[dataframe['LABEL']=='OR'][['text_','Length']].sort_values(by='Length', ascending = False).head().iloc[0].text_)
dataframe.head()

WEAK ON CURRENT SCIENCE.
After seeing it twice, I agree with much (but not all) of the positive five star reviews. Out of respect for those who READ reviews, I'll not repeat everything that I like about the presentation. I found the goofy oversize earrings, hairdo, and facial hair arrangement of Daniel Vitalis, (described as a "Wild Food Expert") distracting. UGH. Ditto for David Wolfe, who had an extremely goofy wild hairdo. On the other hand, Jon Gabriel, described as an "author and weight loss expert" was nicely groomed and a good presenter. His story of personal transformation of a fellow of over 400 pounds (whew) to becoming a jock of normal weight was inspiring. Christiane Northrup preserves her rank as one of America's cutest doctors. A really nice looking woman! Presentations by Dr. Mercola, Jason Vale, Kris Carr, Alejandro Junger were fine. It was disappointing to have Jamie Oliver (so popular in the UK) give Baby Cow Growth Fluid a pass with unscientific but popular ideas on 

Unnamed: 0,category,rating,LABEL,text_,Length
0,Home_and_Kitchen_5,5,CG,"Love this! Well made, sturdy, and very comfor...",75
1,Home_and_Kitchen_5,5,CG,"love it, a great upgrade from the original. I...",80
2,Home_and_Kitchen_5,5,CG,This pillow saved my back. I love the look and...,67
3,Home_and_Kitchen_5,1,CG,"Missing information on how to use it, but it i...",81
4,Home_and_Kitchen_5,5,CG,Very nice set. Good quality. We have had the s...,85


Removing Punctuations and Stop Words

In [4]:
def convertmyTxt(rv):
    np = [c for c in rv if c not in string.punctuation]
    np = ''.join(np)
    return [w for w in np.split() if w.lower() not in stopwords.words('english')]

In [16]:
dataframe['Stop Words'] = dataframe.apply(lambda x: convertmyTxt(x['text_']), axis=1)

In [23]:
def rejoin_words(tokenized_column):
    return ( " ".join(tokenized_column))
dataframe['all_text'] = dataframe.apply(lambda x: rejoin_words(x['Stop Words']), axis=1)

In [24]:
dataframe.head()

Unnamed: 0,category,rating,LABEL,text_,Length,Stop Words,all_text
0,Home_and_Kitchen_5,5,CG,"Love this! Well made, sturdy, and very comfor...",75,"[Love, Well, made, sturdy, comfortable, love, ...",Love Well made sturdy comfortable love itVery ...
1,Home_and_Kitchen_5,5,CG,"love it, a great upgrade from the original. I...",80,"[love, great, upgrade, original, Ive, mine, co...",love great upgrade original Ive mine couple years
2,Home_and_Kitchen_5,5,CG,This pillow saved my back. I love the look and...,67,"[pillow, saved, back, love, look, feel, pillow]",pillow saved back love look feel pillow
3,Home_and_Kitchen_5,1,CG,"Missing information on how to use it, but it i...",81,"[Missing, information, use, great, product, pr...",Missing information use great product price
4,Home_and_Kitchen_5,5,CG,Very nice set. Good quality. We have had the s...,85,"[nice, set, Good, quality, set, two, months]",nice set Good quality set two months


In [5]:
#dataframe['Stopwords'] = convertmyTxt(dataframe['text_'])

Splitting the data to Training and Testing as 75:25

In [26]:
x_train, x_test, y_train, y_test = train_test_split(dataframe['all_text'],dataframe['LABEL'], test_size= 0.25,random_state=42 )
print("X Train \n",x_train,"Y Train\n", y_train,"X Test\n", x_test,"Y Test\n", y_test)

X Train 
 37909    shipped delivered time instructions reason ord...
20816    puppia really helps making unruly easy walk do...
33930    3 year old grand daughters favorite loves litt...
1354     work perfect picking entire bunch fruits veget...
34934    Got 3 yr old nephew loves bugs werent creepy c...
                               ...                        
5452     best quality work fine good quality Nice bag W...
34225    Excellent Easy use Product good better descrip...
24951    happy say part growing group authors published...
3350     bought 20piece set trial along Wallace flatwar...
23921                 Good tiny tots However find hard put
Name: all_text, Length: 30324, dtype: object Y Train
 37909    CG
20816    OR
33930    CG
1354     CG
34934    OR
         ..
5452     CG
34225    OR
24951    CG
3350     OR
23921    CG
Name: LABEL, Length: 30324, dtype: object X Test
 26716    love Georgia Davis love fact writer totally co...
8962     Product fit like glove quality good o

Creating A Pipeline to fit the Training dataset

In [29]:
pip = Pipeline([
    ('bow',CountVectorizer()),
    ('tfidf',TfidfTransformer()),
    ('classifier',RandomForestClassifier())
])

Fitting The 75% Training dataset to the Pipeline

In [30]:
pip.fit(x_train, y_train)

Predicting The Label Class for Test Dataset

In [31]:
rfc = pip.predict(x_test)
print(rfc)

['OR' 'CG' 'OR' ... 'CG' 'CG' 'CG']


In [32]:
test = pd.Series(["Got the 256GB version at 21k and it is really worth the price. Camera can be slightly better but battery backup and other features are really good and worthy for the price paid.","I ordered F 19 pro two times one delivered in May and another in Aug. Only after receiving the second F19 Pro , I realized that first phone packet did not have the the earbuds headphone. I request that I be supplied the earbuds deficient in the first order. Also while confirming the dispatch, pls mention list of supplied items.","Love this!  Well made, sturdy, and very comfortable.  I love it!Very pretty","The Oppo F19 Pro+ 5G is a solid mid-range offering that delivers impressive performance and an attractive design"])
pred = pip.predict(test)
print(pred)

['OR' 'OR' 'CG' 'CG']


Finding The Accuracy and Confusion Matrix of The Model

In [33]:
print('Accuracy of the model: ',str(np.round(accuracy_score(y_test,rfc)*100,2)) + '%')
listt = confusion_matrix(y_test,rfc)
print("TP :",listt[0][0])
print("FP :",listt[0][1])
print("FN :",listt[1][0])
print("TN :",listt[1][1])

Accuracy of the model:  85.2%
TP : 4550
FP : 558
FN : 938
TN : 4062
