# Baseline Experiment Type 1
## Naive Bayes Theory
"Naive Bayes is a statistical classification technique based on Bayes Theorem. It is one of the simplest supervised learning algorithms. Naive Bayes classifier is the fast, accurate and reliable algorithm. Naive Bayes classifiers have high accuracy and speed on large datasets.

Naive Bayes classifier assumes that the effect of a particular feature in a class is independent of other features. For example, a loan applicant is desirable or not depending on his/her income, previous loan and transaction history, age, and location. Even if these features are interdependent, these features are still considered independently. This assumption simplifies computation, and that's why it is considered as naive. This assumption is called class conditional independence.


P(h): the probability of hypothesis h being true (regardless of the data). This is known as the prior probability of h.
P(D): the probability of the data (regardless of the hypothesis). This is known as the prior probability.
P(h|D): the probability of hypothesis h given the data D. This is known as posterior probability.
P(D|h): the probability of data d given that the hypothesis h was true. This is known as posterior probability.

How Naive Bayes classifier works?"
frequency and likelihood -> prior and posterior probability calculation -> calculation of conditional probability -> multiplication of same class conditional probability

https://www.datacamp.com/community/tutorials/naive-bayes-scikit-learn
## Advantages

- It is not only a simple approach but also a fast and accurate method for prediction.               
- Naive Bayes has very low computation cost.                          
- It can efficiently work on a large dataset.  
- It performs well in case of discrete response variable compared to the continuous variable.
- It can be used with multiple class prediction problems. 
- It also performs well in the case of# text analytics problems. 
- When the assumption of independence holds, a Naive Bayes classifier performs better compared to other models like logistic regression.

## Disadvantages
- The assumption of independent features. In practice, it is almost impossible that model will get a set of predictors which are entirely independent.
- If there is no training tuple of a particular class, this causes zero posterior probability. In this case, the model is unable to make predictions. This problem is known as Zero Probability/Frequency Problem.

# Python Coding: Amazon Movies & TV using Naive Bayes
Used Dataset: "Small" subset of Amazon product data for Movies and TV
structure: 
- reviewerID
- asin
- reviewerName
- helpful
- reviewText
- overall
- summary
- unixReviewTime
- reviewTime

## Setup
Start with importing needed packages

In [1]:
import gzip
import pandas as pd
import numpy as np

from sklearn import model_selection, preprocessing
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
import json
from sklearn.metrics import confusion_matrix

## Getting the Data

In [3]:
#!wget 'http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Movies_and_TV_5.json.gz'

--2020-05-14 08:34:14--  http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Movies_and_TV_5.json.gz
Resolving snap.stanford.edu (snap.stanford.edu)... 171.64.75.80
Connecting to snap.stanford.edu (snap.stanford.edu)|171.64.75.80|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 708988936 (676M) [application/x-gzip]
Saving to: ‘reviews_Movies_and_TV_5.json.gz’


2020-05-14 08:37:53 (3.10 MB/s) - ‘reviews_Movies_and_TV_5.json.gz’ saved [708988936/708988936]



## Exploring the Data

In [3]:
data = []
with gzip.open('../Data/reviews_Cell_Phones_and_Accessories_5.json.gz') as f:
    for l in f: 
        data.append(json.loads(l.strip()))

# total length of list, this number equals total number of movie/tv reviews
print(len(data))

194439


In [5]:
## convert list into pandas dataframe
df = pd.DataFrame.from_dict(data)
df.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A30TL5EWN6DFXT,120401325X,christina,"[0, 0]",They look good and stick good! I just don't li...,4.0,Looks Good,1400630400,"05 21, 2014"
1,ASY55RVNIL0UD,120401325X,emily l.,"[0, 0]",These stickers work like the review says they ...,5.0,Really great product.,1389657600,"01 14, 2014"
2,A2TMXE2AFO7ONB,120401325X,Erica,"[0, 0]",These are awesome and make my phone look so st...,5.0,LOVE LOVE LOVE,1403740800,"06 26, 2014"
3,AWJ0WZQYMYFQ4,120401325X,JM,"[4, 4]",Item arrived in great time and was in perfect ...,4.0,Cute!,1382313600,"10 21, 2013"
4,ATX7CZYFXI1KW,120401325X,patrice m rogoza,"[2, 3]","awesome! stays on, and looks great. can be use...",5.0,leopard home button sticker for iphone 4s,1359849600,"02 3, 2013"


In [12]:
df.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,ADZPIG9QOCDG5,5019281,"Alice L. Larson ""alice-loves-books""","[0, 0]",This is a charming version of the classic Dick...,4.0,good version of a classic,1203984000,"02 26, 2008"
1,A35947ZP82G7JH,5019281,Amarah Strack,"[0, 0]",It was good but not as emotionally moving as t...,3.0,Good but not as moving,1388361600,"12 30, 2013"
2,A3UORV8A9D5L2E,5019281,Amazon Customer,"[0, 0]","Don't get me wrong, Winkler is a wonderful cha...",3.0,Winkler's Performance was ok at best!,1388361600,"12 30, 2013"
3,A1VKW06X1O2X7V,5019281,"Amazon Customer ""Softmill""","[0, 0]",Henry Winkler is very good in this twist on th...,5.0,It's an enjoyable twist on the classic story,1202860800,"02 13, 2008"
4,A3R27T4HADWFFJ,5019281,BABE,"[0, 0]",This is one of the best Scrooge movies out. H...,4.0,Best Scrooge yet,1387670400,"12 22, 2013"


In [4]:
target = df['overall']
text = df['reviewText']
print(len(text))

1697533


## Preparing the Data


In [5]:
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(text, target, test_size=0.3,random_state=109) # 70% training and 30% test

# label encode the target variable
encoder = preprocessing.LabelEncoder()
y_train = encoder.fit_transform(y_train)
y_test = encoder.fit_transform(y_test)

In [6]:
#np.unique(y_train)
np.unique(y_test)

array([0, 1, 2, 3, 4], dtype=int64)

In [7]:
# Feature Extraction: Bag of Words with TF-IDF
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}')
tfidf_vect.fit(text)

# transform the training and validation data using tfidf vectorizer object
xtrain_tfidf =  tfidf_vect.transform(X_train)
xtest_tfidf =  tfidf_vect.transform(X_test)

In [65]:
xtrain_tfidf[0]

<1x86756 sparse matrix of type '<class 'numpy.float64'>'
	with 36 stored elements in Compressed Sparse Row format>

## Building the Model

In [8]:
#Create a Binomial Classifier
nb = MultinomialNB()

## Training the Model 

In [9]:
# Train the model using the training sets
nb.fit(xtrain_tfidf, y_train)

MultinomialNB()

## Evaluation

In [10]:
#Predict the response for test dataset
y_pred = nb.predict(xtest_tfidf)

In [16]:
print(confusion_matrix(y_test, y_pred))

# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision:", metrics.precision_score(y_test, y_pred, average="macro"))
print("F1:", metrics.f1_score(y_test, y_pred, average="macro"))

# with CountVectorizer -> accuracy:  0.5956407336134784
# with TFIDF Vectorizer -> Accuracy: 0.5594527874922856

[[    48      2     22     87  30730]
 [     9      0     30    207  30506]
 [     1      0     61    450  59907]
 [     7      2     53    550 114441]
 [    33     15     67    450 271582]]
Accuracy: 0.5345815496995641
Precision: 0.32049096647049036
F1: 0.14230111562625974


In [20]:
df_confusion = confusion_matrix(y_test, y_pred)
print(df_confusion)

[[    48      2     22     87  30730]
 [     9      0     30    207  30506]
 [     1      0     61    450  59907]
 [     7      2     53    550 114441]
 [    33     15     67    450 271582]]


In [24]:
FP = df_confusion.sum(axis=0) - np.diag(df_confusion)
FN = df_confusion.sum(axis=1) - np.diag(df_confusion)
TP = np.diag(df_confusion)
TN = df_confusion.values.sum() - (FP + FN + TP)
print(FP)
print(FN)
print(TP)
print(TN)

AttributeError: 'numpy.ndarray' object has no attribute 'values'

In [21]:
FP = confusion_matrix.sum(axis=0) - np.diag(confusion_matrix)
FN = confusion_matrix.sum(axis=1) - np.diag(confusion_matrix)
TP = np.diag(confusion_matrix)
TN = confusion_matrix.values.sum() - (FP + FN + TP)
print("FP: " + FP)
print("FP: " + FN)
print("FP: " + TP)
print("FP: " + TN)

AttributeError: 'function' object has no attribute 'sum'

# Python Coding: Amazon Movies & TV using Naive Bayes

In [45]:
!wget 'http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Cell_Phones_and_Accessories_5.json.gz'

--2020-05-17 14:17:29--  http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Cell_Phones_and_Accessories_5.json.gz
Resolving snap.stanford.edu (snap.stanford.edu)... 171.64.75.80
Connecting to snap.stanford.edu (snap.stanford.edu)|171.64.75.80|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 45409631 (43M) [application/x-gzip]
Saving to: ‘reviews_Cell_Phones_and_Accessories_5.json.gz’


2020-05-17 14:17:38 (5.18 MB/s) - ‘reviews_Cell_Phones_and_Accessories_5.json.gz’ saved [45409631/45409631]



In [73]:
data = []
with gzip.open('reviews_Cell_Phones_and_Accessories_5.json.gz') as f: 
    for l in f: 
        data.append(json.loads(l.strip()))

# total length of list, this number equals total number of movie/tv reviews
print(len(data))

# first row of the list
print(data[0])

df = pd.DataFrame.from_dict(data)

194439
{'reviewerID': 'A30TL5EWN6DFXT', 'asin': '120401325X', 'reviewerName': 'christina', 'helpful': [0, 0], 'reviewText': "They look good and stick good! I just don't like the rounded shape because I was always bumping it and Siri kept popping up and it was irritating. I just won't buy a product like this again", 'overall': 4.0, 'summary': 'Looks Good', 'unixReviewTime': 1400630400, 'reviewTime': '05 21, 2014'}


In [74]:
target = df['overall']
text = df['reviewText']
print(len(text))

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(text, target, test_size=0.3,random_state=109) # 70% training and 30% test

# label encode the target variable
encoder = preprocessing.LabelEncoder()
y_train = encoder.fit_transform(y_train)
y_test = encoder.fit_transform(y_test)

print(xtrain_tfidf[0])

194439
  (0, 8864)	1
  (0, 9252)	1
  (0, 9782)	1
  (0, 16238)	1
  (0, 23509)	1
  (0, 23761)	1
  (0, 33608)	1
  (0, 34171)	1
  (0, 37033)	1
  (0, 39553)	2
  (0, 40618)	1
  (0, 42632)	2
  (0, 42864)	4
  (0, 43878)	1
  (0, 44431)	1
  (0, 45950)	1
  (0, 46713)	1
  (0, 47295)	1
  (0, 51175)	2
  (0, 52793)	1
  (0, 52881)	1
  (0, 53584)	1
  (0, 57502)	1
  (0, 57869)	1
  (0, 66298)	2
  (0, 68758)	1
  (0, 69619)	1
  (0, 75555)	1
  (0, 76237)	2
  (0, 76305)	2
  (0, 81835)	1
  (0, 84266)	1
  (0, 84324)	1
  (0, 84981)	1
  (0, 85060)	1
  (0, 85716)	1


In [75]:
print(text[0])

They look good and stick good! I just don't like the rounded shape because I was always bumping it and Siri kept popping up and it was irritating. I just won't buy a product like this again


In [78]:
# Feature Extraction: Bag of Words with TF-IDF
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}')
tfidf_vect.fit(text)

# transform the training and validation data using tfidf vectorizer object
xtrain_tfidf =  tfidf_vect.transform(X_train)
xtest_tfidf =  tfidf_vect.transform(X_test)

nb.fit(xtrain_tfidf, y_train)
y_pred = nb.predict(xtest_tfidf)

In [79]:
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision:", metrics.precision_score(y_test, y_pred, average="macro"))
print("F1:", metrics.f1_score(y_test, y_pred, average="macro"))

# with CountVectorizer -> Accuracy: 0.6152712061989989

Accuracy: 0.5594527874922856
Precision: 0.29114904143475573
F1: 0.1453772850881709


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
