# Baseline Experiment Type 1
## Naive Bayes Theory
"Naive Bayes is a statistical classification technique based on Bayes Theorem. It is one of the simplest supervised learning algorithms. Naive Bayes classifier is the fast, accurate and reliable algorithm. Naive Bayes classifiers have high accuracy and speed on large datasets.

Naive Bayes classifier assumes that the effect of a particular feature in a class is independent of other features. For example, a loan applicant is desirable or not depending on his/her income, previous loan and transaction history, age, and location. Even if these features are interdependent, these features are still considered independently. This assumption simplifies computation, and that's why it is considered as naive. This assumption is called class conditional independence.


P(h): the probability of hypothesis h being true (regardless of the data). This is known as the prior probability of h.
P(D): the probability of the data (regardless of the hypothesis). This is known as the prior probability.
P(h|D): the probability of hypothesis h given the data D. This is known as posterior probability.
P(D|h): the probability of data d given that the hypothesis h was true. This is known as posterior probability.

How Naive Bayes classifier works?"
frequency and likelihood -> prior and posterior probability calculation -> calculation of conditional probability -> multiplication of same class conditional probability

https://www.datacamp.com/community/tutorials/naive-bayes-scikit-learn
## Advantages

- It is not only a simple approach but also a fast and accurate method for prediction.               
- Naive Bayes has very low computation cost.                          
- It can efficiently work on a large dataset.  
- It performs well in case of discrete response variable compared to the continuous variable.
- It can be used with multiple class prediction problems. 
- It also performs well in the case of# text analytics problems. 
- When the assumption of independence holds, a Naive Bayes classifier performs better compared to other models like logistic regression.

## Disadvantages
- The assumption of independent features. In practice, it is almost impossible that model will get a set of predictors which are entirely independent.
- If there is no training tuple of a particular class, this causes zero posterior probability. In this case, the model is unable to make predictions. This problem is known as Zero Probability/Frequency Problem.

# Python Coding: Amazon Movies & TV using Naive Bayes
Used Dataset: "Small" subset of Amazon product data for Movies and TV
structure: 
- reviewerID
- asin
- reviewerName
- helpful
- reviewText
- overall
- summary
- unixReviewTime
- reviewTime

## Setup
Start with importing needed packages

In [32]:
import gzip
import pandas as pd
import numpy as np
import scipy
from sklearn import model_selection, preprocessing
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from imblearn.under_sampling import NearMiss
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, accuracy_score
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
import json
from imblearn.over_sampling import SMOTE
from sklearn.feature_extraction.text import CountVectorizer
from imblearn.pipeline import make_pipeline
from imblearn.datasets import make_imbalance
from collections import Counter
from imblearn.metrics import classification_report_imbalanced

#!wget 'http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Movies_and_TV_5.json.gz'

In [18]:
data = []
with gzip.open('../Data/reviews_Cell_Phones_and_Accessories_5.json.gz') as f:
    for l in f: 
        data.append(json.loads(l.strip()))

In [19]:
## convert list into pandas dataframe
df = pd.DataFrame.from_dict(data)
df.describe()
# print(len(df))
# X = df['reviewText']
# y = df['overall']
target = df['overall']
body = df['reviewText']


In [7]:
df.overall.value_counts()

5.0    906608
4.0    382994
3.0    201302
1.0    104219
2.0    102410
Name: overall, dtype: int64

In [20]:
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(body, target, test_size=0.3,random_state=109) # 70% training and 30% test

# label encode the target variable
encoder = preprocessing.LabelEncoder()
y_train = encoder.fit_transform(y_train)
y_test = encoder.fit_transform(y_test)

In [31]:
# important to see the class representation
print('Training target statistics: {}'.format(Counter(y_train)))
print('Testing target statistics: {}'.format(Counter(y_test)))

Training target statistics: Counter({4: 76035, 3: 27974, 2: 15055, 0: 9285, 1: 7758})
Testing target statistics: Counter({4: 32629, 3: 12019, 2: 6384, 0: 3994, 1: 3306})


In [26]:
pipeline = make_pipeline(NearMiss(version=2),MultinomialNB())

In [None]:
#np.unique(y_train)
np.unique(y_test)

In [28]:
# Feature Extraction: Bag of Words with TF-IDF
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}')
tfidf_vect.fit(body)

# transform the training and validation data using tfidf vectorizer object
xtrain_tfidf =  tfidf_vect.transform(X_train)
xtest_tfidf =  tfidf_vect.transform(X_test)


In [30]:
pipeline.fit(xtrain_tfidf, y_train)

Pipeline(steps=[('nearmiss', NearMiss(version=2)),
                ('multinomialnb', MultinomialNB())])

In [36]:
print(classification_report_imbalanced(y_test, pipeline.predict(xtest_tfidf)))

                   pre       rec       spe        f1       geo       iba       sup

          0       0.49      0.45      0.97      0.47      0.66      0.42      3994
          1       0.24      0.19      0.96      0.21      0.43      0.17      3306
          2       0.32      0.20      0.95      0.25      0.44      0.18      6384
          3       0.34      0.46      0.77      0.39      0.60      0.34     12019
          4       0.76      0.73      0.70      0.74      0.72      0.51     32629

avg / total       0.58      0.57      0.77      0.57      0.64      0.42     58332



In [37]:
y_pred = pipeline.predict(xtest_tfidf)

In [None]:
xtrain_tfidf[0]


In [16]:
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=0)
rus.fit(xtrain_tfidf, target)
X_resampled, y_resampled = rus.sample(xtrain_tfidf , target)

ValueError: Found input variables with inconsistent numbers of samples: [1188273, 1697533]

In [8]:
#Create a Binomial Classifier / Model
nb = MultinomialNB()

In [16]:
# Train the model using the training sets
nb.fit(xtrain_tfidf, y_train)

MultinomialNB()

## Evaluation

In [17]:
# https://imbalanced-learn.readthedocs.io/en/stable/auto_examples/applications/plot_multi_class_under_sampling.html
#Predict the response for test dataset
y_pred = nb.predict(xtest_tfidf)

In [38]:
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision:", metrics.precision_score(y_test, y_pred, average="macro"))
print("F1:", metrics.f1_score(y_test, y_pred, average="macro"))

Accuracy: 0.5693272989096894
Precision: 0.43032181039084405
F1: 0.41395953969946325


In [18]:
pd.crosstab(y_test, y_pred)
# with CountVectorizer -> accuracy:  0.5956407336134784
# with TFIDF Vectorizer -> Accuracy: 0.5594527874922856

col_0,0,1,2,3,4
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,48,2,22,87,30730
1,9,0,30,207,30506
2,1,0,61,450,59907
3,7,2,53,550,114441
4,33,15,67,450,271582


In [25]:
smote = SMOTE()
X_train_smote, y_train_smote = smote.fit_sample(xtrain_tfidf, y_train)
from collections import Counter
print("Before SMOTE: " , Counter(y_train))
print("After SMOTE: " , Counter(y_train_smote))

KeyboardInterrupt: 

In [None]:
print(len(X_train_smote))
nb.fit(X_train_smote, y_train)
y_pred = nb.predict(xtest_tfidf)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision:", metrics.precision_score(y_test, y_pred, average="macro"))
print("F1:", metrics.f1_score(y_test, y_pred, average="macro"))

In [16]:
# count_vect = CountVectorizer()
# X_train_counts = count_vect.fit(X_train)
# nm = NearMiss()
# X_res,y_res = nm.fit_sample(X_train_counts, y_train)
# X_res.shape(), y_res.shape()

print('Orginal dataset shape{}'.format(Counter(y_train)))

NameError: name 'Counter' is not defined

In [None]:
print('Orginal dataset shape{}'.format(Counter(y_train)))
print('Resampled dataset shape{}'.format(Counter(y_res)))
nb.fit(X_res, y_train)
y_pred = nb.predict(xtest_tfidf)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision:", metrics.precision_score(y_test, y_pred, average="macro"))
print("F1:", metrics.f1_score(y_test, y_pred, average="macro"))