### Business Case

ABC restaurent wants a binary classification(positive or negative) model analyze review 

1. we have **Historical dataset** with a review positive/negative over 900 observations
2. we have **CurrentWeek dataset** with 100 observations, we have to figure it weather a review is p/n

## Data Cleaning

         1.Original data
         2.Dropping special charecters
         3.Converting to smallcase
         4.Dropping stop words and stemming
         5.Bag of words

**Stop words**
> These are actually the most common words in any language (like articles, prepositions, pronouns, conjunctions, etc) and does not add much information to the text.

**Stemming**
>Stemming is a technique used to extract the base form of the words by removing affixes from them. It is just like cutting down the branches of a tree to its stems. For example, the stem of the words eating, eats, eaten is eat

>https://www.analyticsvidhya.com/blog/2021/11/an-introduction-to-stemming-in-natural-language-processing/

### Naive Bayes

> P(A|B) = P(B|A) * P(A) / P(B)

In [1]:
import pandas as pd
import numpy as np

In [3]:
data = pd.read_csv('Historic.tsv',sep='\t')

In [5]:
data.head(5)

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


In [6]:
data.shape

(900, 2)

### Data Cleaning

In [8]:
import re
import nltk

nltk.download('stopwords')

from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

ps = PorterStemmer()

all_stopwords = stopwords.words('english')
all_stopwords.remove('not')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/harikrishnareddy/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [9]:
#stores the customer data/reviews after the dataCleaning
corpus = []

for i in range(0,900):
    review = re.sub('[^a-zA-Z]',' ',data['Review'][i])
    review = review.lower()
    review = review.split()
    review = [ps.stem(w) for w in review if not w in set(all_stopwords)]
    review = ' '.join(review)
    corpus.append(review)

In [12]:
corpus[:5]

['wow love place',
 'crust not good',
 'not tasti textur nasti',
 'stop late may bank holiday rick steve recommend love',
 'select menu great price']

### Data Transformation

In [14]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=1420)
#we are picking top 1420 most frquent tokens that are used

In [17]:
X = cv.fit_transform(corpus).toarray()
y = data.iloc[:,-1].values

In [19]:
#creating pickle for future use
import pickle as pk
pk.dump(cv,open('sentimetal.pkl','wb'))

### Test Train split

In [22]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = .2 , random_state= 0)

### Naive Bayes

In [23]:
from sklearn.naive_bayes import GaussianNB

classifier = GaussianNB()
classifier.fit(X_train,y_train)

GaussianNB()

In [24]:
#exporting NB for further use
import joblib
joblib.dump(classifier,'NBclassifierSentimental')

['NBclassifierSentimental']

### Model Performance

In [26]:
y_pred = classifier.predict(X_test)

In [30]:
from sklearn.metrics import accuracy_score,confusion_matrix

cm = confusion_matrix(y_test,y_pred)
cm

array([[67, 11],
       [38, 64]])

In [31]:
accuracy_score(y_test,y_pred)

0.7277777777777777