# Natural Language Processing Project

In this NLP project you will be attempting to classify Yelp Reviews into 1 or 5 star categories based off the text content in the reviews.

We will use the [Yelp Review Data Set from Kaggle](https://www.kaggle.com/c/yelp-recsys-2013).

Each observation in this dataset is a review of a particular business by a particular user.

The "stars" column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.

The "cool" column is the number of "cool" votes this review received from other Yelp users. 

All reviews start with 0 "cool" votes, and there is no limit to how many "cool" votes a review can receive. In other words, it is a rating of the review itself, not a rating of the business.

The "useful" and "funny" columns are similar to the "cool" column.

Let's get started!

## Imports
 **Import some libraries . :) **

In [56]:
import seaborn as sns 
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import nltk.corpus

## The Data

**Read the yelp.csv file**

In [57]:
df=pd.read_csv('yelp.csv')

**Check the head, info , and describe methods on df. This can help identify different things such as null values, mean, std etc**

In [58]:
df.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
2,6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I als...,review,0hT2KtfLiobPvh6cDC8JQg,0,1,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0


In [59]:
df.describe()

Unnamed: 0,stars,cool,useful,funny
count,10000.0,10000.0,10000.0,10000.0
mean,3.7775,0.8768,1.4093,0.7013
std,1.214636,2.067861,2.336647,1.907942
min,1.0,0.0,0.0,0.0
25%,3.0,0.0,0.0,0.0
50%,4.0,0.0,1.0,0.0
75%,5.0,1.0,2.0,1.0
max,5.0,77.0,76.0,57.0


In [60]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   business_id  10000 non-null  object
 1   date         10000 non-null  object
 2   review_id    10000 non-null  object
 3   stars        10000 non-null  int64 
 4   text         10000 non-null  object
 5   type         10000 non-null  object
 6   user_id      10000 non-null  object
 7   cool         10000 non-null  int64 
 8   useful       10000 non-null  int64 
 9   funny        10000 non-null  int64 
dtypes: int64(4), object(6)
memory usage: 781.4+ KB


**Create a new column called "text length" which is the number of words in the text column.**

In [61]:
df['length']=df['text'].apply(len)

**Use groupby to get the mean values of the numerical columns, you should be able to create this dataframe with the operation:**

In [62]:
star=df.groupby('stars').mean()

**Use the corr() method on that groupby dataframe to produce this dataframe. It helps us understand the correlation between
different features**

In [63]:
star.corr()

Unnamed: 0,cool,useful,funny,length
cool,1.0,-0.743329,-0.944939,-0.857664
useful,-0.743329,1.0,0.894506,0.699881
funny,-0.944939,0.894506,1.0,0.843461
length,-0.857664,0.699881,0.843461,1.0


In [64]:
df=df[(df['stars']==5) | (df['stars']==1)]

## NLP Classification Task


In [65]:
#we could pass this in countVectorizer object to remove punctuations 
#and stop words but it can take quite some time. Try it if you want to
#I tried using google colab and it worked.
#imported string library for punctuations
#imported nltk for stopwords
from nltk.corpus import stopwords
import string
def text_process(mess):
    """
    1. Remove punctuations.
    2. remove stop words
    4. return list of clearn words """
    nopunc=[char for char in mess if char not in string.punctuation]
    nopunc=''.join(nopunc)
    return [word for word in nopunc.split() if word.lower not in stopwords.words('english')]

**Create two objects x and y. x will be the 'text' column of df and y will be the 'stars' column of df.(Your features and target/labels)**

In [66]:
x=df['text']
y=df['stars']

## Train Test Split

Let's split our data into training and testing data.

In [67]:
from sklearn.model_selection import train_test_split

In [68]:
xtrain,xtest,ytrain,ytest=train_test_split(x,y,test_size=0.3,random_state=101)

## Training a Model

Time to train a model!

**Import MultinomialNB, pipleline and CountVectorizer. Pipleline allows to organize instead of calling them one by one and passing data**

In [79]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

pipeline = Pipeline([
    ('bow', CountVectorizer(analyzer=text_process)),  # strings to token integer counts
    ('classifier', MultinomialNB()),  # train on Naive Bayes classifier
])

**Now fit pipeline using the training data.**

In [80]:
pipeline.fit(xtrain,ytrain)


Pipeline(steps=[('bow',
                 CountVectorizer(analyzer=<function text_process at 0x0000023B322F7A60>)),
                ('classifier', MultinomialNB())])

## Predictions and Evaluations


**Use the predict method off of pipeline to predict labels from xtest.**

In [81]:
predictions = pipeline.predict(xtest)

** Create a confusion matrix and classification report using these predictions and ytest **

In [82]:
from sklearn.metrics import classification_report,confusion_matrix

In [83]:
print(confusion_matrix(predictions,ytest))
print('\n')
print(classification_report(predictions,ytest))

[[142  10]
 [ 86 988]]


              precision    recall  f1-score   support

           1       0.62      0.93      0.75       152
           5       0.99      0.92      0.95      1074

    accuracy                           0.92      1226
   macro avg       0.81      0.93      0.85      1226
weighted avg       0.94      0.92      0.93      1226

