### Text Classification

The objective of this project was to build a predictive model to classify customer reviews as positive and negative sentiments using natural language processing. 

### Data Set

The dataset used was from https://www.yelp.com/dataset. The business and review json files were used here.<br>
Due to computing constraints, we only used 1 million rows of the review json file to build our model.

In [1]:
import json
import numpy as np
import pandas as pd

business = pd.read_json('./yelp_academic_dataset_business.json', lines = True)
reviews = pd.read_json('./yelp_academic_dataset_review.json',lines=True, nrows=1000000)


In [2]:
#Checking column headers of the business file, noticed there is no review column (consumer reviews)
list(business.columns)

['business_id',
 'name',
 'address',
 'city',
 'state',
 'postal_code',
 'latitude',
 'longitude',
 'stars',
 'review_count',
 'is_open',
 'attributes',
 'categories',
 'hours']

In [3]:
#Checking top 2 rows of business file, noticed Categories column
business[:2]

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,6iYb2HFDywm3zjuRg0shjw,Oskar Blues Taproom,921 Pearl St,Boulder,CO,80302,40.017544,-105.283348,4.0,86,1,"{'RestaurantsTableService': 'True', 'WiFi': 'u...","Gastropubs, Food, Beer Gardens, Restaurants, B...","{'Monday': '11:0-23:0', 'Tuesday': '11:0-23:0'..."
1,tCbdrRPZA0oiIYSmHG3J0w,Flying Elephants at PDX,7000 NE Airport Way,Portland,OR,97218,45.588906,-122.593331,4.0,126,1,"{'RestaurantsTakeOut': 'True', 'RestaurantsAtt...","Salad, Soup, Sandwiches, Delis, Restaurants, C...","{'Monday': '5:0-18:0', 'Tuesday': '5:0-17:0', ..."


In [4]:
business['categories'][:5]

0    Gastropubs, Food, Beer Gardens, Restaurants, B...
1    Salad, Soup, Sandwiches, Delis, Restaurants, C...
2    Antiques, Fashion, Used, Vintage & Consignment...
3                           Beauty & Spas, Hair Salons
4    Gyms, Active Life, Interval Training Gyms, Fit...
Name: categories, dtype: object

In [5]:
#Create a dataframe showing Restaurant as one of the Categories
restau_data=business[business['categories'].str.contains('Restaurant')==True]

In [6]:
#There are 50,793 rows categorized under Restaurant
restau_data.shape

(50793, 14)

In [7]:
#Checking reviews column headers
reviews.head()

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
0,lWC-xP3rd6obsecCYsGZRg,ak0TdVmGKo4pwqdJSTLwWw,buF9druCkbuXLX526sGELQ,4,3,1,1,Apparently Prides Osteria had a rough summer a...,2014-10-11 03:34:02
1,8bFej1QE5LXp4O05qjGqXA,YoVfDbnISlW0f7abNQACIg,RA4V8pr014UyUbDvI-LW2A,4,1,0,0,This store is pretty good. Not as great as Wal...,2015-07-03 20:38:25
2,NDhkzczKjLshODbqDoNLSg,eC5evKn1TWDyHCyQAwguUw,_sS2LBIGNT5NQb6PD1Vtjw,5,0,0,0,I called WVM on the recommendation of a couple...,2013-05-28 20:38:06
3,T5fAqjjFooT4V0OeZyuk1w,SFQ1jcnGguO0LYWnbbftAA,0AzLzHfOJgL7ROwhdww2ew,2,1,1,1,I've stayed at many Marriott and Renaissance M...,2010-01-08 02:29:15
4,sjm_uUcQVxab_EeLCqsYLg,0kA0PAJ8QFMeveQWHFqz2A,8zehGz9jnxPqXtOc7KaJxA,4,0,0,0,The food is always great here. The service fro...,2011-07-28 18:05:01


In [8]:
#Create a dataframe only keeping rows (from reviews data) with business ids present in the restau_data since the focus of this project is restaurant reviews

reviews_rest = reviews[reviews.business_id.isin(restau_data['business_id']) == True]

In [9]:
reviews_rest.shape

(653719, 9)

In [10]:
#Saving a clean data frame to a csv file (restaurant reviews only)
reviews_rest.to_csv("restau_reviews.csv",index=False)

In [11]:
df=pd.read_csv("restau_reviews.csv")

In [12]:
#Create the final data frame with the most important data
rest_rev=df[['text','stars']]

In [13]:
rest_rev.head()

Unnamed: 0,text,stars
0,Apparently Prides Osteria had a rough summer a...,4
1,I've stayed at many Marriott and Renaissance M...,2
2,The food is always great here. The service fro...,4
3,"This place used to be a cool, chill place. Now...",1
4,"The setting is perfectly adequate, and the foo...",2


In [14]:
#Adding another column named sentiment to categorize reviews as positive or negative

rest_rev['sentiment'] = rest_rev['stars'].apply(lambda star : 'positive' if star >=3 else 'negative')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  rest_rev['sentiment'] = rest_rev['stars'].apply(lambda star : 'positive' if star >=3 else 'negative')


In [15]:
#To check if the new column was added, selected only rows 5 to 10 for viewing/checking

rest_rev [5:11]

Unnamed: 0,text,stars,sentiment
5,Probably one of the better breakfast sandwiche...,5,positive
6,I work in the Pru and this is the most afforda...,5,positive
7,"They NEVER seem to get our \norder correct, se...",1,negative
8,I have been here twice and have had really goo...,4,positive
9,This is a five-star restaurant if ever I have ...,5,positive
10,Quickly stopped in for a UFC fight. I sat down...,4,positive


In [16]:
#To check if the new column was added, selected only rows 5 to 10 for viewing/checking
rest_rev [100:106]

Unnamed: 0,text,stars,sentiment
100,Wow!!! Absolutely the BEST donuts I've EVER ha...,5,positive
101,We ate at this place all three mornings we wer...,4,positive
102,On a lark (which is a bird according to wikipe...,5,positive
103,It was my very first visit the service was aw...,4,positive
104,Great spot. Comfortable little joint smack dab...,5,positive
105,Best theater ever. Great seats great service....,5,positive


## Pre-processing

In [17]:
#Check for null values
rest_rev.isnull().sum()

text         0
stars        0
sentiment    0
dtype: int64

In [18]:
# Check for whitespace strings
blanks= []

for i,t,s,sent in rest_rev.itertuples():
    if sent.isspace ():
        blanks.append(i) 

In [19]:
blanks

[]

No sentiment with whitespace

In [20]:
len(rest_rev)

653719

In [21]:
rest_rev['sentiment'].value_counts()

positive    520285
negative    133434
Name: sentiment, dtype: int64

In [22]:
### Split the data into train & test sets:

from sklearn.model_selection import train_test_split

X = rest_rev['text']
y= rest_rev['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [23]:
#Build a pipeline to vectorize the data, then train and fit a model

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

rev_clf = Pipeline([('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])

# Feed the training data through the pipeline
rev_clf.fit(X_train, y_train)

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])

In [24]:
# Form a prediction set
predictions =rev_clf.predict(X_test)

In [25]:
# Report the confusion matrix
from sklearn.metrics import confusion_matrix, classification_report,classification_report, accuracy_score

print (confusion_matrix(y_test, predictions))

[[ 35037   8825]
 [  6799 165067]]


In [26]:
# Print a classification report
print (classification_report(y_test, predictions))

              precision    recall  f1-score   support

    negative       0.84      0.80      0.82     43862
    positive       0.95      0.96      0.95    171866

    accuracy                           0.93    215728
   macro avg       0.89      0.88      0.89    215728
weighted avg       0.93      0.93      0.93    215728



In [27]:
# Print the overall accuracy
print (accuracy_score(y_test,predictions))

0.9275754654008752


### Testing my classifier with random reviews copied from Yelp website

In [28]:
rev_clf.predict(["This was one of the worst experiences my friend and I have had since COVID began. Servers not masked even though Delta variant strong in West End. Tables were sticky and uncleaned, observed that other tables vacated were not cleaned. Staff totally disregarded those of us who were over 55 and/or had children but were very busy serving everyone else. Very disappointing since this was one of the first restaurants we visited since the beginning of COVID. Food OK. Service TERRIBLE. WON'T BE COMING BACK.  Note that we came on Sat. Aug. 21st @ 2pm with reservations to enjoy Happy Hour and staff seemed put out that we weren't ordering more. Our bill was $140ish not including tax. Hope they read this."])

array(['negative'], dtype=object)

In [29]:
rev_clf.predict(["The hand-pressed hamburger was very disappointing. I would rather have McDonald's any-day. Fries sucked too. The Molten Lava Cake was cool."])

array(['negative'], dtype=object)

In [30]:
rev_clf.predict(["I came with a few friends of my mine to celebrate my 16th birthday and right as we were seated I could tell our waitress was unhappy to be serving a group of minors. She spoke to us in a tone that you would use on 5 year olds. The food was amazing, but overall the service just ruined it for me. Probably won't be coming here again because of this."])

array(['negative'], dtype=object)

In [31]:
rev_clf.predict(["We spent our first night eating at Joey Burrard.  Didn't realize it was kinda a chain, but read that it have decent reviews however the service is left for improvement.  I do have to agree for the most part.  We got seated and took forever for someone to bring us water.  We saw so many staffs walking around.  But it just doesn't feel a lot of them did so with a purpose.  When we did flag someone down, she was quite pleasant and took our orders.  My wife got the half BBQ rack which she complained that the portion was tiny and cold.  We commented this to the waitress and they apologized on behalf of kitchen and offered for the same or something else.  It was good that the manager came by afterwards and asked us for details and was very gracious.  My wife then got curry shrimp over rice as replacement.  That was decent.  Nothing to boast about.  I had the mushroom cheddar burger.  I have to say it was quite delicious and cooked to perfection.  My only complaint is that it surely doesn't look like the same pic as other people had posted in the past.  Please take a look at my photo and be the judge.  My son got the MAc and cheese and gave it so so review.  He didn't even finish...so that's telling.  All in all, they have to definitely improve on.  I will not go back for sure.  Free wifi."])

array(['positive'], dtype=object)

Above results turned out to be correct :)
