___

<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>
___

# Natural Language Processing Project

In this NLP project, we will try to classify Yelp review into different ratings based off the text contect in the reviews.

We will use the [Yelp Review Data Set from Kaggle](https://www.kaggle.com/c/yelp-recsys-2013).

Each observation in this dataset is a review of a particular business by a particular user.

The "stars" column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.

The "cool" column is the number of "cool" votes this review received from other Yelp users. 

All reviews start with 0 "cool" votes, and there is no limit to how many "cool" votes a review can receive. In other words, it is a rating of the review itself, not a rating of the business.

The "useful" and "funny" columns are similar to the "cool" column.

## Import the commond libraries

In [12]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from IPython.display import display, HTML
%matplotlib inline

## ETL

### Read the yelp_training_set_review.json file

In [13]:
yelp_json = pd.read_json('yelp_training_set_review.json', lines=True)
yelp_json.head()

Unnamed: 0,votes,user_id,review_id,stars,date,text,type,business_id
0,"{'funny': 0, 'useful': 5, 'cool': 2}",rLtl8ZkDX5vH5nAx9C3q5Q,fWKvX83p0-ka4JS3dc6E5A,5,2011-01-26,My wife took me here on my birthday for breakf...,review,9yKzy9PApeiPPOUJEtnvkg
1,"{'funny': 0, 'useful': 0, 'cool': 0}",0a2KyEL0d3Yb1V6aivbIuQ,IjZ33sJrzXqU-0X6U8NwyA,5,2011-07-27,I have no idea why some people give bad review...,review,ZRJwVLyzEJq1VAihDhYiow
2,"{'funny': 0, 'useful': 1, 'cool': 0}",0hT2KtfLiobPvh6cDC8JQg,IESLBzqUCLdSzSqm0eCSxQ,4,2012-06-14,love the gyro plate. Rice is so good and I als...,review,6oRAC4uyJCsJl1X0WZpVSA
3,"{'funny': 0, 'useful': 2, 'cool': 1}",uZetl9T0NcROGOyFfughhg,G-WvGaISbqqaMHlNnByodA,5,2010-05-27,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,_1QQZuf4zZOyFCvXc0o6Vg
4,"{'funny': 0, 'useful': 0, 'cool': 0}",vYmM4KTsC8ZfQBg-j5MWkw,1uJFq2r5QfJG_6ExMRCaGw,5,2012-01-05,General Manager Scott Petello is a good egg!!!...,review,6ozycU1RpktNG2-1BroVtw


### flatten votes to additional three columns

In [14]:
votes_new = pd.json_normalize(yelp_json['votes'])
yelp_orig = pd.concat([yelp_json,votes_new], axis=1)
yelp_orig.drop(columns='votes', inplace=True)
yelp_orig.head()

Unnamed: 0,user_id,review_id,stars,date,text,type,business_id,funny,useful,cool
0,rLtl8ZkDX5vH5nAx9C3q5Q,fWKvX83p0-ka4JS3dc6E5A,5,2011-01-26,My wife took me here on my birthday for breakf...,review,9yKzy9PApeiPPOUJEtnvkg,0,5,2
1,0a2KyEL0d3Yb1V6aivbIuQ,IjZ33sJrzXqU-0X6U8NwyA,5,2011-07-27,I have no idea why some people give bad review...,review,ZRJwVLyzEJq1VAihDhYiow,0,0,0
2,0hT2KtfLiobPvh6cDC8JQg,IESLBzqUCLdSzSqm0eCSxQ,4,2012-06-14,love the gyro plate. Rice is so good and I als...,review,6oRAC4uyJCsJl1X0WZpVSA,0,1,0
3,uZetl9T0NcROGOyFfughhg,G-WvGaISbqqaMHlNnByodA,5,2010-05-27,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,_1QQZuf4zZOyFCvXc0o6Vg,0,2,1
4,vYmM4KTsC8ZfQBg-j5MWkw,1uJFq2r5QfJG_6ExMRCaGw,5,2012-01-05,General Manager Scott Petello is a good egg!!!...,review,6ozycU1RpktNG2-1BroVtw,0,0,0


### info on the dataset

In [15]:
print(yelp_orig.info(),'\n',
      yelp_orig.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 229907 entries, 0 to 229906
Data columns (total 10 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   user_id      229907 non-null  object        
 1   review_id    229907 non-null  object        
 2   stars        229907 non-null  int64         
 3   date         229907 non-null  datetime64[ns]
 4   text         229907 non-null  object        
 5   type         229907 non-null  object        
 6   business_id  229907 non-null  object        
 7   funny        229907 non-null  int64         
 8   useful       229907 non-null  int64         
 9   cool         229907 non-null  int64         
dtypes: datetime64[ns](1), int64(4), object(5)
memory usage: 17.5+ MB
None 
                stars          funny         useful           cool
count  229907.000000  229907.000000  229907.000000  229907.000000
mean        3.766723       0.699030       1.386822       0.868234
std         1.

In [16]:
yelp = yelp_orig.copy()

#### Length of the reviews are an important feature, which give us more info to assess the sentiments of the review.
#### Create a new column called "text length" which is the number of words in the text column.
#### remove all the puctuations and stopwords which are not deterministic for the sentiment analysis.

In [17]:
import string
from nltk.corpus import stopwords

In [18]:
puct_to_space = {x:' ' for x in string.punctuation}
def text_process(x):
    '''
    Return the preprocessed text document
    
    Do a series of text manupulations shown below for a document and return the NLP-ready one.
    '''
    
    # remove punctuations
    no_punc = x.translate(str.maketrans(puct_to_space)) 
    # remove stopwords
    no_punc_list = [word for word in no_punc.split() if word.lower() not in stopwords.words('english')]
    # remove word with numbers
    final_text = [word for word in no_punc_list if word.isalpha()]
    # get the len
    length = len(final_text)
    return [length, ' '.join(final_text)][0]

In [19]:
from multiprocessing import  Pool
from functools import partial

In [20]:
def parallelize(data, func, num_of_processes=4):
    data_split = np.array_split(data, num_of_processes)
    pool = Pool(num_of_processes)
    data = pd.concat(pool.map(func, data_split))
    pool.close()
    pool.join()
    return data

def run_on_subset(func, data_subset):
    return data_subset.apply(func)

def parallelize_on_rows(data, func, num_of_processes=4):
    return parallelize(data, partial(run_on_subset, func), num_of_processes)

### Apply preprocessing function for the text

In [21]:
#%%timeit
#yelp['text_length'] = yelp['text'].apply(func=text_process)
#yelp['text_length'] = yelp['text'].apply(func=text_process,args=[0])
#yelp['text_transformed'] = yelp['text'].apply(func=text_process,args=[1])

In [22]:
%%timeit
yelp['text_length'] = parallelize_on_rows(yelp['text'],text_process)

27min 34s ± 8.78 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [24]:
%%timeit
yelp['text_transformed'] = parallelize_on_rows(yelp['text'],text_process)

27min 37s ± 11.1 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [25]:
yelp.to_csv('yelp_training_set_review(with text_length and transformed)-new.csv', encoding='utf-8', index=False)

In [26]:
yelp[['stars', 'useful', 'funny', 'cool', 'text_length', 'text_transformed']].head()

Unnamed: 0,stars,useful,funny,cool,text_length,text_transformed
0,5,5,0,2,77,wife took birthday breakfast excellent weather...
1,5,0,0,0,111,idea people give bad reviews place goes show p...
2,4,1,0,0,9,love gyro plate Rice good also dig candy selec...
3,5,2,0,1,44,Rosie Dakota LOVE Chaparral Dog Park convenien...
4,5,0,0,0,38,General Manager Scott Petello good egg go deta...
