<img src="images/generalassembly-open-graph.png" width="240" height="240" align="left"/>

# Pre-processing reviews notebook
**Author: Rodolfo Flores Mendez**
<br> May 2019 | Chicago, IL.

### Table of contents
- [Overview](#ov)
- [Importing libraries](#imp)
- [Feature extraction](#fe)

### Overview<a id="ov"></a>
This notebook contains the code to pre-process the "review" dataset in order to extract certain features for modelling purposes. Those features include: 

    (a) Average rating of "cool", "funny", and "useful" comments,
    (b) Average stars/rating per comment,
    (d) Total number of positive and negative comments,
    (e) Average polarity and subjectivity of comments, 
    (f) Proxy for a business age,
    (g) Time (in days) since last comment, 
    
These features are fed into the "Data architecture notebook" along with other features from the "business" dataset. The features were summarized by business_id.

### Importing libraries<a id="imp"></a>
In this section we outline the initial code needed to run this workbook. If this code returns an error we recommend the reader to verify that the most up to date version of the libraries mentioned below have been installed in their computers. For a guideline on python installation of modules please refer to the __[official documentation](https://docs.python.org/3/installing/)__.

In [1]:
#Import libraries
import numpy as np
import pandas as pd

#Import regex
import regex as re

#NLP libraries
from nltk.corpus import stopwords
from textblob import TextBlob

#Import datetime
import datetime as dt

#Import Standard Scaler
from sklearn.preprocessing import StandardScaler, MinMaxScaler

### Feature extraction <a id="fe"></a>
The following lines of code engineer certain features from the review dataset in order to feed the model.
##### Code

In [2]:
#Define function to clean stop words
stops=stops = set(stopwords.words('english'))
def clean_stops(text):
    words=text.split()
    words = [w for w in words if not w in stops]
    return(" ".join(words))


#Read the review dataframe
df = pd.read_csv('./csv_data/review.csv')

#Limit the dataframe for only las Vegas (see readme)
#Read the numpy array of business IDs pertaining only to las vegas
target_ids = np.load('ids.npy')

#Convert the business id on a list
business_id = df['business_id'].values

#Create a mask
mask = np.in1d(business_id,target_ids)

#Rewrite the dataframe
df = df[mask]

#Clean the text from special characters
df['text'] = df['text'].apply(lambda x: re.sub(r"[^a-zA-Z]"," ",x).lower().strip())
df['text'] = df['text'].apply(lambda x: clean_stops(x))


#Create a column for the sentiment of each post
df['polarity']=df['text'].apply(lambda x: TextBlob(x).sentiment[0])
df['subjectivity']=df['text'].apply(lambda x: TextBlob(x).sentiment[1])


#Create a column for positive, negative and neutral comments
df['positive_negative']=df['polarity'].apply(lambda x: 1 if x>0 else 0)

#Convert date to datetime
df['date'] = pd.to_datetime(df['date'])

#Convert cool to float
df['cool'] = df['cool'].astype(float)

#Group the dataframe and summarize by business id
df_sum = df.groupby('business_id').agg({
    'cool': lambda x: np.mean(x),
    'funny': lambda x: np.mean(x),
    'useful': lambda x: np.mean(x),
    'stars': lambda x: np.mean(x),
    'polarity': lambda x:np.mean(x),
    'subjectivity': lambda x: np.mean(x),
    'positive_negative': ['sum',lambda x:len(x)-np.sum(x)],
    'date': ['max','min']
})

#Create a list to rename columns
df_sum.columns = ["_".join(x).replace("_<lambda>",'') for x in df_sum.columns.ravel()]

#Rename columns
df_sum.rename(columns = {
    'positive_negative_sum':'positive_comments',
    'positive_negative':'negative_comments',
    'stars':'rev_stars'},
             inplace = True)

#Create a column for age, time since last comment,
df_sum['today'] = pd.to_datetime('today')
df_sum['age'] = df_sum['today'] - df_sum['date_min']
df_sum['t_last_c'] = df_sum['today'] - df_sum['date_max']
df_sum['t_comments'] = df_sum['date_max'] - df_sum['date_min']


#Convert date time diffs to ints
df_sum['age'] = df_sum['age'] / np.timedelta64(1, 'D')
df_sum['t_last_c'] = df_sum['t_last_c'] / np.timedelta64(1, 'D')
df_sum['t_comments'] = df_sum['t_comments'] / np.timedelta64(1, 'D')


#Standarize all variables to make then closer to the range of -1 and 1
minMaxFeat = ['cool','funny','useful','rev_stars']
mms = MinMaxScaler()
df_mms = mms.fit_transform(df_sum[minMaxFeat])
df_mms = pd.DataFrame(df_mms,columns = minMaxFeat)
    
ssFeat = ['positive_comments','negative_comments','age','t_last_c','t_comments']

ss = StandardScaler()
df_ss = ss.fit_transform(df_sum[ssFeat])
df_ss = pd.DataFrame(df_ss,columns = ssFeat)

#Merge all into a final dataframe (excludes polarity and subjectivity)
df_fin = pd.concat([df_mms,df_ss,pd.Series(df_sum.index)],axis=1)

#Display the head
df_fin.head()

In [19]:
#Inspect the shape
df_fin.shape

(29369, 10)

In [20]:
#Set business id as index
df_fin = df_fin.set_index('business_id')

#Merge with polarity and subjectivity
df_fin =    pd.merge(df_fin,
                     df_sum[['polarity','subjectivity']],
                     how = 'inner',
                     left_index=True,
                     right_index=True)

#Save df as CSV
df_fin.to_csv('./csv_data/reviews_df_vegas.csv')

#Display final df head
df_fin.head()

Unnamed: 0_level_0,cool,funny,useful,rev_stars,positive_comments,negative_comments,age,t_last_c,t_comments
business_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
--9e1ONYQuAa-CB_Rrw7Tw,0.025137,0.022688,0.02255,0.781457,7.353262,1.431592,2.596106,-0.612425,2.990671
--DdmeR16TRb3LsjG0ejrQ,0.082353,0.10274,0.063485,0.583333,-0.257236,-0.298266,0.163534,1.568662,-0.759374
--WsruI0IGEoeRmkErU5Gg,0.008824,0.002568,0.010581,0.921875,-0.224003,-0.198466,-1.001078,0.053315,-1.045407
--Y7NhBKzLTbNliMUX_wfg,0.003922,0.0,0.007054,0.972222,-0.242993,-0.298266,-0.723327,-0.258417,-0.580256
--e8PjCNhEz32pprnPhCwQ,0.010247,0.015908,0.03891,0.508065,-0.176526,-0.032134,-0.347135,-0.523329,-0.043006
