In [37]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time

# Raktim: Few libraries installed

# !pip install googletrans==3.1.0a0
# !pip install vaderSentiment
# !pip install nltk
# nltk.download('punkt')
# !pip install statsmodels
# !pip install xgboost

# Dropped the original review column, after translatiing to english

# Please note the attributes of 'neg_title', 'neu_title', 'compound_title', these are above 4 incase of Variance Inflation factor




### 0. Understanding the Business Problem
Uber Inc in the US wants to know:

- the major complaints premium users have about their cab services,
- how these impact service ratings.

We as (technical) consultants to Uber. have to:  
- [a] analyze text reviews of Uber cabs’ US services,  
- [b] relate whether and which different features of these reviews impact overall ratings  
- [c] pinpoint possible areas of improvement.

### 1. Pre-processing: 
- Examine the dataset. 
- ID the columns of interest. 
- Drop special characters, html junk etc. 
- Perform any other preprocessing and text-cleaning activity you think fits this context.

In [38]:
df = pd.read_csv("https://raw.githubusercontent.com/Kenrich005/Uber_reviews_textanalytics/main/uber_reviews_itune.csv",
                 encoding='cp1252')
df.head()

Unnamed: 0,Author_Name,Title,Author_URL,App_Version,Rating,Review,Date
0,#NEVERUBER,Dishonest and Disgusting,https://itunes.apple.com/us/reviews/id663331949,3.434.10005,1,"For half an hour, we tried EVERY UBER SERVICE ...",29-12-2020 01:14
1,$$Heaven,Free offer,https://itunes.apple.com/us/reviews/id810421958,3.434.10005,2,If I’m not eligible for the offer Stop floodin...,01-01-2021 23:17
2,.Disappointed....,Inaccurate,https://itunes.apple.com/us/reviews/id49598333,3.439.10000,2,Consistently inaccurate Uber Eats ETA and the ...,15-01-2021 23:38
3,.i. andrea,bad,https://itunes.apple.com/us/reviews/id689880334,3.434.10005,1,i had my rides canceled back to back. they the...,08-12-2020 01:01
4,-:deka:-,Double charged me for an order,https://itunes.apple.com/us/reviews/id124963835,3.434.10005,1,Two of the same orders was added by accident. ...,15-12-2020 04:02


Columns of interest:  
1. Title - Brief summary about the review
2. Rating - Label for supervised learning
3. Review - To extract the sentiment of the complaint
4. Date - Extracting weekday or weekend may give better insight on nature of review

### Data Cleaning

In [39]:
df1 = df.drop(['Author_Name','Author_URL','App_Version'],axis=1)
df1.head()

Unnamed: 0,Title,Rating,Review,Date
0,Dishonest and Disgusting,1,"For half an hour, we tried EVERY UBER SERVICE ...",29-12-2020 01:14
1,Free offer,2,If I’m not eligible for the offer Stop floodin...,01-01-2021 23:17
2,Inaccurate,2,Consistently inaccurate Uber Eats ETA and the ...,15-01-2021 23:38
3,bad,1,i had my rides canceled back to back. they the...,08-12-2020 01:01
4,Double charged me for an order,1,Two of the same orders was added by accident. ...,15-12-2020 04:02


## Translate the languages to English.

In [40]:

import googletrans
from googletrans import *


translator = googletrans.Translator()

try:
  df1['Review'] = df1['Review'].astype(str)
  df1['Translated_Review'] = df1['Review'].apply(translator.translate, src='auto', dest='en').apply(getattr, args=('text',))
except AttributeError:
  pass

df1

Unnamed: 0,Title,Rating,Review,Date,Translated_Review
0,Dishonest and Disgusting,1,"For half an hour, we tried EVERY UBER SERVICE ...",29-12-2020 01:14,"For half an hour, we tried EVERY UBER SERVICE ..."
1,Free offer,2,If I’m not eligible for the offer Stop floodin...,01-01-2021 23:17,If I’m not eligible for the offer Stop floodin...
2,Inaccurate,2,Consistently inaccurate Uber Eats ETA and the ...,15-01-2021 23:38,Consistently inaccurate Uber Eats ETA and the ...
3,bad,1,i had my rides canceled back to back. they the...,08-12-2020 01:01,i had my rides canceled back to back. they the...
4,Double charged me for an order,1,Two of the same orders was added by accident. ...,15-12-2020 04:02,Two of the same orders was added by accident. ...
...,...,...,...,...,...
485,Uber,5,Perdí mi cuenta no la puedo recuperar la use e...,16-01-2021 02:39,I lost my account I can't get it back I used i...
486,Crap crap crap,1,Still the same. I was forced to use it in Colo...,23-12-2020 00:15,Still the same. I was forced to use it in Colo...
487,Sleeping Drivers,1,It is a 30 minute commute from my household to...,16-12-2020 19:10,It is a 30 minute commute from my household to...
488,Bad design re: offer code redemption and issue...,1,Was sent a $30 off UBer Eats. I thought about ...,25-11-2020 23:06,Was sent a $30 off UBer Eats. I thought about ...


## Drop Review column !!

In [41]:
df1 = df1.drop(['Review'],axis=1)
df1.head()

Unnamed: 0,Title,Rating,Date,Translated_Review
0,Dishonest and Disgusting,1,29-12-2020 01:14,"For half an hour, we tried EVERY UBER SERVICE ..."
1,Free offer,2,01-01-2021 23:17,If I’m not eligible for the offer Stop floodin...
2,Inaccurate,2,15-01-2021 23:38,Consistently inaccurate Uber Eats ETA and the ...
3,bad,1,08-12-2020 01:01,i had my rides canceled back to back. they the...
4,Double charged me for an order,1,15-12-2020 04:02,Two of the same orders was added by accident. ...


In [42]:
# Replacing emoticon with its respective meaning
df_emojis = pd.read_csv("https://raw.githubusercontent.com/Kenrich005/Uber_reviews_textanalytics/main/emoji_description.csv")
df_emojis['Code'] = df_emojis['Code'].str.replace('+','+000')
df_emojis.head()

  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,Code,CLDR Short Name
0,<U+0001F600>,grinning face
1,<U+0001F603>,grinning face with big eyes
2,<U+0001F604>,grinning face with smiling eyes
3,<U+0001F601>,beaming face with smiling eyes
4,<U+0001F606>,grinning squinting face


In [43]:
df_emojis

Unnamed: 0,Code,CLDR Short Name
0,<U+0001F600>,grinning face
1,<U+0001F603>,grinning face with big eyes
2,<U+0001F604>,grinning face with smiling eyes
3,<U+0001F601>,beaming face with smiling eyes
4,<U+0001F606>,grinning squinting face
...,...,...
2101,<subdivision-flag>,subdivision-flag
2102,<Code>,CLDR Short Name
2103,<U+0001F3F4 U+000E0067 U+000E0062 U+000E0065 U...,flag: England
2104,<U+0001F3F4 U+000E0067 U+000E0062 U+000E0073 U...,flag: Scotland


In [44]:
# Replacing emoticon with its respective meaning
to_replace = df_emojis.Code.tolist()
replace_with = df_emojis['CLDR Short Name'].tolist()

# using zip() to convert lists to dictionary
res = dict(zip(to_replace, replace_with))

def replace_all(text, dic):
    for i, j in dic.items():
        text = text.replace(str(i), j + " ")
    return text

In [45]:
### check following indexes for emoticon replacement
## [1, 37, 89, 132, 145, 190, 208, 214, 237, 357, 367, 439 ]
## index - 439,145 didnot get replaced -- cant find the emojis to replace
###

In [46]:
df1.Translated_Review = df1.Translated_Review.apply(lambda text: replace_all(text, res))
df1.Translated_Review[1]

'If I’m not eligible for the offer Stop flooding my email with this false information pouting face pouting face pouting face '

In [47]:
# df1.Review[37]

In [48]:
# replace_all(df.Review[89],res)

In [49]:
df1.Translated_Review = df1.Translated_Review.str.split('<').str[0]
df1.shape

(490, 4)

In [50]:
df1.Translated_Review[1]

'If I’m not eligible for the offer Stop flooding my email with this false information pouting face pouting face pouting face '

In [51]:
df1['Translated_Review'].replace('', np.nan, inplace=True)
df1.dropna(subset=['Translated_Review'], inplace=True)
df1.shape

(489, 4)

In [52]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()

# define unit func to process one doc
from nltk import sent_tokenize, word_tokenize
def vader_unit_func(doc0,column_name):
    sents_list0 = sent_tokenize(doc0)
    vs_doc0 = []
    sent_ind = []
    for i in range(len(sents_list0)):
        vs_sent0 = analyzer.polarity_scores(sents_list0[i])
        vs_doc0.append(vs_sent0)
        sent_ind.append(i)
        
    # obtain output as DF    
    doc0_df = pd.DataFrame(vs_doc0)
    doc0_df.columns = [x+column_name for x in doc0_df.columns]
    doc0_df.insert(0, 'sent_index', sent_ind)  # insert sent index
    doc0_df.insert(doc0_df.shape[1], 'sentence', sents_list0)
    return(doc0_df)

# define wrapper func
def vader_wrap_func(corpus0,column_name):
    
    # use ifinstance() to check & convert input to DF
    if isinstance(corpus0, list):
        corpus0 = pd.DataFrame({'text':corpus0})
    
    # define empty DF to concat unit func output to
    vs_df = pd.DataFrame()    
    
    # apply unit-func to each doc & loop over all docs
    for i1 in range(len(corpus0)):
        doc0 = str(corpus0.iloc[i1])
        vs_doc_df = vader_unit_func(doc0,column_name)  # applying unit-func
        vs_doc_df.insert(0, 'doc_index', i1)  # inserting doc index
        vs_df = pd.concat([vs_df, vs_doc_df], axis=0)
        
    return(vs_df)

In [53]:
# test-drive wrapper func
import nltk
review_sentiment = vader_wrap_func(df1.Translated_Review,'_Translated_Review').groupby('doc_index').sum()
title_sentiment = vader_wrap_func(df1.Title,'_title').groupby('doc_index').sum()
df1 = pd.concat([df1,review_sentiment,title_sentiment],axis=1)
print(df1.shape)
df1.head()

(490, 14)


Unnamed: 0,Title,Rating,Date,Translated_Review,sent_index,neg_Translated_Review,neu_Translated_Review,pos_Translated_Review,compound_Translated_Review,sent_index.1,neg_title,neu_title,pos_title,compound_title
0,Dishonest and Disgusting,1.0,29-12-2020 01:14,"For half an hour, we tried EVERY UBER SERVICE ...",3.0,0.0,2.876,0.124,0.1406,0.0,0.877,0.123,0.0,-0.7964
1,Free offer,2.0,01-01-2021 23:17,If I’m not eligible for the offer Stop floodin...,0.0,0.099,0.901,0.0,-0.296,0.0,0.0,0.233,0.767,0.5106
2,Inaccurate,2.0,15-01-2021 23:38,Consistently inaccurate Uber Eats ETA and the ...,0.0,0.179,0.821,0.0,-0.34,0.0,0.0,1.0,0.0,0.0
3,bad,1.0,08-12-2020 01:01,i had my rides canceled back to back. they the...,10.0,1.167,3.592,0.241,-0.1617,0.0,1.0,0.0,0.0,-0.5423
4,Double charged me for an order,1.0,15-12-2020 04:02,Two of the same orders was added by accident. ...,21.0,0.908,5.614,0.478,-0.4906,0.0,0.265,0.735,0.0,-0.2023


In [54]:
# Converting Date into datetime format
df1['Date'] =  pd.to_datetime(df1['Date'], format='%d-%m-%Y %H:%M')
df1.Date.head()

0   2020-12-29 01:14:00
1   2021-01-01 23:17:00
2   2021-01-15 23:38:00
3   2020-12-08 01:01:00
4   2020-12-15 04:02:00
Name: Date, dtype: datetime64[ns]

In [55]:
df1['Isweekend'] = np.where(df1.Date.dt.dayofweek>4,1,0)
df1['Late_night'] = np.where(df1.Date.dt.hour<4,1,0)
df1['Early_mrng'] = np.where(df1.Date.dt.hour.between(4,8),1,0)
df1['Morning'] = np.where(df1.Date.dt.hour.between(8,12),1,0)
df1['Noon'] = np.where(df1.Date.dt.hour.between(12,16),1,0)
df1['Eve'] = np.where(df1.Date.dt.hour.between(16,20),1,0)
df1['Night'] = np.where(df1.Date.dt.hour>20,1,0)
df1.head()

Unnamed: 0,Title,Rating,Date,Translated_Review,sent_index,neg_Translated_Review,neu_Translated_Review,pos_Translated_Review,compound_Translated_Review,sent_index.1,...,neu_title,pos_title,compound_title,Isweekend,Late_night,Early_mrng,Morning,Noon,Eve,Night
0,Dishonest and Disgusting,1.0,2020-12-29 01:14:00,"For half an hour, we tried EVERY UBER SERVICE ...",3.0,0.0,2.876,0.124,0.1406,0.0,...,0.123,0.0,-0.7964,0,1,0,0,0,0,0
1,Free offer,2.0,2021-01-01 23:17:00,If I’m not eligible for the offer Stop floodin...,0.0,0.099,0.901,0.0,-0.296,0.0,...,0.233,0.767,0.5106,0,0,0,0,0,0,1
2,Inaccurate,2.0,2021-01-15 23:38:00,Consistently inaccurate Uber Eats ETA and the ...,0.0,0.179,0.821,0.0,-0.34,0.0,...,1.0,0.0,0.0,0,0,0,0,0,0,1
3,bad,1.0,2020-12-08 01:01:00,i had my rides canceled back to back. they the...,10.0,1.167,3.592,0.241,-0.1617,0.0,...,0.0,0.0,-0.5423,0,1,0,0,0,0,0
4,Double charged me for an order,1.0,2020-12-15 04:02:00,Two of the same orders was added by accident. ...,21.0,0.908,5.614,0.478,-0.4906,0.0,...,0.735,0.0,-0.2023,0,0,1,0,0,0,0


In [56]:
df1=df1.drop(['sent_index','Title','Translated_Review','Date'],axis=1)
df1.head()

Unnamed: 0,Rating,neg_Translated_Review,neu_Translated_Review,pos_Translated_Review,compound_Translated_Review,neg_title,neu_title,pos_title,compound_title,Isweekend,Late_night,Early_mrng,Morning,Noon,Eve,Night
0,1.0,0.0,2.876,0.124,0.1406,0.877,0.123,0.0,-0.7964,0,1,0,0,0,0,0
1,2.0,0.099,0.901,0.0,-0.296,0.0,0.233,0.767,0.5106,0,0,0,0,0,0,1
2,2.0,0.179,0.821,0.0,-0.34,0.0,1.0,0.0,0.0,0,0,0,0,0,0,1
3,1.0,1.167,3.592,0.241,-0.1617,1.0,0.0,0.0,-0.5423,0,1,0,0,0,0,0
4,1.0,0.908,5.614,0.478,-0.4906,0.265,0.735,0.0,-0.2023,0,0,1,0,0,0,0


In [57]:
# Removing null values
df1.dropna(inplace=True)

### Preliminary Regression Model


In [58]:
y = df1.Rating
X = df1.drop('Rating', axis=1)
y.shape, X.shape

((488,), (488, 15))

In [59]:
X.isnull().sum()

neg_Translated_Review         0
neu_Translated_Review         0
pos_Translated_Review         0
compound_Translated_Review    0
neg_title                     0
neu_title                     0
pos_title                     0
compound_title                0
Isweekend                     0
Late_night                    0
Early_mrng                    0
Morning                       0
Noon                          0
Eve                           0
Night                         0
dtype: int64

Check if the Numpy object is an array.

In [60]:
np.asarray(df1)

array([[1.   , 0.   , 2.876, ..., 0.   , 0.   , 0.   ],
       [2.   , 0.099, 0.901, ..., 0.   , 0.   , 1.   ],
       [2.   , 0.179, 0.821, ..., 0.   , 0.   , 1.   ],
       ...,
       [1.   , 0.672, 4.864, ..., 0.   , 0.   , 0.   ],
       [1.   , 0.334, 5.264, ..., 0.   , 1.   , 0.   ],
       [1.   , 0.18 , 0.775, ..., 0.   , 0.   , 1.   ]])

In [61]:
import statsmodels.api as sm
X = sm.add_constant(X)
model = sm.OLS(y,X).fit()
model.summary()

  x = pd.concat(x[::order], 1)


0,1,2,3
Dep. Variable:,Rating,R-squared:,0.029
Model:,OLS,Adj. R-squared:,-0.001
Method:,Least Squares,F-statistic:,0.9533
Date:,"Wed, 13 Jul 2022",Prob (F-statistic):,0.504
Time:,05:32:57,Log-Likelihood:,-690.1
No. Observations:,488,AIC:,1412.0
Df Residuals:,472,BIC:,1479.0
Df Model:,15,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.5813,0.312,5.075,0.000,0.969,2.194
neg_Translated_Review,-0.0017,0.167,-0.010,0.992,-0.330,0.327
neu_Translated_Review,-0.0193,0.020,-0.950,0.342,-0.059,0.021
pos_Translated_Review,0.1073,0.203,0.527,0.598,-0.292,0.507
compound_Translated_Review,0.0298,0.101,0.294,0.769,-0.169,0.229
neg_title,0.0093,0.387,0.024,0.981,-0.751,0.769
neu_title,-0.0329,0.231,-0.142,0.887,-0.488,0.422
pos_title,0.0102,0.413,0.025,0.980,-0.802,0.822
compound_title,0.2769,0.364,0.760,0.447,-0.439,0.993

0,1,2,3
Omnibus:,205.668,Durbin-Watson:,2.116
Prob(Omnibus):,0.0,Jarque-Bera (JB):,631.699
Skew:,2.077,Prob(JB):,6.74e-138
Kurtosis:,6.717,Cond. No.,52.4


In [62]:
X.drop('const',axis=1,inplace=True)
X.head()

Unnamed: 0,neg_Translated_Review,neu_Translated_Review,pos_Translated_Review,compound_Translated_Review,neg_title,neu_title,pos_title,compound_title,Isweekend,Late_night,Early_mrng,Morning,Noon,Eve,Night
0,0.0,2.876,0.124,0.1406,0.877,0.123,0.0,-0.7964,0,1,0,0,0,0,0
1,0.099,0.901,0.0,-0.296,0.0,0.233,0.767,0.5106,0,0,0,0,0,0,1
2,0.179,0.821,0.0,-0.34,0.0,1.0,0.0,0.0,0,0,0,0,0,0,1
3,1.167,3.592,0.241,-0.1617,1.0,0.0,0.0,-0.5423,0,1,0,0,0,0,0
4,0.908,5.614,0.478,-0.4906,0.265,0.735,0.0,-0.2023,0,0,1,0,0,0,0


In [63]:
# !pip install -U scikit-learn

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 0)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((341, 15), (147, 15), (341,), (147,))

### Calculating VIF

In [64]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

def calc_vif(X):
   
    vif = pd.DataFrame()
    vif["variables"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    vif['VIF'].sort_values()

    return(vif)

In [65]:
calc_vif(X)

Unnamed: 0,variables,VIF
0,neg_Translated_Review,4.756513
1,neu_Translated_Review,3.560665
2,pos_Translated_Review,4.679321
3,compound_Translated_Review,3.447162
4,neg_title,8.887433
5,neu_title,8.53454
6,pos_title,3.214837
7,compound_title,7.521217
8,Isweekend,1.467544
9,Late_night,3.468579


Generally, a VIF above 4 or tolerance below 0.25 indicates that multicollinearity might exist, and further investigation is required.   
When VIF is higher than 10 or tolerance is lower than 0.1, there is significant multicollinearity that needs to be corrected.  
  
Since all the above variables have VIF below 4 and above 0.25, we can be assured that there is no multicollinearity.

# Raktim
# Please note the attributes of 'neg_title', 'neu_title', 'compound_title', these are above 4.

In [66]:
# We will save the model performance metrics in a DataFrame

from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import KFold, cross_val_score
Model = []
RMSE = []
R_sq = []
cv = KFold(5)

#Creating a Function to append the cross validation scores of the algorithms
def input_scores(name, model, x, y):
    Model.append(name)
    RMSE.append(np.sqrt((-1) * cross_val_score(model, x, y, cv=cv, scoring='neg_mean_squared_error').mean()))
    R_sq.append(cross_val_score(model, x, y, cv=cv, scoring='r2').mean())

In [67]:
from xgboost import XGBRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import (RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor)

names = ['Linear Regression', 'Ridge Regression', 'Lasso Regression',
         'K Neighbors Regressor', 'Decision Tree Regressor', 
         'Random Forest Regressor', 'Gradient Boosting Regressor',
         'Adaboost Regressor','XGBRegressor']

models = [LinearRegression(), Ridge(), Lasso(),
          KNeighborsRegressor(), DecisionTreeRegressor(),
          RandomForestRegressor(), GradientBoostingRegressor(), 
          AdaBoostRegressor(),XGBRegressor()]

#Running all algorithms
for name, model in zip(names, models):
    input_scores(name, model, X_train, y_train)



Reference: https://www.kaggle.com/swatisinghalmav/best-of-8-regression-models-to-predict-strength

In [68]:
evaluation = pd.DataFrame({'Model': Model,'RMSE': RMSE,'R Squared': R_sq})
print("FOLLOWING ARE THE TRAINING SCORES: ")
evaluation

FOLLOWING ARE THE TRAINING SCORES: 


Unnamed: 0,Model,RMSE,R Squared
0,Linear Regression,1.06001,-0.11361
1,Ridge Regression,1.053015,-0.096455
2,Lasso Regression,1.013998,-0.009029
3,K Neighbors Regressor,1.132657,-0.285712
4,Decision Tree Regressor,1.453309,-1.186936
5,Random Forest Regressor,1.120518,-0.245664
6,Gradient Boosting Regressor,1.150093,-0.330727
7,Adaboost Regressor,1.193427,-0.300639
8,XGBRegressor,1.142377,-0.310352


## Next Steps:
1. Convert non-English reviews to English or use non-english dictionary - ## need to add to current code
2. Scale the emoticons replacement - ## validation pending
3. Make sentiment analysis of Title - ## Done
4. From Date, extract weekend, weekday, morning, afternoon, evening, night - ## Done
5. Make preliminary regression model with y variable as Ratings - ## Done
6. ?Use OLS - ## Done
7. Feature Engineering - columns on specific word count
8. Shiny App
9. Add PCA