# **WARNING** 

### We are dealing with raw web data. Some of the information that is retrieved might contain certain explicit content (words, phrases, or references)

# Data Engineering - NLP

## Exercise 1: NLP Tweets

For this exercise, use TfidfVectorizer and any TWO classification models you would like to correctly identify the sentiments of each review, in the Restaurant_Reviews.tsv file, as (Positive, or Negative). 

### Remember:
    1. Split your data into Train and Test sets
    2. Evaluate your model using the metrics of your choice (include a brief interpretation)
    3. Explain which model performed better and why (comparison of results)

In [1]:
import numpy as np
import pandas as pd
import scipy as sc
import sklearn
from sklearn.decomposition import PCA
from sklearn.linear_model import SGDRegressor
from sklearn.metrics import r2_score
import statsmodels.api as sm
import sys

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer,TfidfTransformer

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import classification_report,confusion_matrix, f1_score, recall_score

import gensim
import gensim.downloader as model_api
word_vectors = model_api.load("glove-wiki-gigaword-50")

[nltk_data] Downloading package stopwords to /Users/Sam/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
#Exercise 1
# TfidfVectorizer and any TWO classification models: LogisticRegression, RandomForestClassifier

# More classifications:
# https://cprosenjit.medium.com/10-classification-methods-from-scikit-learn-we-should-know-40c03ab8b077

In [3]:
df = pd.read_table("../data/Restaurant_Reviews.tsv")
df

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1
...,...,...
995,I think food should have flavor and texture an...,0
996,Appetite instantly gone.,0
997,Overall I was not impressed and would not go b...,0
998,"The whole experience was underwhelming, and I ...",0


In [39]:
Reviews = [] # empty array

#range of 1000
for i in range(0,1000):
    #
    review = re.sub('[^a-zA-Z]', ' ', df['Review'][i])
    #
    review = review.lower()
    #
    review = review.split()
    #
    #
    ps = PorterStemmer() # Algo to remove suffix words in english
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    # stopwords are english words which don't add meaning to a sentence: they are being removed from the sentence

    review = ' '.join(review)
    Reviews.append(review)

In [44]:
df_clean = pd.DataFrame({'Reviews':Reviews, 'Liked':df['Liked']})
df_clean

Unnamed: 0,Reviews,Liked
0,wow love place,1
1,crust good,0
2,tasti textur nasti,0
3,stop late may bank holiday rick steve recommen...,1
4,select menu great price,1
...,...,...
995,think food flavor textur lack,0
996,appetit instantli gone,0
997,overal impress would go back,0
998,whole experi underwhelm think go ninja sushi n...,0


In [None]:
# Words cleaning: Stop words, punctuation

In [45]:
tf = TfidfVectorizer()   #Create an instance of our TfidfVectorize()

X = (df_clean['Reviews']).copy() 
y = (df_clean['Liked']).copy()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

tf_X_train = tf.fit_transform(X_train)  #fit_transform on training data
tf_X_test = tf.transform(X_test)

In [50]:

log_model = LogisticRegression(max_iter=1000)

log_model.fit(tf_X_train,y_train)           

preds = log_model.predict(tf_X_test)

print(classification_report(y_test, preds))
acc = log_model.score(tf_X_test,y_test)
rec = recall_score(y_test, preds)
f1_score = 2*(acc*rec)/(acc+rec)
print('Accuracy:',round(acc,2))
print('Recall:',round(rec,2))
print('f1_score:',round(f1_score,2))


              precision    recall  f1-score   support

           0       0.81      0.77      0.79       150
           1       0.78      0.81      0.80       150

    accuracy                           0.79       300
   macro avg       0.79      0.79      0.79       300
weighted avg       0.79      0.79      0.79       300

Accuracy: 0.79
Recall: 0.81
f1_score: 0.8


In [51]:
model_rf = RandomForestClassifier(random_state=2)
model_rf.fit(tf_X_train, y_train)
preds_rf = model_rf.predict(tf_X_test)

print(classification_report(y_test, preds_rf))
acc = model_rf.score(tf_X_test,y_test)
rec = recall_score(y_test, preds_rf)
f1_score = 2*(acc*rec)/(acc+rec)
print('Accuracy:',round(acc,2))
print('Recall:',round(rec,2))
print('f1_score:',round(f1_score,2))


              precision    recall  f1-score   support

           0       0.76      0.83      0.79       150
           1       0.81      0.73      0.77       150

    accuracy                           0.78       300
   macro avg       0.78      0.78      0.78       300
weighted avg       0.78      0.78      0.78       300

Accuracy: 0.78
Recall: 0.73
f1_score: 0.76


In [52]:
knn = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
knn.fit(tf_X_train, y_train)
preds_knn = knn.predict(tf_X_test)

print(classification_report(y_test, preds_knn))
acc = knn.score(tf_X_test,y_test)
rec = recall_score(y_test,preds_knn)
f1_score = 2*(acc*rec)/(acc+rec)
print('Accuracy:',round(acc,2))
print('Recall:',round(rec,2))
print('f1_score:',round(f1_score,2))

              precision    recall  f1-score   support

           0       0.71      0.74      0.72       150
           1       0.73      0.69      0.71       150

    accuracy                           0.72       300
   macro avg       0.72      0.72      0.72       300
weighted avg       0.72      0.72      0.72       300

Accuracy: 0.72
Recall: 0.69
f1_score: 0.7


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


In [None]:
# INTERPRETATION:
          LogisticREGRESION RANDOMFOREST  KNeighborsCLASSIFIER
Accuracy:       0.79           0.78          0.72
Recall:         0.81           0.73          0.69
f1_score:       0.80           0.76          0.70

   The best model overall is the LOGISTIC REGRESSION with f1_score: 0.80
   Classification RANKING: 1> LogisticREGRESION 2> RANDOMFOREST 3> KNeighborsCLASSIFIER




## Exercise 2: App Review NLP work (Similar to Web Data workshop)

The Apple app store has a `GET` API to get reviews on apps. The URL is:

```
https://itunes.apple.com/{COUNTRY_CODE}/rss/customerreviews/id={APP_ID_HERE}/page={PAGE_NUMBER}/sortby=mostrecent/json
```

Note that you need to provide:

- The country codes (`'us'`, `'gb'`, `'ca'`, `'au'`) - use all four
- The app ID. This can be found in the web page for the app right after `id`.
    - You will need to find the IDs for these apps - Candy Crush, Facebook, Twitter & Tinder
- The "Page Number". The request responds with multiple pages of data, but sends them one at a time. So you can cycle through the data pages for any app on any country. (Be careful, there are limits to the number of pages you can access)

For example, Candy Crush's US webpage is `https://apps.apple.com/us/app/candy-crush-saga/id553834731`, which means that the ID is `553834731`.


Do the following:

1. Using any method you want (pre-trained models, dimensionality reduction, feature engineering, etc.) make the best **regression** model you can to predict the 5 star rating. Rate the accuracy in regression terms (mean squared error) and in classification terms (accuracy score, etc.)
1. Do the same as 1.1, but use a classification model. Are classification models better or worse to predict a 5-point rating scale? Explain in a few paragraphs and justify with metrics.

ps. Feel free to do as much data engineering to boost your model. (ie binary vs multinomial)


In [4]:
# exercise 2
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs
import re 
import json
import matplotlib.pyplot as plt

In [5]:

app_dic =  {'553834731':'CandyCrush', '547702041':'Tinder', '284882215':'Facebook', '333903271':'Twitter'}

URL_main = 'https://itunes.apple.com/'
uri_0 = {1:'us', 2:'gb', 3:'ca', 4:'au'}
uri_1 = '/rss/customerreviews/id='
uri_2 = {1:'553834731', 2:'547702041', 3:'284882215', 4:'333903271'}
uri_3 = '/page='
uri_4 = '10' #'Range[1-10]' # US limit: 10 pages | gb Limit: 10 pages | CA limit: 10 pages | au limit: 10 pages 
uri_5 = '/sortby=mostrecent/json'


pack_urls = {}
for i,valuei in uri_0.items():
    for j,valuej in uri_2.items():
        print('country: ', valuei)
        print('App: ', app_dic[valuej])
        
        pages = []
        for page_num in range(1,11):
            uri_4 = str(page_num)
            total_url = URL_main + valuei + uri_1 + valuej + uri_3 + uri_4 + uri_5 
            print(total_url)
            pages.append(total_url)
        id_url = valuei + app_dic[valuej]    
        pack_urls[id_url] = pages
        print(pages)
print(pack_urls)        
#print(pd.DataFrame.from_dict(json.loads(requests.get(total_url).content)))



country:  us
App:  CandyCrush
https://itunes.apple.com/us/rss/customerreviews/id=553834731/page=1/sortby=mostrecent/json
https://itunes.apple.com/us/rss/customerreviews/id=553834731/page=2/sortby=mostrecent/json
https://itunes.apple.com/us/rss/customerreviews/id=553834731/page=3/sortby=mostrecent/json
https://itunes.apple.com/us/rss/customerreviews/id=553834731/page=4/sortby=mostrecent/json
https://itunes.apple.com/us/rss/customerreviews/id=553834731/page=5/sortby=mostrecent/json
https://itunes.apple.com/us/rss/customerreviews/id=553834731/page=6/sortby=mostrecent/json
https://itunes.apple.com/us/rss/customerreviews/id=553834731/page=7/sortby=mostrecent/json
https://itunes.apple.com/us/rss/customerreviews/id=553834731/page=8/sortby=mostrecent/json
https://itunes.apple.com/us/rss/customerreviews/id=553834731/page=9/sortby=mostrecent/json
https://itunes.apple.com/us/rss/customerreviews/id=553834731/page=10/sortby=mostrecent/json
['https://itunes.apple.com/us/rss/customerreviews/id=553834

In [6]:
# Function to retrieve json data from the given API url

def get_json_data(mini_url, page):
    if page >0:
        page -=1
        
    get_url = pack_urls[mini_url]
    local_req = requests.get(get_url[page])
    js_data = json.loads(local_req.content)
    return js_data

In [8]:
# Returns a tuple of ( words in title where score is 1 , words in title where score is 2, 
# words in title where score is 3, words in title where score is 4, words in title where score is 5 )

def app_title(json_data):
    
    app_titles_1=[]
    app_titles_2=[]
    app_titles_3=[]
    app_titles_4=[]
    app_titles_5=[]
    
    entries = json_data['feed']['entry']
    for entry in entries:
        
        title = entry['title']['label']
        rating = int(entry['im:rating']['label'])
        
        if rating == 1:
            app_titles_1.append((title.split())[0])
        if rating == 2:
            app_titles_2.append((title.split())[0])
        if rating == 3:
            app_titles_3.append((title.split())[0])
        if rating == 4:
            app_titles_4.append((title.split())[0])
        if rating == 5:
            app_titles_5.append((title.split())[0])

    return app_titles_1, app_titles_2, app_titles_3, app_titles_4, app_titles_5

In [9]:

words_vs_score_5_all_apps = []
words_vs_score_1_all_apps = []

words_vs_score_4_all_apps = []
words_vs_score_3_all_apps = []
words_vs_score_2_all_apps = []

for k in pack_urls.keys():
    print('MiniURL:', k)
    for page in range(1,11):
        data = get_json_data(k,page)
        words_score_1,words_score_2,words_score_3,words_score_4,words_score_5 = app_title(data)

        words_vs_score_1_all_apps.append(' '.join(words_score_1))
        words_vs_score_5_all_apps.append(' '.join(words_score_5))

        words_vs_score_4_all_apps.append(' '.join(words_score_4))
        words_vs_score_3_all_apps.append(' '.join(words_score_3))
        words_vs_score_2_all_apps.append(' '.join(words_score_2))



num_5 = len(words_vs_score_5_all_apps)
num_1 = len(words_vs_score_1_all_apps)

num_4 = len(words_vs_score_4_all_apps)
num_3 = len(words_vs_score_3_all_apps)
num_2 = len(words_vs_score_2_all_apps)


MiniURL: usCandyCrush
MiniURL: usTinder
MiniURL: usFacebook
MiniURL: usTwitter
MiniURL: gbCandyCrush
MiniURL: gbTinder
MiniURL: gbFacebook
MiniURL: gbTwitter
MiniURL: caCandyCrush
MiniURL: caTinder
MiniURL: caFacebook
MiniURL: caTwitter
MiniURL: auCandyCrush
MiniURL: auTinder
MiniURL: auFacebook
MiniURL: auTwitter


In [10]:
words_vs_score_3_all_apps

['Game It music It’s',
 'Too Pésimo Latest Use Extra Servers Hard',
 'Crashes Too Level',
 '2477 Complain Hate',
 'Annoying Window',
 'Irritating Portrait It’s Estrésate',
 'Candy Message Why LANDSCAPE, iPad Impossible Used',
 'Control Who weekly Animations TOO $$$$ Pop',
 'Changed Don’t My Bring Listen Hate',
 'Keep Rewards Only Waiting Problems WHY Bonuses',
 'Please This',
 'Disappointed Single',
 '',
 "Wish It's Perks",
 'Troubles Likes',
 'Mixed Tinder Not',
 'App',
 'Kinda',
 'Hot Blah',
 'Is Tinder',
 'The STILL iPad',
 '“There Avatar Hacked I',
 'iPhone Notifications 🤔there’s',
 '',
 'DARKMODE',
 'Automated Page Notifications Ads',
 'Default No Newest Bug',
 'Bug',
 'Unable Update?',
 'Can’t CAN’T',
 'Russian 🇳🇬🇯🇵🌍🖤Ambassador Review Complain',
 'Freedom',
 'How Add',
 'Missy',
 'Going Improvements It’s',
 'Clustered Please',
 'Eh… TwitterBlue',
 'Error Learning User',
 'Comment Used Banned Kate',
 'He My Fakes',
 'Shame Lives Please Far Changes It’s Change Stop',
 'Game Portrai

In [11]:
# List of positive reviews ( score 5): Flag to 5 
list_of_5  = np.ones(num_5)*5
# List of reviews ( score 4): Flag to 4 
list_of_4  = np.ones(num_4)*4
# List of reviews ( score 3): Flag to 3 
list_of_3  = np.ones(num_3)*3
# List of reviews ( score 2): Flag to 2 
list_of_2  = np.ones(num_2)*2
# List of negative reviews ( score 1): Flag to 0 
list_of_1  = np.ones(num_1)

In [12]:
df_5 = pd.DataFrame({'Title':words_vs_score_5_all_apps, 'Rating':list_of_5})
df_4 = pd.DataFrame({'Title':words_vs_score_4_all_apps, 'Rating':list_of_4})
df_3 = pd.DataFrame({'Title':words_vs_score_3_all_apps, 'Rating':list_of_3})
df_2 = pd.DataFrame({'Title':words_vs_score_2_all_apps, 'Rating':list_of_2})
df_1 = pd.DataFrame({'Title':words_vs_score_1_all_apps, 'Rating':list_of_1})

In [13]:
df_1['Title']=df_1['Title'].convert_dtypes(convert_string=True)
df_2['Title']=df_2['Title'].convert_dtypes(convert_string=True)
df_3['Title']=df_3['Title'].convert_dtypes(convert_string=True)
df_4['Title']=df_4['Title'].convert_dtypes(convert_string=True)
df_5['Title']=df_5['Title'].convert_dtypes(convert_string=True)

df_1['Rating']=df_1['Rating'].astype(int)
df_2['Rating']=df_2['Rating'].astype(int)
df_3['Rating']=df_3['Rating'].astype(int)
df_4['Rating']=df_4['Rating'].astype(int)
df_5['Rating']=df_5['Rating'].astype(int)




In [14]:
def cleanup(df):
    Titles =[]
    for i in range(0,len(df['Title'])):
        review = re.sub('[^a-zA-Z]', ' ', df['Title'][i])
        review = review.lower()
        review = review.split()
        ps = PorterStemmer() # Algo to remove suffix words in english
        review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
        # stopwords are english words which don't add meaning to a sentence: they are being removed from the sentence

        review = ' '.join(review)
        Titles.append(review)
    return Titles
    


In [15]:

clean_titles1 = cleanup(df_1)
clean_titles2 = cleanup(df_2)
clean_titles3 = cleanup(df_3)
clean_titles4 = cleanup(df_4)
clean_titles5 = cleanup(df_5)


In [16]:
df_5 = pd.DataFrame({'Title':clean_titles5, 'Rating':list_of_5})
df_4 = pd.DataFrame({'Title':clean_titles4, 'Rating':list_of_4})
df_3 = pd.DataFrame({'Title':clean_titles3, 'Rating':list_of_3})
df_2 = pd.DataFrame({'Title':clean_titles2, 'Rating':list_of_2})
df_1 = pd.DataFrame({'Title':clean_titles1, 'Rating':list_of_1})

In [17]:
df_15 = pd.concat([df_1,df_2,df_3,df_4,df_5]).reset_index(drop=True)
df_15['Rating'] = df_15['Rating'].astype(int)
df_15

Unnamed: 0,Title,Rating
0,pay new review time goodby greedi sexual candi...,1
1,okay horribl anoth fix percol pleas custom las...,1
2,comparar stress need portrait rip disappoint n...,1
3,freemium freez fun fun bug portrait best rig u...,1
4,screen fact disappoint like goodby fun landsca...,1
...,...,...
795,love fantast review witto well great thank fre...,5
796,twitter elon twitter free love ty independ elo...,5
797,epic free twitter translat love new see happi ...,5
798,brilliant thank new great elon elon thank keep...,5


In [43]:
X = df_15['Title'].copy()
y = df_15['Rating'].copy()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

In [44]:
# Ex 2.1: Dimensionality reduction


In [45]:
# TFidfVectorizer + PCA:


# Apply our TfidfVectorizer to our data toarray() - PCA does not support sparse input
tf = TfidfVectorizer()

tf_X_train = tf.fit_transform(X_train).toarray()
tf_X_test = tf.transform(X_test).toarray()


In [46]:

pca_model = PCA(0.3).fit(tf_X_train)


In [47]:
pca_model.explained_variance_ratio_

array([0.03836161, 0.02843321, 0.0237424 , 0.01920891, 0.01643134,
       0.01143638, 0.01102869, 0.01059866, 0.01012152, 0.00951235,
       0.00924379, 0.00860184, 0.00851069, 0.00842271, 0.00821335,
       0.00811055, 0.00756037, 0.0075112 , 0.00735976, 0.00713611,
       0.00678456, 0.00673224, 0.00666336, 0.00637598, 0.00607511,
       0.00603836, 0.00593661])

In [48]:
pca_train = pca_model.transform(tf_X_train) 
pca_test = pca_model.transform(tf_X_test)


In [49]:
pca_train.shape,pca_test.shape

((560, 27), (240, 27))

In [50]:
import statsmodels.api as sm

pca_train_ = sm.add_constant(pca_train)

model_ols = sm.OLS(y_train, pca_train_).fit()
model_ols.summary()


0,1,2,3
Dep. Variable:,Rating,R-squared:,0.576
Model:,OLS,Adj. R-squared:,0.554
Method:,Least Squares,F-statistic:,26.72
Date:,"Sat, 11 Mar 2023",Prob (F-statistic):,1.66e-81
Time:,18:50:53,Log-Likelihood:,-739.78
No. Observations:,560,AIC:,1536.0
Df Residuals:,532,BIC:,1657.0
Df Model:,27,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,3.0161,0.039,76.721,0.000,2.939,3.093
x1,-3.1794,0.219,-14.545,0.000,-3.609,-2.750
x2,1.1012,0.254,4.337,0.000,0.602,1.600
x3,2.2506,0.278,8.100,0.000,1.705,2.796
x4,1.1734,0.309,3.798,0.000,0.567,1.780
x5,0.6015,0.334,1.801,0.072,-0.055,1.258
x6,4.4206,0.400,11.042,0.000,3.634,5.207
x7,0.6773,0.408,1.661,0.097,-0.124,1.478
x8,4.4317,0.416,10.656,0.000,3.615,5.249

0,1,2,3
Omnibus:,10.264,Durbin-Watson:,1.97
Prob(Omnibus):,0.006,Jarque-Bera (JB):,7.289
Skew:,0.156,Prob(JB):,0.0261
Kurtosis:,2.536,Cond. No.,14.1


In [51]:
pca_test_ = sm.add_constant(pca_test)
preds = model_ols.predict(pca_test_)

In [52]:
import math

# The range must be digits between 1-5
def evalPred(val):
    diff = val - int(val)
    if diff<0.5:
        ret = math.floor(val)
        if ret <1:
            return 1
        else:
            return ret
    else:
        ret = math.ceil(val)
        if ret < 5:
            return ret
        else:
            return 5
    
# To get integers value rather than floats:
preds = [evalPred(float(x)) for x in preds]


In [53]:
preds

[2,
 4,
 3,
 4,
 3,
 1,
 3,
 1,
 3,
 3,
 2,
 3,
 3,
 3,
 3,
 4,
 2,
 3,
 2,
 3,
 2,
 3,
 1,
 3,
 2,
 3,
 2,
 2,
 3,
 3,
 3,
 3,
 3,
 3,
 6,
 3,
 5,
 3,
 3,
 3,
 3,
 3,
 2,
 4,
 5,
 3,
 3,
 4,
 4,
 3,
 3,
 3,
 1,
 3,
 5,
 1,
 4,
 5,
 1,
 3,
 4,
 3,
 2,
 4,
 3,
 5,
 2,
 1,
 2,
 3,
 4,
 3,
 2,
 4,
 2,
 1,
 3,
 3,
 3,
 4,
 3,
 4,
 4,
 3,
 3,
 2,
 3,
 2,
 3,
 4,
 3,
 1,
 2,
 5,
 4,
 3,
 4,
 3,
 2,
 3,
 2,
 5,
 3,
 2,
 3,
 2,
 3,
 3,
 3,
 3,
 1,
 3,
 4,
 2,
 2,
 4,
 5,
 4,
 3,
 3,
 2,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 4,
 3,
 2,
 3,
 3,
 3,
 4,
 5,
 3,
 3,
 2,
 5,
 2,
 5,
 4,
 2,
 1,
 4,
 3,
 3,
 4,
 4,
 3,
 5,
 2,
 3,
 3,
 3,
 3,
 3,
 2,
 3,
 2,
 1,
 1,
 3,
 2,
 3,
 2,
 3,
 2,
 2,
 3,
 5,
 3,
 3,
 3,
 5,
 4,
 4,
 3,
 1,
 1,
 3,
 3,
 3,
 3,
 5,
 3,
 3,
 1,
 3,
 4,
 2,
 4,
 2,
 1,
 3,
 1,
 3,
 3,
 2,
 3,
 3,
 2,
 1,
 4,
 2,
 4,
 2,
 3,
 2,
 1,
 2,
 3,
 5,
 3,
 3,
 3,
 2,
 1,
 3,
 2,
 5,
 3,
 4,
 3,
 3,
 3,
 3,
 5,
 4,
 1,
 2,
 2,
 3,
 3]

In [54]:

print(classification_report(y_test, preds))



              precision    recall  f1-score   support

           1       0.95      0.39      0.55        54
           2       0.21      0.21      0.21        47
           3       0.27      0.70      0.39        46
           4       0.18      0.15      0.16        40
           5       0.83      0.28      0.42        53
           6       0.00      0.00      0.00         0

    accuracy                           0.35       240
   macro avg       0.41      0.29      0.29       240
weighted avg       0.52      0.35      0.36       240



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [55]:
from sklearn.metrics import confusion_matrix

print(confusion_matrix(y_test,preds))

[[21 24  9  0  0  0]
 [ 0 10 33  4  0  0]
 [ 1  8 32  4  1  0]
 [ 0  5 27  6  2  0]
 [ 0  0 17 20 15  1]
 [ 0  0  0  0  0  0]]


In [None]:
# RATING TFidfVectorizer + PCA + OLS:
This approach gives an OLS model with R-Square of 0.576 which is considered good ( close to 0.7).

The PCA is set to the dimensions that gives at 30% of the variance.
It resulted in 28 dimensions with 13 non significant, based on their Pvalue.
The float predicted values are parsed through a function to convert them to the closest integer in the range[0-5]

Having used a regression, we can evaluate what the classification would have been:
    ACCURACY is 35% only, which is considered bad.
    The confusion matrix highlights that the false positives and false negatives are high:
        ACCURACY = (TP + TN) / (TP + TN + FN + FP)
        in our case, FN + FP are elevated resulting in a low accuracy
        
    PRECISION for scores 1 and 5 is very good.
    Scores at 1 the precision is 95%
    Scores at 5 the precision is 83%
    Any score in between has a very low predictible precision < 30%

    RECALL is best for score 3 at 70%. 
    And RECALL  <30% for scores 1 and 5. which is not good.
    
    From the precision we can deduct that the words used to describe scores 1 and 5 are very specific, 
    leading to a precise score evaluation.



In [56]:
# Ex2.1: Same evaluation but with a Pre-trained Model

In [57]:

words = df_15.Title.str.split()
words = pd.DataFrame(words.tolist())
words

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,29,30,31,32,33,34,35,36,37,38
0,pay,new,review,time,goodby,greedi,sexual,candi,problemat,ridicul,...,,,,,,,,,,
1,okay,horribl,anoth,fix,percol,pleas,custom,last,bring,imposs,...,,,,,,,,,,
2,comparar,stress,need,portrait,rip,disappoint,new,landscap,new,lee,...,,,,,,,,,,
3,freemium,freez,fun,fun,bug,portrait,best,rig,unhappi,problem,...,,,,,,,,,,
4,screen,fact,disappoint,like,goodby,fun,landscap,dissatisfi,disappoint,candi,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
795,love,fantast,review,witto,well,great,thank,freedom,support,review,...,,,,,,,,,,
796,twitter,elon,twitter,free,love,ty,independ,elon,given,viva,...,,,,,,,,,,
797,epic,free,twitter,translat,love,new,see,happi,final,elon,...,,,,,,,,,,
798,brilliant,thank,new,great,elon,elon,thank,keep,elon,thank,...,,,,,,,,,,


In [58]:
def soft_get(w):
    try:
        return word_vectors[w] #either get the word or return 0s
    except KeyError:
        return np.zeros(word_vectors.vector_size)

def map_vectors(row):
    try:
        return np.sum(
            row.loc[words.iloc[0].notna()].apply(soft_get)
        ) # take the row and take the columns that are not NaN and get the soft_get and then take the sum of that
    except:
        return np.zeros(word_vectors.vector_size)

df_15_ = pd.DataFrame(words.apply(map_vectors, axis=1).tolist())
df_15_

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49
0,1.148567,2.641470,0.345738,3.755204,1.483803,0.722462,-1.248930,-4.714908,0.656005,2.743226,...,1.297251,-1.239448,-0.785817,2.703488,-0.974348,0.746080,0.128605,2.257964,0.902897,4.039250
1,1.111358,-1.407868,1.312007,-0.376199,-1.363047,-0.309220,-1.456234,2.110356,-1.672305,0.750926,...,-0.466342,0.498086,1.873242,3.317010,1.733953,0.463957,0.347942,1.449861,1.236846,0.669731
2,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
3,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
4,0.031350,-0.961055,2.446594,1.308703,3.107920,-2.286531,-2.882983,-0.133644,-5.640782,7.617380,...,1.139771,-0.298996,-4.568761,4.233591,1.752730,1.144976,-1.246760,1.375456,-0.479237,3.784584
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
795,-1.203702,4.950981,-2.376214,-2.964750,3.249500,-2.156953,-6.076870,-1.246770,3.507370,3.562294,...,1.455592,-0.514980,-2.284661,0.965731,-2.112881,1.776001,-1.946838,2.918013,-1.383255,5.082319
796,-2.696972,6.830843,2.908010,-0.696860,-1.987133,0.072767,-6.936793,-4.136040,-0.162384,7.171484,...,2.866548,2.011067,-9.743384,-0.330607,3.089880,-2.387059,-1.114356,-1.433080,-1.550801,1.350839
797,-0.785490,6.132310,2.422061,-0.269366,-0.058617,-1.458275,-7.611196,-4.495486,-1.620190,4.608180,...,2.013687,-0.854415,-5.945653,0.456034,0.384861,0.223031,1.128566,-0.275822,-0.267749,1.391310
798,-4.388921,6.643965,0.715165,-1.004086,7.136956,-3.555853,-7.298880,0.832374,-4.100427,1.599216,...,1.578750,-1.058248,-8.602822,0.616551,2.760192,2.721389,-4.690031,2.269812,-1.812367,2.150664


In [59]:
import statsmodels.api as sm

X_ = sm.add_constant(df_15_)
y = df_15['Rating']
model_ols = sm.OLS(y , X_).fit()
model_ols.summary()

0,1,2,3
Dep. Variable:,Rating,R-squared:,0.38
Model:,OLS,Adj. R-squared:,0.339
Method:,Least Squares,F-statistic:,9.187
Date:,"Sat, 11 Mar 2023",Prob (F-statistic):,3.07e-50
Time:,18:53:25,Log-Likelihood:,-1221.1
No. Observations:,800,AIC:,2544.0
Df Residuals:,749,BIC:,2783.0
Df Model:,50,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,3.2872,0.045,72.964,0.000,3.199,3.376
0,0.2338,0.172,1.355,0.176,-0.105,0.572
1,0.5588,0.179,3.131,0.002,0.208,0.909
2,0.0259,0.220,0.118,0.906,-0.406,0.458
3,0.0720,0.156,0.461,0.645,-0.235,0.378
4,0.1265,0.210,0.603,0.547,-0.285,0.539
5,-0.1389,0.143,-0.969,0.333,-0.420,0.142
6,-0.3039,0.269,-1.131,0.258,-0.831,0.223
7,0.0156,0.232,0.067,0.946,-0.441,0.472

0,1,2,3
Omnibus:,65.852,Durbin-Watson:,0.281
Prob(Omnibus):,0.0,Jarque-Bera (JB):,22.216
Skew:,-0.073,Prob(JB):,1.5e-05
Kurtosis:,2.197,Cond. No.,314.0


In [60]:
preds = model_ols.predict(X_)

In [61]:
preds = [evalPred(float(x)) for x in preds]
preds

[2,
 1,
 3,
 3,
 2,
 1,
 1,
 3,
 3,
 1,
 1,
 1,
 2,
 1,
 1,
 1,
 1,
 1,
 2,
 1,
 1,
 2,
 1,
 2,
 2,
 1,
 1,
 1,
 1,
 1,
 3,
 3,
 1,
 3,
 3,
 3,
 2,
 3,
 2,
 3,
 1,
 1,
 3,
 3,
 3,
 1,
 3,
 1,
 3,
 1,
 1,
 2,
 1,
 2,
 1,
 1,
 2,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 2,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 3,
 1,
 1,
 3,
 1,
 1,
 2,
 3,
 1,
 3,
 3,
 2,
 1,
 1,
 1,
 1,
 3,
 2,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 2,
 1,
 1,
 1,
 2,
 2,
 2,
 1,
 2,
 3,
 1,
 1,
 3,
 3,
 3,
 2,
 3,
 1,
 3,
 3,
 3,
 1,
 2,
 3,
 1,
 2,
 1,
 1,
 2,
 1,
 1,
 1,
 2,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 3,
 3,
 3,
 1,
 3,
 1,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,


In [62]:
print(classification_report(y, preds))



              precision    recall  f1-score   support

           1       1.00      0.59      0.75       160
           2       0.00      0.00      0.00       160
           3       0.25      1.00      0.39       160
           4       0.00      0.00      0.00       160
           5       1.00      0.09      0.17       160

    accuracy                           0.34       800
   macro avg       0.45      0.34      0.26       800
weighted avg       0.45      0.34      0.26       800



In [63]:
print(confusion_matrix(y,preds))

[[ 95  26  39   0   0]
 [  0   0 160   0   0]
 [  0   0 160   0   0]
 [  0   0 160   0   0]
 [  0   0 133  12  15]]


In [None]:
# RATING Pre-Trained Model + OLS:
This approach gives an OLS model with R-Square of 0.380 which is not considered good  ( Good~ 0.7).
The Pre-trained model Corpus and the context (dimensions) to which this model was made is
probably not specific enough to the type of context we are evaluating here.
Another fact is that a deeper cleaner could improve the scores.

The OLS float predicted values are parsed through a function to convert them 
to the closest integer in the range[0-5]
Having used a regression, we can evaluate what the classification would have been:
    ACCURACY is 34% only, which is considered bad.
    The confusion matrix highlights that the false positives and false negatives are high:
        ACCURACY = (TP + TN) / (TP + TN + FN + FP)
        in our case, FN + FP are elevated resulting in a low accuracy
        
    PRECISION for scores 1 and 5 is very good.
    Scores at 1 the precision is 100%
    Scores at 5 the precision is 100%
    Any score in between has a very low predictible precision < 25%
    
    RECALL is best for score 3 and 1.
    Other scores 2,3,5 RECALL is LOW down to 0.

    From the precision and RECALL we can deduct that the words used to describe scores 1 is very specific, 
    leading to a precise score evaluation.
    A different pre-trained model should be use to evaluate if the context of training impacts the predictions accuracy.

In [183]:
# Ex 2.2

tf = TfidfVectorizer()   #Create an instance of our TfidfVectorize()

X = (df_15['Title']).copy() 
y = (df_15['Rating']).copy()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

tf_X_train = tf.fit_transform(X_train)  #fit_transform on training data
tf_X_test = tf.transform(X_test)

In [184]:

log_model = LogisticRegression(max_iter=1000)

log_model.fit(tf_X_train,y_train)           

preds = log_model.predict(tf_X_test)

print(classification_report(y_test, preds))


# This parameter is required for multiclass/multilabel targets. If None, the scores for each class are returned.
# average : string, [None, ‘binary’ (default), ‘micro’, ‘macro’, ‘samples’, ‘weighted’]
rec = recall_score(y_test, preds, pos_label='positive', average='micro')
acc = log_model.score(tf_X_test,y_test)
f1_score = 2*(acc*rec)/(acc+rec)
print('Accuracy:',round(acc,2))
print('Recall:',round(rec,2))
print('f1_score:',round(f1_score,2))



              precision    recall  f1-score   support

           1       0.88      0.85      0.87        54
           2       0.26      0.17      0.21        47
           3       0.47      0.35      0.40        46
           4       0.35      0.65      0.45        40
           5       0.62      0.57      0.59        53

    accuracy                           0.53       240
   macro avg       0.52      0.52      0.50       240
weighted avg       0.54      0.53      0.52       240

Accuracy: 0.52
Recall: 0.52
f1_score: 0.52




In [175]:
from sklearn.metrics import confusion_matrix

print(confusion_matrix(y_test,preds))

[[46  5  1  0  2]
 [ 5  8 15 13  6]
 [ 1 12 16 14  3]
 [ 0  5  2 26  7]
 [ 0  1  0 22 30]]


In [None]:
# INTERPRETATION TfidfVectorizer + LogisticRegression:

    ACCURACY =53% which is the best among all other approaches we have evaluated in this exercise.
    But still not considered has a good reference.
    F1_score are better for 1 and 5 overall.
    F1_score  for 1 stands out at 87%, highlighting consitency is the words used to describe this score.
    
    Overall the extremes scores 1 and 5 are easier to predict due to the specificity and consistency in the use of specfic
    vocabulary.
    
    The TfidfVectorizer + LogisticRegression perform better than regressions to evaluate scores.