# ![GA Logo](https://camo.githubusercontent.com/6ce15b81c1f06d716d753a61f5db22375fa684da/68747470733a2f2f67612d646173682e73332e616d617a6f6e6177732e636f6d2f70726f64756374696f6e2f6173736574732f6c6f676f2d39663838616536633963333837313639306533333238306663663535376633332e706e67) Project 3: Web APIs & Classification

## Business Objective

XXX is upcoming tech startup that is looking into venturing Virtual Reality (VR) gadget market. To get a feel of the the current market, XXX is zooming its attention to 2 pre-exisitng big players: Oculus and Vive.

As dedicated Data Analysts of XXX, we decided to look into Reddits posts of Oculus and Vive fo find out more about user needs and preferences as comments on the platform will usually either reflect very positive or very negative experiences of each individual product. From our model, we will then be able to further improve our features of our prototype to suit consumer preference.

We have chosen Reddit over other tech review websites because:
- Easily scrapable API 
- Neutral platform
- International subscriber base
- Gaining popularity and outreach 


### Contents:
- [Data Collection](#Data-Collection)
- [Data Cleaning & EDA](#Data-Cleaning-and-EDA])
- [Preprocessing & Modeling](#Preprocessing-and-Modeling)
- [Evaluation and Conceptual Understanding](#Evaluation-and-Conceptual-Understanding)
- [Conclusion and Recommendations](#Conclusion-and-Recommendations)

## Data Collection

In [4]:
import requests
import time
import nltk
import pandas as pd
import regex as re
import numpy as np

from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

In [5]:
url_1 = "https://www.reddit.com/r/oculus.json"
url_2 = "https://www.reddit.com/r/Vive.json"

In [6]:
headers = {"User-Agent" : "Jalabulajals" }

In [7]:
res_1 = requests.get(url_1, headers = headers)
res_2 = requests.get(url_2, headers = headers)

In [8]:
print(res_1.status_code)
print(res_2.status_code)

200
200


In [9]:
data_json_1 = res_1.json()
data_json_2 = res_2.json()

In [10]:
data_json_1

{'kind': 'Listing',
 'data': {'modhash': '',
  'dist': 27,
  'children': [{'kind': 't3',
    'data': {'approved_at_utc': None,
     'subreddit': 'oculus',
     'selftext': "^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^.|**Rift S**|**Quest**|**Go**\n---|---|----|----\n**Display**|LCD (RGB stripe)|OLED (pentile matrix)|LCD (RGB stripe)\n**Resolution&amp;nbsp;(per&amp;nbsp;eye)**|1280 × 1440 @ 80 Hz|1440 × 1600 @ 72 Hz|1280 × 1440 @ 72 Hz\n**Optimal IPD**|61.5 - 65.5&amp;nbsp;mm (software)|56 - 74&amp;nbsp;mm (hardware)|61.5 - 65.5&amp;nbsp;mm (software)\n**Audio**|Integrated speakers, 3.5mm headphone jack |Integrated speakers, 3.5mm headphone jacks|Integrated speakers, 3.5mm headphone jack \n**Controllers**|Thumbsticks, buttons, triggers (left &amp; right pair)|Thumbsticks, buttons, triggers (left &amp; right pair)|Touchpad, buttons, trigger (single ambidextrous)\n**Tracking**|6 DOF (IMU + 5 cameras)|6 DOF (IMU + 4 cameras)|3 DOF (IMU)\n**Finger Tracking**|Capsense|Capsense, optical (2020)|N

In [11]:
data_json_2

{'kind': 'Listing',
 'data': {'modhash': '',
  'dist': 26,
  'children': [{'kind': 't3',
    'data': {'approved_at_utc': None,
     'subreddit': 'Vive',
     'selftext': "\n\n**This is the daily thread where you can keep the /r/vive community updated with all the things happening in your VR World.**\n\n*   What you have been doing in VR  \n*   VR Games and Experiences - Your thoughts, questions and recommendations   \n*   Unboxings, pickup posts, swag hauls - show us what you've bought or made\n*   Your PC Builds, Battlestations, VR Caves, Furniture optimization, cabling routing and other endevours.   \n*   Your questions about virtual reality hardware and software\n*   Online social meetups and events - find people to join you in multiplayer games.\n*   Stock updates, shipping and customer service issues with VR suppliers\n*   Technical support stories and requests for technical help\n\nThe daily thread can  be a great place for anything you feel like sharing that may not warrant a se

In [12]:
sorted(data_json_1.keys())

['data', 'kind']

In [13]:
sorted(data_json_2.keys())

['data', 'kind']

In [14]:
sorted(data_json_1['data'].keys())

['after', 'before', 'children', 'dist', 'modhash']

In [15]:
sorted(data_json_2['data'].keys())

['after', 'before', 'children', 'dist', 'modhash']

In [16]:
len(data_json_1['data']['children'])

27

In [17]:
len(data_json_2['data']['children'])

26

I will be choosing as features of interest
1. subreddit
2. selftext
3. title
4. score
5. num_comments
6. created_utc (It is the number of seconds that have elapsed since 00:00:00 Thursday, 1 January 1970) 
7. subreddit_subscribers

### Creating a function to scrap Reddit's API

In [18]:
def reddit_to_posts(subreddit, n_requests, csv_name):
    
    posts = []
    headers = {'User-Agent': 'Jalabulajals'}
    after = None
    
    for i in range(n_requests):
        print(i)
        
        if after == None:
            params = {}
        else:
            params = {'after': after}
            
        url = 'https://www.reddit.com/' + str(subreddit) + '/.json'
        
        res = requests.get(url, params = params, headers = headers)
        
        if res.status_code == 200:
            data_json = res.json()
            
            for i in range(len(data_json['data']['children'])):
                
                post_dict = {}
                
                post_dict['subreddit'] = data_json['data']['children'][i]['data']['subreddit']
                post_dict['title'] = data_json['data']['children'][i]['data']['title']
                post_dict['selftext'] = data_json['data']['children'][i]['data']['selftext']
                post_dict['created_utc'] = data_json['data']['children'][i]['data']['created_utc']
                post_dict['score'] = data_json['data']['children'][i]['data']['score']
                post_dict['num_comments'] = data_json['data']['children'][i]['data']['num_comments']
                post_dict['subreddit_subscribers'] = data_json['data']['children'][i]['data']['subreddit_subscribers']
                
                posts.append(post_dict)
                
            after = data_json['data']['after']
        
        else:
            print(res.status_code)
            break
            
        time.sleep(1)
        
    
    #Save the DataFrame as a .csv file:
    pd.DataFrame(posts).to_csv(csv_name, index = False)

In [19]:
#Retrieve posts for Oculus 
reddit_to_posts(subreddit = 'r/oculus', n_requests = 150, csv_name = 'oculus_reddit_posts.csv')

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149


In [20]:
reddit_oculus = pd.read_csv('../project_3/oculus_reddit_posts.csv')
reddit_oculus.drop_duplicates(subset='title', keep='first', inplace=True)
len(reddit_oculus)

990

In [21]:
#Retrieve posts for Vive 
reddit_to_posts(subreddit = 'r/Vive', n_requests = 150, csv_name = 'Vive_reddit_posts.csv')

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149


In [22]:
reddit_vive = pd.read_csv('../project_3/Vive_reddit_posts.csv')
reddit_vive.drop_duplicates(subset='title', keep='first', inplace=True)
len(reddit_vive)

930

### Merging of two Dataframes

In [23]:
reddit_oculus.head()

Unnamed: 0,subreddit,title,selftext,created_utc,score,num_comments,subreddit_subscribers
0,oculus,"Rift S, Quest or Go - Which headset is the rig...",^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^.|**R...,1574925000.0,148,443,196789
1,oculus,[Weekly] What games have you been enjoying thi...,Hello guys! \n\n\n\n \n\n\n\n-\n\n\n\n \n...,1575306000.0,22,108,196789
2,oculus,Left 4 Dead VR is real according to ValveNewsN...,,1575912000.0,970,342,196789
3,oculus,Thumbs Up: Hand Tracking Available on Oculus Q...,,1575914000.0,122,62,196789
4,oculus,Boneworks to release tomorrow(Dec. 10) at 1 PM...,Boneworks will release tomorrow at 10 AM PST/1...,1575906000.0,137,125,196789


In [24]:
reddit_oculus.shape

(990, 7)

In [25]:
reddit_vive.head()

Unnamed: 0,subreddit,title,selftext,created_utc,score,num_comments,subreddit_subscribers
0,Vive,Daily Updates and Casual Conversation - Decemb...,\n\n**This is the daily thread where you can k...,1575895000.0,4,0,129422
1,Vive,DETAILED GUIDE: Half-Life 2 Campaign in VR (vi...,**Introduction:** \n\n* Recently I played thro...,1575926000.0,148,21,129422
2,Vive,Twin Peaks VR: First Official Trailer &amp; Sc...,,1575910000.0,53,10,129422
3,Vive,Highly recommend Onward for anyone who plays P...,Just a headsup because I know a lot of people ...,1575888000.0,93,107,129422
4,Vive,Vive FOV Values: For Unity Shader (s),"Heya brothers and sisters! First time poster, ...",1575926000.0,3,2,129422


In [26]:
reddit_vive.shape

(930, 7)

In [27]:
#Merging of Oculus and Vive Dataframe
df = reddit_oculus.append(reddit_vive, ignore_index=True)

In [28]:
df.head()

Unnamed: 0,subreddit,title,selftext,created_utc,score,num_comments,subreddit_subscribers
0,oculus,"Rift S, Quest or Go - Which headset is the rig...",^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^.|**R...,1574925000.0,148,443,196789
1,oculus,[Weekly] What games have you been enjoying thi...,Hello guys! \n\n\n\n \n\n\n\n-\n\n\n\n \n...,1575306000.0,22,108,196789
2,oculus,Left 4 Dead VR is real according to ValveNewsN...,,1575912000.0,970,342,196789
3,oculus,Thumbs Up: Hand Tracking Available on Oculus Q...,,1575914000.0,122,62,196789
4,oculus,Boneworks to release tomorrow(Dec. 10) at 1 PM...,Boneworks will release tomorrow at 10 AM PST/1...,1575906000.0,137,125,196789


In [29]:
df.shape

(1920, 7)

## Data Cleaning & EDA

#### Removing null vales

In [30]:
#Removing any duplicates
df.drop_duplicates(subset='title', keep='first', inplace=True)

In [31]:
df.shape

(1909, 7)

In [32]:
#Checking for null values
df.isnull().sum()

subreddit                  0
title                      0
selftext                 455
created_utc                0
score                      0
num_comments               0
subreddit_subscribers      0
dtype: int64

In [33]:
#Fill null values with 'None' for 'selftext'
df['selftext'].fillna('None', inplace=True)

In [34]:
df.isnull().sum()

subreddit                0
title                    0
selftext                 0
created_utc              0
score                    0
num_comments             0
subreddit_subscribers    0
dtype: int64

#### Creating a 'target' column

In [35]:
df['target'] = df.subreddit.map({'oculus':1, 'Vive':0})

In [36]:
df.head()

Unnamed: 0,subreddit,title,selftext,created_utc,score,num_comments,subreddit_subscribers,target
0,oculus,"Rift S, Quest or Go - Which headset is the rig...",^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^.|**R...,1574925000.0,148,443,196789,1
1,oculus,[Weekly] What games have you been enjoying thi...,Hello guys! \n\n\n\n \n\n\n\n-\n\n\n\n \n...,1575306000.0,22,108,196789,1
2,oculus,Left 4 Dead VR is real according to ValveNewsN...,,1575912000.0,970,342,196789,1
3,oculus,Thumbs Up: Hand Tracking Available on Oculus Q...,,1575914000.0,122,62,196789,1
4,oculus,Boneworks to release tomorrow(Dec. 10) at 1 PM...,Boneworks will release tomorrow at 10 AM PST/1...,1575906000.0,137,125,196789,1


In [37]:
df['target'].unique()

array([1, 0])

#### Creating a 'timestamp' column

In [38]:
df['timestamp'] = (pd.to_datetime(df['created_utc'], unit='s').dt.tz_localize('utc').dt.tz_convert('Asia/Hong_Kong'))

In [39]:
df.head()

Unnamed: 0,subreddit,title,selftext,created_utc,score,num_comments,subreddit_subscribers,target,timestamp
0,oculus,"Rift S, Quest or Go - Which headset is the rig...",^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^.|**R...,1574925000.0,148,443,196789,1,2019-11-28 15:11:01+08:00
1,oculus,[Weekly] What games have you been enjoying thi...,Hello guys! \n\n\n\n \n\n\n\n-\n\n\n\n \n...,1575306000.0,22,108,196789,1,2019-12-03 00:58:42+08:00
2,oculus,Left 4 Dead VR is real according to ValveNewsN...,,1575912000.0,970,342,196789,1,2019-12-10 01:24:41+08:00
3,oculus,Thumbs Up: Hand Tracking Available on Oculus Q...,,1575914000.0,122,62,196789,1,2019-12-10 02:00:26+08:00
4,oculus,Boneworks to release tomorrow(Dec. 10) at 1 PM...,Boneworks will release tomorrow at 10 AM PST/1...,1575906000.0,137,125,196789,1,2019-12-09 23:36:06+08:00


In [40]:
df['alltext'] = df['title'] + df['selftext']

In [41]:
df.head()

Unnamed: 0,subreddit,title,selftext,created_utc,score,num_comments,subreddit_subscribers,target,timestamp,alltext
0,oculus,"Rift S, Quest or Go - Which headset is the rig...",^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^.|**R...,1574925000.0,148,443,196789,1,2019-11-28 15:11:01+08:00,"Rift S, Quest or Go - Which headset is the rig..."
1,oculus,[Weekly] What games have you been enjoying thi...,Hello guys! \n\n\n\n \n\n\n\n-\n\n\n\n \n...,1575306000.0,22,108,196789,1,2019-12-03 00:58:42+08:00,[Weekly] What games have you been enjoying thi...
2,oculus,Left 4 Dead VR is real according to ValveNewsN...,,1575912000.0,970,342,196789,1,2019-12-10 01:24:41+08:00,Left 4 Dead VR is real according to ValveNewsN...
3,oculus,Thumbs Up: Hand Tracking Available on Oculus Q...,,1575914000.0,122,62,196789,1,2019-12-10 02:00:26+08:00,Thumbs Up: Hand Tracking Available on Oculus Q...
4,oculus,Boneworks to release tomorrow(Dec. 10) at 1 PM...,Boneworks will release tomorrow at 10 AM PST/1...,1575906000.0,137,125,196789,1,2019-12-09 23:36:06+08:00,Boneworks to release tomorrow(Dec. 10) at 1 PM...


In [42]:
def text_to_words(titletext):
    
    # 1. keep only alphabets.
    letters_only = re.sub("[^a-zA-Z]", " ", titletext)
    
    # 2. Convert to lower case, split into individual words.
    words = letters_only.lower().split()
    
    # 3. Add on own words to stop words and convert them to a set for a faster search
    my_stops = stopwords.words('english')
    my_stops.extend(['none','\n', 'www', 'reddit', 'com', 'comment', 'http'])
    my_stops = set(my_stops)
    
    # 4. Remove stop words.
    meaningful_words = [w for w in words if not w in my_stops]
    
    # 5. lemmatizer the words.
    lemmatizer = WordNetLemmatizer()
    tokens_lem = [lemmatizer.lemmatize(i) for i in meaningful_words]
    
    # 6. Join the words back into one string separated by space, and return the result.
    return(' '.join(tokens_lem))

In [43]:
#call the function to clean df['alltext']
clean_alltext=[]

for text in df['alltext']:
     # Convert text to words, then append to clean_text.
    clean_alltext.append(text_to_words(text))

In [44]:
clean_alltext[:1]

['rift quest go headset right choice rift quest go display lcd rgb stripe oled pentile matrix lcd rgb stripe resolution amp nbsp per amp nbsp eye hz hz hz optimal ipd amp nbsp mm software amp nbsp mm hardware amp nbsp mm software audio integrated speaker mm headphone jack integrated speaker mm headphone jack integrated speaker mm headphone jack controller thumbsticks button trigger left amp right pair thumbsticks button trigger left amp right pair touchpad button trigger single ambidextrous tracking dof imu camera dof imu camera dof imu finger tracking capsense capsense optical headband halo band rigid strap elastic strap weight g g g internal storage gb gb connectivity displayport usb amp nbsp usb type c wi fi bluetooth micro usb wi fi bluetooth gamepads pc amp nbsp support required link cable beta wireless streaming rd party wireless streaming rd party hardware limitation question ask try answer best']

# Preprocessing & Modeling

In [45]:
#assign target and result to y and X then carry out train test split
X = clean_alltext
y = df['target']

In [46]:
# Import train_test_split.
from sklearn.model_selection import train_test_split

# Create train_test_split.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state=42)

#### Baseline Accuracy 

In [47]:
y.value_counts(normalize=True)

1    0.518596
0    0.481404
Name: target, dtype: float64

#### Model : Pipeline (CountVectorizer + LogisticRegression)

In [48]:
from sklearn.pipeline import Pipeline
pipe_cvec = Pipeline([
    ('cvec', CountVectorizer()),
    ('lr', LogisticRegression())
])

In [49]:
pipe_cvec.steps

[('cvec', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                  dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                  lowercase=True, max_df=1.0, max_features=None, min_df=1,
                  ngram_range=(1, 1), preprocessor=None, stop_words=None,
                  strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                  tokenizer=None, vocabulary=None)),
 ('lr',
  LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                     intercept_scaling=1, l1_ratio=None, max_iter=100,
                     multi_class='warn', n_jobs=None, penalty='l2',
                     random_state=None, solver='warn', tol=0.0001, verbose=0,
                     warm_start=False))]

In [53]:
pipe_cvec_params = {
    'cvec__max_features': [None, 15, 20],
    'cvec__min_df': [1, 2],
    'cvec__max_df': [.5, .6, .7, .8],
    'cvec__ngram_range': [(1,1), (1,2)]
}

gs_cvec = GridSearchCV(pipe_cvec, param_grid=pipe_cvec_params, cv=5, n_jobs=-1)
gs_cvec.fit(X_train, y_train)
best_score_cvec = gs_cvec.best_score_
print(best_score_cvec)
best_params_cvec = gs_cvec.best_params_
best_params_cvec

0.8581411600279525




{'cvec__max_df': 0.5,
 'cvec__max_features': None,
 'cvec__min_df': 1,
 'cvec__ngram_range': (1, 2)}

The best score implies that out of 100 instances, the model correctly predicts Oculus 85.8 times.

The high best score indicates that the model is overfitting

In [54]:
score_train_cvec = gs_cvec.score(X_train,y_train)
score_train_cvec

1.0

This clearly shows that the model of CountVectorizer + LogisticRegression is a complete overfit. We will have to not consider this model as it will not be able to generalise well becasue the variance is just too high.

In [55]:
score_test_cvec = gs_cvec.score(X_test,y_test)
score_test_cvec

0.8326359832635983

Let's consider using TfidfVectorizer instead of CountVectorizer in our Logisitic Regression pipeline model.

#### Model : Pipeline (TfidfVectorizer + LogisticRegression)

In [56]:
from sklearn.pipeline import Pipeline
pipe_tvec = Pipeline([
    ('tvec', TfidfVectorizer()),
    ('lr', LogisticRegression())
]) 

In [57]:
pipe_tvec.steps

[('tvec', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                  dtype=<class 'numpy.float64'>, encoding='utf-8',
                  input='content', lowercase=True, max_df=1.0, max_features=None,
                  min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None,
                  smooth_idf=True, stop_words=None, strip_accents=None,
                  sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                  tokenizer=None, use_idf=True, vocabulary=None)),
 ('lr',
  LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                     intercept_scaling=1, l1_ratio=None, max_iter=100,
                     multi_class='warn', n_jobs=None, penalty='l2',
                     random_state=None, solver='warn', tol=0.0001, verbose=0,
                     warm_start=False))]

In [58]:
pipe_tvec_params = {
    'tvec__max_features': [15, 20],
    'tvec__min_df': [1, 2],
    'tvec__max_df': [.5, .6, .7, .8],
    'tvec__ngram_range': [(1,1), (1,2)]
}

gs_tvec = GridSearchCV(pipe_tvec, param_grid=pipe_tvec_params, cv=5, n_jobs=-1)
gs_tvec.fit(X_train, y_train)
best_score_tvec = gs_tvec.best_score_
print(best_score_tvec)
best_params_tvec = gs_tvec.best_params_
best_params_tvec

0.827393431167016




{'tvec__max_df': 0.5,
 'tvec__max_features': 15,
 'tvec__min_df': 1,
 'tvec__ngram_range': (1, 1)}

The best score implies that out of 100 instances, the model correctly predicts Oculus 83.9 times.

In [59]:
score_train_tvec = gs_tvec.score(X_train,y_train)
score_train_tvec

0.8378756114605171

In [60]:
score_test_tvec = gs_tvec.score(X_test,y_test)
score_test_tvec

0.8200836820083682

#### Comparing both Pipeline models

In [61]:
#Comparing Best Scores
print(f'Best Score of Model CVEC is {best_score_cvec}')
print(f'Best Score of Model TVEC is {best_score_tvec}')

Best Score of Model CVEC is 0.8581411600279525
Best Score of Model TVEC is 0.827393431167016


In [62]:
#Comparing Best Params
print(best_params_cvec)
print(best_params_tvec)

{'cvec__max_df': 0.5, 'cvec__max_features': None, 'cvec__min_df': 1, 'cvec__ngram_range': (1, 2)}
{'tvec__max_df': 0.5, 'tvec__max_features': 15, 'tvec__min_df': 1, 'tvec__ngram_range': (1, 1)}


In [63]:
#Comparing Train Scores
print(f'Train Score of Model CVEC is {score_train_cvec}')
print(f'Train Score of Model TVEC is {score_train_tvec}')

Train Score of Model CVEC is 1.0
Train Score of Model TVEC is 0.8378756114605171


In [64]:
#Comparing Test Scores
print(f'Test Score of Model CVEC is {score_test_cvec}')
print(f'Test Score of Model TVEC is {score_test_tvec}')

Test Score of Model CVEC is 0.8326359832635983
Test Score of Model TVEC is 0.8200836820083682


Comparing the two Pipeline models, I would choose Pipeline (TfidfVectorizer + LogisticRegression) as the variation of the training and test set isn't significant as compared to its counterpart

#### Choosing TfidfVectorizer+LogisticRegression Pipeline model

In [65]:
pipe = pipe_tvec

In [66]:
pipe.steps

[('tvec', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                  dtype=<class 'numpy.float64'>, encoding='utf-8',
                  input='content', lowercase=True, max_df=1.0, max_features=None,
                  min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None,
                  smooth_idf=True, stop_words=None, strip_accents=None,
                  sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                  tokenizer=None, use_idf=True, vocabulary=None)),
 ('lr',
  LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                     intercept_scaling=1, l1_ratio=None, max_iter=100,
                     multi_class='warn', n_jobs=None, penalty='l2',
                     random_state=None, solver='warn', tol=0.0001, verbose=0,
                     warm_start=False))]

In [67]:
pipe_params = {
    'tvec__max_features': [15, 20],
    'tvec__min_df': [1, 2],
    'tvec__max_df': [.5, .6, .7, .8],
    'tvec__ngram_range': [(1,1), (1,2)]
}

gs = GridSearchCV(pipe, param_grid=pipe_params, cv=5, n_jobs=-1)
gs.fit(X_train, y_train)
best_score_gs = gs_tvec.best_score_
print(best_score_gs)
best_params_gs = gs_tvec.best_params_
best_params_gs

0.827393431167016




{'tvec__max_df': 0.5,
 'tvec__max_features': 15,
 'tvec__min_df': 1,
 'tvec__ngram_range': (1, 1)}

In [68]:
score_pipe = gs.score(X_train,y_train)
score_pipe

0.8378756114605171

In [69]:
score_pipe = gs.score(X_test,y_test)
score_pipe

0.8200836820083682

### TfidfVectorizer and MultinomialNB

In [70]:
from sklearn.naive_bayes import MultinomialNB

In [71]:
pipe2 = Pipeline([('tvec', TfidfVectorizer()),
                 ('nb', MultinomialNB())
                ])

In [72]:
pipe2.steps

[('tvec', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                  dtype=<class 'numpy.float64'>, encoding='utf-8',
                  input='content', lowercase=True, max_df=1.0, max_features=None,
                  min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None,
                  smooth_idf=True, stop_words=None, strip_accents=None,
                  sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                  tokenizer=None, use_idf=True, vocabulary=None)),
 ('nb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]

In [73]:
pipe2_params = {
    'tvec__max_features': [15, 20],
    'tvec__min_df': [1, 2],
    'tvec__max_df': [.5, .6, .7, .8],
    'tvec__ngram_range': [(1,1), (1,2)]
}

gs2 = GridSearchCV(pipe2, param_grid=pipe2_params, cv=5, n_jobs=-1)
gs2.fit(X_train, y_train)
best_score_gs2 = gs2.best_score_
print(best_score_gs2)
best_params_gs2 = gs2.best_params_
best_params_gs2

0.827393431167016


{'tvec__max_df': 0.5,
 'tvec__max_features': 20,
 'tvec__min_df': 1,
 'tvec__ngram_range': (1, 1)}

The best score implies that out of 100 instances, the model correctly predicts Oculus 82.8 times.

In [74]:
score_pipe2 = gs2.score(X_train,y_train)
score_pipe2

0.8280922431865828

In [75]:
score_pipe2 = gs2.score(X_test,y_test)
score_pipe2

0.7698744769874477

# Evaluation and Conceptual Understanding

In [76]:
from sklearn.metrics import confusion_matrix

In [77]:
def c_matrix(model, X_test):
    model.fit(X_train, y_train) 
    y_pred = model.predict(X_test)            # calculate predictions
    cm = confusion_matrix(y_test, y_pred)     # defining the confusion matrix
    tn, fp, fn, tp = cm.ravel()              # assigning the elements of the confusion matrix to variables
    
    accuracy = (tp + tn) / (tp + tn + fp + fn)
    print(f'Accuracy: {round(accuracy,4)}')
    
    return pd.DataFrame(cm, 
                        columns = ['Pred Oculus','Pred Vive'], 
                        index = ['Act Oculus', 'Act Vive'])

#### Model 1 : Pipeline (TfidfVectorizer + LogisticRegression)

In [78]:
c_matrix(pipe, X_test)

Accuracy: 0.8452




Unnamed: 0,Pred Oculus,Pred Vive
Act Oculus,196,35
Act Vive,39,208


#### Model 2 : Pipeline (TfidfVectorizer + MultinomialNB)

In [79]:
c_matrix(pipe2, X_test)

Accuracy: 0.8243


Unnamed: 0,Pred Oculus,Pred Vive
Act Oculus,172,59
Act Vive,25,222


In [80]:
from sklearn.metrics import classification_report

In [81]:
def report(model, X_test):
    model.fit(X_train, y_train) 
    y_pred = model.predict(X_test) 
    print(classification_report(y_test, y_pred))

In [82]:
report(pipe, X_test)

              precision    recall  f1-score   support

           0       0.83      0.85      0.84       231
           1       0.86      0.84      0.85       247

    accuracy                           0.85       478
   macro avg       0.85      0.85      0.85       478
weighted avg       0.85      0.85      0.85       478





In [83]:
report(pipe2, X_test)

              precision    recall  f1-score   support

           0       0.87      0.74      0.80       231
           1       0.79      0.90      0.84       247

    accuracy                           0.82       478
   macro avg       0.83      0.82      0.82       478
weighted avg       0.83      0.82      0.82       478



From the classification reports comparing both models, the f1-score for Model 1 (0.85) is higher than Model 2 (0.84) therefore we can conclude that LogisticRegression has done better than its counterpart. I am looking at f1-score instead of global accuracy as precision and recall are taken into consideration in its computation.

Just to reinstate, the global accuracy is still higher for Model 1 (0.85) than Model 2 (0.82)

### Relationship between Sensitivity and Specificity of our choosen model 

In [84]:
# Let's create a dataframe called pred_df that contains:
# 1. The list of true values of our test set.
# 2. The list of predicted probabilities based on our model.

pred_proba = [i[1] for i in pipe.predict_proba(X_test)]

pred_df = pd.DataFrame({'true_values': y_test,
                        'pred_probs':pred_proba})
pred_df

Unnamed: 0,true_values,pred_probs
1435,0,0.133322
111,1,0.741133
1861,0,0.337665
968,1,0.654912
415,1,0.885652
...,...,...
1187,0,0.456874
199,1,0.352226
1371,0,0.485655
1633,0,0.087400


#### Receiver Operating Characteristic (ROC) Curve

In [85]:
import matplotlib.pyplot as plt

# Create figure.
plt.figure(figsize = (10,7))

# Create threshold values.
thresholds = np.linspace(0, 1, 200)

# Define function to calculate sensitivity. (True positive rate.)
def TPR(df, true_col, pred_prob_col, threshold):
    true_positive = df[(df[true_col] == 1) & (df[pred_prob_col] >= threshold)].shape[0]
    false_negative = df[(df[true_col] == 1) & (df[pred_prob_col] < threshold)].shape[0]
    return true_positive / (true_positive + false_negative)
    

# Define function to calculate 1 - specificity. (False positive rate.)
def FPR(df, true_col, pred_prob_col, threshold):
    true_negative = df[(df[true_col] == 0) & (df[pred_prob_col] <= threshold)].shape[0]
    false_positive = df[(df[true_col] == 0) & (df[pred_prob_col] > threshold)].shape[0]
    return 1 - (true_negative / (true_negative + false_positive))
    
# Calculate sensitivity & 1-specificity for each threshold between 0 and 1.
tpr_values = [TPR(pred_df, 'true_values', 'pred_probs', prob) for prob in thresholds]
fpr_values = [FPR(pred_df, 'true_values', 'pred_probs', prob) for prob in thresholds]

# Plot ROC curve.
plt.plot(fpr_values, # False Positive Rate on X-axis
         tpr_values, # True Positive Rate on Y-axis
         label='ROC Curve')

# Plot baseline. (Perfect overlap between the two populations.)
plt.plot(np.linspace(0, 1, 200),
         np.linspace(0, 1, 200),
         label='baseline',
         linestyle='--')

# Label axes.
plt.title('Receiver Operating Characteristic Curve', fontsize=22)
plt.ylabel('True Positive Rate (Sensitivity)', fontsize=18)
plt.xlabel('False Positive Rate (1 - Specificity)', fontsize=18)

# Create legend.
plt.legend(fontsize=16);

ROC Curve shows the tradeoff between sensitivity and specificity of our model which in this case loosely means that how good the model can distinguish between Oculus and Vive VR headset

The AUC (the area under the ROC curve) seems to be large enough therefore this show that our model is doing a good job of distinguishing the positive and the negative values.

Since our model is performing well enough, we are well convinced to move forth with it.

In [86]:
step1 = pipe.named_steps['tvec']
step2 = pipe.named_steps['lr']

In [87]:
#get the feature name 
columns=(step1.get_feature_names())

#form the df by combining features name and the coef
coef = pd.DataFrame(step2.coef_, columns=columns)
oculus_coef = coef.T.sort_values(by = 0, ascending=False) #by 0 means the by index a

oculus_coef.head(10)

Unnamed: 0,0
rift,6.322209
oculus,5.454003
quest,3.190748
usb,1.457703
cv,1.381233
saber,0.991645
got,0.990693
sensor,0.878901
port,0.847701
far,0.822729


For a one unit increase in the word 'rift' or 'quest', the probability of a reddit post being an 'Oculus' subreddit increases by 6.45 or 3.22 times respectively. This shows that inclusion of a specific VR headset model type in a reddit post will type will increase it s visibilty.

And the keywords of usb, port and sensor seems to be important features that people are speaking about - maybe because they might have exisiting concerns with their interested/purchased Oculus products. From this, we can identify user concerns and drawing our attention to these features in our XXX VR headset prototype.


In [88]:
vive_coef = coef.T.sort_values(by = 0, ascending=True)

vive_coef.head(10)

Unnamed: 0,0
vive,-7.150494
htc,-2.063093
wireless,-2.056473
index,-1.76588
cosmos,-1.700938
station,-1.64005
pro,-1.619421
vr,-1.427172
revive,-1.397873
base,-1.272114


## Conclusion and Recommendations

Oculus VR headset seems to be more widely mentioned as compared to Vive VR headset. Especially, the Oculus Rift version do seem to be a popular choice of mention as compared to the Oculus Quest. However, high mentions doesn't automatically translate to high popularity among the consumers as comments on the platform will usually either reflect very positive or very negative experiences of each individual product. Using the reuslts as a guideline, we can attritube importance to these important features giving it a more targeted and optimised research 

One future recommendation will be looking into other tech reviews sites in additional to Redddit to get a more holistic approach to consumer and expert opinions. 