# Problem Statement

Collecting posts from 2 subreddits using Reddit's API, use NLP techniques to train a classifier for which subreddit each post comes from.

Project Overview:
NLP techniques that will be used are CountVectorizer and Tfid-Ifd. For each technique, different models will be tested. These models are Naive Bayes' models, Multinomial; Gaussian; Bernoulli and other classification models such as KNN and Logistic Regression. KNN and Logistic Regression models will be subjected to a grid search to better optimize the model. For every model used, the confusion matrix and ROC AUC will be used to evaluate each classification model's effectiveness. 

As the project is not linear; the procedure for one NLP technique, one model and its associated performance is:
- Extracting the posts using Reddit API
- EDA to identify target columns
- NLP transformation
- Modelling
- Classification model performance

## *Note: Please Run Notebook from middle*

## 1) Posts Extraction

In [1]:
import requests
import time
import pandas as pd

In [2]:
url_1 = "https://www.reddit.com/r/Warframe/hot.json"

In [3]:
#need to specify user-agent because default using python, it has its own user-agent. Therefore, with many users 
#connecting to the web at the same time, it will return response code of 429.
headers = {'User-agent' : 'Evan 0.1'}

In [4]:
res_1 = requests.get(url_1, headers=headers)

In [5]:
res_1.status_code

200

In [6]:
war_json = res_1.json()

In [7]:
sorted(war_json.keys())

['data', 'kind']

In [8]:
sorted(war_json['data'].keys())

['after', 'before', 'children', 'dist', 'modhash']

In [9]:
len(war_json['data']['children'])

27

In [10]:
list_id = [i['data']['name'] for i in war_json['data']['children']]
list_id

['t3_asfcn4',
 't3_chp9zc',
 't3_chl4nh',
 't3_chrku6',
 't3_chr960',
 't3_chotwt',
 't3_chn98j',
 't3_chr7ue',
 't3_chqnyj',
 't3_cho3jx',
 't3_chrssn',
 't3_chqah7',
 't3_cho5rm',
 't3_chl296',
 't3_chmm3s',
 't3_chhaxp',
 't3_chtask',
 't3_chqa2u',
 't3_chkwfq',
 't3_chunuo',
 't3_chsar5',
 't3_chuebe',
 't3_chhb42',
 't3_cht1zt',
 't3_chmngh',
 't3_chsrix',
 't3_chl0av']

In [11]:
war_json['data']['after']

't3_chl0av'

In [12]:
param = {'after': war_json['data']['after']}

In [13]:
requests.get(url_1, params=param, headers=headers)

<Response [200]>

In [14]:
#extract the posts data
posts_war = []
after = None
for i in range(40):
    print(i)
    if after == None:
        params = {}
    else: 
        params = {'after': after}
    url_1= "https://www.reddit.com/r/Warframe/hot.json"
    res_1 = requests.get(url_1, params=params, headers=headers)
    if res_1.status_code == 200:
        war_json = res_1.json()
        posts_war.extend(war_json['data']['children'])
        after = war_json['data']['after']
    else:
        print(res_1.status_code)
        break
    time.sleep(1)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39


In [15]:
len(posts_war)

983

In [16]:
#to check for repetition
len(set(i['data']['name'] for i in posts_war))

983

In [17]:
#There are 102 posts because the first 2 of the posts are pinned and are not considered in the 25post/page limit
posts_war = posts_war[2:]

In [18]:
len(posts_war)

981

In [19]:
#Repeat the above steps for the other sub reddit page: apexlegends
url_2 = "https://www.reddit.com/r/apexlegends/hot.json"

In [20]:
headers = {'User-agent' : 'Bleep blorp bot 0.1'}

In [21]:
res_2 = requests.get(url_2, headers=headers)

In [22]:
res_2.status_code

200

In [23]:
apex_json = res_2.json()

In [24]:
sorted(apex_json.keys())

['data', 'kind']

In [25]:
sorted(apex_json['data'].keys())

['after', 'before', 'children', 'dist', 'modhash']

In [26]:
len(apex_json['data']['children'])

27

In [27]:
list_id_apex = [i['data']['name'] for i in apex_json['data']['children']]
list_id_apex

['t3_chgblz',
 't3_chhsse',
 't3_cho5h3',
 't3_chlndk',
 't3_chrqq6',
 't3_chojfp',
 't3_chqjc3',
 't3_chlya1',
 't3_chhsje',
 't3_cho85f',
 't3_chp0kn',
 't3_chnzo3',
 't3_chox8p',
 't3_chsfjf',
 't3_chqqq5',
 't3_chdaal',
 't3_chp4cs',
 't3_cho6yy',
 't3_chsncv',
 't3_chrl7c',
 't3_chomyj',
 't3_chq98u',
 't3_chs336',
 't3_chr3ea',
 't3_chslxg',
 't3_chirab',
 't3_chtlkq']

In [28]:
apex_json['data']['after']

't3_chtlkq'

In [29]:
param = {'after': apex_json['data']['after']}

In [30]:
requests.get(url_1, params=param, headers=headers)

<Response [200]>

In [31]:
#example of text in first post for apex
apex_json['data']['children'][0]['data']['selftext']

"&amp;#x200B;\n\nhttps://i.redd.it/qeb183dqacc31.png\n\n# Apex Legends Community Reward Challenge\n\nWelcome to the first Apex Legends art related challenge!\n\nWe at r/ApexLegends are about to finalize a relatively new Reddit feature - [Community Rewards](https://www.reddit.com/r/redesign/comments/c3psbg/community_awards_everything_you_need_to_know/).\n\nYou can read more about community rewards in the link.\n\n&amp;#x200B;\n\n**Whenever you see a popular post, it's very likely that the post has received Reddit Rewards. Silver, Gold &amp; Platinum. Community Rewards are more or less the same, but instead of having silver, gold &amp; platinum, we can make custom rewards with custom icons up to a total of 7 rewards. This is where you talented people come in. In order for us to properly be able to populate the different rewards, we want to put your skills to the test to create the artwork!**\n\n&amp;#x200B;\n\n&amp;#x200B;\n\n|Placing|Price|\n|:-|:-|\n|1st up to 7th|Your icon will be add

In [32]:
#example of title in first post for apex
apex_json['data']['children'][0]['data']['title']

'[r/ApexLegends] Community Reward Challenge'

In [33]:
#example of timestamp in first post for apex
apex_json['data']['children'][0]['data']['created_utc']

1564013703.0

In [34]:
#example of number of comments
apex_json['data']['children'][0]['data']['num_comments']

57

In [35]:
#example of subreddit
apex_json['data']['children'][0]['data']['subreddit']

'apexlegends'

In [36]:
#extract the posts data
posts_apex = []
after = None
for i in range(40):
    print(i)
    if after == None:
        params = {}
    else: 
        params = {'after': after}
    url_2= "https://www.reddit.com/r/apexlegends/hot.json"
    res_2 = requests.get(url_2, params=params, headers=headers)
    if res_2.status_code == 200:
        apex_json = res_2.json()
        posts_apex.extend(apex_json['data']['children'])
        after = apex_json['data']['after']
    else:
        print(res_2.status_code)
        break
    time.sleep(1)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39


In [37]:
len(posts_apex)

982

In [38]:
#to check for repetition
len(set(i['data']['name'] for i in posts_apex))

830

In [39]:
posts_apex = posts_apex[2:]

In [40]:
len(posts_apex)

980

## 2) EDA

In [41]:
#extract title, subreddit, length of time it has been up and number of comments, text in posts.
data_apex = {
    'title': [i['data']['title'] for i in posts_apex], 
    'subreddit': [i['data']['subreddit'] for i in posts_apex], 
    'time': [i['data']['created_utc'] for i in posts_apex], 
    'comments': [i['data']['num_comments'] for i in posts_apex],
    'text': [i['data']['selftext'] for i in posts_apex]
}

df_apex = pd.DataFrame(data_apex, columns = data_apex.keys())

In [42]:
df_apex.head()

Unnamed: 0,title,subreddit,time,comments,text
0,No one gets my gold armor!,apexlegends,1564064000.0,152,
1,I need more of this on my insta feed,apexlegends,1564049000.0,147,
2,Tried to do Fight Club but they wanted the win,apexlegends,1564080000.0,34,
3,Mirage here to save the day,apexlegends,1564066000.0,41,
4,Not a high kill count or damage count. But man...,apexlegends,1564075000.0,12,


In [43]:
data_war = {
    'title': [i['data']['title'] for i in posts_war], 
    'subreddit': [i['data']['subreddit'] for i in posts_war], 
    'time': [i['data']['created_utc'] for i in posts_war], 
    'comments': [i['data']['num_comments'] for i in posts_war],
    'text': [i['data']['selftext'] for i in posts_war]
}

df_war = pd.DataFrame(data_war, columns = data_war.keys())

In [44]:
df_war.head()

Unnamed: 0,title,subreddit,time,comments,text
0,Warframe Sbubby 2: Electric Boogaloo,Warframe,1564045000.0,247,
1,Umbra(cat) vs Shawzin,Warframe,1564079000.0,20,
2,Remember... Even if your dreams are gone it's ...,Warframe,1564078000.0,8,
3,So I did a thing (better with sound),Warframe,1564067000.0,56,
4,Hey Listen !,Warframe,1564059000.0,9,


In [45]:
df = pd.concat([df_apex, df_war])

In [46]:
df.head()

Unnamed: 0,title,subreddit,time,comments,text
0,No one gets my gold armor!,apexlegends,1564064000.0,152,
1,I need more of this on my insta feed,apexlegends,1564049000.0,147,
2,Tried to do Fight Club but they wanted the win,apexlegends,1564080000.0,34,
3,Mirage here to save the day,apexlegends,1564066000.0,41,
4,Not a high kill count or damage count. But man...,apexlegends,1564075000.0,12,


In [47]:
df['apexlegends'] = pd.get_dummies(df['subreddit']).drop(columns=['Warframe'])

In [48]:
df = df.drop(columns='subreddit')

In [49]:
df.head()

Unnamed: 0,title,time,comments,text,apexlegends
0,No one gets my gold armor!,1564064000.0,152,,1
1,I need more of this on my insta feed,1564049000.0,147,,1
2,Tried to do Fight Club but they wanted the win,1564080000.0,34,,1
3,Mirage here to save the day,1564066000.0,41,,1
4,Not a high kill count or damage count. But man...,1564075000.0,12,,1


In [50]:
df.to_csv('./datasets/data.csv')

# **Run Notebook from here**

In [51]:
#find X and y
X = df['title'] + df['text']
y = df['apexlegends']

## 3) NLP Technique

In [52]:
#split data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.25,
                                                    random_state=42,
                                                    stratify=y)

The following will be done using CountVectorizer together with the models used:
1) Multinomial NB
2) Bernoulli NB
3) Gaussian NB
4) Logistic Regression + Grid Search
5) KNN + Grid Search

Each classification method will be evaluated based on the following:
1) Confusion matrix

In [53]:
from sklearn.feature_extraction.text import CountVectorizer
cvec = CountVectorizer(stop_words='english', token_pattern=r'(?u)\b[a-zA-Z]{2,}\b')
X_train_cvec = cvec.fit_transform(X_train)
X_test_cvec = cvec.transform(X_test)

In [54]:
#Check out the most common words
X_df = pd.DataFrame(cvec.transform(X_train).todense(),
                       columns=cvec.get_feature_names())
word_counts = pd.DataFrame(X_df).sum(axis=0)
word_counts.sort_values(ascending = False).head(20)

just        502
game        440
like        420
warframe    263
amp         252
know        229
time        224
https       211
don         202
apex        186
ve          184
new         172
play        172
got         158
think       152
good        147
people      146
playing     145
damage      144
use         143
dtype: int64

In [55]:
 #can try other kinds of vectorizer like tf-idf
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression

model = make_pipeline(CountVectorizer(stop_words='english',
                                      token_pattern=r'(?u)\b[a-zA-Z]{2,}\b'),
                      LogisticRegression(solver='lbfgs'),
                      )

model_count = model.fit(X_train, y_train)
y_pred = model_count.predict(X_test)
print('model score on train data: {}'.format(model_count.score(X_train, y_train)))
print('model score on test data: {}'.format(accuracy_score(y_test, y_pred)))
print("Number of features:", len(model.steps[0][1].get_feature_names()))

model score on train data: 0.991156462585034
model score on test data: 0.8533604887983707
Number of features: 7572


In [56]:
# GridSearch was attempted and excluded because there was no change.
# Import the confusion matrix function.
from sklearn.metrics import confusion_matrix

In [57]:
cm = confusion_matrix(y_test, y_pred)
cm

array([[195,  51],
       [ 21, 224]])

In [58]:
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

In [59]:
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

True Negatives: 195
False Positives: 51
False Negatives: 21
True Positives: 224


In [60]:
from sklearn.neighbors import KNeighborsClassifier
model = make_pipeline(CountVectorizer(stop_words='english',
                                      token_pattern=r'(?u)\b[a-zA-Z]{2,}\b'),
                      KNeighborsClassifier(),
                      )

model_count = model.fit(X_train, y_train)
y_pred = model_count.predict(X_test)
print('model score on train data: {}'.format(model_count.score(X_train, y_train)))
print('model score on test data: {}'.format(accuracy_score(y_test, y_pred)))
print("Number of features:", len(model.steps[0][1].get_feature_names()))

model score on train data: 0.7959183673469388
model score on test data: 0.6415478615071283
Number of features: 7572


In [61]:
from sklearn.model_selection import GridSearchCV
knn_params = {
    'n_neighbors': range(1,10, 2),
    'metric' : ['euclidean', 'manhattan']
}

knn_gridsearch = GridSearchCV(
    KNeighborsClassifier(),
    knn_params,
    cv=9,
    verbose=1,
    return_train_score=False
)

knn_gridsearch.fit(X_train_cvec, y_train)

Fitting 9 folds for each of 10 candidates, totalling 90 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  90 out of  90 | elapsed:   36.1s finished


GridSearchCV(cv=9, error_score='raise-deprecating',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'n_neighbors': range(1, 10, 2), 'metric': ['euclidean', 'manhattan']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
       scoring=None, verbose=1)

In [62]:
knn_gridsearch.best_score_

0.6510204081632653

In [63]:
knn_gridsearch.best_params_

{'metric': 'euclidean', 'n_neighbors': 1}

In [64]:
knn_gridsearch.best_estimator_

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='euclidean',
           metric_params=None, n_jobs=None, n_neighbors=1, p=2,
           weights='uniform')

In [65]:
#need to fit do a cvec first before fitting into knn, as shown above it can be done using pipeline alone
#however we want to use the best parameter model from our girdsearch and git manually hence the need.
knn = KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='euclidean',
           metric_params=None, n_jobs=None, n_neighbors=3, p=2,
           weights='uniform')
model = knn.fit(X_train_cvec, y_train)

In [66]:
model.score(X_train_cvec, y_train)

0.7659863945578231

In [67]:
model.score(X_test_cvec, y_test)

0.6028513238289206

In [68]:
y_pred = model.predict(X_test_cvec)

In [None]:
cm = confusion_matrix(y_test, y_pred)
cm

array([[187,  59],
       [136, 109]])

In [None]:
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

In [None]:
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

True Negatives: 187
False Positives: 59
False Negatives: 136
True Positives: 109


In [None]:
# Import our model! for Multinomial
from sklearn.naive_bayes import MultinomialNB

In [None]:
model = make_pipeline(CountVectorizer(stop_words='english',
                                      token_pattern=r'(?u)\b[a-zA-Z]{2,}\b'),
                      MultinomialNB(),
                      )

model_count = model.fit(X_train, y_train)
y_pred = model_count.predict(X_test)
#note that even though X_train is not CountVectorizer fit_transformed, using model_count takes into account
#pipeline characteristics, allowing raw X_train to be used here.
print('model score on train data: {}'.format(model_count.score(X_train, y_train)))
print('model score on test data: {}'.format(accuracy_score(y_test, y_pred)))
print("Number of features:", len(model.steps[0][1].get_feature_names()))

model score on train data: 0.9591836734693877
model score on test data: 0.8818737270875764
Number of features: 7572


In [None]:
# Generate a confusion matrix.

cm = confusion_matrix(y_test, y_pred)
cm

array([[215,  31],
       [ 27, 218]])

In [None]:
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

In [None]:
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

True Negatives: 215
False Positives: 31
False Negatives: 27
True Positives: 218


Use of BernoulliNB model to classify (elaborate a bit about how it works), expected to be better or worse?

In [None]:
#Import Bernoulli and edit pipeline from above
from sklearn.naive_bayes import BernoulliNB
model = make_pipeline(CountVectorizer(stop_words='english',
                                      token_pattern=r'(?u)\b[a-zA-Z]{2,}\b'),
                      BernoulliNB(),
                      )


model_count = model.fit(X_train, y_train)
y_pred = model_count.predict(X_test)
print('model score on train data: {}'.format(model_count.score(X_train, y_train)))
print('model score on test data: {}'.format(accuracy_score(y_test, y_pred)))
print("Number of features:", len(model.steps[0][1].get_feature_names()))

model score on train data: 0.8006802721088435
model score on test data: 0.7515274949083504
Number of features: 7572


In [None]:
# Generate a confusion matrix.
cm = confusion_matrix(y_test, y_pred)
cm

array([[128, 118],
       [  4, 241]])

In [None]:
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

In [None]:
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

True Negatives: 128
False Positives: 118
False Negatives: 4
True Positives: 241


Add in the ROC AUC as another metric

Use of GaussianNB

We will need to reinstantiate the CountVectorizer outside the pipeline in order to obtain an array type for it to be able to fit under a Gaussian Naive Bayes' model.

In [None]:
#cvec already defined above
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
X_train_cvec = X_train_cvec.toarray()
model_gnb = gnb.fit(X_train_cvec, y_train)
X_test_cvec = X_test_cvec.toarray()
predictions_gnb = model_gnb.predict(X_test_cvec)
print('model score on train data: {}'.format(model_gnb.score(X_train_cvec, y_train)))
print('model score on test data: {}'.format(accuracy_score(y_test, predictions_gnb)))
print("Number of features:", len(model.steps[0][1].get_feature_names()))

model score on train data: 0.9714285714285714
model score on test data: 0.8472505091649695
Number of features: 7572


In [None]:
# Generate a confusion matrix.
cm = confusion_matrix(y_test, predictions_gnb)
cm

array([[199,  47],
       [ 28, 217]])

In [None]:
tn, fp, fn, tp = confusion_matrix(y_test, predictions_gnb).ravel()

In [None]:
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

True Negatives: 199
False Positives: 47
False Negatives: 28
True Positives: 217


Repeat the above steps using Tfid-ifd together with the following models:
1) Multinomial
2) Bernoulli
3) Gaussian

Then evaluate similarly for each model with the following classification metrics:
1) Confusion matrix
2) ROC AUC

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tvec = TfidfVectorizer(stop_words='english', 
                       sublinear_tf=True,
                       max_df=0.7, 
                       token_pattern=r'(?u)\b[a-zA-Z]{2,}\b')
X_train_tvec = tvec.fit_transform(X_train)
X_test_tvec = tvec.transform(X_test)

In [None]:
#import libraries required to use TfidVectorizer
model = make_pipeline(TfidfVectorizer(stop_words='english',
                        sublinear_tf=True,
                        max_df=0.5,
                        token_pattern=r'(?u)\b[a-zA-Z]{2,}\b'),
                      LogisticRegression(),
                      )

model_tf = model.fit(X_train, y_train)
y_pred = model_tf.predict(X_test)
print('model score on train data: {}'.format(model_tf.score(X_train, y_train)))
print('model score on test data: {}'.format(accuracy_score(y_test, y_pred)))
print("Number of features:", len(model.steps[0][1].get_feature_names()))

model score on train data: 0.9850340136054422
model score on test data: 0.8655804480651731
Number of features: 7572




In [None]:
model_tf.named_steps

{'tfidfvectorizer': TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
         dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
         lowercase=True, max_df=0.5, max_features=None, min_df=1,
         ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
         stop_words='english', strip_accents=None, sublinear_tf=True,
         token_pattern='(?u)\\b[a-zA-Z]{2,}\\b', tokenizer=None,
         use_idf=True, vocabulary=None),
 'logisticregression': LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
           intercept_scaling=1, max_iter=100, multi_class='warn',
           n_jobs=None, penalty='l2', random_state=None, solver='warn',
           tol=0.0001, verbose=0, warm_start=False)}

In [None]:
df_log = pd.DataFrame(model_tf.named_steps.logisticregression.coef_)

In [None]:
df_log.columns = model_tf.named_steps.tfidfvectorizer.get_feature_names()

In [None]:
df_log.head()

Unnamed: 0,aaaand,abandoned,abandoning,abc,abilities,abilitiesi,ability,abilityvalkyr,able,abnormally,...,zipline,ziplines,zipped,zipping,zone,zoned,zones,zoom,zopney,zotac
0,-0.140789,-0.144448,-0.101362,0.041176,-0.846892,-0.062824,-0.506748,-0.149584,0.061572,0.144559,...,0.28742,0.10457,0.04879,-0.010316,0.074014,0.082609,0.028599,-0.07277,-0.013251,0.042855


In [None]:
df_log = df_log.reset_index() #Make your index into a column
df_log = pd.melt(df_log, id_vars = ['index']) #Reshape data

In [None]:
df_log = df_log.drop(columns='index').sort_values(by = 'value') #Remove duplicates, sort

In [None]:
#top 20 key words that are relevant for Warframe
df_log.head(20)

Unnamed: 0,variable,value
7304,warframe,-3.591021
4967,prime,-2.600727
827,build,-1.705613
4113,mission,-1.608339
4474,operator,-1.608161
2584,frame,-1.515252
7507,wukong,-1.366606
819,bug,-1.303548
921,captura,-1.290037
6685,tenno,-1.277216


In [None]:
#top 20 key words that are relevant for Apex Legends
df_log.tail(20)

Unnamed: 0,variable,value
6657,teammate,1.115199
543,bangalore,1.133305
2778,gold,1.139909
3685,legends,1.169517
1163,clutch,1.206615
3543,kills,1.251683
6292,squad,1.253062
4102,mirage,1.269949
7340,wattson,1.285051
7488,wraith,1.293023


In [None]:
cm = confusion_matrix(y_test, y_pred)
cm

array([[215,  31],
       [ 35, 210]])

In [None]:
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

In [None]:
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

True Negatives: 215
False Positives: 31
False Negatives: 35
True Positives: 210


In [None]:
from sklearn.ensemble import BaggingClassifier
knn = KNeighborsClassifier()
model = make_pipeline(TfidfVectorizer(stop_words='english',
                        sublinear_tf=True,
                        max_df=0.5,
                        token_pattern=r'(?u)\b[a-zA-Z]{2,}\b'),
                      BaggingClassifier(base_estimator=knn, max_samples=0.5, max_features=0.5)
                      )

model_tf = model.fit(X_train, y_train)
y_pred = model_tf.predict(X_test)
print('model score on train data: {}'.format(model_tf.score(X_train, y_train)))
print('model score on test data: {}'.format(accuracy_score(y_test, y_pred)))
print("Number of features:", len(model.steps[0][1].get_feature_names()))

model score on train data: 0.7482993197278912
model score on test data: 0.594704684317719
Number of features: 7572


In [None]:
knn_params = {
    'n_neighbors': range(15,25,2),
    'metric' : ['euclidean', 'manhattan']
}

knn_gridsearch = GridSearchCV(
    KNeighborsClassifier(),
    knn_params,
    cv=9,
    verbose=1,
    return_train_score=False
)

knn_gridsearch.fit(X_train_tvec, y_train)

Fitting 9 folds for each of 10 candidates, totalling 90 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


In [None]:
knn_gridsearch.best_score_

In [None]:
knn_gridsearch.best_params_

In [None]:
knn_gridsearch.best_estimator_

In [None]:
knn = KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='euclidean',
           metric_params=None, n_jobs=None, n_neighbors=19, p=2,
           weights='uniform')
model = knn.fit(X_train_tvec, y_train)

In [None]:
model.score(X_train_tvec, y_train)

In [None]:
model.score(X_test_tvec, y_test)

In [None]:
y_pred = model.predict(X_test_tvec)

In [None]:
cm = confusion_matrix(y_test, y_pred)
cm

In [None]:
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

In [None]:
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

In [None]:
model = make_pipeline(TfidfVectorizer(stop_words='english',
                                      sublinear_tf=True,
                                      max_df=0.5,
                                      token_pattern=r'(?u)\b[a-zA-Z]{2,}\b'),
                      MultinomialNB(),
                      )
model_tf = model.fit(X_train, y_train)
y_pred = model_tf.predict(X_test)
print('model score on train data: {}'.format(model_tf.score(X_train, y_train)))
print('model score on test data: {}'.format(accuracy_score(y_test, y_pred)))
print("Number of features:", len(model.steps[0][1].get_feature_names()))

In [None]:
model_tf.named_steps

In [None]:
df_multi = pd.DataFrame(model_tf.named_steps.multinomialnb.coef_)

In [None]:
df_multi.columns = model_tf.named_steps.tfidfvectorizer.get_feature_names()

In [None]:
df_multi.head()

In [None]:
df_multi = df_multi.reset_index() #Make your index into a column
df_multi = pd.melt(df_multi, id_vars = ['index']) #Reshape data

In [None]:
df_multi.head()

In [None]:
df_multi = df_multi.drop(columns='index').sort_values(by = 'value') #Remove duplicates, sort

In [None]:
#top 20 key words that are relevant for Warframe
df_multi.head(20)

In [None]:
#top 20 key words that are relevant for Apex Legends
df_multi.tail(20)

In [None]:
cm = confusion_matrix(y_test, y_pred)
cm

In [None]:
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

In [None]:
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

In [None]:
model = make_pipeline(TfidfVectorizer(stop_words='english',
                                      sublinear_tf=True,
                                      max_df=0.7,
                                      token_pattern=r'(?u)\b[a-zA-Z]{2,}\b'),
                      BernoulliNB(),
                      )
model_tf = model.fit(X_train, y_train)
y_pred = model_tf.predict(X_test)
print('model score on train data: {}'.format(model_tf.score(X_train, y_train)))
print('model score on test data: {}'.format(accuracy_score(y_test, y_pred)))
print("Number of features:", len(model.steps[0][1].get_feature_names()))

In [None]:
cm = confusion_matrix(y_test, y_pred)
cm

In [None]:
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

In [None]:
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

In [None]:
X_train_tvec = X_train_tvec.toarray()
model_gnb = gnb.fit(X_train_tvec, y_train)
X_test_tvec = X_test_tvec.toarray()
predictions_gnb = model_gnb.predict(X_test_tvec)
print('model score on train data: {}'.format(model_gnb.score(X_train_tvec, y_train)))
print('model score on test data: {}'.format(accuracy_score(y_test, predictions_gnb)))
print("Number of features:", len(model.steps[0][1].get_feature_names()))

In [None]:
cm = confusion_matrix(y_test, predictions_gnb)
cm

In [None]:
tn, fp, fn, tp = confusion_matrix(y_test, predictions_gnb).ravel()

In [None]:
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

## Conclusion

For this example, there is no difference between a false positive and false negative. A false positive is a post that is predicted to be on Apex Legends and is in fact a WarFrame post. In scenarios where positive outcomes are more serious like detecting cancer, pregnancy or fraud, a false positive is preferred over a false negative. One way to think of this is a false negative is a positive occurence that managed to 'slip' under the radar. Whereas a false positive is negative scenario(usually good) detected as a positive outcome (usually bad and the ones we want to detect). This is a relatively straightforward classification case.

A practical use case can be filtering news/forum posts into specific industry news and go further by performing sentiment analysis. This can be useful for market analysis and in today's context, predict a probability of a recession. Comparing traditional ways of doing surveys to collect data, this is a much more efficient way to collect and analyse information. Regardless of where information is collected, one drawback is the quality of information.

## Summary between models

The ideal model to be used in this case is Multinomial Naive Bayes' model as the columns of X are integer counts. Hence this model is likely to give the highest score. A fairly close result of false negatives and positives is yielded.

Bernoulli functions best when the columns of X are dummy variables/one-hot encoded. It is not very applicable here as we can see from the lower model score.

Gaussian functions best when the columns of X are Normally distributed. In this case, it is not very applicable as well. One note is it also gives a fairly close result of false negatives and positives unlike the Bernoulli function where false positives >> false negatives.

A logistic Regression
Grid search was not very useful and returned a best penalty of 1.0. However, the model score is high.

KNearestNeighbor (KNN) on its own, the model scored very poorly on the train data and even worse on the test data. There is no change when using grid search to improve the parameters of KNN when using CountVectorizer. However, grid search vastly improved the score of KNN model when using Tfidf-Vectorizer.

## Comparison between the 2 vectorizer
Modelling wise, the train score tended to be slightly lower for TfidfVectorizer than CountVectorizer. But the test score is improved for TfidfVectorizer as compared to the CountVectorizer.
