# Problem Statement

Collecting posts from 2 subreddits using Reddit's API, use NLP techniques to train a classifier for which subreddit each post comes from.

Project Overview:
NLP techniques that will be used are CountVectorizer and Tfid-Ifd. For each technique, different models will be tested. These models are Naive Bayes' models, Multinomial; Gaussian; Bernoulli and other classification models such as KNN and Logistic Regression. KNN and Logistic Regression models will be subjected to a grid search to better optimize the model. For every model used, the confusion matrix and ROC AUC will be used to evaluate each classification model's effectiveness. 

As the project is not linear; the procedure for one NLP technique, one model and its associated performance is:
- Extracting the posts using Reddit API
- EDA to identify target columns
- NLP transformation
- Modelling
- Classification model performance

## *Note: Please Run Notebook from middle*

## 1) Posts Extraction

In [1]:
import requests
import time
import pandas as pd

In [2]:
url_1 = "https://www.reddit.com/r/Warframe/hot.json"

In [3]:
#need to specify user-agent because default using python, it has its own user-agent. Therefore, with many users 
#connecting to the web at the same time, it will return response code of 429.
headers = {'User-agent' : 'Evan 0.1'}

In [4]:
res_1 = requests.get(url_1, headers=headers)

In [5]:
res_1.status_code

200

In [6]:
war_json = res_1.json()

In [7]:
sorted(war_json.keys())

['data', 'kind']

In [8]:
sorted(war_json['data'].keys())

['after', 'before', 'children', 'dist', 'modhash']

In [9]:
len(war_json['data']['children'])

27

In [10]:
list_id = [i['data']['name'] for i in war_json['data']['children']]
list_id

['t3_asfcn4',
 't3_chp9zc',
 't3_chuuvn',
 't3_chuebe',
 't3_chrku6',
 't3_chr7ue',
 't3_chvpw7',
 't3_chr960',
 't3_chl4nh',
 't3_chtask',
 't3_chotwt',
 't3_chqnyj',
 't3_chvpru',
 't3_chn98j',
 't3_chqah7',
 't3_chrssn',
 't3_cho3jx',
 't3_chsar5',
 't3_cho5rm',
 't3_chsrix',
 't3_chsttm',
 't3_chtk6v',
 't3_chmm3s',
 't3_chl296',
 't3_chtek0',
 't3_cht8ii',
 't3_chqa2u']

In [11]:
war_json['data']['after']

't3_chqa2u'

In [12]:
param = {'after': war_json['data']['after']}

In [13]:
requests.get(url_1, params=param, headers=headers)

<Response [200]>

In [14]:
#extract the posts data
posts_war = []
after = None
for i in range(40):
    print(i)
    if after == None:
        params = {}
    else: 
        params = {'after': after}
    url_1= "https://www.reddit.com/r/Warframe/hot.json"
    res_1 = requests.get(url_1, params=params, headers=headers)
    if res_1.status_code == 200:
        war_json = res_1.json()
        posts_war.extend(war_json['data']['children'])
        after = war_json['data']['after']
    else:
        print(res_1.status_code)
        break
    time.sleep(1)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39


In [15]:
len(posts_war)

982

In [16]:
#to check for repetition
len(set(i['data']['name'] for i in posts_war))

982

In [17]:
#There are 102 posts because the first 2 of the posts are pinned and are not considered in the 25post/page limit
posts_war = posts_war[2:]

In [18]:
len(posts_war)

980

In [19]:
#Repeat the above steps for the other sub reddit page: apexlegends
url_2 = "https://www.reddit.com/r/apexlegends/hot.json"

In [20]:
headers = {'User-agent' : 'Bleep blorp bot 0.1'}

In [21]:
res_2 = requests.get(url_2, headers=headers)

In [22]:
res_2.status_code

200

In [23]:
apex_json = res_2.json()

In [24]:
sorted(apex_json.keys())

['data', 'kind']

In [25]:
sorted(apex_json['data'].keys())

['after', 'before', 'children', 'dist', 'modhash']

In [26]:
len(apex_json['data']['children'])

27

In [27]:
list_id_apex = [i['data']['name'] for i in apex_json['data']['children']]
list_id_apex

['t3_chgblz',
 't3_chwj9f',
 't3_chrqq6',
 't3_chvqrv',
 't3_cho5h3',
 't3_chvekl',
 't3_chxxaj',
 't3_chlndk',
 't3_chqjc3',
 't3_chojfp',
 't3_chsfjf',
 't3_chx093',
 't3_chlya1',
 't3_chp0kn',
 't3_chsncv',
 't3_cho85f',
 't3_chtlkq',
 't3_chnzo3',
 't3_chhsje',
 't3_chox8p',
 't3_chu6t0',
 't3_chqqq5',
 't3_chvc4e',
 't3_chslxg',
 't3_chyq9c',
 't3_chxqdb',
 't3_chs336']

In [28]:
apex_json['data']['after']

't3_chs336'

In [29]:
param = {'after': apex_json['data']['after']}

In [30]:
requests.get(url_1, params=param, headers=headers)

<Response [200]>

In [31]:
#example of text in first post for apex
apex_json['data']['children'][0]['data']['selftext']

"&amp;#x200B;\n\nhttps://i.redd.it/qeb183dqacc31.png\n\n# Apex Legends Community Reward Challenge\n\nWelcome to the first Apex Legends art related challenge!\n\nWe at r/ApexLegends are about to finalize a relatively new Reddit feature - [Community Rewards](https://www.reddit.com/r/redesign/comments/c3psbg/community_awards_everything_you_need_to_know/).\n\nYou can read more about community rewards in the link.\n\n&amp;#x200B;\n\n**Whenever you see a popular post, it's very likely that the post has received Reddit Rewards. Silver, Gold &amp; Platinum. Community Rewards are more or less the same, but instead of having silver, gold &amp; platinum, we can make custom rewards with custom icons up to a total of 7 rewards. This is where you talented people come in. In order for us to properly be able to populate the different rewards, we want to put your skills to the test to create the artwork!**\n\n&amp;#x200B;\n\n&amp;#x200B;\n\n|Placing|Price|\n|:-|:-|\n|1st up to 7th|Your icon will be add

In [32]:
#example of title in first post for apex
apex_json['data']['children'][0]['data']['title']

'[r/ApexLegends] Community Reward Challenge'

In [33]:
#example of timestamp in first post for apex
apex_json['data']['children'][0]['data']['created_utc']

1564013703.0

In [34]:
#example of number of comments
apex_json['data']['children'][0]['data']['num_comments']

61

In [35]:
#example of subreddit
apex_json['data']['children'][0]['data']['subreddit']

'apexlegends'

In [36]:
#extract the posts data
posts_apex = []
after = None
for i in range(40):
    print(i)
    if after == None:
        params = {}
    else: 
        params = {'after': after}
    url_2= "https://www.reddit.com/r/apexlegends/hot.json"
    res_2 = requests.get(url_2, params=params, headers=headers)
    if res_2.status_code == 200:
        apex_json = res_2.json()
        posts_apex.extend(apex_json['data']['children'])
        after = apex_json['data']['after']
    else:
        print(res_2.status_code)
        break
    time.sleep(1)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39


In [37]:
len(posts_apex)

1001

In [38]:
#to check for repetition
len(set(i['data']['name'] for i in posts_apex))

849

In [39]:
posts_apex = posts_apex[2:]

In [40]:
len(posts_apex)

999

## 2) EDA

In [41]:
#extract title, subreddit, length of time it has been up and number of comments, text in posts.
data_apex = {
    'title': [i['data']['title'] for i in posts_apex], 
    'subreddit': [i['data']['subreddit'] for i in posts_apex], 
    'time': [i['data']['created_utc'] for i in posts_apex], 
    'comments': [i['data']['num_comments'] for i in posts_apex],
    'text': [i['data']['selftext'] for i in posts_apex]
}

df_apex = pd.DataFrame(data_apex, columns = data_apex.keys())

In [42]:
df_apex.head()

Unnamed: 0,title,subreddit,time,comments,text
0,Tried to do Fight Club but they wanted the win,apexlegends,1564080000.0,215,
1,Anybody else getting empty Apex Packs?,apexlegends,1564100000.0,50,
2,No one gets my gold armor!,apexlegends,1564064000.0,220,
3,Pro strats,apexlegends,1564098000.0,42,
4,"“During development, we use heatmaps to look a...",apexlegends,1564113000.0,23,


In [43]:
data_war = {
    'title': [i['data']['title'] for i in posts_war], 
    'subreddit': [i['data']['subreddit'] for i in posts_war], 
    'time': [i['data']['created_utc'] for i in posts_war], 
    'comments': [i['data']['num_comments'] for i in posts_war],
    'text': [i['data']['selftext'] for i in posts_war]
}

df_war = pd.DataFrame(data_war, columns = data_war.keys())

In [44]:
df_war.head()

Unnamed: 0,title,subreddit,time,comments,text
0,Idea: Watch Prime time &amp; Devstreams inside...,Warframe,1564095000.0,119,
1,Two Volts Walk into a Bar...,Warframe,1564093000.0,62,
2,Umbra(cat) vs Shawzin,Warframe,1564079000.0,32,
3,"meet George Endo, my first Ayatan golem",Warframe,1564078000.0,40,
4,"DE Please make this a Glyph, we will pay anyth...",Warframe,1564100000.0,15,


In [45]:
df = pd.concat([df_apex, df_war])

In [46]:
df.head()

Unnamed: 0,title,subreddit,time,comments,text
0,Tried to do Fight Club but they wanted the win,apexlegends,1564080000.0,215,
1,Anybody else getting empty Apex Packs?,apexlegends,1564100000.0,50,
2,No one gets my gold armor!,apexlegends,1564064000.0,220,
3,Pro strats,apexlegends,1564098000.0,42,
4,"“During development, we use heatmaps to look a...",apexlegends,1564113000.0,23,


In [47]:
df['apexlegends'] = pd.get_dummies(df['subreddit']).drop(columns=['Warframe'])

In [48]:
df = df.drop(columns='subreddit')

In [49]:
df.head()

Unnamed: 0,title,time,comments,text,apexlegends
0,Tried to do Fight Club but they wanted the win,1564080000.0,215,,1
1,Anybody else getting empty Apex Packs?,1564100000.0,50,,1
2,No one gets my gold armor!,1564064000.0,220,,1
3,Pro strats,1564098000.0,42,,1
4,"“During development, we use heatmaps to look a...",1564113000.0,23,,1


In [50]:
df.to_csv('./datasets/data.csv')

# **Run Notebook from here**

In [52]:
#find X and y
X = df['title'] + df['text']
y = df['apexlegends']

## 3) NLP Technique

In [53]:
#split data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.25,
                                                    random_state=42,
                                                    stratify=y)

The following will be done using CountVectorizer together with the models used:
1) Multinomial NB
2) Bernoulli NB
3) Gaussian NB
4) Logistic Regression + Grid Search
5) KNN + Grid Search

Each classification method will be evaluated based on the following:
1) Confusion matrix

In [54]:
from sklearn.feature_extraction.text import CountVectorizer
cvec = CountVectorizer(stop_words='english', token_pattern=r'(?u)\b[a-zA-Z]{2,}\b')
X_train_cvec = cvec.fit_transform(X_train)
X_test_cvec = cvec.transform(X_test)

In [55]:
#Check out the most common words
X_df = pd.DataFrame(cvec.transform(X_train).todense(),
                       columns=cvec.get_feature_names())
word_counts = pd.DataFrame(X_df).sum(axis=0)
word_counts.sort_values(ascending = False).head(20)

just        503
game        402
like        375
warframe    267
amp         217
know        211
don         209
time        198
https       193
ve          183
apex        177
play        177
think       166
new         164
got         160
make        159
damage      151
use         145
really      144
way         144
dtype: int64

In [56]:
 #can try other kinds of vectorizer like tf-idf
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression

model = make_pipeline(CountVectorizer(stop_words='english',
                                      token_pattern=r'(?u)\b[a-zA-Z]{2,}\b'),
                      LogisticRegression(solver='lbfgs'),
                      )

model_count = model.fit(X_train, y_train)
y_pred = model_count.predict(X_test)
print('model score on train data: {}'.format(model_count.score(X_train, y_train)))
print('model score on test data: {}'.format(accuracy_score(y_test, y_pred)))
print("Number of features:", len(model.steps[0][1].get_feature_names()))

model score on train data: 0.9892183288409704
model score on test data: 0.8464646464646465
Number of features: 7527


In [57]:
# GridSearch was attempted and excluded because there was no change.
# Import the confusion matrix function.
from sklearn.metrics import confusion_matrix

In [58]:
cm = confusion_matrix(y_test, y_pred)
cm

array([[190,  55],
       [ 21, 229]])

In [59]:
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

In [60]:
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

True Negatives: 190
False Positives: 55
False Negatives: 21
True Positives: 229


In [61]:
from sklearn.neighbors import KNeighborsClassifier
model = make_pipeline(CountVectorizer(stop_words='english',
                                      token_pattern=r'(?u)\b[a-zA-Z]{2,}\b'),
                      KNeighborsClassifier(),
                      )

model_count = model.fit(X_train, y_train)
y_pred = model_count.predict(X_test)
print('model score on train data: {}'.format(model_count.score(X_train, y_train)))
print('model score on test data: {}'.format(accuracy_score(y_test, y_pred)))
print("Number of features:", len(model.steps[0][1].get_feature_names()))

model score on train data: 0.8173854447439353
model score on test data: 0.604040404040404
Number of features: 7527


In [62]:
from sklearn.model_selection import GridSearchCV
knn_params = {
    'n_neighbors': range(1,10, 2),
    'metric' : ['euclidean', 'manhattan']
}

knn_gridsearch = GridSearchCV(
    KNeighborsClassifier(),
    knn_params,
    cv=9,
    verbose=1,
    return_train_score=False
)

knn_gridsearch.fit(X_train_cvec, y_train)

Fitting 9 folds for each of 10 candidates, totalling 90 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  90 out of  90 | elapsed:   34.8s finished


GridSearchCV(cv=9, error_score='raise-deprecating',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'n_neighbors': range(1, 10, 2), 'metric': ['euclidean', 'manhattan']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
       scoring=None, verbose=1)

In [63]:
knn_gridsearch.best_score_

0.6401617250673854

In [64]:
knn_gridsearch.best_params_

{'metric': 'euclidean', 'n_neighbors': 9}

In [65]:
knn_gridsearch.best_estimator_

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='euclidean',
           metric_params=None, n_jobs=None, n_neighbors=9, p=2,
           weights='uniform')

In [66]:
#need to fit do a cvec first before fitting into knn, as shown above it can be done using pipeline alone
#however we want to use the best parameter model from our girdsearch and git manually hence the need.
knn = KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='euclidean',
           metric_params=None, n_jobs=None, n_neighbors=3, p=2,
           weights='uniform')
model = knn.fit(X_train_cvec, y_train)

In [67]:
model.score(X_train_cvec, y_train)

0.8915094339622641

In [68]:
model.score(X_test_cvec, y_test)

0.6060606060606061

In [69]:
y_pred = model.predict(X_test_cvec)

In [70]:
cm = confusion_matrix(y_test, y_pred)
cm

array([[ 70, 175],
       [ 20, 230]])

In [71]:
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

In [72]:
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

True Negatives: 70
False Positives: 175
False Negatives: 20
True Positives: 230


In [73]:
# Import our model! for Multinomial
from sklearn.naive_bayes import MultinomialNB

In [74]:
model = make_pipeline(CountVectorizer(stop_words='english',
                                      token_pattern=r'(?u)\b[a-zA-Z]{2,}\b'),
                      MultinomialNB(),
                      )

model_count = model.fit(X_train, y_train)
y_pred = model_count.predict(X_test)
#note that even though X_train is not CountVectorizer fit_transformed, using model_count takes into account
#pipeline characteristics, allowing raw X_train to be used here.
print('model score on train data: {}'.format(model_count.score(X_train, y_train)))
print('model score on test data: {}'.format(accuracy_score(y_test, y_pred)))
print("Number of features:", len(model.steps[0][1].get_feature_names()))

model score on train data: 0.9582210242587601
model score on test data: 0.9090909090909091
Number of features: 7527


In [75]:
# Generate a confusion matrix.

cm = confusion_matrix(y_test, y_pred)
cm

array([[221,  24],
       [ 21, 229]])

In [76]:
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

In [77]:
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

True Negatives: 221
False Positives: 24
False Negatives: 21
True Positives: 229


Use of BernoulliNB model to classify (elaborate a bit about how it works), expected to be better or worse?

In [78]:
#Import Bernoulli and edit pipeline from above
from sklearn.naive_bayes import BernoulliNB
model = make_pipeline(CountVectorizer(stop_words='english',
                                      token_pattern=r'(?u)\b[a-zA-Z]{2,}\b'),
                      BernoulliNB(),
                      )


model_count = model.fit(X_train, y_train)
y_pred = model_count.predict(X_test)
print('model score on train data: {}'.format(model_count.score(X_train, y_train)))
print('model score on test data: {}'.format(accuracy_score(y_test, y_pred)))
print("Number of features:", len(model.steps[0][1].get_feature_names()))

model score on train data: 0.7884097035040432
model score on test data: 0.7333333333333333
Number of features: 7527


In [79]:
# Generate a confusion matrix.
cm = confusion_matrix(y_test, y_pred)
cm

array([[116, 129],
       [  3, 247]])

In [80]:
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

In [81]:
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

True Negatives: 116
False Positives: 129
False Negatives: 3
True Positives: 247


Add in the ROC AUC as another metric

Use of GaussianNB

We will need to reinstantiate the CountVectorizer outside the pipeline in order to obtain an array type for it to be able to fit under a Gaussian Naive Bayes' model.

In [82]:
#cvec already defined above
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
X_train_cvec = X_train_cvec.toarray()
model_gnb = gnb.fit(X_train_cvec, y_train)
X_test_cvec = X_test_cvec.toarray()
predictions_gnb = model_gnb.predict(X_test_cvec)
print('model score on train data: {}'.format(model_gnb.score(X_train_cvec, y_train)))
print('model score on test data: {}'.format(accuracy_score(y_test, predictions_gnb)))
print("Number of features:", len(model.steps[0][1].get_feature_names()))

model score on train data: 0.9683288409703504
model score on test data: 0.8707070707070707
Number of features: 7527


In [83]:
# Generate a confusion matrix.
cm = confusion_matrix(y_test, predictions_gnb)
cm

array([[199,  46],
       [ 18, 232]])

In [84]:
tn, fp, fn, tp = confusion_matrix(y_test, predictions_gnb).ravel()

In [85]:
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

True Negatives: 199
False Positives: 46
False Negatives: 18
True Positives: 232


Repeat the above steps using Tfid-ifd together with the following models:
1) Multinomial
2) Bernoulli
3) Gaussian

Then evaluate similarly for each model with the following classification metrics:
1) Confusion matrix
2) ROC AUC

In [86]:
from sklearn.feature_extraction.text import TfidfVectorizer
tvec = TfidfVectorizer(stop_words='english', 
                       sublinear_tf=True,
                       max_df=0.7, 
                       token_pattern=r'(?u)\b[a-zA-Z]{2,}\b')
X_train_tvec = tvec.fit_transform(X_train)
X_test_tvec = tvec.transform(X_test)

In [87]:
#import libraries required to use TfidVectorizer
model = make_pipeline(TfidfVectorizer(stop_words='english',
                        sublinear_tf=True,
                        max_df=0.5,
                        token_pattern=r'(?u)\b[a-zA-Z]{2,}\b'),
                      LogisticRegression(),
                      )

model_tf = model.fit(X_train, y_train)
y_pred = model_tf.predict(X_test)
print('model score on train data: {}'.format(model_tf.score(X_train, y_train)))
print('model score on test data: {}'.format(accuracy_score(y_test, y_pred)))
print("Number of features:", len(model.steps[0][1].get_feature_names()))

model score on train data: 0.9824797843665768
model score on test data: 0.8727272727272727
Number of features: 7527




In [88]:
model_tf.named_steps

{'tfidfvectorizer': TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
         dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
         lowercase=True, max_df=0.5, max_features=None, min_df=1,
         ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
         stop_words='english', strip_accents=None, sublinear_tf=True,
         token_pattern='(?u)\\b[a-zA-Z]{2,}\\b', tokenizer=None,
         use_idf=True, vocabulary=None),
 'logisticregression': LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
           intercept_scaling=1, max_iter=100, multi_class='warn',
           n_jobs=None, penalty='l2', random_state=None, solver='warn',
           tol=0.0001, verbose=0, warm_start=False)}

In [89]:
df_log = pd.DataFrame(model_tf.named_steps.logisticregression.coef_)

In [90]:
df_log.columns = model_tf.named_steps.tfidfvectorizer.get_feature_names()

In [91]:
df_log.head()

Unnamed: 0,aaaand,abandoned,abandoning,abc,abilities,abilitiesi,ability,able,abnormally,aboard,...,zipline,ziplines,zipped,zipping,zone,zoned,zones,zoom,zopney,zotac
0,-0.137653,-0.018376,-0.037548,0.040318,-0.913455,-0.058753,-0.492311,-0.223484,0.143165,-0.142141,...,0.505523,0.163159,0.054529,-0.009177,0.080989,0.092325,-0.019723,-0.068758,-0.011881,0.041804


In [92]:
df_log = df_log.reset_index() #Make your index into a column
df_log = pd.melt(df_log, id_vars = ['index']) #Reshape data

In [93]:
df_log = df_log.drop(columns='index').sort_values(by = 'value') #Remove duplicates, sort

In [94]:
#top 20 key words that are relevant for Warframe
df_log.head(20)

Unnamed: 0,variable,value
7276,warframe,-3.741892
4972,prime,-2.640685
844,build,-1.589209
4498,operator,-1.551822
2614,frame,-1.376048
4129,mission,-1.353938
6659,tenno,-1.295308
835,bug,-1.283752
4131,missions,-1.256867
3532,just,-1.241961


In [95]:
#top 20 key words that are relevant for Apex Legends
df_log.tail(20)

Unnamed: 0,variable,value
989,caustic,1.155718
7448,wraith,1.1703
1181,clutch,1.188559
2799,gold,1.196397
540,bangalore,1.217846
6635,teammates,1.284757
2750,gibby,1.301021
5508,respawn,1.374351
3709,legends,1.374978
1021,challenges,1.377156


In [96]:
cm = confusion_matrix(y_test, y_pred)
cm

array([[213,  32],
       [ 31, 219]])

In [97]:
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

In [98]:
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

True Negatives: 213
False Positives: 32
False Negatives: 31
True Positives: 219


In [99]:
from sklearn.ensemble import BaggingClassifier
knn = KNeighborsClassifier()
model = make_pipeline(TfidfVectorizer(stop_words='english',
                        sublinear_tf=True,
                        max_df=0.5,
                        token_pattern=r'(?u)\b[a-zA-Z]{2,}\b'),
                      BaggingClassifier(base_estimator=knn, max_samples=0.5, max_features=0.5)
                      )

model_tf = model.fit(X_train, y_train)
y_pred = model_tf.predict(X_test)
print('model score on train data: {}'.format(model_tf.score(X_train, y_train)))
print('model score on test data: {}'.format(accuracy_score(y_test, y_pred)))
print("Number of features:", len(model.steps[0][1].get_feature_names()))

model score on train data: 0.6866576819407008
model score on test data: 0.5474747474747474
Number of features: 7527


In [100]:
knn_params = {
    'n_neighbors': range(15,25,2),
    'metric' : ['euclidean', 'manhattan']
}

knn_gridsearch = GridSearchCV(
    KNeighborsClassifier(),
    knn_params,
    cv=9,
    verbose=1,
    return_train_score=False
)

knn_gridsearch.fit(X_train_tvec, y_train)

Fitting 9 folds for each of 10 candidates, totalling 90 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  90 out of  90 | elapsed:   34.7s finished


GridSearchCV(cv=9, error_score='raise-deprecating',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'n_neighbors': range(15, 25, 2), 'metric': ['euclidean', 'manhattan']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
       scoring=None, verbose=1)

In [101]:
knn_gridsearch.best_score_

0.8234501347708895

In [102]:
knn_gridsearch.best_params_

{'metric': 'euclidean', 'n_neighbors': 17}

In [103]:
knn_gridsearch.best_estimator_

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='euclidean',
           metric_params=None, n_jobs=None, n_neighbors=17, p=2,
           weights='uniform')

In [104]:
knn = KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='euclidean',
           metric_params=None, n_jobs=None, n_neighbors=19, p=2,
           weights='uniform')
model = knn.fit(X_train_tvec, y_train)

In [105]:
model.score(X_train_tvec, y_train)

0.8598382749326146

In [106]:
model.score(X_test_tvec, y_test)

0.8303030303030303

In [107]:
y_pred = model.predict(X_test_tvec)

In [108]:
cm = confusion_matrix(y_test, y_pred)
cm

array([[222,  23],
       [ 61, 189]])

In [109]:
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

In [110]:
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

True Negatives: 222
False Positives: 23
False Negatives: 61
True Positives: 189


In [111]:
model = make_pipeline(TfidfVectorizer(stop_words='english',
                                      sublinear_tf=True,
                                      max_df=0.5,
                                      token_pattern=r'(?u)\b[a-zA-Z]{2,}\b'),
                      MultinomialNB(),
                      )
model_tf = model.fit(X_train, y_train)
y_pred = model_tf.predict(X_test)
print('model score on train data: {}'.format(model_tf.score(X_train, y_train)))
print('model score on test data: {}'.format(accuracy_score(y_test, y_pred)))
print("Number of features:", len(model.steps[0][1].get_feature_names()))

model score on train data: 0.9797843665768194
model score on test data: 0.8848484848484849
Number of features: 7527


In [112]:
model_tf.named_steps

{'tfidfvectorizer': TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
         dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
         lowercase=True, max_df=0.5, max_features=None, min_df=1,
         ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
         stop_words='english', strip_accents=None, sublinear_tf=True,
         token_pattern='(?u)\\b[a-zA-Z]{2,}\\b', tokenizer=None,
         use_idf=True, vocabulary=None),
 'multinomialnb': MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)}

In [113]:
df_multi = pd.DataFrame(model_tf.named_steps.multinomialnb.coef_)

In [114]:
df_multi.columns = model_tf.named_steps.tfidfvectorizer.get_feature_names()

In [115]:
df_multi.head()

Unnamed: 0,aaaand,abandoned,abandoning,abc,abilities,abilitiesi,ability,able,abnormally,aboard,...,zipline,ziplines,zipped,zipping,zone,zoned,zones,zoom,zopney,zotac
0,-9.208863,-9.063953,-9.057981,-9.027315,-8.920346,-9.208863,-8.368399,-8.030579,-8.857243,-9.208863,...,-8.205598,-8.774714,-9.062018,-9.208863,-8.451381,-8.971569,-9.037196,-9.208863,-9.208863,-9.068323


In [116]:
df_multi = df_multi.reset_index() #Make your index into a column
df_multi = pd.melt(df_multi, id_vars = ['index']) #Reshape data

In [117]:
df_multi.head()

Unnamed: 0,index,variable,value
0,-9.208863,level_0,0.0
1,-9.208863,aaaand,-9.208863
2,-9.208863,abandoned,-9.063953
3,-9.208863,abandoning,-9.057981
4,-9.208863,abc,-9.027315


In [118]:
df_multi = df_multi.drop(columns='index').sort_values(by = 'value') #Remove duplicates, sort

In [119]:
#top 20 key words that are relevant for Warframe
df_multi.head(20)

Unnamed: 0,variable,value
3156,hugging,-9.208863
5126,qcmopoq,-9.208863
2811,goosebumps,-9.208863
2813,gorgon,-9.208863
5122,pyramid,-9.208863
2815,gospel,-9.208863
7025,underranked,-9.208863
2817,goth,-9.208863
5119,pvp,-9.208863
5117,puzzle,-9.208863


In [120]:
#top 20 key words that are relevant for Apex Legends
df_multi.tail(20)

Unnamed: 0,variable,value
3596,know,-7.144298
3570,kills,-7.143765
3709,legends,-7.117982
6258,squad,-7.055298
7308,wattson,-7.05019
6634,teammate,-7.024906
6721,think,-6.998852
3984,match,-6.990577
4664,pathfinder,-6.965961
1905,don,-6.895478


In [121]:
cm = confusion_matrix(y_test, y_pred)
cm

array([[216,  29],
       [ 28, 222]])

In [122]:
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

In [123]:
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

True Negatives: 216
False Positives: 29
False Negatives: 28
True Positives: 222


In [124]:
model = make_pipeline(TfidfVectorizer(stop_words='english',
                                      sublinear_tf=True,
                                      max_df=0.7,
                                      token_pattern=r'(?u)\b[a-zA-Z]{2,}\b'),
                      BernoulliNB(),
                      )
model_tf = model.fit(X_train, y_train)
y_pred = model_tf.predict(X_test)
print('model score on train data: {}'.format(model_tf.score(X_train, y_train)))
print('model score on test data: {}'.format(accuracy_score(y_test, y_pred)))
print("Number of features:", len(model.steps[0][1].get_feature_names()))

model score on train data: 0.7884097035040432
model score on test data: 0.7333333333333333
Number of features: 7527


In [125]:
cm = confusion_matrix(y_test, y_pred)
cm

array([[116, 129],
       [  3, 247]])

In [126]:
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

In [127]:
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

True Negatives: 116
False Positives: 129
False Negatives: 3
True Positives: 247


In [128]:
X_train_tvec = X_train_tvec.toarray()
model_gnb = gnb.fit(X_train_tvec, y_train)
X_test_tvec = X_test_tvec.toarray()
predictions_gnb = model_gnb.predict(X_test_tvec)
print('model score on train data: {}'.format(model_gnb.score(X_train_tvec, y_train)))
print('model score on test data: {}'.format(accuracy_score(y_test, predictions_gnb)))
print("Number of features:", len(model.steps[0][1].get_feature_names()))

model score on train data: 0.9696765498652291
model score on test data: 0.8646464646464647
Number of features: 7527


In [129]:
cm = confusion_matrix(y_test, predictions_gnb)
cm

array([[202,  43],
       [ 24, 226]])

In [130]:
tn, fp, fn, tp = confusion_matrix(y_test, predictions_gnb).ravel()

In [131]:
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

True Negatives: 202
False Positives: 43
False Negatives: 24
True Positives: 226


## Conclusion

For this example, there is no difference between a false positive and false negative. A false positive is a post that is predicted to be on Apex Legends and is in fact a WarFrame post. In scenarios where positive outcomes are more serious like detecting cancer, pregnancy or fraud, a false positive is preferred over a false negative. One way to think of this is a false negative is a positive occurence that managed to 'slip' under the radar. Whereas a false positive is negative scenario(usually good) detected as a positive outcome (usually bad and the ones we want to detect). This is a relatively straightforward classification case.

A practical use case can be filtering news/forum posts into specific industry news and go further by performing sentiment analysis. This can be useful for market analysis and in today's context, predict a probability of a recession. Comparing traditional ways of doing surveys to collect data, this is a much more efficient way to collect and analyse information. Regardless of where information is collected, one drawback is the quality of information.

## Summary between models

The ideal model to be used in this case is Multinomial Naive Bayes' model as the columns of X are integer counts. Hence this model is likely to give the highest score. A fairly close result of false negatives and positives is yielded.

Bernoulli functions best when the columns of X are dummy variables/one-hot encoded. It is not very applicable here as we can see from the lower model score.

Gaussian functions best when the columns of X are Normally distributed. In this case, it is not very applicable as well. One note is it also gives a fairly close result of false negatives and positives unlike the Bernoulli function where false positives >> false negatives.

A logistic Regression
Grid search was not very useful and returned a best penalty of 1.0. However, the model score is high.

KNearestNeighbor (KNN) on its own, the model scored very poorly on the train data and even worse on the test data. There is no change when using grid search to improve the parameters of KNN when using CountVectorizer. However, grid search vastly improved the score of KNN model when using Tfidf-Vectorizer.

## Comparison between the 2 vectorizer
Modelling wise, the train score tended to be slightly lower for TfidfVectorizer than CountVectorizer. But the test score is improved for TfidfVectorizer as compared to the CountVectorizer.
