# Machine Learning Models (Using `keywords`)

### Table of Content

* [1 Python Setup](#1)
* [2 Data Visulation](#2)
    * [2.1 Data Cleaning](#2.1)
* [3 Model Fitting](#3)
    * [3.1 K-Nearest-Neighbors Classifier](#3.1)
    * [3.2 Multinomial NB](#3.2)
    * [3.3 RandomForestClassifier](#3.3)
    * [3.4 OneVsRestClassifier ](#3.4)
    * [3.5 LinearSVC](#3.5)
    * [3.6 StackingCVClassifier + GridSearchCV](#3.6)
    * [3.7 Decision Tree Classifier + Grid Search CV](#3.7)

## 1 Python Setup <a class="anchor" id="1"></a>

In [1]:
import re
import random
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC
from sklearn.preprocessing import StandardScaler
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from xgboost import XGBClassifier
from mlxtend.classifier import StackingCVClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

## 2 Data Visualization <a class="anchor" id="2"></a>

In [2]:
news = pd.read_csv('final_news_df.csv')
news = news.dropna()

In [3]:
news.head()

Unnamed: 0,category,headline,short_description,date,keywords
1,BUSINESS,Investment Crowdfunding Draws a Crowd,It's getting crowded in the investment crowdfu...,2015-03-02,investment-crowdfunding-draw
2,BUSINESS,Office Romances Often Sparked By Emoticon-Lace...,But even though they might literally be sendin...,2012-08-26,office-romances-emoticons
3,BUSINESS,Verizon Could Buy Yahoo In The Next Few Days,A deal could be coming soon.,2016-07-22,verizon-could-buy-yahoo-next-few-days
4,BUSINESS,What's Your Meeting Brand?,Many leaders operate as if their meeting brand...,2014-06-11,whats-your-meeting-brand
5,BUSINESS,Feds OK Charter And Time Warner Mega Merger,"The deal, which still needs FCC approval, woul...",2016-04-25,charter-time-warner-cable-takeover


In [4]:
def remove_sign(text):
    pattern = r'[,:!?.\/\|\*\(\)\"]'
    text=re.sub(pattern,' ',text)
    text = re.sub('[^a-zA-Z]'," " ,text)
    return text

news['text'] = news['short_description'].apply(remove_sign)

def remove_stop_words(text):
    tokens = word_tokenize(text)
    words = [word.lower() for word in tokens if word.isalpha()]
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if not word in stop_words]
    str1 = " ".join(words)
    return str1.strip()
        
news['text'] = news['text'].apply(remove_stop_words)

In [5]:
def remove_dash(text):
    text = text.replace("-", " ")
    return text

news['keywords text'] = news['keywords'].apply(remove_dash)

In [6]:
news.head()

Unnamed: 0,category,headline,short_description,date,keywords,text,keywords text
1,BUSINESS,Investment Crowdfunding Draws a Crowd,It's getting crowded in the investment crowdfu...,2015-03-02,investment-crowdfunding-draw,getting crowded investment crowdfunding space ...,investment crowdfunding draw
2,BUSINESS,Office Romances Often Sparked By Emoticon-Lace...,But even though they might literally be sendin...,2012-08-26,office-romances-emoticons,even though might literally sending wrong mess...,office romances emoticons
3,BUSINESS,Verizon Could Buy Yahoo In The Next Few Days,A deal could be coming soon.,2016-07-22,verizon-could-buy-yahoo-next-few-days,deal could coming soon,verizon could buy yahoo next few days
4,BUSINESS,What's Your Meeting Brand?,Many leaders operate as if their meeting brand...,2014-06-11,whats-your-meeting-brand,many leaders operate meeting brand directly af...,whats your meeting brand
5,BUSINESS,Feds OK Charter And Time Warner Mega Merger,"The deal, which still needs FCC approval, woul...",2016-04-25,charter-time-warner-cable-takeover,deal still needs fcc approval would create sec...,charter time warner cable takeover


### 2.1 Data Cleaning <a class="anchor" id="2.1"></a>

In [7]:
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(news['keywords text'])
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(x)


encoder = LabelEncoder()
y = encoder.fit_transform(news['category'])

x_train, x_test, y_train, y_test = \
    train_test_split(X_train_tfidf, y, test_size=0.3, random_state=123, stratify=y)

x_train_sub, x_valid, y_train_sub, y_valid = \
    train_test_split(x_train, y_train, test_size=0.2, random_state=123, stratify=y_train)

print('Train/Valid/Test sizes:', y_train_sub.shape[0], y_valid.shape[0], y_test.shape[0])

Train/Valid/Test sizes: 15689 3923 8406


## 3 Model Fitting <a class="anchor" id="3"></a>

## 3.1 K-Nearest-Neighbors Classifier <a class="anchor" id="3.1"></a>

In [8]:
knn_model = KNeighborsClassifier(n_neighbors=3)
knn_model.fit(x_train, y_train)

KNeighborsClassifier(n_neighbors=3)

### 3.1.1 Training Set

In [9]:
print(f'Training Accuracy: {knn_model.score(x_train, y_train)*100:.2f}%')
print(f'Valid set accuracy: {knn_model.score(x_valid, y_valid)*100:.2f}%')
print(f'Test set accuracy: {knn_model.score(x_test, y_test)*100:.2f}%')

Training Accuracy: 80.51%
Valid set accuracy: 80.30%
Test set accuracy: 65.05%


### 3.1.2 Validation  Set

In [10]:
y_pred_knn_valid = knn_model.predict(x_valid)

In [11]:
f1_score(y_valid, y_pred_knn_valid, average='weighted')

0.8030965698349332

In [12]:
precision_score(y_valid, y_pred_knn_valid, average='weighted')

0.8143319538599235

In [13]:
recall_score(y_valid, y_pred_knn_valid, average='weighted')

0.8029569207239358

### 3.1.3 Test Set

In [14]:
y_pred_knn = knn_model.predict(x_test)

In [15]:
f1_score(y_test, y_pred_knn, average='weighted')

0.6506679943311742

In [16]:
precision_score(y_test, y_pred_knn, average='weighted')

0.6671480762517079

In [17]:
recall_score(y_test, y_pred_knn, average='weighted')

0.6504877468474899

-------------------------------------------------------------------------

## 3.2 Multinomial NB <a class="anchor" id="3.2"></a>

### 3.2.1 Training Set

In [18]:
nb = MultinomialNB()
nb.fit(x_train, y_train)

MultinomialNB()

In [19]:
print(f'Training Accuracy: {nb.score(x_train, y_train)*100:.2f}%')
print(f'Valid set accuracy: {nb.score(x_valid, y_valid)*100:.2f}%')
print(f'Test set accuracy: {nb.score(x_test, y_test)*100:.2f}%')

Training Accuracy: 88.20%
Valid set accuracy: 88.10%
Test set accuracy: 76.98%


In [20]:
nb.score(x_train, y_train)

0.8819600244748114

### 3.2.2 Validation Set

In [21]:
nb.score(x_valid, y_valid)

0.8809584501656895

In [22]:
y_pred_nb_valid = nb.predict(x_valid)

In [23]:
f1_score(y_valid, y_pred_nb_valid, average='weighted')

0.8809324287424846

In [24]:
precision_score(y_valid, y_pred_nb_valid, average='weighted')

0.8820583630491844

In [25]:
recall_score(y_valid, y_pred_nb_valid, average='weighted')

0.8809584501656895

### 3.2.3 Test Set

In [26]:
nb.score(x_test, y_test)

0.7698072805139187

In [27]:
y_pred_nb = nb.predict(x_test)

In [28]:
f1_score(y_test, y_pred_nb, average='weighted')

0.7698307579760778

In [29]:
precision_score(y_test, y_pred_nb, average='weighted')

0.770883973032243

In [30]:
recall_score(y_test, y_pred_nb, average='weighted')

0.7698072805139187

------------------------------------------------------------------------------------------------------------------------------------

## 3.3 RandomForestClassifier <a class="anchor" id="3.3"></a>

In [31]:
forest = RandomForestClassifier(n_estimators=100,
                                random_state=1)
forest.fit(x_train, y_train)

RandomForestClassifier(random_state=1)

### 3.3.1 Training Set

In [32]:
print(f'Training Accuracy: {forest.score(x_train, y_train)*100:.2f}%')
print(f'Valid set accuracy: {forest.score(x_valid, y_valid)*100:.2f}%')
print(f'Test set accuracy: {forest.score(x_test, y_test)*100:.2f}%')

Training Accuracy: 99.56%
Valid set accuracy: 99.75%
Test set accuracy: 73.59%


### 3.3.2 Validation Set

In [33]:
y_pred_forest_valid = forest.predict(x_valid)

In [34]:
f1_score(y_valid, y_pred_forest_valid, average='weighted')

0.9974492048361288

In [35]:
precision_score(y_valid, y_pred_forest_valid, average='weighted')

0.997450653542024

In [36]:
recall_score(y_valid, y_pred_forest_valid, average='weighted')

0.9974509304104002

### 3.3.3 Test Set

In [37]:
y_pred_forest = forest.predict(x_test)

In [38]:
f1_score(y_test, y_pred_forest, average='weighted')

0.7370416119770092

In [39]:
precision_score(y_test, y_pred_forest, average='weighted')

0.7487719566127018

In [40]:
recall_score(y_test, y_pred_forest, average='weighted')

0.7359029264810849

----------------------------------------------------

## 3.4 OneVsRestClassifier <a class="anchor" id="3.4"></a>

In [41]:
ovr = OneVsRestClassifier(MultinomialNB())
y_pred_ovr = ovr.fit(x_train, y_train).predict(x_test)

### 3.4.1 Training Set

In [42]:
print(f'Training Accuracy: {ovr.score(x_train, y_train)*100:.2f}%')
print(f'Valid set accuracy: {ovr.score(x_valid, y_valid)*100:.2f}%')
print(f'Test set accuracy: {ovr.score(x_test, y_test)*100:.2f}%')

Training Accuracy: 88.93%
Valid set accuracy: 88.81%
Test set accuracy: 77.69%


### 3.4.2 Validation Set

In [43]:
y_pred_ovr_valid = ovr.predict(x_valid)

In [44]:
accuracy_score(y_valid, y_pred_ovr_valid)

0.8880958450165689

In [45]:
precision_score(y_valid, y_pred_ovr_valid, average='weighted')

0.888767963965286

In [46]:
recall_score(y_valid, y_pred_ovr_valid, average='weighted')

0.8880958450165689

In [47]:
f1_score(y_valid, y_pred_ovr_valid, average='weighted')

0.8880091145691023

### 3.4.3 Test Set

In [48]:
accuracy_score(y_test, y_pred_ovr)

0.7769450392576731

In [49]:
f1_score(y_test, y_pred_ovr, average='weighted')

0.7768579896907817

In [50]:
precision_score(y_test, y_pred_ovr, average='weighted')

0.7775355762077373

In [51]:
recall_score(y_test, y_pred_ovr, average='weighted')

0.7769450392576731

---------------------------------------------

## 3.5 LinearSVC <a class="anchor" id="3.5"></a>

In [52]:
clf = LinearSVC(random_state=0, tol=1e-5)
LinearSVC_model = clf.fit(x_train, y_train)

### 3.5.1 Training Set

In [53]:
print(f'Training Accuracy: {LinearSVC_model.score(x_train, y_train)*100:.2f}%')
print(f'Valid set accuracy: {LinearSVC_model.score(x_valid, y_valid)*100:.2f}%')
print(f'Test set accuracy: {LinearSVC_model.score(x_test, y_test)*100:.2f}%')

Training Accuracy: 96.47%
Valid set accuracy: 96.15%
Test set accuracy: 78.33%


### 3.5.2 Validation Set

In [54]:
y_pred_clf_valid = LinearSVC_model.predict(x_valid)

In [55]:
accuracy_score(y_valid, y_pred_clf_valid)

0.961509049197043

In [56]:
f1_score(y_valid, y_pred_clf_valid, average='weighted')

0.9614777036844301

In [57]:
precision_score(y_valid, y_pred_clf_valid, average='weighted')

0.9617215623182921

In [58]:
recall_score(y_valid, y_pred_clf_valid, average='weighted')

0.961509049197043

### 3.5.3 Test Set

In [59]:
y_pred_clf = LinearSVC_model.predict(x_test)

In [60]:
accuracy_score(y_test, y_pred_clf)

0.7832500594813229

In [61]:
f1_score(y_test, y_pred_clf, average='weighted')

0.7832518750023001

In [62]:
precision_score(y_test, y_pred_clf, average='weighted')

0.7837042491455734

In [63]:
recall_score(y_test, y_pred_clf, average='weighted')

0.7832500594813229

-------------------------------------------

## 3.6 StackingCVClassifier + GridSearchCV <a class="anchor" id="3.6"></a>

In [64]:
forest = RandomForestClassifier(n_estimators=100,
                                random_state=123)

boost = XGBClassifier(random_state=123, verbosity=0, use_label_encoder=False)

metaclassifier = LogisticRegression(random_state=123, max_iter=1000)

sclf = StackingCVClassifier(classifiers=[forest, boost], 
                            meta_classifier=metaclassifier, 
                            random_state=123)

In [65]:
pipe = make_pipeline(sclf)

params =  {
    'stackingcvclassifier__use_probas': [True, False],
    'stackingcvclassifier__drop_proba_col': [None, 'last'],
    'stackingcvclassifier__xgbclassifier__max_depth': [4, 6],
    'stackingcvclassifier__randomforestclassifier__n_estimators': [10]
}


grid = GridSearchCV(estimator=pipe,
                    param_grid=params,
                    cv=2,
                    n_jobs=1,
                    verbose=2)

#with parallel_backend('threading'):
grid.fit(x_train, y_train)

grid.best_score_

Fitting 2 folds for each of 8 candidates, totalling 16 fits
[CV] END stackingcvclassifier__drop_proba_col=None, stackingcvclassifier__randomforestclassifier__n_estimators=10, stackingcvclassifier__use_probas=True, stackingcvclassifier__xgbclassifier__max_depth=4; total time=  48.9s
[CV] END stackingcvclassifier__drop_proba_col=None, stackingcvclassifier__randomforestclassifier__n_estimators=10, stackingcvclassifier__use_probas=True, stackingcvclassifier__xgbclassifier__max_depth=4; total time=  56.2s
[CV] END stackingcvclassifier__drop_proba_col=None, stackingcvclassifier__randomforestclassifier__n_estimators=10, stackingcvclassifier__use_probas=True, stackingcvclassifier__xgbclassifier__max_depth=6; total time= 1.1min
[CV] END stackingcvclassifier__drop_proba_col=None, stackingcvclassifier__randomforestclassifier__n_estimators=10, stackingcvclassifier__use_probas=True, stackingcvclassifier__xgbclassifier__max_depth=6; total time= 1.3min
[CV] END stackingcvclassifier__drop_proba_col=No

0.6905975933102182

### 3.6.1 Training Set

In [66]:
print(f"Train Accuracy: {grid.best_estimator_.score(x_train, y_train)*100:0.2f}%")
print(f"Valid Accuracy: {grid.best_estimator_.score(x_valid, y_valid)*100:0.2f}%")
print(f"Test Accuracy: {grid.best_estimator_.score(x_test, y_test)*100:0.2f}%")

Train Accuracy: 96.27%
Valid Accuracy: 96.69%
Test Accuracy: 74.02%


### 3.6.2 Validation Set

In [67]:
y_pred_grid_valid = grid.predict(x_valid)

In [68]:
f1_score(y_valid, y_pred_grid_valid, average='weighted')

0.9667214971016279

In [69]:
precision_score(y_valid, y_pred_grid_valid, average='weighted')

0.9670114254490513

In [70]:
recall_score(y_valid, y_pred_grid_valid, average='weighted')

0.9668620953352026

### 3.6.3 Test Set

In [71]:
y_pred_grid = grid.predict(x_test)

In [72]:
f1_score(y_test, y_pred_grid, average='weighted')

0.7400521693603924

In [73]:
precision_score(y_test, y_pred_grid, average='weighted')

0.7438740492017507

In [74]:
recall_score(y_test, y_pred_grid, average='weighted')

0.7401855817273376

## 3.7 Decision Tree Classifier + Grid Search CV <a class="anchor" id="3.7"></a>

In [75]:
tree = DecisionTreeClassifier(random_state=123)

param_grid = {
    'max_depth': [1, 2, 3, 4, 5, 10, 15, 20, None],
    'criterion': ['gini', 'entropy']
}

gs = GridSearchCV(estimator=tree,
                  param_grid=param_grid,
                  scoring='accuracy',
                  cv=10)

gs.fit(x_train, y_train)

GridSearchCV(cv=10, estimator=DecisionTreeClassifier(random_state=123),
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [1, 2, 3, 4, 5, 10, 15, 20, None]},
             scoring='accuracy')

### 3.7.1 Training Set

In [76]:
print(f"Train Accuracy: {gs.best_estimator_.score(x_train, y_train)*100:0.2f}%")
print(f"Valid Accuracy: {gs.best_estimator_.score(x_valid, y_valid)*100:0.2f}%")
print(f"Test Accuracy: {gs.best_estimator_.score(x_test, y_test)*100:0.2f}%")

Train Accuracy: 99.56%
Valid Accuracy: 99.62%
Test Accuracy: 67.93%


### 3.7.2 Validation Set

In [77]:
y_pred_gs_valid = gs.predict(x_valid)

In [78]:
f1_score(y_valid, y_pred_gs_valid, average='weighted')

0.9961753981232394

In [79]:
precision_score(y_valid, y_pred_gs_valid, average='weighted')

0.9961806071934198

In [80]:
recall_score(y_valid, y_pred_gs_valid, average='weighted')

0.9961763956156003

### 3.7.3 Test Set

In [81]:
y_pred_gs = gs.predict(x_test)

In [82]:
f1_score(y_test, y_pred_gs, average='weighted')

0.6791098537580611

In [83]:
precision_score(y_test, y_pred_gs, average='weighted')

0.6844797235316151

In [84]:
recall_score(y_test, y_pred_gs, average='weighted')

0.6792767071139663