# Machine Learning Models (Using `short descriptions`)

### Table of Content

* [1 Python Setup](#1)
* [2 Data Visulation](#2)
    * [2.1 Data Cleaning](#2.1)
* [3 Model Fitting](#3)
    * [3.1 K-Nearest-Neighbors Classifier](#3.1)
    * [3.2 Multinomial NB](#3.2)
    * [3.3 RandomForestClassifier](#3.3)
    * [3.4 OneVsRestClassifier ](#3.4)
    * [3.5 LinearSVC](#3.5)
    * [3.6 StackingCVClassifier + GridSearchCV](#3.6)
    * [3.7 Decision Tree Classifier + Grid Search CV](#3.7)

## 1 Python Setup <a class="anchor" id="1"></a>

In [1]:
import re
import random
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC
from sklearn.preprocessing import StandardScaler
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from xgboost import XGBClassifier
from mlxtend.classifier import StackingCVClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

## 2 Data Visualization <a class="anchor" id="2"></a>

In [2]:
random.seed(123)
news = pd.read_csv('final_news_df.csv')
news = news.dropna()

In [3]:
news.head()

Unnamed: 0,category,headline,short_description,date,keywords
1,BUSINESS,Investment Crowdfunding Draws a Crowd,It's getting crowded in the investment crowdfu...,2015-03-02,investment-crowdfunding-draw
2,BUSINESS,Office Romances Often Sparked By Emoticon-Lace...,But even though they might literally be sendin...,2012-08-26,office-romances-emoticons
3,BUSINESS,Verizon Could Buy Yahoo In The Next Few Days,A deal could be coming soon.,2016-07-22,verizon-could-buy-yahoo-next-few-days
4,BUSINESS,What's Your Meeting Brand?,Many leaders operate as if their meeting brand...,2014-06-11,whats-your-meeting-brand
5,BUSINESS,Feds OK Charter And Time Warner Mega Merger,"The deal, which still needs FCC approval, woul...",2016-04-25,charter-time-warner-cable-takeover


### 2.1 Data Cleaning <a class="anchor" id="2.1"></a>

In [4]:
def remove_sign(text):
    pattern = r'[,:!?.\/\|\*\(\)\"]'
    text=re.sub(pattern,' ',text)
    text = re.sub('[^a-zA-Z]'," " ,text)
    return text

news['text'] = news['short_description'].apply(remove_sign)

def remove_stop_words(text):
    tokens = word_tokenize(text)
    words = [word.lower() for word in tokens if word.isalpha()]
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if not word in stop_words]
    str1 = " ".join(words)
    return str1.strip()
        
news['text'] = news['text'].apply(remove_stop_words)

In [5]:
news.head()

Unnamed: 0,category,headline,short_description,date,keywords,text
1,BUSINESS,Investment Crowdfunding Draws a Crowd,It's getting crowded in the investment crowdfu...,2015-03-02,investment-crowdfunding-draw,getting crowded investment crowdfunding space ...
2,BUSINESS,Office Romances Often Sparked By Emoticon-Lace...,But even though they might literally be sendin...,2012-08-26,office-romances-emoticons,even though might literally sending wrong mess...
3,BUSINESS,Verizon Could Buy Yahoo In The Next Few Days,A deal could be coming soon.,2016-07-22,verizon-could-buy-yahoo-next-few-days,deal could coming soon
4,BUSINESS,What's Your Meeting Brand?,Many leaders operate as if their meeting brand...,2014-06-11,whats-your-meeting-brand,many leaders operate meeting brand directly af...
5,BUSINESS,Feds OK Charter And Time Warner Mega Merger,"The deal, which still needs FCC approval, woul...",2016-04-25,charter-time-warner-cable-takeover,deal still needs fcc approval would create sec...


In [6]:
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(news['text'])
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(x)


encoder = LabelEncoder()
y = encoder.fit_transform(news['category'])

x_train, x_test, y_train, y_test = \
    train_test_split(X_train_tfidf, y, test_size=0.3, random_state=123, stratify=y)

x_train_sub, x_valid, y_train_sub, y_valid = \
    train_test_split(x_train, y_train, test_size=0.2, random_state=123, stratify=y_train)

print('Train/Valid/Test sizes:', y_train_sub.shape[0], y_valid.shape[0], y_test.shape[0])

Train/Valid/Test sizes: 15689 3923 8406


---------------------------------------

## 3 Model Fitting <a class="anchor" id="3"></a>

## 3.1 K-Nearest-Neighbors Classifier <a class="anchor" id="3.1"></a>

In [7]:
knn_model = KNeighborsClassifier(n_neighbors=3)
knn_model.fit(x_train, y_train)

KNeighborsClassifier(n_neighbors=3)

### 3.1.1 Training Set

In [8]:
print(f'Training Accuracy: {knn_model.score(x_train, y_train)*100:.2f}%')
print(f'Valid set accuracy: {knn_model.score(x_valid, y_valid)*100:.2f}%')
print(f'Test set accuracy: {knn_model.score(x_test, y_test)*100:.2f}%')

Training Accuracy: 28.63%
Valid set accuracy: 29.24%
Test set accuracy: 21.20%


### 3.1.2 Validation Set

In [9]:
y_pred_knn_valid = knn_model.predict(x_valid)

In [10]:
f1_score(y_valid, y_pred_knn_valid, average='weighted')

0.23023260698142486

In [11]:
precision_score(y_valid, y_pred_knn_valid, average='weighted')

0.8073422047544284

In [12]:
recall_score(y_valid, y_pred_knn_valid, average='weighted')

0.2923782819270966

### 3.1.3 Test Set

In [13]:
y_pred_knn = knn_model.predict(x_test)

In [14]:
f1_score(y_test, y_pred_knn, average='weighted')

0.12896311335322155

In [15]:
precision_score(y_test, y_pred_knn, average='weighted')

0.7740710088713483

In [16]:
recall_score(y_test, y_pred_knn, average='weighted')

0.21199143468950749

--------------------

## 3.2 Multinomial NB <a class="anchor" id="3.2"></a>

### 3.2.1 Training Set

In [17]:
nb = MultinomialNB()
nb.fit(x_train, y_train)

MultinomialNB()

In [18]:
nb.score(x_train, y_train)

0.8635019375892311

In [19]:
print(f'Training Accuracy: {nb.score(x_train, y_train)*100:.2f}%')
print(f'Valid set accuracy: {nb.score(x_valid, y_valid)*100:.2f}%')
print(f'Test set accuracy: {nb.score(x_test, y_test)*100:.2f}%')

Training Accuracy: 86.35%
Valid set accuracy: 86.57%
Test set accuracy: 72.82%


### 3.2.2 Validation Set

In [20]:
y_pred_nb_valid = nb.predict(x_valid)

In [21]:
nb.score(x_valid, y_valid)

0.8656640326280908

In [22]:
f1_score(y_valid, y_pred_nb_valid, average='weighted')

0.8658070959602286

In [23]:
precision_score(y_valid, y_pred_nb_valid, average='weighted')

0.8676558777448972

In [24]:
recall_score(y_valid, y_pred_nb_valid, average='weighted')

0.8656640326280908

### 3.2.3 Test Set

In [25]:
y_pred_nb = nb.predict(x_test)

In [26]:
nb.score(x_test, y_test)

0.7281703545086843

In [27]:
f1_score(y_test, y_pred_nb, average='weighted')

0.7284042070861073

In [28]:
precision_score(y_test, y_pred_nb, average='weighted')

0.7352731622532759

In [29]:
recall_score(y_test, y_pred_nb, average='weighted')

0.7281703545086843

-------------------

## 3.3 RandomForestClassifier <a class="anchor" id="3.3"></a>

In [30]:
forest = RandomForestClassifier(n_estimators=50,
                                random_state=1)
forest.fit(x_train, y_train)

RandomForestClassifier(n_estimators=50, random_state=1)

### 3.3.1 Training Set

In [31]:
print(f'Training Accuracy: {forest.score(x_train, y_train)*100:.2f}%')
print(f'Valid set accuracy: {forest.score(x_valid, y_valid)*100:.2f}%')
print(f'Test set accuracy: {forest.score(x_test, y_test)*100:.2f}%')

Training Accuracy: 99.90%
Valid set accuracy: 99.77%
Test set accuracy: 69.51%


### 3.3.2 Validation Set

In [32]:
y_pred_forest_valid = forest.predict(x_valid)

In [33]:
f1_score(y_valid, y_pred_forest_valid, average='weighted')

0.9977087458639238

In [34]:
precision_score(y_valid, y_pred_forest_valid, average='weighted')

0.9977203890295185

In [35]:
recall_score(y_valid, y_pred_forest_valid, average='weighted')

0.9977058373693601

### 3.3.3 Test Set

In [36]:
y_pred_forest = forest.predict(x_test)

In [37]:
f1_score(y_test, y_pred_forest, average='weighted')

0.6965195086302622

In [38]:
precision_score(y_test, y_pred_forest, average='weighted')

0.7056670248429486

In [39]:
recall_score(y_test, y_pred_forest, average='weighted')

0.6950987389959553

-------------------------

## 3.4 OneVsRestClassifier <a class="anchor" id="3.4"></a>

In [40]:
ovr = OneVsRestClassifier(MultinomialNB())
y_pred_ovr = ovr.fit(x_train, y_train).predict(x_test)

### 3.4.1 Traing Set

In [41]:
print(f'Training Accuracy: {ovr.score(x_train, y_train)*100:.2f}%')
print(f'Valid set accuracy: {ovr.score(x_valid, y_valid)*100:.2f}%')
print(f'Test set accuracy: {ovr.score(x_test, y_test)*100:.2f}%')

Training Accuracy: 87.28%
Valid set accuracy: 87.56%
Test set accuracy: 73.82%


### 3.4.2 Validation Set

In [42]:
y_pred_ovr_valid = ovr.predict(x_valid)

In [43]:
accuracy_score(y_valid, y_pred_ovr_valid)

0.87560540402753

In [44]:
f1_score(y_valid, y_pred_ovr_valid, average='weighted')

0.8755330195542799

In [45]:
precision_score(y_valid, y_pred_ovr_valid, average='weighted')

0.8769154547889189

In [46]:
recall_score(y_valid, y_pred_ovr_valid, average='weighted')

0.87560540402753

### 3.4.3 Test Set

In [47]:
accuracy_score(y_test, y_pred_ovr)

0.7381632167499406

In [48]:
f1_score(y_test, y_pred_ovr, average='weighted')

0.7379994635984602

In [49]:
precision_score(y_test, y_pred_ovr, average='weighted')

0.7428185495110095

In [50]:
recall_score(y_test, y_pred_ovr, average='weighted')

0.7381632167499406

----------------

## 3.5 LinearSVC <a class="anchor" id="3.5"></a>

In [51]:
clf = LinearSVC(random_state=0, tol=1e-3)

### 3.5.1 Training Set

In [52]:
LinearSVC_model = clf.fit(x_train, y_train)

In [53]:
print(f'Training Accuracy: {LinearSVC_model.score(x_train, y_train)*100:.2f}%')
print(f'Valid set accuracy: {LinearSVC_model.score(x_valid, y_valid)*100:.2f}%')
print(f'Test set accuracy: {LinearSVC_model.score(x_test, y_test)*100:.2f}%')

Training Accuracy: 97.67%
Valid set accuracy: 97.55%
Test set accuracy: 73.16%


### 3.5.2 Validation Set

In [54]:
y_pred_clf_valid = LinearSVC_model.predict(x_valid)

In [55]:
accuracy_score(y_valid, y_pred_clf_valid)

0.9755289319398419

In [56]:
f1_score(y_valid, y_pred_clf_valid, average='weighted')

0.9754980516802335

In [57]:
precision_score(y_valid, y_pred_clf_valid, average='weighted')

0.9755952615228551

In [58]:
recall_score(y_valid, y_pred_clf_valid, average='weighted')

0.9755289319398419

### 3.5.3 Test Set

In [59]:
y_pred_clf = LinearSVC_model.predict(x_test)

In [60]:
accuracy_score(y_test, y_pred_clf)

0.7316202712348323

In [61]:
f1_score(y_test, y_pred_clf, average='weighted')

0.7310298346722387

In [62]:
precision_score(y_test, y_pred_clf, average='weighted')

0.73085368062498

In [63]:
recall_score(y_test, y_pred_clf, average='weighted')

0.7316202712348323

----------------------------------

## 3.6 StackingCVClassifier + GridSearchCV <a class="anchor" id="3.6"></a>

In [64]:
forest = RandomForestClassifier(n_estimators=100,
                                random_state=123)

boost = XGBClassifier(random_state=123, verbosity=0, use_label_encoder=False)

metaclassifier = LogisticRegression(random_state=123, max_iter=1000)

sclf = StackingCVClassifier(classifiers=[forest, boost], 
                            meta_classifier=metaclassifier, 
                            random_state=123)

In [65]:
pipe = make_pipeline(sclf)

params =  {
    'stackingcvclassifier__use_probas': [True, False],
    'stackingcvclassifier__drop_proba_col': [None, 'last'],
    'stackingcvclassifier__xgbclassifier__max_depth': [4, 6],
    'stackingcvclassifier__randomforestclassifier__n_estimators': [10]
}


grid = GridSearchCV(estimator=pipe,
                    param_grid=params,
                    cv=2,
                    n_jobs=1,
                    verbose=2)

#with parallel_backend('threading'):
grid.fit(x_train, y_train)

grid.best_score_

Fitting 2 folds for each of 8 candidates, totalling 16 fits
[CV] END stackingcvclassifier__drop_proba_col=None, stackingcvclassifier__randomforestclassifier__n_estimators=10, stackingcvclassifier__use_probas=True, stackingcvclassifier__xgbclassifier__max_depth=4; total time=  54.6s
[CV] END stackingcvclassifier__drop_proba_col=None, stackingcvclassifier__randomforestclassifier__n_estimators=10, stackingcvclassifier__use_probas=True, stackingcvclassifier__xgbclassifier__max_depth=4; total time= 1.0min
[CV] END stackingcvclassifier__drop_proba_col=None, stackingcvclassifier__randomforestclassifier__n_estimators=10, stackingcvclassifier__use_probas=True, stackingcvclassifier__xgbclassifier__max_depth=6; total time= 1.2min
[CV] END stackingcvclassifier__drop_proba_col=None, stackingcvclassifier__randomforestclassifier__n_estimators=10, stackingcvclassifier__use_probas=True, stackingcvclassifier__xgbclassifier__max_depth=6; total time= 1.3min
[CV] END stackingcvclassifier__drop_proba_col=No

0.6620436467468896

### 3.6.1 Training Set

In [66]:
print(f"Train Accuracy: {grid.best_estimator_.score(x_train, y_train)*100:0.2f}%")
print(f"Valid Accuracy: {grid.best_estimator_.score(x_valid, y_valid)*100:0.2f}%")
print(f"Test Accuracy: {grid.best_estimator_.score(x_test, y_test)*100:0.2f}%")

Train Accuracy: 95.05%
Valid Accuracy: 95.44%
Test Accuracy: 70.91%


### 3.6.2 Validation Set

In [67]:
y_pred_grid_valid = grid.predict(x_valid)

In [68]:
f1_score(y_valid, y_pred_grid_valid, average='weighted')

0.9542154091211849

In [69]:
precision_score(y_valid, y_pred_grid_valid, average='weighted')

0.9542769416540927

In [70]:
recall_score(y_valid, y_pred_grid_valid, average='weighted')

0.9543716543461637

### 3.6.3 Test Set

In [71]:
y_pred_grid = grid.predict(x_test)

In [72]:
f1_score(y_test, y_pred_grid, average='weighted')

0.7096300370180499

In [73]:
precision_score(y_test, y_pred_grid, average='weighted')

0.7123087484131599

In [74]:
recall_score(y_test, y_pred_grid, average='weighted')

0.7091363311920057

-----------------------------

## 3.7 Decision Tree Classifier + Grid Search CV <a class="anchor" id="3.7"></a>

In [75]:
tree = DecisionTreeClassifier(random_state=123)

param_grid = {
    'max_depth': [1, 2, 3, 4, 5, 10, 15, 20, None],
    'criterion': ['gini', 'entropy']
}

gs = GridSearchCV(estimator=tree,
                  param_grid=param_grid,
                  scoring='accuracy',
                  cv=10)

gs.fit(x_train, y_train)

GridSearchCV(cv=10, estimator=DecisionTreeClassifier(random_state=123),
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [1, 2, 3, 4, 5, 10, 15, 20, None]},
             scoring='accuracy')

### 3.7.1 Training Set

In [76]:
print(f"Train Accuracy: {gs.best_estimator_.score(x_train, y_train)*100:0.2f}%")
print(f"Valid Accuracy: {gs.best_estimator_.score(x_valid, y_valid)*100:0.2f}%")
print(f"Test Accuracy: {gs.best_estimator_.score(x_test, y_test)*100:0.2f}%")

Train Accuracy: 99.91%
Valid Accuracy: 99.87%
Test Accuracy: 60.08%


### 3.7.2 Validation Set

In [77]:
y_pred_gs_valid = gs.predict(x_valid)

In [78]:
f1_score(y_valid, y_pred_gs_valid, average='weighted')

0.9987260365974077

In [79]:
precision_score(y_valid, y_pred_gs_valid, average='weighted')

0.9987285184887693

In [80]:
recall_score(y_valid, y_pred_gs_valid, average='weighted')

0.9987254652052001

### 3.7.3 Test Set

In [81]:
y_pred_gs = gs.predict(x_test)

In [82]:
f1_score(y_test, y_pred_gs, average='weighted')

0.6006005489446619

In [83]:
precision_score(y_test, y_pred_gs, average='weighted')

0.6014320582146789

In [84]:
recall_score(y_test, y_pred_gs, average='weighted')

0.6007613609326672