# Machine Learning Models (Using `headline`)

### Table of Content

* [1 Python Setup](#1)
* [2 Data Visulation](#2)
    * [2.1 Data Cleaning](#2.1)
* [3 Model Fitting](#3)
    * [3.1 K-Nearest-Neighbors Classifier](#3.1)
    * [3.2 Multinomial NB](#3.2)
    * [3.3 RandomForestClassifier](#3.3)
    * [3.4 OneVsRestClassifier ](#3.4)
    * [3.5 LinearSVC](#3.5)
    * [3.6 StackingCVClassifier + GridSearchCV](#3.6)
    * [3.7 Decision Tree Classifier + Grid Search CV](#3.7)

## 1 Python Setup <a class="anchor" id="1"></a>

In [1]:
import re
import random
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC
from sklearn.preprocessing import StandardScaler
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from xgboost import XGBClassifier
from mlxtend.classifier import StackingCVClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

---------------------------------------------------------------------------------------------------------------------------------------

## 2 Data Visualization <a class="anchor" id="2"></a>

In [2]:
random.seed(123)
news = pd.read_csv('final_news_df.csv')
news = news.dropna()

In [3]:
news.head() # sanpshot of the dataset before cleaning

Unnamed: 0,category,headline,short_description,date,keywords
1,BUSINESS,Investment Crowdfunding Draws a Crowd,It's getting crowded in the investment crowdfu...,2015-03-02,investment-crowdfunding-draw
2,BUSINESS,Office Romances Often Sparked By Emoticon-Lace...,But even though they might literally be sendin...,2012-08-26,office-romances-emoticons
3,BUSINESS,Verizon Could Buy Yahoo In The Next Few Days,A deal could be coming soon.,2016-07-22,verizon-could-buy-yahoo-next-few-days
4,BUSINESS,What's Your Meeting Brand?,Many leaders operate as if their meeting brand...,2014-06-11,whats-your-meeting-brand
5,BUSINESS,Feds OK Charter And Time Warner Mega Merger,"The deal, which still needs FCC approval, woul...",2016-04-25,charter-time-warner-cable-takeover


### 2.1 Data Cleaning <a class="anchor" id="2.1"></a>

In [4]:
def remove_sign(text):
    pattern = r'[,:!?.\/\|\*\(\)\"]'
    text=re.sub(pattern,' ',text)
    text = re.sub('[^a-zA-Z]'," " ,text)
    return text

news['text'] = news['headline'].apply(remove_sign)

def remove_stop_words(text):
    tokens = word_tokenize(text)
    words = [word.lower() for word in tokens if word.isalpha()]
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if not word in stop_words]
    str1 = " ".join(words)
    return str1.strip()
        
news['text'] = news['text'].apply(remove_stop_words)

In [5]:
news.head()

Unnamed: 0,category,headline,short_description,date,keywords,text
1,BUSINESS,Investment Crowdfunding Draws a Crowd,It's getting crowded in the investment crowdfu...,2015-03-02,investment-crowdfunding-draw,investment crowdfunding draws crowd
2,BUSINESS,Office Romances Often Sparked By Emoticon-Lace...,But even though they might literally be sendin...,2012-08-26,office-romances-emoticons,office romances often sparked emoticon laced e...
3,BUSINESS,Verizon Could Buy Yahoo In The Next Few Days,A deal could be coming soon.,2016-07-22,verizon-could-buy-yahoo-next-few-days,verizon could buy yahoo next days
4,BUSINESS,What's Your Meeting Brand?,Many leaders operate as if their meeting brand...,2014-06-11,whats-your-meeting-brand,meeting brand
5,BUSINESS,Feds OK Charter And Time Warner Mega Merger,"The deal, which still needs FCC approval, woul...",2016-04-25,charter-time-warner-cable-takeover,feds ok charter time warner mega merger


In [6]:
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(news['text'])
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(x)

encoder = LabelEncoder()
y = encoder.fit_transform(news['category'])

x_train, x_test, y_train, y_test = \
    train_test_split(X_train_tfidf, y, test_size=0.3, random_state=123, stratify=y)

x_train_sub, x_valid, y_train_sub, y_valid = \
    train_test_split(x_train, y_train, test_size=0.2, random_state=123, stratify=y_train)

print('Train/Valid/Test sizes:', y_train_sub.shape[0], y_valid.shape[0], y_test.shape[0])

Train/Valid/Test sizes: 15689 3923 8406


---------------------------------------------------------------------------------------------------------------------------------------

## 3 Model Fitting <a class="anchor" id="3"></a>

## 3.1 K-Nearest-Neighbors Classifier <a class="anchor" id="3.1"></a>

In [7]:
knn_model = KNeighborsClassifier(n_neighbors=3)
knn_model.fit(x_train_sub, y_train_sub)

KNeighborsClassifier(n_neighbors=3)

### 3.3.1 Training Set

In [8]:
print(f'Training Accuracy: {knn_model.score(x_train, y_train)*100:.2f}%')
print(f'Test set accuracy: {knn_model.score(x_test, y_test)*100:.2f}%')
print(f'Valid set accuracy: {knn_model.score(x_valid, y_valid)*100:.2f}%')

Training Accuracy: 59.26%
Test set accuracy: 30.16%
Valid set accuracy: 30.92%


### 3.1.2 Validation Set

In [9]:
y_pred_knn_valid = knn_model.predict(x_valid)

In [10]:
f1_score(y_valid, y_pred_knn_valid, average='weighted')

0.27606652545707516

In [11]:
precision_score(y_valid, y_pred_knn_valid, average='weighted')

0.6661949061650427

In [12]:
recall_score(y_valid, y_pred_knn_valid, average='weighted')

0.30920214121845524

### 3.1.2 Test Set

In [13]:
y_pred_knn = knn_model.predict(x_test)

In [14]:
precision_score(y_test, y_pred_knn, average='weighted')

0.6823646076280551

In [15]:
recall_score(y_test, y_pred_knn, average='weighted')

0.30157030692362596

In [16]:
f1_score(y_test, y_pred_knn, average='weighted')

0.2661809457802585

---------------------------------------------------------------------------------------------------------------------------------------

## 3.2 Multinomial NB <a class="anchor" id="3.2"></a>

### 3.2.1 Trainning Set

In [17]:
nb = MultinomialNB()
nb.fit(x_train_sub, y_train_sub)

MultinomialNB()

In [18]:
print(f'Training Accuracy: {nb.score(x_train, y_train)*100:.2f}%')
print(f'Valid set accuracy: {nb.score(x_valid, y_valid)*100:.2f}%')
print(f'Test set accuracy: {nb.score(x_test, y_test)*100:.2f}%')

Training Accuracy: 87.87%
Valid set accuracy: 77.72%
Test set accuracy: 78.46%


### 3.2.2 Validation Set

In [19]:
y_pred_nb_valid = nb.predict(x_valid)

In [20]:
# F1 score
f1_score(y_valid, y_pred_nb_valid, average='weighted')

0.7753715649190562

In [21]:
# Precision
precision_score(y_valid, y_pred_nb_valid, average='weighted')

0.775284397773305

In [22]:
# Recall
recall_score(y_valid, y_pred_nb_valid, average='weighted')

0.7772113178689778

### 3.2.3 Test Set

In [23]:
y_pred_nb = nb.predict(x_test)

In [24]:
# F1 Score
f1_score(y_test, y_pred_nb, average='weighted')

0.7834895298775895

In [25]:
# Precision
precision_score(y_test, y_pred_nb, average='weighted')

0.7835947052109126

In [26]:
# Recall
recall_score(y_test, y_pred_nb, average='weighted')

0.7845586485843445

---------------------------------------------------------------------------------------------------------------------------------------

## 3.3 RandomForestClassifier <a class="anchor" id="3.3"></a>

In [27]:
forest = RandomForestClassifier(n_estimators=50,
                                random_state=1)
forest.fit(x_train_sub, y_train_sub)

RandomForestClassifier(n_estimators=50, random_state=1)

### 3.3.1 Training Set

In [28]:
# Accuracy
print(f'Training Accuracy: {forest.score(x_train, y_train)*100:.2f}%')
print(f'Valid set accuracy: {forest.score(x_valid, y_valid)*100:.2f}%')
print(f'Test set accuracy: {forest.score(x_test, y_test)*100:.2f}%')

Training Accuracy: 94.99%
Valid set accuracy: 75.45%
Test set accuracy: 75.57%


### 3.3.2 Validation Set

In [29]:
y_pred_forest_valid = forest.predict(x_valid)

In [30]:
# F1 score
f1_score(y_valid, y_pred_forest_valid, average='weighted')

0.7560903724746849

In [31]:
# Precision
precision_score(y_valid, y_pred_forest_valid, average='weighted')

0.7599458906270985

In [32]:
# Recall
recall_score(y_valid, y_pred_forest_valid, average='weighted')

0.7545245985215396

### 3.3.3 Test Set

In [33]:
y_pred_forest = forest.predict(x_test)

In [34]:
# F1 Score
f1_score(y_test, y_pred_forest, average='weighted')

0.7569112347328744

In [35]:
# Precision
precision_score(y_test, y_pred_forest, average='weighted')

0.7621355272203307

In [36]:
# Recall
recall_score(y_test, y_pred_forest, average='weighted')

0.755650725672139

---------------------------------------------------------------------------------------------------------------------------------------

## 3.4 OneVsRestClassifier <a class="anchor" id="3.4"></a>

In [37]:
ovr = OneVsRestClassifier(MultinomialNB())
y_pred_ovr = ovr.fit(x_train, y_train).predict(x_test)

### 3.4.1 Training Set

In [38]:
print(f'Training Accuracy: {ovr.score(x_train, y_train)*100:.2f}%')
print(f'Valid set accuracy: {ovr.score(x_valid, y_valid)*100:.2f}%')
print(f'Test set accuracy: {ovr.score(x_test, y_test)*100:.2f}%')

Training Accuracy: 90.37%
Valid set accuracy: 90.82%
Test set accuracy: 79.47%


### 3.4.2 Validation Set

In [39]:
y_pred_ovr_valid = ovr.predict(x_valid)

In [40]:
# Accuracy Verification
accuracy_score(y_valid, y_pred_ovr_valid)

0.9082334947744073

In [41]:
# F1 Score
f1_score(y_valid, y_pred_ovr_valid, average='weighted')

0.9076569062559122

In [42]:
# Precision
precision_score(y_valid, y_pred_ovr_valid, average='weighted')

0.9083366656046938

In [43]:
# Recall
recall_score(y_valid, y_pred_ovr_valid, average='weighted')

0.9082334947744073

### 3.4.3 Test Set

In [44]:
# Accuracy Verification
accuracy_score(y_test, y_pred_ovr)

0.79467047347133

In [45]:
# F1 Score
f1_score(y_test, y_pred_ovr, average='weighted')

0.7935306082617217

In [46]:
# Precision
precision_score(y_test, y_pred_ovr, average='weighted')

0.79339859770077

In [47]:
# Recall
recall_score(y_test, y_pred_ovr, average='weighted')

0.79467047347133

---------------------------------------------------------------------------------------------------------------------------------------

## 3.5 LinearSVC <a class="anchor" id="3.5"></a>

In [48]:
clf = LinearSVC(random_state=0, tol=1e-5)

### 3.5.1 Training Set

In [49]:
LinearSVC_model = clf.fit(x_train, y_train)

In [50]:
print(f'Training Accuracy: {LinearSVC_model.score(x_train, y_train)*100:.2f}%')
print(f'Valid set accuracy: {LinearSVC_model.score(x_valid, y_valid)*100:.2f}%')
print(f'Test set accuracy: {LinearSVC_model.score(x_test, y_test)*100:.2f}%')

Training Accuracy: 97.69%
Valid set accuracy: 97.45%
Test set accuracy: 80.56%


### 3.5.2 Validation Set

In [51]:
y_pred_clf_valid = LinearSVC_model.predict(x_valid)

In [52]:
# Accuracy Verification
accuracy_score(y_valid, y_pred_clf_valid)

0.974509304104002

In [53]:
# F1 score
f1_score(y_valid, y_pred_clf_valid, average='weighted')

0.9744733952827339

In [54]:
# Precision
precision_score(y_valid, y_pred_clf_valid, average='weighted')

0.9745175041902239

In [55]:
# Recall
recall_score(y_valid, y_pred_clf_valid, average='weighted')

0.974509304104002

### 3.5.3 Test Set

In [56]:
y_pred_clf = LinearSVC_model.predict(x_test)

In [57]:
# Accuracy Verification
accuracy_score(y_test, y_pred_clf)

0.8056150368784202

In [58]:
# F1 Score
f1_score(y_test, y_pred_clf, average='weighted')

0.8058794154841358

In [59]:
# Recall
recall_score(y_test, y_pred_clf, average='weighted')

0.8056150368784202

In [60]:
# Precision
precision_score(y_test, y_pred_clf, average='weighted')

0.8069298033141059

---------------------------------------------------------------------------------------------------------------------------------------

## 3.6 StackingCVClassifier + GridSearchCV <a class="anchor" id="3.6"></a>

In [61]:
forest = RandomForestClassifier(n_estimators=100,
                                random_state=123)

boost = XGBClassifier(random_state=123, verbosity=0, use_label_encoder=False)

metaclassifier = LogisticRegression(random_state=123, max_iter=1000)

sclf = StackingCVClassifier(classifiers=[forest, boost], 
                            meta_classifier=metaclassifier, 
                            random_state=123)

In [62]:
pipe = make_pipeline(sclf)

params =  {
    'stackingcvclassifier__use_probas': [True, False],
    'stackingcvclassifier__drop_proba_col': [None, 'last'],
    'stackingcvclassifier__xgbclassifier__max_depth': [4, 6],
    'stackingcvclassifier__randomforestclassifier__n_estimators': [10]
}


grid = GridSearchCV(estimator=pipe,
                    param_grid=params,
                    cv=2,
                    n_jobs=1,
                    verbose=2)

#with parallel_backend('threading'):
grid.fit(x_train, y_train)

grid.best_score_

Fitting 2 folds for each of 8 candidates, totalling 16 fits
[CV] END stackingcvclassifier__drop_proba_col=None, stackingcvclassifier__randomforestclassifier__n_estimators=10, stackingcvclassifier__use_probas=True, stackingcvclassifier__xgbclassifier__max_depth=4; total time=  56.8s
[CV] END stackingcvclassifier__drop_proba_col=None, stackingcvclassifier__randomforestclassifier__n_estimators=10, stackingcvclassifier__use_probas=True, stackingcvclassifier__xgbclassifier__max_depth=4; total time=  49.3s
[CV] END stackingcvclassifier__drop_proba_col=None, stackingcvclassifier__randomforestclassifier__n_estimators=10, stackingcvclassifier__use_probas=True, stackingcvclassifier__xgbclassifier__max_depth=6; total time= 1.1min
[CV] END stackingcvclassifier__drop_proba_col=None, stackingcvclassifier__randomforestclassifier__n_estimators=10, stackingcvclassifier__use_probas=True, stackingcvclassifier__xgbclassifier__max_depth=6; total time= 1.1min
[CV] END stackingcvclassifier__drop_proba_col=No

0.7272078319396288

### 3.6.1 Training Set

In [63]:
print(f"Train Accuracy: {grid.best_estimator_.score(x_train, y_train)*100:0.2f}%")
print(f"Valid Accuracy: {grid.best_estimator_.score(x_valid, y_valid)*100:0.2f}%")
print(f"Test Accuracy: {grid.best_estimator_.score(x_test, y_test)*100:0.2f}%")

Train Accuracy: 95.86%
Valid Accuracy: 95.87%
Test Accuracy: 76.71%


### 3.6.2 Validation Set

In [64]:
y_pred_grid_valid = grid.predict(x_valid)

In [65]:
# F1 Scire
f1_score(y_valid, y_pred_grid_valid, average='weighted')

0.9583287685888081

In [66]:
# Precision
precision_score(y_valid, y_pred_grid_valid, average='weighted')

0.9590182416290163

In [67]:
# Recall
recall_score(y_valid, y_pred_grid_valid, average='weighted')

0.9587050726484833

### 3.6.3 Test Set

In [68]:
y_pred_grid = grid.predict(x_test)

In [69]:
# F1 Score
f1_score(y_test, y_pred_grid, average='weighted')

0.7692494116003705

In [70]:
# Precision
precision_score(y_test, y_pred_grid, average='weighted')

0.7775897317106159

In [71]:
# Recall
recall_score(y_test, y_pred_grid, average='weighted')

0.7670711396621461

---------------------------------------------------------------------------------------------------------------------------------------

## 3.7 Decision Tree Classifier + Grid Search CV <a class="anchor" id="3.7"></a>

In [72]:
tree = DecisionTreeClassifier(random_state=123)

param_grid = {
    'max_depth': [1, 2, 3, 4, 5, 10, 15, 20, None],
    'criterion': ['gini', 'entropy']
}

gs = GridSearchCV(estimator=tree,
                  param_grid=param_grid,
                  scoring='accuracy',
                  cv=10)

gs.fit(x_train, y_train)

GridSearchCV(cv=10, estimator=DecisionTreeClassifier(random_state=123),
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [1, 2, 3, 4, 5, 10, 15, 20, None]},
             scoring='accuracy')

### 3.7.1 Training Set

In [73]:
print(f"Train Accuracy: {gs.best_estimator_.score(x_train, y_train)*100:0.2f}%")
print(f"Valid Accuracy: {gs.best_estimator_.score(x_valid, y_valid)*100:0.2f}%")
print(f"Test Accuracy: {gs.best_estimator_.score(x_test, y_test)*100:0.2f}%")

Train Accuracy: 99.88%
Valid Accuracy: 99.90%
Test Accuracy: 71.38%


### 3.7.2 Valiadation Set

In [74]:
y_pred_gs_valid = gs.predict(x_valid)

In [75]:
f1_score(y_valid, y_pred_gs_valid, average='weighted')

0.9989798504643607

In [76]:
precision_score(y_valid, y_pred_gs_valid, average='weighted')

0.9989844650863847

In [77]:
recall_score(y_valid, y_pred_gs_valid, average='weighted')

0.9989803721641601

### 3.7.3 Testing Set

In [78]:
y_pred_gs = gs.predict(x_test)

In [79]:
f1_score(y_test, y_pred_gs, average='weighted')

0.7135783246098991

In [80]:
precision_score(y_test, y_pred_gs, average='weighted')

0.7179624782147437

In [81]:
recall_score(y_test, y_pred_gs, average='weighted')

0.7137758743754461