<div style='background-color : orange'>
<a id='TableOfContents'></a>
    <b><u><i><h1 style='text-align : center'>
        Table of Contents
    </h1></i></u></b>
<li><a href='#imports'>Imports</a></li>
<li><a href='#initial'>Initial Setup</a></li>
<li><a href='#dtc'>Decision Tree Classifier</a></li>
<li><a href='#rfc'>Random Forest Classifier</a></li>
<li><a href='#lr'>Logistic Regression</a></li>
<li><a href='#nb'>Naive Bayes (MultinominalNB)</a></li>
<li><a href='#takeaway'>Takeaway</a></li>
<li><a href='#misc'>Miscellaneous</a></li>

Take the work we did in the lessons further:

- What other types of models (i.e. different classifcation algorithms) could you use?
- How do the models compare when trained on term frequency data alone, instead of TF-IDF values?

<div style='background-color : orange'>
<a id='imports'></a>
    <b><u><i><h1 style='text-align : center'>
        Imports
    </h1></i></u></b>
<li><a href='#TableOfContents'>Table of Contents</a></li>

In [206]:
# Vectorization and dataframing
import numpy as np
import pandas as pd

# Splitting data
from sklearn.model_selection import train_test_split

# Classification Modeling
from sklearn.tree import DecisionTreeClassifier as DTC
from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.linear_model import LogisticRegression as LR
from sklearn.naive_bayes import MultinomialNB as MNB

# Model Metrics
from sklearn.metrics import classification_report, accuracy_score

# NLP related fit/transformers
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

# .py files
import acquire as a
import prepare as p

In [126]:
# Set default plt style to 'bmh'
mpl.style.use('bmh')

<div style='background-color : orange'>
<a id='initial'></a>
    <b><u><i><h1 style='text-align : center'>
        Initial Setup
    </h1></i></u></b>
<li><a href='#TableOfContents'>Table of Contents</a></li>
<li><a href='#initialinitial'>Initial</a></li>
<li><a href='#initialreferencecv'>Reference - Count Vectorizer (cv)</a></li>
<li><a href='#initialreferencetfidf'>Reference - Term Frequency - Inverse Document Frequency (TF-IDF)</a></li>
<li><a href='#initialreferencebag'>Reference - Bag of Ngrams (For either cv or tfidf)</a></li>

<a id='initialinitial'></a>
<h3><b><i>
    Initial
</i></b></h3>
<li><a href='#initial'>Initial Setup Top</a></li>

In [127]:
# Obtain the prepared news dataframe
news_df = p.prepare_news_articles()
news_df.shape

(100, 5)

In [128]:
# Split the prepared news dataframe
train_validate, test = train_test_split(news_df,
                                  random_state=1349,
                                  train_size=0.9,
                                  stratify=news_df.category)
train, validate = train_test_split(train_validate,
                                   random_state=1349,
                                   train_size=0.778,
                                   stratify=train_validate.category)
train.shape, validate.shape, test.shape

((70, 5), (20, 5), (10, 5))

---

<a id='initialreferencecv'></a>
<h3><b><i>
    Reference - Count Vectorizer (CV)
</i></b></h3>
<li><a href='#initial'>Initial Setup Top</a></li>

In [129]:
# Bag of Words
cv = CountVectorizer()
bag_of_words = cv.fit_transform(news_df.summary_clean)

In [130]:
# Matrix representing words
bag_of_words.todense()

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]])

In [131]:
# Gets words
cv.get_feature_names_out()

array(['10', '1011', '10part', ..., 'zardozi', 'zealand', 'zuckerberg'],
      dtype=object)

In [132]:
# Gets count of each word
cv.vocabulary_

{'microsoft': 1107,
 'give': 750,
 'salary': 1483,
 'hike': 816,
 'fulltime': 722,
 'employee': 592,
 'year': 1898,
 'ceo': 349,
 'satya': 1491,
 'nadella': 1152,
 'said': 1481,
 'internal': 891,
 'email': 590,
 'wednesday': 1847,
 'economic': 576,
 'condition': 421,
 'different': 521,
 'across': 102,
 'many': 1068,
 'dimension': 523,
 'well': 1850,
 'increase': 862,
 'certain': 351,
 'hourly': 828,
 'equivalent': 605,
 'role': 1465,
 'wont': 1874,
 'added': 115,
 'wrestler': 1889,
 'vinesh': 1812,
 'phogat': 1277,
 'request': 1430,
 'ratan': 1385,
 'tata': 1680,
 'check': 368,
 'whether': 1858,
 'fund': 723,
 'donates': 548,
 'wrestling': 1890,
 'federation': 666,
 'reach': 1392,
 'athlete': 211,
 'must': 1151,
 'seek': 1508,
 'information': 873,
 'india': 865,
 'wfi': 1854,
 'one': 1214,
 'protesting': 1353,
 'delhi': 500,
 'jantar': 916,
 'mantar': 1065,
 'chief': 373,
 'brij': 307,
 'bhushan': 272,
 'sharan': 1528,
 'singh': 1562,
 'bengalurus': 263,
 'mg': 1103,
 'road': 1458,
 'r

In [133]:
bow = pd.DataFrame(bag_of_words.todense())
bow.columns = cv.get_feature_names_out()
bow.sample()

Unnamed: 0,10,1011,10part,11,118,12,1214,1218,123,12th,...,yogi,york,young,youth,youve,yoy,yuvraj,zardozi,zealand,zuckerberg
85,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [134]:
bow.apply(lambda row: row / row.sum(), axis=1).sample()

Unnamed: 0,10,1011,10part,11,118,12,1214,1218,123,12th,...,yogi,york,young,youth,youve,yoy,yuvraj,zardozi,zealand,zuckerberg
93,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.030303,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


---

<a id='initialreferencetfidf'></a>
<h3><b><i>
    Reference - Term Frequency - Inverse Document Frequency (TF-IDF)
</i></b></h3>
<li><a href='#initial'>Initial Setup Top</a></li>

In [135]:
tfidf = TfidfVectorizer()
bag_of_words = tfidf.fit_transform(news_df.summary_clean)
pd.DataFrame(bag_of_words.todense(),
            columns = tfidf.get_feature_names_out()).sample()

Unnamed: 0,10,1011,10part,11,118,12,1214,1218,123,12th,...,yogi,york,young,youth,youve,yoy,yuvraj,zardozi,zealand,zuckerberg
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [136]:
pd.Series(
    dict(
        zip(
            tfidf.get_feature_names_out(), tfidf.idf_)))

10            3.823361
1011          4.921973
10part        4.921973
11            4.516508
118           4.921973
                ...   
yoy           4.921973
yuvraj        4.921973
zardozi       4.921973
zealand       4.921973
zuckerberg    4.921973
Length: 1912, dtype: float64

---

<a id='initialreferencebag'></a>
<h3><b><i>
    Reference - Bag of Ngrams (For either cv or tfidf)
</i></b></h3>
<li><a href='#initial'>Initial Setup Top</a></li>

In [137]:
# Bag of Ngrams Stuff
cv = CountVectorizer(ngram_range=(2, 3))
bag_of_grams = cv.fit_transform(news_df.summary_clean)

In [139]:
pd.DataFrame(bag_of_grams.todense(),
            columns=cv.get_feature_names_out()).sample()

Unnamed: 0,10 around,10 around 3200,10 ball,10 ball added,10 ball impact,10 ball mi,10 billion,10 billion incentive,10 billion india,10 yoy,...,yuvraj singh,yuvraj singh calling,zardozi border,zardozi border vintage,zealand cricketer,zealand cricketer simon,zuckerberg musk,zuckerberg musk tweeted,zuckerberg privatised,zuckerberg privatised government
41,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,1,0,0,0,0


<div style='background-color : orange'>
<a id='dtc'></a>
    <b><u><i><h1 style='text-align : center'>
        Decision Tree Classifier
    </h1></i></u></b>
<li><a href='#TableOfContents'>Table of Contents</a></li>
<li><a href='#dtccv'>Using CV</a></li>
<li><a href='#dtccvbag'>Using CV Bags</a></li>
<li><a href='#dtctfidf'>Using TF-IDF</a></li>
<li><a href='#dtctfidfbag'>Using TF-IDF Bags</a></li>

<a id='dtccv'></a>
<h3><b><i>
    Using CV
</i></b></h3>
<li><a href='#dtc'>Decision Tree Classifier Top</a></li>

In [144]:
# Define x and y
cv = CountVectorizer()
x_train = cv.fit_transform(train.summary_clean)
y_train = train.category
x_validate = cv.transform(validate.summary_clean)
y_validate = validate.category
x_test = cv.transform(test.summary_clean)
y_test = test.category

In [153]:
# Do modeling thing
tree = DTC(max_depth=4, random_state=1349)
tree.fit(x_train, y_train)
print(f'Train: {tree.score(x_train, y_train):.2f}')
print(f'Validate: {tree.score(x_validate, y_validate):.2f}')
print(f'Test: {tree.score(x_test, y_test):.2f}')

Train: 0.61
Validate: 0.40
Test: 0.30


In [154]:
# Feature importances
pd.Series(
    dict(
    zip(cv.get_feature_names_out(),
       tree.feature_importances_))).sort_values(ascending=False)

twitter         0.291518
actress         0.266342
film            0.235008
captain         0.207132
personalised    0.000000
                  ...   
exit            0.000000
exindia         0.000000
exhibit         0.000000
exengland       0.000000
zuckerberg      0.000000
Length: 1487, dtype: float64

---

<a id='dtccvbag'></a>
<h3><b><i>
    Using CV Bags
</i></b></h3>
<li><a href='#dtc'>Decision Tree Classifier Top</a></li>

In [195]:
# Define x and y
cv = CountVectorizer(ngram_range=(1, 5))
x_train = cv.fit_transform(train.summary_clean)
y_train = train.category
x_validate = cv.transform(validate.summary_clean)
y_validate = validate.category
x_test = cv.transform(test.summary_clean)
y_test = test.category

In [196]:
# Do modeling thing
tree = DTC(max_depth=4, random_state=1349)
tree.fit(x_train, y_train)
print(f'Train: {tree.score(x_train, y_train):.2f}')
print(f'Validate: {tree.score(x_validate, y_validate):.2f}')
print(f'Test: {tree.score(x_test, y_test):.2f}')

Train: 0.61
Validate: 0.40
Test: 0.30


---

<a id='dtctfidf'></a>
<h3><b><i>
    Using TF-IDF
</i></b></h3>
<li><a href='#dtc'>Decision Tree Classifier Top</a></li>

In [155]:
# Define x and y
tfidf = TfidfVectorizer()
x_train = tfidf.fit_transform(train.summary_clean)
y_train = train.category
x_validate = tfidf.transform(validate.summary_clean)
y_validate = validate.category
x_test = tfidf.transform(test.summary_clean)
y_test = test.category

In [156]:
# Do modeling thing
tree = DTC(max_depth=4, random_state=1349)
tree.fit(x_train, y_train)
print(f'Train: {tree.score(x_train, y_train):.2f}')
print(f'Validate: {tree.score(x_validate, y_validate):.2f}')
print(f'Test: {tree.score(x_test, y_test):.2f}')

Train: 0.61
Validate: 0.40
Test: 0.30


---

<a id='dtctfidfbag'></a>
<h3><b><i>
    Using TF-IDF Bags
</i></b></h3>
<li><a href='#dtc'>Decision Tree Classifier Top</a></li>

In [193]:
# Define x and y
tfidf = TfidfVectorizer(ngram_range=(1, 5))
x_train = tfidf.fit_transform(train.summary_clean)
y_train = train.category
x_validate = tfidf.transform(validate.summary_clean)
y_validate = validate.category
x_test = tfidf.transform(test.summary_clean)
y_test = test.category

In [194]:
# Do modeling thing
tree = DTC(max_depth=4, random_state=1349)
tree.fit(x_train, y_train)
print(f'Train: {tree.score(x_train, y_train):.2f}')
print(f'Validate: {tree.score(x_validate, y_validate):.2f}')
print(f'Test: {tree.score(x_test, y_test):.2f}')

Train: 0.61
Validate: 0.40
Test: 0.30


<div style='background-color : orange'>
<a id='rfc'></a>
    <b><u><i><h1 style='text-align : center'>
        Random Forest Classifier
    </h1></i></u></b>
<li><a href='#TableOfContents'>Table of Contents</a></li>
<li><a href='#rfccv'>Using CV</a></li>
<li><a href='#rfccvbag'>Using CV Bags</a></li>
<li><a href='#rfctfidf'>Using TF-IDF</a></li>
<li><a href='#rfctfidfbag'>Using TF-IDF Bags</a></li>

<a id='rfccv'></a>
<h3><b><i>
    Using CV
</i></b></h3>
<li><a href='#rfc'>Random Forest Classifier Top</a></li>

In [157]:
# Define x and y
cv = CountVectorizer()
x_train = cv.fit_transform(train.summary_clean)
y_train = train.category
x_validate = cv.transform(validate.summary_clean)
y_validate = validate.category
x_test = cv.transform(test.summary_clean)
y_test = test.category

In [158]:
# Do modeling thing
tree = RFC(max_depth=4, random_state=1349)
tree.fit(x_train, y_train)
print(f'Train: {tree.score(x_train, y_train):.2f}')
print(f'Validate: {tree.score(x_validate, y_validate):.2f}')
print(f'Test: {tree.score(x_test, y_test):.2f}')

Train: 0.96
Validate: 0.55
Test: 0.60


In [159]:
# Feature importances
pd.Series(
    dict(
    zip(cv.get_feature_names_out(),
       tree.feature_importances_))).sort_values(ascending=False)

actress       0.029947
film          0.029221
twitter       0.025152
wednesday     0.016734
crore         0.014511
                ...   
former        0.000000
forced        0.000000
football      0.000000
foot          0.000000
zuckerberg    0.000000
Length: 1487, dtype: float64

---

<a id='rfccvbag'></a>
<h3><b><i>
    Using CV Bags
</i></b></h3>
<li><a href='#rfc'>Random Forest Classifier Top</a></li>

In [197]:
# Define x and y
cv = CountVectorizer(ngram_range=(1, 5))
x_train = cv.fit_transform(train.summary_clean)
y_train = train.category
x_validate = cv.transform(validate.summary_clean)
y_validate = validate.category
x_test = cv.transform(test.summary_clean)
y_test = test.category

In [198]:
# Do modeling thing
tree = RFC(max_depth=4, random_state=1349)
tree.fit(x_train, y_train)
print(f'Train: {tree.score(x_train, y_train):.2f}')
print(f'Validate: {tree.score(x_validate, y_validate):.2f}')
print(f'Test: {tree.score(x_test, y_test):.2f}')

Train: 0.94
Validate: 0.35
Test: 0.40


---

<a id='rfctfidf'></a>
<h3><b><i>
    Using TF-IDF
</i></b></h3>
<li><a href='#rfc'>Random Forest Classifier Top</a></li>

In [160]:
# Define x and y
tfidf = TfidfVectorizer()
x_train = tfidf.fit_transform(train.summary_clean)
y_train = train.category
x_validate = tfidf.transform(validate.summary_clean)
y_validate = validate.category
x_test = tfidf.transform(test.summary_clean)
y_test = test.category

In [161]:
# Do modeling thing
tree = RFC(max_depth=4, random_state=1349)
tree.fit(x_train, y_train)
print(f'Train: {tree.score(x_train, y_train):.2f}')
print(f'Validate: {tree.score(x_validate, y_validate):.2f}')
print(f'Test: {tree.score(x_test, y_test):.2f}')

Train: 0.96
Validate: 0.55
Test: 0.40


---

<a id='rfctfidfbag'></a>
<h3><b><i>
    Using TF-IDF Bags
</i></b></h3>
<li><a href='#rfc'>Random Forest Classifier Top</a></li>

In [199]:
# Define x and y
tfidf = TfidfVectorizer(ngram_range=(1, 5))
x_train = tfidf.fit_transform(train.summary_clean)
y_train = train.category
x_validate = tfidf.transform(validate.summary_clean)
y_validate = validate.category
x_test = tfidf.transform(test.summary_clean)
y_test = test.category

In [200]:
# Do modeling thing
tree = RFC(max_depth=4, random_state=1349)
tree.fit(x_train, y_train)
print(f'Train: {tree.score(x_train, y_train):.2f}')
print(f'Validate: {tree.score(x_validate, y_validate):.2f}')
print(f'Test: {tree.score(x_test, y_test):.2f}')

Train: 0.93
Validate: 0.35
Test: 0.40


<div style='background-color : orange'>
<a id='lr'></a>
    <b><u><i><h1 style='text-align : center'>
        Logistic Regression
    </h1></i></u></b>
<li><a href='#TableOfContents'>Table of Contents</a></li>
<li><a href='#lrcv'>Using CV</a></li>
<li><a href='#lrcvbag'>Using CV Bags</a></li>
<li><a href='#lrtfidf'>Using TF-IDF</a></li>
<li><a href='#lrtfidfbag'>Using TF-IDF Bags</a></li>

<a id='lrcv'></a>
<h3><b><i>
    Using CV
</i></b></h3>
<li><a href='#lr'>Logistic Regression Top</a></li>

In [157]:
# Define x and y
cv = CountVectorizer()
x_train = cv.fit_transform(train.summary_clean)
y_train = train.category
x_validate = cv.transform(validate.summary_clean)
y_validate = validate.category
x_test = cv.transform(test.summary_clean)
y_test = test.category

In [164]:
# Do modeling thing
model = LR(random_state=1349)
model.fit(x_train, y_train)
print(f'Train: {model.score(x_train, y_train):.2f}')
print(f'Validate: {model.score(x_validate, y_validate):.2f}')
print(f'Test: {model.score(x_test, y_test):.2f}')

Train: 0.96
Validate: 0.70
Test: 0.70


---

<a id='lrcvbag'></a>
<h3><b><i>
    Using CV Bags
</i></b></h3>
<li><a href='#lr'>Logistic Regression Top</a></li>

In [201]:
# Define x and y
cv = CountVectorizer(ngram_range=(1, 5))
x_train = cv.fit_transform(train.summary_clean)
y_train = train.category
x_validate = cv.transform(validate.summary_clean)
y_validate = validate.category
x_test = cv.transform(test.summary_clean)
y_test = test.category

In [202]:
# Do modeling thing
model = LR(random_state=1349)
model.fit(x_train, y_train)
print(f'Train: {model.score(x_train, y_train):.2f}')
print(f'Validate: {model.score(x_validate, y_validate):.2f}')
print(f'Test: {model.score(x_test, y_test):.2f}')

Train: 0.96
Validate: 0.55
Test: 0.60


---

<a id='lrtfidf'></a>
<h3><b><i>
    Using TF-IDF
</i></b></h3>
<li><a href='#lr'>Logistic Regression Top</a></li>

In [168]:
# Define x and y
tfidf = TfidfVectorizer()
x_train = tfidf.fit_transform(train.summary_clean)
y_train = train.category
x_validate = tfidf.transform(validate.summary_clean)
y_validate = validate.category
x_test = tfidf.transform(test.summary_clean)
y_test = test.category

In [169]:
# Do modeling thing
model = LR(random_state=1349)
model.fit(x_train, y_train)
print(f'Train: {model.score(x_train, y_train):.2f}')
print(f'Validate: {model.score(x_validate, y_validate):.2f}')
print(f'Test: {model.score(x_test, y_test):.2f}')

Train: 0.96
Validate: 0.70
Test: 0.70


---

<a id='lrtfidfbag'></a>
<h3><b><i>
    Using TF-IDF Bags
</i></b></h3>
<li><a href='#lr'>Logistic Regression Top</a></li>

In [203]:
# Define x and y
tfidf = TfidfVectorizer(ngram_range=(1, 5))
x_train = tfidf.fit_transform(train.summary_clean)
y_train = train.category
x_validate = tfidf.transform(validate.summary_clean)
y_validate = validate.category
x_test = tfidf.transform(test.summary_clean)
y_test = test.category

In [204]:
# Do modeling thing
model = LR(random_state=1349)
model.fit(x_train, y_train)
print(f'Train: {model.score(x_train, y_train):.2f}')
print(f'Validate: {model.score(x_validate, y_validate):.2f}')
print(f'Test: {model.score(x_test, y_test):.2f}')

Train: 0.96
Validate: 0.65
Test: 0.70


<div style='background-color : orange'>
<a id='nb'></a>
    <b><u><i><h1 style='text-align : center'>
        Naive Bayes (MultinominalNB)
    </h1></i></u></b>
<li><a href='#TableOfContents'>Table of Contents</a></li>
<li><a href='#nbcv'>Using CV</a></li>
<li><a href='#nbtfidf'>Using TF-IDF</a></li>

<a id='nbcv'></a>
<h3><b><i>
    Using CV
</i></b></h3>
<li><a href='#nb'>Naive Bayes Top</a></li>

In [170]:
# Define x and y
cv = CountVectorizer()
x_train = cv.fit_transform(train.summary_clean)
y_train = train.category
x_validate = cv.transform(validate.summary_clean)
y_validate = validate.category
x_test = cv.transform(test.summary_clean)
y_test = test.category

In [172]:
# Do modeling thing
model = MNB()
model.fit(x_train, y_train)
print(f'Train: {model.score(x_train, y_train):.2f}')
print(f'Validate: {model.score(x_validate, y_validate):.2f}')
print(f'Test: {model.score(x_test, y_test):.2f}')

Train: 0.96
Validate: 0.70
Test: 0.80


---

<a id='nbtfidf'></a>
<h3><b><i>
    Using TF-IDF
</i></b></h3>
<li><a href='#nb'>Naive Bayes Top</a></li>

In [173]:
# Define x and y
tfidf = TfidfVectorizer()
x_train = tfidf.fit_transform(train.summary_clean)
y_train = train.category
x_validate = tfidf.transform(validate.summary_clean)
y_validate = validate.category
x_test = tfidf.transform(test.summary_clean)
y_test = test.category

In [174]:
# Do modeling thing
model = MNB()
model.fit(x_train, y_train)
print(f'Train: {model.score(x_train, y_train):.2f}')
print(f'Validate: {model.score(x_validate, y_validate):.2f}')
print(f'Test: {model.score(x_test, y_test):.2f}')

Train: 0.96
Validate: 0.70
Test: 0.70


<div style='background-color : orange'>
<a id='takeaway'></a>
    <b><u><i><h1 style='text-align : center'>
        Takeaway
    </h1></i></u></b>
<li><a href='#TableOfContents'>Table of Contents</a></li>

- Best Model Type
    - Naive Bayes (MultinominalNB)
        - Superior score by far
    - Logistic Regressor
        - Near 2nd
- Best Methodology
    - CV & TF-IDF
        - Appear to perform similarly
        - Hyperparameter(ngram_range()): Doesn't appear to do much

<div style='background-color : orange'>
<a id='misc'></a>
    <b><u><i><h1 style='text-align : center'>
        Miscellaneous
    </h1></i></u></b>
<li><a href='#TableOfContents'>Table of Contents</a></li>