## Vectorizing and Modeling

### 1. Loading info

- In this notebook, I wanted to test 4 different models: Logistic Regression, KNN, Decision Tree and Random Forest

In [18]:
# First, let's import libraries and modules

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.preprocessing import StandardScaler

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier


import warnings
warnings.filterwarnings('ignore')

In [19]:
# Loading dataset

df = pd.read_csv('../data/cleaned_df.csv')
df.head(2)

Unnamed: 0,text,subreddit_Parenting,subreddit_books
0,why do you like james joyce james joyce from m...,0,1
1,we yevgeny zamyatin spoilers just finished thi...,0,1


In [20]:
# Recover stop words
%store -r sw

np.array(sw)[0:5]

array(['i', 'me', 'my', 'myself', 'we'], dtype='<U10')

### 2. Evaluating a Baseline score

- My baseline is 0.5. Words from both subreddits are equally distributed. But, we will see if any of our 4 models could predict words from the book's subreddit and receive a score more than %50

In [21]:
df['subreddit_books'].value_counts(normalize=True)

1    0.504507
0    0.495493
Name: subreddit_books, dtype: float64

### 3. Establishing target, features, train, test and split our model

- My target is to predict words that belong to subreddit books
- My features would be text
- Also,I need to split text in my dataset to train and test samples. By default, the train-test-split would split my model to 4 different parts, I wanted to split it to 5 parts that's why I'm using test_size 0.2


In [12]:
# Defining X,y and train-test-split

X = df.text
y = df.subreddit_books

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                           random_state=42)

### 4. Instantiating the Count Vectorizer

 - In this notebook, I wanted to compare both Count Vectorizer and Tf-Idf Vectorizer to check which one would perform better for my models

In [9]:
# Vectorize words 

cvec = CountVectorizer(strip_accents = 'ascii',
                       stop_words = sw)

#vec fit transform 
vect_cvec =pd.DataFrame(cvec.fit_transform(df.text).todense(), 
                        columns=cvec.get_feature_names())

X_cvec = vect_cvec


X_train_cvec = cvec.fit_transform(X_train)
X_test_cvec = cvec.transform(X_test)

# X = StandardScaler().fit_transform(vect_cvec) I used scaling as well in experiments on some models

### 5. Instantiating the Tf-Idf Vectorizer

In [13]:
# TfIdf Vectorizer

tfidf =TfidfVectorizer(strip_accents = 'ascii',
                       stop_words = sw)           

vect_tf = pd.DataFrame(tfidf.fit_transform(df.text).todense(),
                       columns = tfidf.get_feature_names())

Xtf = vect_tf

X_train_tf = tfidf.fit_transform(X_train)
X_test_tf = tfidf.transform(X_test)
   
# Xtf_sc = StandardScaler().fit_transform(Xtf)

### 6. Logistic regression

In [17]:
# Instantiating the Logistic Regression with tf-idf vectorizer. Cross val score is performed on entire dataset using KFold

logs = LogisticRegression(n_jobs = -1)

logs.fit(X_train_tf, y_train)


logs_score = cross_val_score(logs, Xtf, y, cv = KFold(n_splits=5,
                                                         shuffle = True,
                                                         random_state = 42) )

print('train:', logs.score(X_train_tf, y_train))
print('test:', logs.score(X_test_tf, y_test))
print('cross val logs:', round(logs_score.mean(), 3), '±', round(2*logs_score.std(), 3))

train: 0.9936249073387695
test: 0.98814463544754
cross val logs: 0.988 ± 0.004


### 7. KNN

In [14]:
# Instantiate Standard Scaler, fit and transfer it on the entire tfidf's text

ss = StandardScaler()

Xtf_sc = StandardScaler().fit_transform(Xtf)

In [8]:
# KNN. I've got all score for KNN before. But, it takes very long time to run all and I've decided to get only cross_val_score for KNN this time.
# This model I'm using only for experimentations

knn = KNeighborsClassifier(n_jobs=-1,
                           p=2, # defines minkowski distance: euclidean
                           weights='distance',
                           n_neighbors=7)

knn_score = cross_val_score(knn, Xtf_sc, y, cv = KFold(n_splits=5,
                                                       shuffle = True,
                                                       random_state = 42))

print('cros val rf:', round(knn_score.mean(), 3), '±' , round(2*knn_score.std(), 3))

cros val rf: 0.539 ± 0.077


### 8. Decision Tree

In [26]:
# Instantiate model with:
# - a maximum depth of 5.
# - at least 7 samples required in order to split an internal node. default 2
# - at least 3 samples in each leaf node. default 1
# - a cost complexity of 0.01. if it increase, we regularize more. default 0
# - random state of 42.

dt = DecisionTreeClassifier(max_depth=5,
                            min_samples_split=7,
                            min_samples_leaf=3,
                            ccp_alpha=.01,
                            random_state=42)

dt.fit(X_train_tf, y_train)

dt_score = cross_val_score(dt, Xtf, y, cv = KFold(n_splits=5,
                                                  shuffle = True,
                                                  random_state = 42))


print(f'Score on training set: {dt.score(X_train_tf, y_train)}')
print(f'Score on testing set: {dt.score(X_test_tf, y_test)}')
print('cros val rf:', round(rf_score.mean(), 3), '±' ,  round(2*rf_score.std(), 3))

Score on training set: 0.9369903632320237
Score on testing set: 0.9270895080023711
cros val rf: 0.983 ± 0.006


### 9. Random Forest

In [16]:
# Random Forest

rf = RandomForestClassifier(n_estimators=10, n_jobs = -1)

rf.fit(X_train_tf, y_train)


rf_score = cross_val_score(rf, Xtf, y, cv = KFold(n_splits=5,
                                                     shuffle = True,
                                                     random_state = 42))

print('train:', rf.score(X_train_tf, y_train))
print('test:',  rf.score(X_test_tf, y_test))
print('cros val rf:', round(rf_score.mean(), 3), '±' , round(2*rf_score.std(), 3))

train: 0.9982209043736101
test: 0.9590989922940131
cros val rf: 0.956 ± 0.01


### 10. Conclusions after initial modeling

After experimenting with different models and combinations, I made next conclusions:

- all models would succesfully predict the subreddit and beat the baseline score.

- TfIdf Vectorizer performed slightly better than Count Vectorizer by %1. But, Count Vectorizer run slightly faster

- KNN with Standard Scaler performed worse than other models. But, probably more custom setting need to be done to this model such as number of neighbors or changing the distance to manhattan for example

- I experimented with the Decision Tree and I'm going to leave this model for now because it has lower scores. It is definitely very interesting model with a lot of great parameters what could be used to improve the model

- Logistic Regression worked well and would be a good candiate for the winning model. I used it both with the Count Vectoriser and Tf-idf Vectorizer and Logs + tf-idf give the best results. The score is pretty high.

- Random Forest is very interesting, it has higher traing score compare to testing and stdiv is 0.01. It has slightly higher variannce and I wanted to use this model in my next notebook to see if I could imprpove the score by changing some parameters.

**Proceed to the next notebook!**