# Assignmnet 4

Generally, a parameter selection procedure might be necessary to evaluate Probability of
Detection versus Probability of False Alarm (i.e., Pd versus Pf) in order to select a classifier
model and/or select a value for a hyperparameter for a classifier.
                                                
In this assignment we will produce an ROC plot presenting operating points of various
classifiers and their varying hyperparameters so that we can make a justifiable operating
classifier/parameter selection for the following problem.

The classification of fake news or misinformation is a very important task today. Download the
fake news dataset (https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset),
Fake.csv and True.csv files. Load the datasets into your model development framework and
examine the features to confirm that they are text in title and text columns. Set fake as 1
and true as 0. Concatenate the datasets together to produce one dataset of around 44,880
rows. Apply necessary pre-processing to extract the title column with Tf-Idf. (This assigns
numerical values to terms based on their frequency in a given document and throughout a
given collection of documents.) Use around 50 features. Make sure to include a sanity check in
the pipeline and perhaps run your favorite baseline classifier first.

```
df_true['class'] = 0; df_fake['class'] = 1
df = pd.concat([df_fake, df_true])
X = TfidfVectorizer(stop_words='english',
max_features=40).fit_transform(df['title'])
```

## 1. [70 pts]

By using three classifiers—decision tree, random forest, and neural network—and
at least 2 different hyperparameter settings for each, generate operating points and plot
them on a ROC. In particular, plot mean TPR and mean FPR, where the means are taken
from the multiple runs of cross-validations. Do not hesitate to use/modify the ROC plot code
in the module notebook if necessary. In case you do not see enough variety in Pd-Pf you
might need to work on the classifiers set and/or hyperparameters. And do not hesitate to try
hundreds, if necessary, since the ROC is just a natural scatter plot.
(Some recommended parameters and ranges: depth [3-12], number of features [3-20],
number of estimators [20-100], layer size [1-10], learning rate; and total of 10-20 Ops.)

In [1]:
# Load the datasets 
import pandas as pd

df_true = pd.read_csv('datasets/True.csv')
df_fake = pd.read_csv('datasets/Fake.csv')

In [2]:
# Inspect the dataset
print('\n\n\nTrue dataset head: \n',df_true.head(n=2))
print('\n\n\nFake dataset head: \n',df_fake.head(n=2))
print(f'\n\n\n{df_true.columns}')
print(f'\n\n\n{df_fake.columns}')

df_true['class'] = 0; df_fake['class'] = 1
df = pd.concat([df_fake, df_true])
print(f'\n\n\nThe columns of the final dataset are: {df.columns}')
print(f'dataframe has {len(df)} samples.')




True dataset head: 
                                                title  \
0  As U.S. budget fight looms, Republicans flip t...   
1  U.S. military to accept transgender recruits o...   

                                                text       subject  \
0  WASHINGTON (Reuters) - The head of a conservat...  politicsNews   
1  WASHINGTON (Reuters) - Transgender people will...  politicsNews   

                 date  
0  December 31, 2017   
1  December 29, 2017   



Fake dataset head: 
                                                title  \
0   Donald Trump Sends Out Embarrassing New Year’...   
1   Drunk Bragging Trump Staffer Started Russian ...   

                                                text subject  \
0  Donald Trump just couldn t wish all Americans ...    News   
1  House Intelligence Committee Chairman Devin Nu...    News   

                date  
0  December 31, 2017  
1  December 31, 2017  



Index(['title', 'text', 'subject', 'date'], dtype='object')



Ind

### Split the data into training and testing, and try a 

Per the assignment prompt do a sanity-- in this case check that we have the correct number of samples, approximately 44,000

In [3]:
# Transform the titles into a vector
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split

MAX_FEAUTRES = 50
X = TfidfVectorizer(stop_words='english', max_features=MAX_FEAUTRES).fit_transform(df['title'])
y = df['class']
print(X.shape, y.shape)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

(44898, 50) (44898,)


### Test GridSearch optimal paramters across several classifiers

In [5]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.experimental import enable_halving_search_cv
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV, HalvingRandomSearchCV, train_test_split
from sklearn.linear_model import Perceptron
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier

import numpy as np


clfs = {
    'decision_tree': DecisionTreeClassifier(random_state=42),
    'random_forest': RandomForestClassifier(random_state=42),
    'neural_network': Perceptron(random_state=42)
}
param_dict = {
    'halving_search': {
        'decision_tree': {
            'criterion':['gini', 'entropy', 'log_loss'],
            'max_features': ['sqrt', 'log2', None],
            'splitter': ['best', 'random'],
            'max_depth': [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37],
            'ccp_alpha': np.linspace(0.0, 5, 10),
            'random_state': [None, 42],
            'min_samples_leaf': MAX_FEAUTRES*[.01, .05, .1, .15, .2],
        },
        'random_forest': {
            'criterion':['gini', 'entropy', 'log_loss'],
            'max_features': ['sqrt', 'log2', None],
            'max_depth': [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37],
            'ccp_alpha': np.linspace(0.0, 5, 10),
            'random_state': [None, 42],
            'min_samples_leaf': MAX_FEAUTRES*[.01, .05, .1, .15, .2],
        },
        'neural_network': {
            'random_state': [None, 42],
            'penalty': ['l2', 'l1', 'elasticnet', None],
            'eta0': np.linspace(.5,5,10),
            'early_stopping': [True, False],
        }
    },
    'grid_search': {
        'decision_tree': {
            'criterion':['gini', 'entropy', 'log_loss'],
            'max_features': ['sqrt', 'log2', None],
            'splitter': ['best', 'random'],
            'max_depth': [2, 3, 5, 7, 11, 13],
            'random_state': [None, 42],
            'min_samples_leaf': MAX_FEAUTRES*[.01, .05, .1, .15, .2],
        },
        'random_forest': {
            'criterion':['gini', 'entropy', 'log_loss'],
            'max_features': ['sqrt', 'log2', None],
            'max_depth': [2, 3, 5, 7, 11, 13],
            'random_state': [None, 42],
            'min_samples_leaf': MAX_FEAUTRES*[.01, .05, .1, .15, .2],
        },
        'neural_network': {
            'random_state': [None, 42],
            'penalty': ['l2', 'l1', 'elasticnet', None],
            'eta0': np.linspace(.5,5,10),
            'early_stopping': [True, False],
        }
    }
}

In [6]:
bests = {}
for k in clfs.keys():
    print(f'\n\n\nrunning grid search for {k}')
    gs = GridSearchCV(clfs[k], param_dict['grid_search'][k], cv=5, scoring='accuracy', n_jobs=8, verbose=1)
    gs.fit(X_train, y_train)
    bests[k+'__GridSearch'] = { 
                'best_model': gs.best_estimator_,
                'best_accuracy': gs.best_estimator_.score(X_test, y_test),
                'best_params': gs.best_params_,
                'best_score': gs.best_score_
            }
    print(f'\n\n\nrunning halving random search for {k}')
    hs = HalvingRandomSearchCV(clfs[k], param_dict['halving_search'][k], cv=5, scoring='accuracy', n_jobs=8, verbose=1)
    hs.fit(X_train, y_train)
    bests[k+'__HalvingSearch'] = { 
                'best_model': hs.best_estimator_,
                'best_accuracy': hs.best_estimator_.score(X_test, y_test),
                'best_params': hs.best_params_,
                'best_score': hs.best_score_
            }




running grid search for decision_tree


KeyError: 'decision_tree'

In [None]:
for k_clf, clf_scores in bests.items():
    print(f'For {str(k_clf).replace('__HalvingSearch',' with halving random search').replace('__GridSearch',' with grid search')} achieved')
    for k, v in clf_scores.items():
        print(f'\t\t{k}: {v}')
    print()
    print()
        