# 2. Applied ML

We are going to build a classifier of news to directly assign them to 20 news categories. Note that the pipeline that you will build in this exercise could be of great help during your project if you plan to work with text!

1. Load the 20newsgroup dataset. It is, again, a classic dataset that can directly be loaded using sklearn ([link](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html)).  
[TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf), short for term frequency–inverse document frequency, is of great help when if comes to compute textual features. Indeed, it gives more importance to terms that are more specific to the considered articles (TF) but reduces the importance of terms that are very frequent in the entire corpus (IDF). Compute TF-IDF features for every article using [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). Then, split your dataset into a training, a testing and a validation set (10% for validation and 10% for testing). Each observation should be paired with its corresponding label (the article category).

2. Train a random forest on your training set. Try to fine-tune the parameters of your predictor on your validation set using a simple grid search on the number of estimator "n_estimators" and the max depth of the trees "max_depth". Then, display a confusion matrix of your classification pipeline. Lastly, once you assessed your model, inspect the `feature_importances_` attribute of your random forest and discuss the obtained results.



## 2.1 Load dataset and vectorize

A classifier of news is going to be built to assign 20 news categories. Let's import the dataset and check the different categories:

In [51]:
# Import libraries
from sklearn.datasets import fetch_20newsgroups # Data function that dowloads the data from archive
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix
from pprint import pprint
from time import time
import itertools


from pathlib import Path # to check if file exists
import os # get current directory
import pickle


# imported because of the splitting of the questions:
import numpy as np
import matplotlib.pyplot as plt

In [52]:
# Import dataset
newsgroups = fetch_20newsgroups(subset='all')#, remove = ('headers', 'footers', 'quotes'))

# Show categories list
pprint(list(newsgroups.target_names))
classes = newsgroups.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']


Let's check the dataset shape and how is like each datapoint:

In [53]:
import pandas as pd
from pandas.io.json import json_normalize

# The real data lies in the filenames and target attributes(target category)
print('Filenames shape:', newsgroups.filenames.shape)
print('Target shape:',newsgroups.target.shape,'\n')

# Show 
print('The categories of newsgroups are: ',list(pd.DataFrame.from_dict(json_normalize(newsgroups), orient='columns').columns),'\n')
print('description is: ',newsgroups.description)
print('filenames contains the location of the files in the hardware running the code ')

Filenames shape: (18846,)
Target shape: (18846,) 

The categories of newsgroups are:  ['DESCR', 'data', 'description', 'filenames', 'target', 'target_names'] 

description is:  the 20 newsgroups by date dataset
filenames contains the location of the files in the hardware running the code 


now we vectorize the data:
* small max_features is faster and reduces a bit the chances of overfitting
* if word in max_df portion of the files then it get ignored
* strip eventual accents (even though we should not have problems with english)
* for this current version we can only eliminate "english" key word, even though it should already be taken by max_df!!

In [75]:
# Vectorize the text dataset
vectorizer = TfidfVectorizer(max_features=None,\
                             max_df=0.7,\
                             strip_accents = 'ascii',
                             stop_words='english')
vectors = vectorizer.fit_transform(newsgroups.data)
vectors.shape

(18846, 173438)

### Split in training, testing and validation sets

Split training (80%), testing(10%) and validation (10%) sets

In [76]:
# Create predictors and predicted variable
X = vectors
y = newsgroups.target

# Split training, testing and validation tests
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, train_size=0.8) # 80% train
X_test, X_val, y_test, y_val = train_test_split(X_test, y_test, random_state=0, train_size=0.5) # 10% test, 10% validation



In [77]:
seed = 50

Watching at the max depth statistical description all estimators will help us to know how to tune better this parameter.

### Grid search

``` python
We had form 10 to 100 n_estimators
and from 10 to 1584
Best parameters: {'max_depth': 1584, 'n_estimators': 100}
```

Thus we instead do like this: 

In [79]:
# Parameters of the grid search
n_estim_range = np.round(np.logspace(2,2.5,num=2,dtype=int))
max_depth_range = np.round(np.logspace(3.5,3.5,num=1,dtype=int))
print(n_estim_range)
print(max_depth_range)
n_estim_range = [500]
max_depth_range = [400]
print(n_estim_range)
print(max_depth_range)

[100 316]
[3162]
[500]
[400]


In [80]:
paramgrid = {'n_estimators': n_estim_range,'max_depth': max_depth_range}

# Grid search on the number of estimators
grid_search = GridSearchCV(RandomForestClassifier(random_state=seed), paramgrid, cv = 3)
grid_search.fit(X_train, y_train)
print('Best parameters:',grid_search.best_params_)

KeyboardInterrupt: 

In [None]:
# Evaluation
print('Train score:',grid_search.score(X_train, y_train))
print('Test score:',grid_search.score(X_test, y_test))
print('Validation score:',grid_search.score(X_val, y_val))

###  Saving functions
We used them to write to a binary file the content of a given variable, in our case it will be the content of grid search. The goal being to not recompute the results each time. We saved all thoses _.pickle_ files innthe subdirectory **pickles/**

In [None]:
DATA_FOLDER = 'pickles/'

# function to save results in a pickle file
def save_results(var_name, file_name):
    file_path = DATA_FOLDER + file_name +'.pickle'
    my_file = Path(file_path)
    my_dir = Path(DATA_FOLDER)
    if not(my_dir.is_dir()):
        os.makedirs(DATA_FOLDER)
        save_results(var_name, file_name)
    elif my_file.is_file():
        print('WARNING! This filename already exusted so we wrote in "overrided.pickle" PLEASE MANUALLY CHANGE THE NAME'+\
             '\n CHancge the file name with your name')
        with open(DATA_FOLDER + 'overrided.pickle', 'wb') as file:
            pickle.dump(var_name, file)
    else:
        with open(file_path, 'wb') as file:
            pickle.dump(var_name, file)

# Function to read the pickles file
def read_pickle(file_name):
    file_path = DATA_FOLDER + str(file_name) + '.pickle'
    with open(file_path, "rb") as file:
        out = pickle.load(file)
    return out

# function to prin the scores
def print_scores(grid_search_var):
    print('Train score:',grid_search_var.score(X_train, y_train))
    print('Test score:',grid_search_var.score(X_test, y_test))
    print('Validation score:',grid_search_var.score(X_val, y_val))

In [None]:
# cwd = os.getcwd()

In [None]:


#What we used up to now to save the results 
save_results(grid_search, '08_max_df_10000_max_features')
print_scores(grid_search)

In [71]:
#[50, 100, 200, 500, 600, 900]

In [72]:
removing_features_GS = read_pickle('removing_features_1')

print('Run with "remove = ("headers", "footers", "quotes")" and with the following set of parameters:\n'+\
      'n_estim_range = [200, 400, 700]\n'+\
      'max_depth_range = [50, 100, 200, 500, 600, 900]\n')

print('Optimal features found: ',removing_features_GS.best_params_)
print('which gave the following results:')
# Evaluation
print_scores(removing_features_GS) 


Run with "remove = ("headers", "footers", "quotes")" and with the following set of parameters:
n_estim_range = [200, 400, 700]
max_depth_range = [50, 100, 200, 500, 600, 900]

Optimal features found:  {'max_depth': 500, 'n_estimators': 400}
which gave the following results:
Train score: 0.0522021756434
Test score: 0.0615384615385
Validation score: 0.0503978779841


**Discussion:** We can see that removing the headers, footers and quotes decreases our accuracy of preditcion, even if it is in some sense a more thruthfull classification since only the content is analysed. <br />
**Thus in the rest of the analysis we thus not remove the features, but we should keep in mind that consideration**

In [73]:
#nightly_run_1 = read_pickle('gilcompa')
print('Run with without removing anything and with the following set of parameters:\n'+\
      'n_estim_range = [ 316  464  681 1000]\n'+\
      'max_depth_range = [ 1000  2154  4641 10000]\n')

print('We found:\n'+\
     'max_depth = 1000\n'+\
     'n_estimators = 1000\n'+\
     'train score = 1.0\n'+\
     'test_score = 0.841909814324\n'+\
     'val_score = 0.835013262599')

#print('Optimal features found: ',nightly_run_1.best_params_)


Run with without removing anything and with the following set of parameters:
n_estim_range = [ 316  464  681 1000]
max_depth_range = [ 1000  2154  4641 10000]

We found:
max_depth = 1000
n_estimators = 1000
train score = 1.0
test_score = 0.841909814324
val_score = 0.835013262599


In [28]:
nightly_run_2 = read_pickle('martino')
print('Run with "remove = ("headers", "footers", "quotes")" and with the following set of parameters:\n'+\
      'n_estim_range = [200, 500, 900]\n'+\
      'max_depth_range = [300, 500, 700, 900]\n')
print('Optimal features found: ',nightly_run_2.best_params_)
print('which gave the following results:')
# Evaluation
print_scores(nightly_run_2) 

Run with "remove = ("headers", "footers", "quotes")" and with the following set of parameters:
n_estim_range = [200, 500, 900]
max_depth_range = [300, 500, 700, 900]

Optimal features found:  {'max_depth': 300, 'n_estimators': 900}
which gave the following results:
Train score: 1.0
Test score: 0.841379310345
Validation score: 0.835543766578


In [59]:
# Evaluation
print('Train score:',non_removed_features_2.score(X_train, y_train))
print('Test score:',non_removed_features_2.score(X_test, y_test))
print('Validation score:',non_removed_features_2.score(X_val, y_val))

AttributeError: 'GridSearchCV' object has no attribute 'multimetric_'

## Confusion matrix

Using the template found here: http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html#sphx-glr-auto-examples-model-selection-plot-confusion-matrix-py

we plotted the confusion matrix

In [49]:
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [50]:


plt.figure(figsize=(18,18))
plot_confusion_matrix(cm, classes, normalize=True, title='Confusion matrix', cmap=plt.cm.Greys);
plt.show()

NameError: name 'cm' is not defined

## 2.2.Results discussion

In [41]:
print('Number of features:',rfc.n_features_)
print('Features importances:',rfc.feature_importances_)
depths = pd.Series([estimator.tree_.max_depth for estimator in rfc.estimators_])
depths.describe()

NameError: name 'rand_forest' is not defined