# 2. Applied ML

We are going to build a classifier of news to directly assign them to 20 news categories. Note that the pipeline that you will build in this exercise could be of great help during your project if you plan to work with text!

1. Load the 20newsgroup dataset. It is, again, a classic dataset that can directly be loaded using sklearn ([link](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html)).  
[TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf), short for term frequency–inverse document frequency, is of great help when if comes to compute textual features. Indeed, it gives more importance to terms that are more specific to the considered articles (TF) but reduces the importance of terms that are very frequent in the entire corpus (IDF). Compute TF-IDF features for every article using [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). Then, split your dataset into a training, a testing and a validation set (10% for validation and 10% for testing). Each observation should be paired with its corresponding label (the article category).

2. Train a random forest on your training set. Try to fine-tune the parameters of your predictor on your validation set using a simple grid search on the number of estimator "n_estimators" and the max depth of the trees "max_depth". Then, display a confusion matrix of your classification pipeline. Lastly, once you assessed your model, inspect the `feature_importances_` attribute of your random forest and discuss the obtained results.



## 2.1 Load dataset and vectorize

A classifier of news is going to be built to assign 20 news categories. Let's import the dataset and check the different categories:

In [116]:
# Import libraries
from sklearn.datasets import fetch_20newsgroups # Data function that dowloads the data from archive
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix
from pprint import pprint
from time import time
import itertools


from pathlib import Path # to check if file exists
import os # get current directory
import pickle


# imported because of the splitting of the questions:
import numpy as np
import matplotlib.pyplot as plt

In [3]:
# Import dataset
newsgroups = fetch_20newsgroups(subset='all')#, remove = ('headers', 'footers', 'quotes'))

# Show categories list
pprint(list(newsgroups.target_names))
classes = newsgroups.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']


Let's check the dataset shape and how is like each datapoint:

In [26]:
import pandas as pd
from pandas.io.json import json_normalize

# The real data lies in the filenames and target attributes(target category)
print('Filenames shape:', newsgroups.filenames.shape)
print('Target shape:',newsgroups.target.shape,'\n')

# Show 
print('The categories of newsgroups are: ',list(pd.DataFrame.from_dict(json_normalize(newsgroups), orient='columns').columns),'\n')
print('description is: ',newsgroups.description,'\n')
print('filenames contains the location of the files in the hardware running the code ','\n')

Filenames shape: (18846,)
Target shape: (18846,) 

The categories of newsgroups are:  ['DESCR', 'data', 'description', 'filenames', 'target', 'target_names'] 

description is:  the 20 newsgroups by date dataset 

filenames contains the location of the files in the hardware running the code  



In [114]:
# Vectorize the text dataset

# small is faster and reduces a bit the chances of overfitting
# if word in max_df portion of the files then ignore
# strip eventual accents (even though we should not have problems with english)
# for this current version we can only eliminate "english" key word, even though it should already be taken by max_df!!
vectorizer = TfidfVectorizer(max_features=10000,\
                             max_df=0.7,\
                             strip_accents = 'ascii',
                             stop_words='english')
vectors = vectorizer.fit_transform(newsgroups.data)
vectors.shape

(18846, 10000)

### Split in training, testing and validation sets

Split training (80%), testing(10%) and validation (10%) sets

In [49]:
# Create predictors and predicted variable
X = vectors
y = newsgroups.target

# Split training, testing and validation tests
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, train_size=0.8) # 80% train
X_test, X_val, y_test, y_val = train_test_split(X_test, y_test, random_state=0, train_size=0.5) # 10% test, 10% validation



In [51]:
rand_state = 50

# Create a fit the classifier
rfc = RandomForestClassifier(n_estimators=100, random_state=rand_state)
rfc.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=50, verbose=0, warm_start=False)

In [52]:
# First Evaluation
print('Train score:',rfc.score(X_train, y_train))
print('Test score:',rfc.score(X_test, y_test))
print('Validation score:',rfc.score(X_val, y_val))

Train score: 1.0
Test score: 0.830769230769
Validation score: 0.82599469496


It can be observed that the first fitting of the classifier on the test and validation sets are not very good.

In [53]:
print('Number of features:',rfc.n_features_)
print('Features importances:',rfc.feature_importances_)
depths = pd.Series([estimator.tree_.max_depth for estimator in rfc.estimators_])
depths.describe()

Number of features: 10000
Features importances: [  5.74276830e-04   2.07053299e-04   1.25403332e-04 ...,   1.30413849e-06
   2.50585632e-04   8.65350439e-06]


count    100.00000
mean     305.48000
std       35.92008
min      223.00000
25%      278.50000
50%      304.50000
75%      329.25000
max      385.00000
dtype: float64

Watching at the max depth statistical description all estimators will help us to know how to tune better this parameter.

### Grid search

``` python
We had form 10 to 100 n_estimators
and from 10 to 1584
Best parameters: {'max_depth': 1584, 'n_estimators': 100}
```

In [163]:
# Parameters of the grid search
n_estim_range = np.round(np.logspace(3,3.2,num=4,dtype=int))
max_depth_range = np.round(np.logspace(3,5,num=4,dtype=int))
print(n_estim_range)
print(max_depth_range)
n_estim_range = [200 500 900]
max_depth_range = [300 500 700 900]

[1000 1165 1359 1584]
[  1000   4641  21544 100000]


In [57]:
paramgrid = {'n_estimators': n_estim_range,'max_depth': max_depth_range}

# Grid search on the number of estimators
grid_search = GridSearchCV(RandomForestClassifier(random_state=rand_state), paramgrid, cv = 3)
grid_search.fit(X_train, y_train)
print('Best parameters:',grid_search.best_params_)

[ 10 100]
[  10   54  292 1584]


In [110]:
DATA_FOLDER = 'pickles/'

def save_results(var_name, file_name):
    file_path = DATA_FOLDER + file_name +'.pickle'
    my_file = Path(file_path)
    my_dir = Path(DATA_FOLDER)
    if not(my_dir.is_dir()):
        os.makedirs(DATA_FOLDER)
        save_results(var_name, file_name)
    elif my_file.is_file():
        print('WARNING! This filename already exusted so we wrote in "overrided.pickle" PLEASE MANUALLY CHANGE THE NAME'+\
             '\n CHancge the file name with your name')
        with open(DATA_FOLDER + 'overrided.pickle', 'wb') as file:
            pickle.dump(var_name, file)
    else:
        with open(file_path, 'wb') as file:
            pickle.dump(var_name, file)

def read_pickle(file_name):
    file_path = DATA_FOLDER + str(file_name) + '.pickle'
    with open(file_path, "rb") as file:
        out = pickle.load(file)
    return out

In [111]:
# cwd = os.getcwd()

In [112]:
save_results(grid_search, 'first_try')

 CHancge the file name with your name


In [89]:
lala = read_pickle('first_try')
print('Best parameters:',lala.best_params_)

Best parameters: {'max_depth': 1584, 'n_estimators': 100}


In [56]:
# Evaluation
print('Train score:',grid_search.score(X_train, y_train))
print('Test score:',grid_search.score(X_test, y_test))
print('Validation score:',grid_search.score(X_val, y_val))

Train score: 1.0
Test score: 0.830769230769
Validation score: 0.82599469496
