# Week 5 - Model Selection

## Learning Objectives
+ Train-Test Split
+ Preprocessing and Choice of Model
    + Classification using Naive Bayes
    + Using Pipeline in Sklearn
+ Model Validation 
    + Evaluation Metrics
    + Cross Validation
+ Introduction to Hyperparameter tuning
    + Introduction to GridSearchCV
    
For this tutorial, you need the following installed:
```
conda install -c anaconda nltk
```


In this tutorial, let us consider another type of data we have not yet learnt to handle - text data. 

# Dataset - Spam Detection

Given a text document, we want to be able to classify whether it is a spam or not (binary classification). We use the SMS Spam dataset available in this [kaggle competition](https://www.kaggle.com/uciml/sms-spam-collection-dataset).

The data is available as a csv in which the first column is the class label. The "spam" label refers to message being categorized as spam, while "ham" label exists when the SMS is not a spam.

Let us first load the dataset.

In order to obtain reproducible (i.e. constant) results across multiple program executions, we need to remove all uses of ```random_state=None```, which is the default. The recommended way in sklearn is to declare a ```rng``` variable at the top of the program, and pass it down to any object that accepts a ```random_state``` parameter.

# Train-Test split

Before we begin our modeling, let us first split the data into train and test split. For this, we can use the [```train_test_split```](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split) utility available in sklearn.

# Data Preprocessing

Let us first get quick descriptive statistics of the data. As the aim of the tutorial is not preprocessing, we will do quick operations and majorly focus on handling text data and learning to train a model.

## Punctuation and Stopword Removal

Stopword refers to commonly used words, such as "a", "the", "is", etc. These words are not providing very useful information and hence are generally removed during preprocessing.

Nltk library has a list of stopwords. We can use this list to filter out the stopwords from our documents. However, we must be careful about using all these preprocessing steps, and decide based on the data and task what preprocessing to perform.

Let us now create two new features for the word length of the message and the processed word length. The processed word length is essentually just going to be all words in message sans the stopwords.

The distributions seem to be different and hence, we can use all these three features. However, as these features are constructed from same approach, they will be highly correlated. Ideally, it might be not a great idea to include all highly correlated features. However, for this tutorial, let us continue. As we have done the preprocessing on the train set, we need to do the feature generation similarly for the test set.

## CountVectorizer in sklearn

Sklearn includes a submodule which is dedicated to feature extraction from  [images](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.image) and [text](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.text). An useful and simple utility in the text submodule is the [```CountVectorizer```](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer). It includes text preprocessing (punctuation removal and optional stopwords removal and tokenization), builds a dictionary of features (the vocabulary) and transforms documents to feature vectors. This also has option to specify n-gram text consideration,  in case you are interested in more sophisticated analysis. For this tutorial, we will just generate a word count vector based on the vocabulary constructed. 

# Classification using Naive Bayes Algorithm

Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of features given the value of the class variable. If the *y* is the prediction, and the *x*s are the features, then Bayes' theorem gives the conditional probability of y given x. Using the conditional assumption among features, the equations are simplified to provide us an estimate of *y*. The different Naive Bayes algorithms typically differ in the assumption of the distribution of feature given the *y*.

In this tutorial, we do not aim to understand a specific classifier, or its working. The aim is to understand how we can experiment with the features and perform predictions using sklearn. Once how to implement is understood, the classifiers in sklearn can be changed according to the problem at hand. 

In this tutorial, we will try with two different Naive Bayes algorithms available in sklearn. The [User Guide](https://scikit-learn.org/stable/modules/naive_bayes.html#naive-bayes) is a useful resource for finding simple explanation regarding what can be used. 

# Putting it all together - Building a pipeline in sklearn

We have already seen the ```ColumnTransformer``` in sklearn. Also, we know the naming conventions in sklearn, and have a vague idea about how sklearn makes our life easy in putting the earlier blocks together for experimentation. [```Pipeline```](https://scikit-learn.org/stable/modules/compose.html#pipeline) in sklearn can be used for chaining different estimators together. When we have a fixed sequence of operations, this is usually helpful to put it all together. 

However, for that we would need the operations to also be in form of estimators. We can easily do so by using existing sklearn API, or writing our custom transformer if we have done our custom preprocessing.

## Writing Custom Transformer

You can implement a transformer from an arbitrary function with [```FunctionTransformer```](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html#sklearn.preprocessing.FunctionTransformer). However, if do not have a specific function to implement as transformer, but want flexibility to implement our operations, we can write our transformer using two baseclasses from sklearn: 
1. [```BaseEstimator```](https://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html): As seen in previous tutorial, the estimator provides for get_params and set_params functions. 
2. [```TransformerMixin```](https://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html): This class essentially provides us with fit_transform function when we define our own fit and transform functions.

In general, it is good to note that all estimators should specify all the parameters that can be set at the class level in their ```__init__``` as explicit keyword arguments. However, for our transformation, we are not storing some transformer parameter, and hence can also skip the ```__init__``` function.

## Preprocessing using FeatureUnion 

[```FeatureUnion```](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html#sklearn.pipeline.FeatureUnion) combines several transformer objects into a new transformer that combines their output. A ```FeatureUnion``` takes a list of transformer objects. During fitting, each of these is fit to the data **independently**. The transformers are applied in parallel, and the feature matrices they output are concatenated side-by-side into a larger matrix.

Do note here that the each transformer object is fit to the entire data. If you want to specify different transformer for different column - you can go back to the [```ColumnTransformer```](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html#sklearn.compose.ColumnTransformer) covered in previous tutorial. 

# Hyperparameter Tuning

Hyper-parameters are parameters that are not directly learnt within estimators. To better understand, consider the bayesian classifier. The distribution of the data can be considered to be normal or not. The estimator will learn the distribution according to whether we decide that distribution is normal or not, but the parameter of distribution is not learnt within estimator, but passed to it by us. This would form a hyper-parameter for the model. Let us see the hyperparameters for our text classifier pipeline estimator.

## GridSearch CV

Grid Search algorithm runs through all the different parameters that is fed into the parameter grid and produces the best combination of parameters, based on a scoring metric of your choice (accuracy, f1, etc).

By default, the ```GridSearchCV``` uses a 5-fold cross-validation. However, if it detects that a classifier is passed, rather than a regressor, it uses a stratified 5-fold. The ```random_state``` parameter defaults to ```None```, meaning that the shuffling will be different every time ```KFold(..., shuffle=True)``` is iterated. However, ```GridSearchCV``` will use the same shuffling for each set of parameters validated by a single call to its ```fit``` method. A recommended read on handling randomness in sklearn and getting reproducible results: [link](https://scikit-learn.org/stable/common_pitfalls.html#controlling-randomness).

Please note that by default, parameter search uses the score function of the estimator to evaluate a parameter setting. These are the sklearn.metrics.accuracy_score for classification. 

After fitting the models through GridSearchCV we can get the results using the ```cv_results_``` attribute. We can see the best estimator from the grid search using ```best_estimator_``` attribute. 


# Practice Exercise (Optional):
1. What happens when we set the ```rng``` variable to a specfic integer? How would it affect the cross-validation? Does it affect our code? Read this page for [reference](https://scikit-learn.org/stable/common_pitfalls.html#robustness-of-cross-validation-results).
2. What is the validation strategy we have used in the code? What is the advantage of having a held out test-set?
3. If you do not set the ```scoring``` parameter in GridSearchCV, what is the scoring criterion being used? 
4. Can you do the Grid Search on multiple scoring criterion (e.g. accuracy and AUC)? Read this [reference](https://scikit-learn.org/stable/modules/grid_search.html#specifying-multiple-metrics-for-evaluation) and try to change the code to be able to do this.