# Week 5 - Model Selection

## Learning Objectives
+ Supervised and unsupervised learning
+ Train-Test Split
+ Preprocessing and Choice of Model
    + Classification using Naive Bayes
    + Using Pipeline in Sklearn
+ Model Validation 
    + Evaluation Metrics
    
For this tutorial, you need the following installed:



In [None]:
from google.colab import drive
drive.mount('/content/drive') 

# A supervised learning example (regression)

We work on the ```sklearn``` iris dataset to identify how sepal length varies with sepal width, petal length and petal width. 

+ We split the dataset into train and test datasets.
+ We fit the linear regression model on the train dataset.
+ We use the linear regression model to predict on the test dataset.
+ We calculate the mean squared error to understand goodness of fit.

In [None]:
import pandas as pd
from sklearn.datasets import load_iris

iris = load_iris()
iris_df = pd.DataFrame(data= iris.data, columns= iris.feature_names)
target_df = pd.DataFrame(data= iris.target, columns= ['species'])
iris_df = pd.concat([iris_df, target_df], axis= 1)
iris_df.head()

In [None]:
iris_df.describe()

In [None]:
import seaborn as sns
sns.set_theme()

sns.pairplot(iris_df, hue= 'species')

It is good practice to scale the dataset using ```StandardScaler()``` but since data preprocessing is not the focus of this tutorial, and the magnitude of all features is the same, we skip this step.

We split the dataset using ```train_test_split``` from ```sklearn```.

Whenever randomization is part of a Scikit-learn algorithm, a ```random_state``` parameter may be provided to control the random number generator used. In order to obtain reproducible (i.e. constant) results across multiple program executions, we need to remove all uses of ```random_state=None```, which is the default. The recommended way in sklearn is to declare a ```rng``` variable at the top of the program, and pass it down to any object that accepts a ```random_state``` parameter.

In [None]:
X= iris_df[['petal width (cm)', 'petal length (cm)', 'sepal width (cm)']].values
y= iris_df['sepal length (cm)'].values

import numpy as np
from sklearn.model_selection import train_test_split

rng = np.random.RandomState(0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.33, random_state= rng)

We perform a linear regression to fit the train dataset. How do we find the $R^2$?

In [None]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(X_train, y_train)
lr.score(X_train, y_train)

What are the coefficients that give this $R^2$?

We use the fitted linear regression model to perform predictions in the test dataset. We can also calculate metrics using the predictions. 

In [None]:
lr.predict(X_test)
pred = lr.predict(X_test)

In [None]:
from sklearn.metrics import mean_squared_error
print('Mean Squared Error:', mean_squared_error(y_test, pred))
print('Mean Root Squared Error:', np.sqrt(mean_squared_error(y_test, pred)))

# An unsupervised learning example

We work on the ```sklearn``` iris dataset to identify how petal length and petal width vary with species. We use KMeans

In [None]:
X= iris_df[['petal width (cm)', 'petal length (cm)']].values
y= iris_df['species'].values

How do I do a train test split? 

In [None]:
from sklearn.cluster import KMeans
km = KMeans(n_clusters = 3, random_state=rng)
km.fit(X)

In [None]:
iris_labels = km.labels_
iris_df['species_predict'] = iris_labels

In [None]:
import matplotlib.pyplot as plt
f, axs = plt.subplots(1, 2, figsize=(8, 4), gridspec_kw=dict(width_ratios=[4, 4]))
sns.scatterplot(data=iris_df, x="petal width (cm)", y="petal length (cm)", hue="species", ax=axs[0])
sns.scatterplot(data=iris_df, x="petal width (cm)", y="petal length (cm)", hue="species_predict", ax=axs[1])
f.tight_layout()

We can find the optimal number of clusters using the elbow method.

In [None]:
Sum_of_squared_distances = []
K = range(1,15)
for k in K:
    km = KMeans(n_clusters=k)
    km = km.fit(X)
    Sum_of_squared_distances.append(km.inertia_)

import matplotlib.pyplot as plt

plt.plot(K, Sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()

# A supervised learning example (classification) on the Spam Detection dataset

Let's try to handle a new type of data- text data

Given a text document, we want to be able to classify whether it is a spam or not (binary classification). We use the SMS Spam dataset available in this [kaggle competition](https://www.kaggle.com/uciml/sms-spam-collection-dataset).

The data is available as a csv in which the first column is the class label. The "spam" label refers to message being categorized as spam, while "ham" label exists when the SMS is not a spam.

Let us first load the dataset.

In [None]:
sms = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/IT5006/Week 5/spam.csv', encoding='latin-1')
sms.head()

In [None]:
sms = sms.drop(['Unnamed: 2','Unnamed: 3','Unnamed: 4'],axis=1)
sms = sms.rename(columns = {'v1':'label','v2':'message'})
sms.head()

In [None]:
sms.shape

# Train-Test split

Before we begin our modeling, let us first split the data into train and test split. For this, we can use the [```train_test_split```](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split) utility available in sklearn.

In [None]:
sms_train, sms_test 

In [None]:
print("No of samples in train set: %s"%(len(sms_train)))
print("No of samples in test set: %s"%(len(sms_test)))

# Data Preprocessing

Let us first get quick descriptive statistics of the data. As the aim of the tutorial is not preprocessing, we will do quick operations and majorly focus on handling text data and learning to train a model.

In [None]:
sms_train.describe()

In [None]:
sms_train.groupby('label').describe()

In [None]:
sms_train['length'] = sms_train['message'].apply(len)
sms_train.head()

## Punctuation and Stopword Removal

Stopword refers to commonly used words, such as "a", "the", "is", etc. These words are not providing very useful information and hence are generally removed during preprocessing.

Nltk library has a list of stopwords. We can use this list to filter out the stopwords from our documents. However, we must be careful about using all these preprocessing steps, and decide based on the data and task what preprocessing to perform.

In [None]:
import nltk
import string
import re

nltk.download('stopwords')

from nltk.corpus import stopwords
stop = stopwords.words('english')

def wordCount(text, testing=False): 
    try:
        text = text.lower() # convert text to lower case
        if testing==True:
            print(text)
        regex = re.compile('['+re.escape(string.punctuation) + '0-9\\r\\t\\n'+']')
        txt = regex.sub('',text)  # remove punctuation
        if testing==True:
            print(txt)
        words = [w for w in txt.split(' ')
                if w not in stop] # remove stop words and words with length smaller than 3 letters. create array of remaining words
        if testing==True:
            print(words)
        return len(words)
    except:
        return 0

In [None]:
wordCount(sms_train['message'].iloc[103], testing=True)

Let us now create two new features for the word length of the message and the processed word length. The processed word length is essentually just going to be all words in message sans the stopwords.

We can use df.apply to count the number of words in each message.

In [None]:
sms_train.head()

As we have done the preprocessing on the train set, we need to do the feature generation similarly for the test set.

In [None]:
sms_test['length'] = sms_test['message'].apply(len)
sms_test['word_length'] = sms_test['message'].apply(lambda x: len([w for w in x.split(' ')]))
sms_test['processed_word_length'] = sms_test['message'].apply(lambda x: wordCount(x))

In [None]:
x_train = sms_train[['length', 'word_length', 'processed_word_length']].to_numpy()
x_test = sms_test[['length', 'word_length', 'processed_word_length']].to_numpy()

We also change the spam/ ham labels to numeric.

In [None]:
y_train = [1 if l=="spam" else 0 for l in sms_train['label']]
y_test = [1 if l=="spam" else 0 for l in sms_test['label']]

## CountVectorizer in sklearn

Sklearn includes a submodule which is dedicated to feature extraction from  [images](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.image) and [text](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.text). A useful and simple utility in the text submodule is the [```CountVectorizer```](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer). It includes text preprocessing (punctuation removal and optional stopwords removal and tokenization), builds a dictionary of features (the vocabulary) and transforms documents to feature vectors. This also has option to specify n-gram text consideration,  in case you are interested in more sophisticated analysis. For this tutorial, we will just generate a word count vector based on the vocabulary constructed. 

![CountVectorizer](https://www.educative.io/api/edpresso/shot/5197621598617600/image/6596233398321152)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(sms_train.message)
print("No of samples in train set: %s"%(len(sms_train)))
X_train_counts.shape

What features are in our dataset?

In [None]:
sms_train.message[200]

In [None]:
X_train_counts.toarray()[200]

In [None]:
X_test_counts = count_vect.transform(sms_test.message)

In [None]:
## REMOVE what happens to words in the test dataset that have not been seen before (tags: unseen)
np.sum(count_vect.transform(["arts"]).toarray())

In [None]:
print(X_train_counts.shape)
print(x_train.shape)
print(X_test_counts.shape)
print(x_test.shape)

Let us now create the dataset by concatenating the word length features and the word count vector. Do note that essentially the word count vector is also providing us information regarding the count of the words - which is similar to the length. We do not really expect to see huge improvements with this approach. But we are continuing in this tutorial so as to learn about mixing and using such differently created features. 

In [None]:
trainData = np.hstack((X_train_counts.todense(), x_train))
testData = np.hstack((X_test_counts.todense(), x_test))

# Classification using Naive Bayes Algorithm

Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of features given the value of the class variable. 

If *y* is the prediction, and the *x*s are the features, then Bayes' theorem gives the conditional probability of y given x. Using the conditional assumption among features, the equations are simplified to provide us an estimate of *y*. The different Naive Bayes algorithms typically differ in the assumption of the distribution of feature given the *y*.

In this tutorial, we do not aim to understand a specific classifier, or its working. The aim is to understand how we can experiment with the features and perform predictions using sklearn. Once how to implement is understood, the classifiers in sklearn can be changed according to the problem at hand. 

In this tutorial, we will try with two different Naive Bayes algorithms available in sklearn. The [User Guide](https://scikit-learn.org/stable/modules/naive_bayes.html#naive-bayes) is a useful resource for finding simple explanation regarding what can be used. 

In [None]:
from sklearn.naive_bayes import MultinomialNB, ComplementNB

clf = MultinomialNB()

In [None]:
clf.fit(x_train, y_train)

In [None]:
from sklearn.metrics import precision_score, recall_score, confusion_matrix, accuracy_score

def evaluate(y_pred, y_test):
    print("###########")

    c = confusion_matrix(y_test, y_pred) #[[TN FP],[FN,TP]]
    tn, fp, fn, tp = c.ravel() # returns a flattened array

    print(c)
    print("###########")
    print("Accuracy:"+str(accuracy_score(y_test, y_pred)))
    sens, spec = tp/(tp+fn), tn/(tn+fp) 
    print("Specificity:{0}, Sensitivity: {1}".format(spec, sens))
    print("Precision:"+str(precision_score(y_test, y_pred)))

In [None]:
y_pred = clf.predict(x_test)

evaluate(y_pred, y_test)

In [None]:
clf = ComplementNB()

clf.fit(trainData, y_train)

In [None]:
y_pred = clf.predict(testData)

evaluate(y_pred, y_test)

# Putting it all together - Building a pipeline in sklearn

We have already seen the ```ColumnTransformer``` in sklearn. Also, we know the naming conventions in sklearn, and have a vague idea about how sklearn makes our life easy in putting the earlier blocks together for experimentation. [```Pipeline```](https://scikit-learn.org/stable/modules/compose.html#pipeline) in sklearn can be used for chaining different estimators together. When we have a fixed sequence of operations, this is usually helpful to put it all together. 

However, for that we would need the operations to also be in form of estimators. We can easily do so by using existing sklearn API, or writing our custom transformer if we have done our custom preprocessing.

## Writing Custom Transformer

You can implement a transformer from an arbitrary function with [```FunctionTransformer```](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html#sklearn.preprocessing.FunctionTransformer). However, if do not have a specific function to implement as transformer, but want flexibility to implement our operations, we can write our transformer using two baseclasses from sklearn: 
1. [```BaseEstimator```](https://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html): The estimator provides for get_params and set_params functions. 
2. [```TransformerMixin```](https://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html): This class essentially provides us with fit_transform function when we define our own fit and transform functions.

In general, it is good to note that all estimators should specify all the parameters that can be set at the class level in their ```__init__``` as explicit keyword arguments. However, for our transformation, we are not storing some transformer parameter, and hence can also skip the ```__init__``` function.

Let's create an transformer that creates the ```length```, ```word_length``` and ```processed_word_length``` features in the dataframe.

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

class FeatureCreator( BaseEstimator, TransformerMixin ):    
    def fit(self, X, y=None):
        return self
        
    def transform(self, X, y=None):
        self.df = pd.DataFrame()
        self.df['length'] = X.apply(len)
        self.df['word_length'] = X.apply(wordLength)
        self.df['processed_word_length'] = X.apply(wordCount)
        return self.df

In [None]:
proc1 = FeatureCreator()
x_train = proc1.fit_transform(sms_train.message)
print(x_train.shape)

In [None]:
class DenseTransformer(TransformerMixin):

    def fit(self, X, y=None, **fit_params):
        return self

    def transform(self, X, y=None, **fit_params):
        return X.todense()

## Preprocessing using FeatureUnion 

[```FeatureUnion```](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html#sklearn.pipeline.FeatureUnion) combines several transformer objects into a new transformer that combines their output. A ```FeatureUnion``` takes a list of transformer objects. During fitting, each of these is fit to the data **independently**. The transformers are applied in parallel, and the feature matrices they output are concatenated side-by-side into a larger matrix.

Do note here that the each transformer object is fit to the entire data. If you want to specify different transformer for different column - you can go back to the [```ColumnTransformer```](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html#sklearn.compose.ColumnTransformer) covered in previous tutorial. 

In [None]:
from sklearn.pipeline import Pipeline, FeatureUnion

tf_pipeline = Pipeline([("countvec", count_vect), ("to_dense",DenseTransformer())])
feats = FeatureUnion([("lengths", proc1), ("tf", tf_pipeline)])

In [None]:
feats.fit(sms_train.message)
x_train = feats.transform(sms_train.message)
print(x_train.shape)

What should we do with the test dataset?

In [None]:
x_test

In [None]:
x_train

In [None]:
text_clf = Pipeline([('feats', feats),('clf', clf)])

text_clf.fit(sms_train.message, y_train)

In [None]:
y_pred = text_clf.predict(sms_test.message)
evaluate(y_pred, y_test)