# Supervised Learning

In this lab, we will be working on creating a model that predicts whether a horse which has colic will survive based on past medical conditions. The dataset is called Horse Colic Dataset. The column 'outcome' determines what happened to the horse, and will be the label.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

## Data Exploration

In [None]:
df = pd.read_csv('horse.csv')
df.head()

In [None]:
df.info()

In [None]:
df.describe() #Only describes numeric columns, not categorical

In [None]:
df.describe(include='object')

## Querying the dataset

Q1. Create any queries which will better help you understand the dataset.

In [None]:
#TODO



## Data Visualization

Check out the bar plot below showing the outcome of different horses.

In [None]:
sns.countplot(x='outcome',data=df)

Q2. Make your own visualizations and try to figure out important features. Feel free to use the examples from previous lab, but try to get creative.

In [None]:
# Plot


In [None]:
# Plot


## Fill missing values

Use **fillna()** function in pandas for filling missing values. Make sure all the features have correct dtypes. <br/>
**Numeric**: float or int <br/>
**Categorical**: object or int

In [None]:
# Check features have correct dtype


In [None]:
# Calculate number of NaNs in each column


Q3. Fill all NaN values for numeric features with mean.

Q4. Fill in NaN values for categorical features with mode.

## Transformation of Skewed Continuous Features

Sometimes continuous features are distributed such that the values reside near one central value, but there are sometimes a non-trivial amount of larger or smaller values which may negatively affect the learning algorithm. Therefore it is common to perform a transformation such as log transformation on these features.
Let us take an example:

In [None]:
sns.distplot(df['total_protein'],kde = False)

We can see that most of values lie between 0-20, however there are a large amount of data points which are greater than 40. We need to transform this feature.

Q5. Carry out a log transformation on total_protein by applying natural log on all values.

In [None]:
#TODO
df['total_protein'] = None
# sns.distplot(df['total_protein'],kde = False)

## Feature Selection

Q6. Select features from the dataset based on your analysis.

In [None]:
numerical_features = []
categorical_features = []
X = df[numerical_features+categorical_features]
y = df["outcome"]

## Ordinal and One-hot encoding of categorical attributes

In learning algorithms, values are expected to be numeric. However, categorical attributes can provide a lot of information to the model. So the way we incorporate these attributes is by encoding them.

Q7. Encode the categorical variables (Use **pd.get_dummies()** for one-hot encode).

In [None]:
#TODO

#Ordinal (can also use OrdinalEncoder())

#One-hot


X.head()

## Encoding the labels

One more step before we move on is converting the categorical labels in 'outcome' to numbers. This is called label encoding, and is done with the help of LabelEncoder. Check the sklearn documentation example for help.

Q8. Using LabelEncoder, encode 'outcome'

In [None]:
from sklearn.preprocessing import LabelEncoder

le = None  #Instantiate the encoder
y = None   #Fit and transform the labels using labelencoder

y

## Creating a Train and Test Split

Q9. Create training and validation split on data. Check out train_test_split() function from sklearn to do this.

In [None]:
from sklearn.model_selection import train_test_split

X_train,X_val,y_train,y_val = None


## Scaling of numeric attributes

Q10. Scale numeric attributes using MinMaxScaler, StandardScaler(Z-score normalization) or RobustScaler. Scale train and validation datasets separately!

In [None]:
#TODO
from sklearn.preprocessing import RobustScaler

scaler = None  #Instantiate the scaler 
X_train[numerical_features] = None  #Fit and transform the features using scaler
X_val[numerical_features] = None  #Transform the features using scaler

X_train[numerical_features].head()

## Model Selection and Training

Q11. Select 2 classifiers, instantiate them and train them. A few models are given below:
- DecisionTreeClassifier
- GaussianNB
- RandomForestClassifier
- Support Vector Machine
- Any other classifier, look them up!

In [None]:
# Initialize and train
clf1 = None
clf2 = None

## Cross Validation and Performance Analysis

In [None]:
from sklearn.metrics import accuracy_score

y_pred_1 = None
y_pred_2 = None

acc1 = None
acc2 = None

print("Accuracy score of clf1: {}".format(acc1))
print("Accuracy score of clf2: {}".format(acc1))

## Hyperparameter Tuning

How do we optimize the classifier in order to produce the best results? We need to tune the model by varying various hyperparameters. We can use GridSearchCV to simplify the whole process.

For GridSearchCV, carry out the following steps(We will only do this for one classifier, so choose one of your previous classifiers):
- Initialize a new classifier object
- Create a dictionary of parameters you wish to tune.(e.g. parameters = {'param_name':[list of values]})
- Note: Avoid tuning the max_features parameter of your learner if that parameter is available!
- Use make_scorer to create an accuracy_score object
- Perform grid search on the classifier clf using the 'scorer', and store it in grid_obj.
- Fit the grid search object to the training data (X_train, y_train), and store it in grid_fit.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

#TODO
clf = None           #Initialize the classifier object

parameters = None    #Dictionary of parameters

scorer = None        #Initialize the scorer using make_scorer

grid_obj = None      #Initialize a GridSearchCV object with above parameters,scorer and classifier

grid_fit = None      #Fit the gridsearch object with X_train,y_train

best_clf = None      #Get the best estimator. For this, check documentation of GridSearchCV object

unoptimized_predictions = None      #Using the unoptimized classifiers, generate predictions
optimized_predictions = None        #Same, but use the best estimator

acc_unop = None       #Calculate accuracy for unoptimized model
acc_op = None         #Calculate accuracy for optimized model

print("Accuracy score on unoptimized model:{}".format(acc_unop))
print("Accuracy score on optimized model:{}".format(acc_op))

We have learnt some methods to boost our accuracy. Try messing around with the above functions and bring up model performance.