# Binary classification hands-on

In this hands-on you will try solving a simple classification problem from one of the "dummy" datasets. The task consists of 

1. Loading and exploration of the dataset 
2. Transforming the dataset so the features are usable by the given classification algorithms
3. Training a decision tree model 
4. Evaluation of the model results and its performance
5. (Bonus) Experiment with different algorithm, performance evaluation methods, etc.

In [None]:
# import the data scientist most used packages (you can import additional when needed)
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

In [None]:
# dataset_load
df = sns.load_dataset("titanic")
df.head(10)


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
5,0,3,male,,0,0,8.4583,Q,Third,man,True,,Queenstown,no,True
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
7,0,3,male,2.0,3,1,21.075,S,Third,child,False,,Southampton,no,False
8,1,3,female,27.0,0,2,11.1333,S,Third,woman,False,,Southampton,yes,False
9,1,2,female,14.0,1,0,30.0708,C,Second,child,False,,Cherbourg,yes,False


* Pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
* survival Survival (0 = No; 1 = Yes)
*  name Name
* sex Sex
* age Age
* sibsp Number of Siblings/Spouses Aboard
* parch Number of Parents/Children Aboard
* ticket Ticket Number
* fare Passenger Fare (British pound)
* cabin Cabin
* embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
* boat Lifeboat
* body Body Identification Number
* home.dest Home/Destination

### Dataset exploration (15 min)

The target variable of the dataset is in column **survived**. Skim through the dataset and see which features are numeric, which are categorical, which have missing values etc. Attribute .dtypes of the dataframe can come in handy. Also, try to see which columns contain missing values and what is the percentage of missing value in that column.

### Dataset visual exploration

Afterwards, make a plot for each of the feature trying to show which of those variables should have predictive power. For numerical features, plot two histograms in one plot (one for target 0 and second for target 1). For categorical, plot one of countplots/catplots/barplots or simple event rate plot (what is the ratio of survived/not-survied for each category.

## Prepare the dataset for the classification task

* Delete unnecessary columns
* Deal with missing values
* Dummify categorical variables (pd.get_dummies())

## Split data to test and train set

* print the resulting ratio of size of train vs whole dataset
* print ratio of survived people in the different dataset options to check that the ratio is similar in both train and test dataset

In [2]:
from sklearn.model_selection import train_test_split

## Train a decision tree model
Fit a decision tree classifier of a reasonable height

In [1]:
from sklearn.tree import DecisionTreeClassifier
import numpy as np

#### Print the train and test accuracy of the model (number of correctly classified samples / number of samples)
You don't need for loop for this!

### Plot the tree (sklearn.plot_tree() function)
Make the plot pretty and annotated so it can be explained to business.

In [None]:
from sklearn import tree


## Plot feature importance
* Create dataframe consisting of dataset **column names** and **feature importances** (tree.feature_importance_ attribute)
* Plot bar plot showing the most important features first (from most important to least important)

## Plot ROC curve (train and test to one plot)

(sklearn.metrics.roc_curve)

## Bonus

1. What happens when depth of the tree is large? 
2. Plot cummulative gain chart / Plot train confusion matrix as a heatmap. (http://www2.cs.uregina.ca/~dbd/cs831/notes/lift_chart/lift_chart.html) (pd.qcut(), pd.sort_valus())
3. Try different algorithms for the task and see whether the result will be better.
4. Add event-rates to categorical plots in data exploration (= survived ratio in the category) 

# Regression

Estimate diamond price based on its features

In [None]:
# dataset_load
df_tips = sns.load_dataset("diamonds")
df_tips.head(10)

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75
5,0.24,Very Good,J,VVS2,62.8,57.0,336,3.94,3.96,2.48
6,0.24,Very Good,I,VVS1,62.3,57.0,336,3.95,3.98,2.47
7,0.26,Very Good,H,SI1,61.9,55.0,337,4.07,4.11,2.53
8,0.22,Fair,E,VS2,65.1,61.0,337,3.87,3.78,2.49
9,0.23,Very Good,H,VS1,59.4,61.0,338,4.0,4.05,2.39


### Explore the dataset visually 

https://seaborn.pydata.org/tutorial/regression.html and/or plot distribution for each category when value is categorical/binary.

* seaborn.lmplot() for numerical columns
* seaborn.distplot() for categorical columns (or matplotlib .hist())
* plot target distribution

### Train a model and evaluate it (https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics)

1. Binarize the categorical columns
1. Split to train/test
2. use e.g. Decision tree regressor or Linear Regression so you don't need to deal with scaling (bad practice - it is hard to estimate importance of individual features than)

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression

## Evaluate model for train and test 

* R2 (method regressor.score())
* explained_variance_score (https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics)
* mean_absolute_error (https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics)
* distribution plot of errors for both test and train (difference of predicted and ground truth values)

In [None]:
from sklearn.metrics import explained_variance_score
from sklearn.metrics import mean_absolute_error