### Lecture 4: Homework

Today we gonna learn how to choose between ML models, based on data type. Your task would be to predict **the edibility of a mushroom** based on sample descriptions (binary classification problem)

The **tricky part here is that 95% of the features are of categorical type.**
<br>That's the one where we would **(usually)  prefer tree-based algorithms over linear methods**

Although this dataset was originally contributed to the UCI Machine Learning repository nearly 30 years ago, mushroom hunting (otherwise known as "shrooming") is enjoying new peaks in popularity. Learn which features spell certain death and which are most palatable in this dataset of mushroom characteristics. And how certain can your model be?

This dataset includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family Mushroom drawn from The Audubon Society Field Guide to North American Mushrooms (1981). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like "leaflets three, let it be'' for Poisonous Oak and Ivy.

More information can be found [here](https://www.kaggle.com/uciml/mushroom-classification/data)

Please find below correspondent [google form](https://docs.google.com/forms/d/e/1FAIpQLScmKfUApMlcD81u9UZxM7xG3vJiEJHrPrG-3b0i_jyPEDijgQ/viewform) to submit your answers

In [1]:
# library import
import pandas as pd
import numpy as np
from os.path import join as pjoin
pd.options.display.max_columns = 50
pd.options.display.max_colwidth = 100

# preprocessing / validation
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import (
    train_test_split, StratifiedKFold, cross_val_score, GridSearchCV
)
# ML models
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

# metrics
from sklearn.metrics import classification_report, f1_score

In [2]:
import pandas_profiling as pp

In [3]:
# read data
DATA_DIR = 'data'
df_train = pd.read_csv( '../data/4-mushrooms-train.csv', engine='c')
df_test = pd.read_csv('../data/4-mushrooms-test.csv', engine='c')
ytest = pd.read_csv ('../data/4-mushrooms-y_test.csv', engine='c')
print(df_train.shape, df_test.shape)

(6499, 23) (1625, 22)


In [4]:
df_train.shape

(6499, 23)

In [5]:
df_test.shape

(1625, 22)

In [6]:
ytest.shape

(1625, 1)

In [7]:
df_train.head()

Unnamed: 0,target,cap_shape,cap_surface,cap_color,bruises,odor,gill_attachment,gill_spacing,gill_size,gill_color,stalk_shape,stalk_root,stalk_surface_above_ring,stalk_surface_below_ring,stalk_color_above_ring,stalk_color_below_ring,veil_type,veil_color,ring_number,ring_type,spore_print_color,population,habitat
0,0,convex,scaly,brown,bruises,pungent,free,close,narrow,white,enlarging,equal,smooth,smooth,white,white,partial,white,one,pendant,brown,scattered,urban
1,1,flat,fibrous,gray,bruises,none,free,close,broad,brown,tapering,bulbous,smooth,smooth,white,white,partial,white,one,pendant,brown,several,woods
2,0,flat,smooth,brown,no,none,attached,close,broad,orange,enlarging,missing,smooth,smooth,orange,orange,partial,orange,one,pendant,brown,several,leaves
3,1,convex,fibrous,gray,bruises,none,free,close,broad,brown,tapering,bulbous,smooth,smooth,white,white,partial,white,one,pendant,black,solitary,woods
4,0,knobbed,smooth,brown,no,foul,free,close,narrow,buff,tapering,missing,silky,smooth,pink,pink,partial,white,one,evanescent,white,several,paths


In [8]:
# let's see what data looks like
df_train.head()

Unnamed: 0,target,cap_shape,cap_surface,cap_color,bruises,odor,gill_attachment,gill_spacing,gill_size,gill_color,stalk_shape,stalk_root,stalk_surface_above_ring,stalk_surface_below_ring,stalk_color_above_ring,stalk_color_below_ring,veil_type,veil_color,ring_number,ring_type,spore_print_color,population,habitat
0,0,convex,scaly,brown,bruises,pungent,free,close,narrow,white,enlarging,equal,smooth,smooth,white,white,partial,white,one,pendant,brown,scattered,urban
1,1,flat,fibrous,gray,bruises,none,free,close,broad,brown,tapering,bulbous,smooth,smooth,white,white,partial,white,one,pendant,brown,several,woods
2,0,flat,smooth,brown,no,none,attached,close,broad,orange,enlarging,missing,smooth,smooth,orange,orange,partial,orange,one,pendant,brown,several,leaves
3,1,convex,fibrous,gray,bruises,none,free,close,broad,brown,tapering,bulbous,smooth,smooth,white,white,partial,white,one,pendant,black,solitary,woods
4,0,knobbed,smooth,brown,no,foul,free,close,narrow,buff,tapering,missing,silky,smooth,pink,pink,partial,white,one,evanescent,white,several,paths


In [9]:
# for convenient calculations, let us merge train with test
df = pd.concat([df_train, df_test], axis=0)
# add column for filtering train/test
df['is_train'] = True
df.loc[df.target.isnull(), 'is_train'] = False
# check shapes
print(df.shape)
# check labels
df.is_train.value_counts()

(8124, 24)


True     6499
False    1625
Name: is_train, dtype: int64

In [10]:
pp.ProfileReport(df)

0,1
Number of variables,25
Number of observations,8124
Total Missing (%),0.8%
Total size in memory,1.5 MiB
Average record size in memory,193.0 B

0,1
Numeric,2
Categorical,21
Boolean,1
Date,0
Text (Unique),0
Rejected,1
Unsupported,0

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
no,4748
bruises,3376

Value,Count,Frequency (%),Unnamed: 3
no,4748,58.4%,
bruises,3376,41.6%,

0,1
Distinct count,10
Unique (%),0.1%
Missing (%),0.0%
Missing (n),0

0,1
brown,2284
gray,1840
red,1500
Other values (7),2500

Value,Count,Frequency (%),Unnamed: 3
brown,2284,28.1%,
gray,1840,22.6%,
red,1500,18.5%,
yellow,1072,13.2%,
white,1040,12.8%,
buff,168,2.1%,
pink,144,1.8%,
cinnamon,44,0.5%,
purple,16,0.2%,
green,16,0.2%,

0,1
Distinct count,6
Unique (%),0.1%
Missing (%),0.0%
Missing (n),0

0,1
convex,3656
flat,3152
knobbed,828
Other values (3),488

Value,Count,Frequency (%),Unnamed: 3
convex,3656,45.0%,
flat,3152,38.8%,
knobbed,828,10.2%,
bell,452,5.6%,
sunken,32,0.4%,
conical,4,0.0%,

0,1
Distinct count,4
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
scaly,3244
smooth,2556
fibrous,2320

Value,Count,Frequency (%),Unnamed: 3
scaly,3244,39.9%,
smooth,2556,31.5%,
fibrous,2320,28.6%,
grooves,4,0.0%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
free,7914
attached,210

Value,Count,Frequency (%),Unnamed: 3
free,7914,97.4%,
attached,210,2.6%,

0,1
Distinct count,12
Unique (%),0.1%
Missing (%),0.0%
Missing (n),0

0,1
buff,1728
pink,1492
white,1202
Other values (9),3702

Value,Count,Frequency (%),Unnamed: 3
buff,1728,21.3%,
pink,1492,18.4%,
white,1202,14.8%,
brown,1048,12.9%,
gray,752,9.3%,
chocolate,732,9.0%,
purple,492,6.1%,
black,408,5.0%,
red,96,1.2%,
yellow,86,1.1%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
broad,5612
narrow,2512

Value,Count,Frequency (%),Unnamed: 3
broad,5612,69.1%,
narrow,2512,30.9%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
close,6812
crowded,1312

Value,Count,Frequency (%),Unnamed: 3
close,6812,83.9%,
crowded,1312,16.1%,

0,1
Distinct count,7
Unique (%),0.1%
Missing (%),0.0%
Missing (n),0

0,1
woods,3148
grasses,2148
paths,1144
Other values (4),1684

Value,Count,Frequency (%),Unnamed: 3
woods,3148,38.7%,
grasses,2148,26.4%,
paths,1144,14.1%,
leaves,832,10.2%,
urban,368,4.5%,
meadows,292,3.6%,
waste,192,2.4%,

0,1
Distinct count,6499
Unique (%),80.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,2761.5
Minimum,0
Maximum,6498
Zeros (%),0.0%

0,1
Minimum,0.0
5-th percentile,203.0
Q1,1015.0
Median,2436.5
Q3,4467.2
95-th percentile,6091.8
Maximum,6498.0
Range,6498.0
Interquartile range,3452.2

0,1
Standard deviation,1952.1
Coef of variation,0.70687
Kurtosis,-1.2221
Mean,2761.5
MAD,1719
Skewness,0.33229
Sum,22434751
Variance,3810500
Memory size,63.5 KiB

Value,Count,Frequency (%),Unnamed: 3
0,2,0.0%,
1197,2,0.0%,
1245,2,0.0%,
1237,2,0.0%,
1229,2,0.0%,
1221,2,0.0%,
1213,2,0.0%,
1205,2,0.0%,
1189,2,0.0%,
1261,2,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,2,0.0%,
1,2,0.0%,
2,2,0.0%,
3,2,0.0%,
4,2,0.0%,

Value,Count,Frequency (%),Unnamed: 3
6494,1,0.0%,
6495,1,0.0%,
6496,1,0.0%,
6497,1,0.0%,
6498,1,0.0%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.79998

0,1
True,6499
(Missing),1625

Value,Count,Frequency (%),Unnamed: 3
True,6499,80.0%,
(Missing),1625,20.0%,

0,1
Distinct count,9
Unique (%),0.1%
Missing (%),0.0%
Missing (n),0

0,1
none,3528
foul,2160
spicy,576
Other values (6),1860

Value,Count,Frequency (%),Unnamed: 3
none,3528,43.4%,
foul,2160,26.6%,
spicy,576,7.1%,
fishy,576,7.1%,
anise,400,4.9%,
almond,400,4.9%,
pungent,256,3.2%,
creosote,192,2.4%,
musty,36,0.4%,

0,1
Distinct count,6
Unique (%),0.1%
Missing (%),0.0%
Missing (n),0

0,1
several,4040
solitary,1712
scattered,1248
Other values (3),1124

Value,Count,Frequency (%),Unnamed: 3
several,4040,49.7%,
solitary,1712,21.1%,
scattered,1248,15.4%,
numerous,400,4.9%,
abundant,384,4.7%,
clustered,340,4.2%,

0,1
Distinct count,3
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
one,7488
two,600
none,36

Value,Count,Frequency (%),Unnamed: 3
one,7488,92.2%,
two,600,7.4%,
none,36,0.4%,

0,1
Distinct count,5
Unique (%),0.1%
Missing (%),0.0%
Missing (n),0

0,1
pendant,3968
evanescent,2776
large,1296
Other values (2),84

Value,Count,Frequency (%),Unnamed: 3
pendant,3968,48.8%,
evanescent,2776,34.2%,
large,1296,16.0%,
flaring,48,0.6%,
none,36,0.4%,

0,1
Distinct count,9
Unique (%),0.1%
Missing (%),0.0%
Missing (n),0

0,1
white,2388
brown,1968
black,1872
Other values (6),1896

Value,Count,Frequency (%),Unnamed: 3
white,2388,29.4%,
brown,1968,24.2%,
black,1872,23.0%,
chocolate,1632,20.1%,
green,72,0.9%,
purple,48,0.6%,
orange,48,0.6%,
yellow,48,0.6%,
buff,48,0.6%,

0,1
Distinct count,9
Unique (%),0.1%
Missing (%),0.0%
Missing (n),0

0,1
white,4464
pink,1872
gray,576
Other values (6),1212

Value,Count,Frequency (%),Unnamed: 3
white,4464,54.9%,
pink,1872,23.0%,
gray,576,7.1%,
brown,448,5.5%,
buff,432,5.3%,
orange,192,2.4%,
red,96,1.2%,
cinnamon,36,0.4%,
yellow,8,0.1%,

0,1
Distinct count,9
Unique (%),0.1%
Missing (%),0.0%
Missing (n),0

0,1
white,4384
pink,1872
gray,576
Other values (6),1292

Value,Count,Frequency (%),Unnamed: 3
white,4384,54.0%,
pink,1872,23.0%,
gray,576,7.1%,
brown,512,6.3%,
buff,432,5.3%,
orange,192,2.4%,
red,96,1.2%,
cinnamon,36,0.4%,
yellow,24,0.3%,

0,1
Distinct count,5
Unique (%),0.1%
Missing (%),0.0%
Missing (n),0

0,1
bulbous,3776
missing,2480
equal,1120
Other values (2),748

Value,Count,Frequency (%),Unnamed: 3
bulbous,3776,46.5%,
missing,2480,30.5%,
equal,1120,13.8%,
club,556,6.8%,
rooted,192,2.4%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
tapering,4608
enlarging,3516

Value,Count,Frequency (%),Unnamed: 3
tapering,4608,56.7%,
enlarging,3516,43.3%,

0,1
Distinct count,4
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
smooth,5176
silky,2372
fibrous,552

Value,Count,Frequency (%),Unnamed: 3
smooth,5176,63.7%,
silky,2372,29.2%,
fibrous,552,6.8%,
scaly,24,0.3%,

0,1
Distinct count,4
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
smooth,4936
silky,2304
fibrous,600

Value,Count,Frequency (%),Unnamed: 3
smooth,4936,60.8%,
silky,2304,28.4%,
fibrous,600,7.4%,
scaly,284,3.5%,

0,1
Distinct count,3
Unique (%),0.0%
Missing (%),20.0%
Missing (n),1625
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.51331
Minimum,0
Maximum,1
Zeros (%),38.9%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,1
Q3,1
95-th percentile,1
Maximum,1
Range,1
Interquartile range,1

0,1
Standard deviation,0.49986
Coef of variation,0.9738
Kurtosis,-1.9978
Mean,0.51331
MAD,0.49965
Skewness,-0.05327
Sum,3336
Variance,0.24986
Memory size,63.5 KiB

Value,Count,Frequency (%),Unnamed: 3
1.0,3336,41.1%,
0.0,3163,38.9%,
(Missing),1625,20.0%,

Value,Count,Frequency (%),Unnamed: 3
0.0,3163,38.9%,
1.0,3336,41.1%,

Value,Count,Frequency (%),Unnamed: 3
0.0,3163,38.9%,
1.0,3336,41.1%,

0,1
Distinct count,4
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
white,7924
brown,96
orange,96

Value,Count,Frequency (%),Unnamed: 3
white,7924,97.5%,
brown,96,1.2%,
orange,96,1.2%,
yellow,8,0.1%,

0,1
Constant value,partial

Unnamed: 0,bruises,cap_color,cap_shape,cap_surface,gill_attachment,gill_color,gill_size,gill_spacing,habitat,odor,population,ring_number,ring_type,spore_print_color,stalk_color_above_ring,stalk_color_below_ring,stalk_root,stalk_shape,stalk_surface_above_ring,stalk_surface_below_ring,target,veil_color,veil_type,is_train
0,bruises,brown,convex,scaly,free,white,narrow,close,urban,pungent,scattered,one,pendant,brown,white,white,equal,enlarging,smooth,smooth,0.0,white,partial,True
1,bruises,gray,flat,fibrous,free,brown,broad,close,woods,none,several,one,pendant,brown,white,white,bulbous,tapering,smooth,smooth,1.0,white,partial,True
2,no,brown,flat,smooth,attached,orange,broad,close,leaves,none,several,one,pendant,brown,orange,orange,missing,enlarging,smooth,smooth,0.0,orange,partial,True
3,bruises,gray,convex,fibrous,free,brown,broad,close,woods,none,solitary,one,pendant,black,white,white,bulbous,tapering,smooth,smooth,1.0,white,partial,True
4,no,brown,knobbed,smooth,free,buff,narrow,close,paths,foul,several,one,evanescent,white,pink,pink,missing,tapering,silky,smooth,0.0,white,partial,True


In [11]:
df.shape

(8124, 24)

In [12]:
df.head()

Unnamed: 0,bruises,cap_color,cap_shape,cap_surface,gill_attachment,gill_color,gill_size,gill_spacing,habitat,odor,population,ring_number,ring_type,spore_print_color,stalk_color_above_ring,stalk_color_below_ring,stalk_root,stalk_shape,stalk_surface_above_ring,stalk_surface_below_ring,target,veil_color,veil_type,is_train
0,bruises,brown,convex,scaly,free,white,narrow,close,urban,pungent,scattered,one,pendant,brown,white,white,equal,enlarging,smooth,smooth,0.0,white,partial,True
1,bruises,gray,flat,fibrous,free,brown,broad,close,woods,none,several,one,pendant,brown,white,white,bulbous,tapering,smooth,smooth,1.0,white,partial,True
2,no,brown,flat,smooth,attached,orange,broad,close,leaves,none,several,one,pendant,brown,orange,orange,missing,enlarging,smooth,smooth,0.0,orange,partial,True
3,bruises,gray,convex,fibrous,free,brown,broad,close,woods,none,solitary,one,pendant,black,white,white,bulbous,tapering,smooth,smooth,1.0,white,partial,True
4,no,brown,knobbed,smooth,free,buff,narrow,close,paths,foul,several,one,evanescent,white,pink,pink,missing,tapering,silky,smooth,0.0,white,partial,True


In [13]:
df.iloc[-10:,:]

Unnamed: 0,bruises,cap_color,cap_shape,cap_surface,gill_attachment,gill_color,gill_size,gill_spacing,habitat,odor,population,ring_number,ring_type,spore_print_color,stalk_color_above_ring,stalk_color_below_ring,stalk_root,stalk_shape,stalk_surface_above_ring,stalk_surface_below_ring,target,veil_color,veil_type,is_train
1615,bruises,pink,knobbed,smooth,free,white,broad,close,waste,none,clustered,two,evanescent,white,red,white,missing,enlarging,smooth,smooth,,white,partial,False
1616,bruises,white,convex,scaly,free,white,narrow,close,grasses,pungent,several,one,pendant,black,white,white,equal,enlarging,smooth,smooth,,white,partial,False
1617,no,brown,flat,scaly,free,buff,narrow,close,leaves,fishy,several,one,evanescent,white,pink,white,missing,tapering,smooth,silky,,white,partial,False
1618,no,brown,flat,scaly,free,buff,narrow,close,paths,fishy,several,one,evanescent,white,pink,pink,missing,tapering,silky,smooth,,white,partial,False
1619,no,yellow,convex,fibrous,free,pink,broad,close,woods,foul,several,one,large,chocolate,pink,pink,bulbous,enlarging,silky,silky,,white,partial,False
1620,no,brown,knobbed,smooth,free,buff,narrow,close,woods,fishy,several,one,evanescent,white,white,pink,missing,tapering,smooth,smooth,,white,partial,False
1621,no,white,convex,smooth,free,white,broad,crowded,grasses,none,numerous,two,pendant,white,white,white,missing,enlarging,silky,smooth,,white,partial,False
1622,bruises,yellow,bell,scaly,free,brown,broad,close,grasses,almond,scattered,one,pendant,black,white,white,club,enlarging,smooth,smooth,,white,partial,False
1623,no,gray,flat,fibrous,free,pink,broad,close,grasses,foul,solitary,one,large,chocolate,pink,buff,bulbous,enlarging,silky,silky,,white,partial,False
1624,bruises,yellow,flat,fibrous,free,white,narrow,crowded,woods,anise,several,one,pendant,purple,white,white,bulbous,tapering,smooth,smooth,,white,partial,False


### Task 1. Which feature has the highest amount of unique values? (joint dataset)


In [14]:
# your code goes here
# ---------------------------------------------------------------
most_diversive = 'gill_color'
# ---------------------------------------------------------------

print(most_diversive)

gill_color


### Task 2
**As a preparation, one would spend up to 15-30 minutes on exploratory data analysis (EDA)** - make sure you understand how features are distributed in train/test, what they look like, are they ordinal/binary/categorical before moving further
<br>While doing it, please answer the questions

#### 2.1 Are there any features, obviously redundant to train on? If yes - what are they and why it's better to remove them?

In [15]:
print(df.shape)
# your code/hardcoded list goes here
# ---------------------------------------------------------------
redundant_columns = ['veil_type']
# ---------------------------------------------------------------
# lets drop these columns from joint dataset
df = df.drop(redundant_columns, axis=1, errors='ignore')
print(df.shape)

(8124, 24)
(8124, 23)


In [16]:
df.head()

Unnamed: 0,bruises,cap_color,cap_shape,cap_surface,gill_attachment,gill_color,gill_size,gill_spacing,habitat,odor,population,ring_number,ring_type,spore_print_color,stalk_color_above_ring,stalk_color_below_ring,stalk_root,stalk_shape,stalk_surface_above_ring,stalk_surface_below_ring,target,veil_color,is_train
0,bruises,brown,convex,scaly,free,white,narrow,close,urban,pungent,scattered,one,pendant,brown,white,white,equal,enlarging,smooth,smooth,0.0,white,True
1,bruises,gray,flat,fibrous,free,brown,broad,close,woods,none,several,one,pendant,brown,white,white,bulbous,tapering,smooth,smooth,1.0,white,True
2,no,brown,flat,smooth,attached,orange,broad,close,leaves,none,several,one,pendant,brown,orange,orange,missing,enlarging,smooth,smooth,0.0,orange,True
3,bruises,gray,convex,fibrous,free,brown,broad,close,woods,none,solitary,one,pendant,black,white,white,bulbous,tapering,smooth,smooth,1.0,white,True
4,no,brown,knobbed,smooth,free,buff,narrow,close,paths,foul,several,one,evanescent,white,pink,pink,missing,tapering,silky,smooth,0.0,white,True


####  2.2 How many features (excluding target variable and train/test indexing columns) are:
- categorical (more than 2 unique values, no explicit ordering)
- ordinal (more than 2 unique values, explicit ordering)
- binary (2 unique values, doesn't matter whether it has ordering or is "yes/no" styled) 

In [17]:
# your code goes here
# ---------------------------------------------------------------

ordinal_cols = sorted(['ring_number'])
binary_cols = sorted(['bruises','gill_attachment','gill_size','gill_spacing','stalk_shape'])
categorical_cols = sorted(list(set(df.columns.tolist()) - set(ordinal_cols)-set(binary_cols)-set(['is_train','target'])))
#categorical_cols = sorted(['cap_color','cap_shape','cap_surface','habitat','odor','population','ring_type','spore_print_color'])
# ---------------------------------------------------------------
print('categorical: {}\nordinal: {}\nbinary: {}'.format(
    len(categorical_cols), len(ordinal_cols), len(binary_cols)))

categorical: 15
ordinal: 1
binary: 5


In [18]:
df.head()

Unnamed: 0,bruises,cap_color,cap_shape,cap_surface,gill_attachment,gill_color,gill_size,gill_spacing,habitat,odor,population,ring_number,ring_type,spore_print_color,stalk_color_above_ring,stalk_color_below_ring,stalk_root,stalk_shape,stalk_surface_above_ring,stalk_surface_below_ring,target,veil_color,is_train
0,bruises,brown,convex,scaly,free,white,narrow,close,urban,pungent,scattered,one,pendant,brown,white,white,equal,enlarging,smooth,smooth,0.0,white,True
1,bruises,gray,flat,fibrous,free,brown,broad,close,woods,none,several,one,pendant,brown,white,white,bulbous,tapering,smooth,smooth,1.0,white,True
2,no,brown,flat,smooth,attached,orange,broad,close,leaves,none,several,one,pendant,brown,orange,orange,missing,enlarging,smooth,smooth,0.0,orange,True
3,bruises,gray,convex,fibrous,free,brown,broad,close,woods,none,solitary,one,pendant,black,white,white,bulbous,tapering,smooth,smooth,1.0,white,True
4,no,brown,knobbed,smooth,free,buff,narrow,close,paths,foul,several,one,evanescent,white,pink,pink,missing,tapering,silky,smooth,0.0,white,True


In [19]:
# To be used in training, data must be properly encoded
from collections import defaultdict

# function to encode categorical data


def __encode_categorical(df_list, cat_cols):
    # initialize placeholder
    d = defaultdict(LabelEncoder)
    # fit and encode train/test,
    codes = pd.concat(
        [df[cat_cols] for df in df_list],
        axis=0
    ).fillna('').apply(
        lambda x: d[x.name].fit(x)
    ),
    # transform encodings to train/test etc
    for df in df_list:
        df[cat_cols] = df[cat_cols].fillna('').apply(
            lambda x: d[x.name].transform(x))


# label encode data (categorical + binary)
__encode_categorical(df_list=[df], cat_cols=categorical_cols+binary_cols)
# make sure you encode the only ordinal column in correct order
df[ordinal_cols[0]] = df[ordinal_cols[0]].map({'none': 0, 'one': 1, 'two': 2})

# define useful feature columns to be used for training
# (union of all columns discussed above)
columns_to_use = ordinal_cols + binary_cols + categorical_cols

In [20]:
df.sample(5)

Unnamed: 0,bruises,cap_color,cap_shape,cap_surface,gill_attachment,gill_color,gill_size,gill_spacing,habitat,odor,population,ring_number,ring_type,spore_print_color,stalk_color_above_ring,stalk_color_below_ring,stalk_root,stalk_shape,stalk_surface_above_ring,stalk_surface_below_ring,target,veil_color,is_train
4184,1,7,4,2,1,2,1,0,6,8,4,1,0,7,7,5,3,1,2,3,0.0,2,True
5390,1,0,4,3,1,2,1,0,3,3,4,1,0,7,7,5,3,1,3,2,0.0,2,True
542,0,0,3,2,1,8,0,0,6,6,5,1,4,0,3,5,0,1,3,3,,2,False
386,0,0,3,0,1,7,0,0,6,6,5,1,4,0,7,3,0,1,3,3,1.0,2,True
1574,0,7,3,2,1,8,0,0,6,6,5,1,4,1,3,7,0,1,3,3,,2,False


In [21]:
df.iloc[-5:,:]

Unnamed: 0,bruises,cap_color,cap_shape,cap_surface,gill_attachment,gill_color,gill_size,gill_spacing,habitat,odor,population,ring_number,ring_type,spore_print_color,stalk_color_above_ring,stalk_color_below_ring,stalk_root,stalk_shape,stalk_surface_above_ring,stalk_surface_below_ring,target,veil_color,is_train
1620,1,0,4,3,1,2,1,0,6,3,4,1,0,7,7,5,3,1,3,3,,2,False
1621,1,8,2,3,1,10,0,1,0,6,2,2,4,7,7,7,3,0,2,3,,2,False
1622,0,9,0,2,1,1,0,0,0,0,3,1,4,0,7,7,1,0,3,3,,2,False
1623,1,3,3,0,1,7,0,0,0,4,5,1,2,3,5,1,0,0,2,2,,2,False
1624,0,9,3,0,1,10,1,1,6,1,4,1,4,6,7,7,0,1,3,3,,2,False


In [22]:
df.shape

(8124, 23)

### Task 3. Prepare cross-validation strategy and perform comparison of 2 baseline models (linear vs tree-based)

### =====================================================
#### Briefly about Validation / Cross-Validation

Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but **would fail to predict anything useful on yet-unseen data. This situation is called overfitting**. 
<br>To avoid it, it is common practice when performing a (supervised) machine learning experiment to hold out part of the available data as a test set ```X_test, y_test```. 
<br>Note that the word “experiment” is not intended to denote academic use only, because even in commercial settings machine learning usually starts out experimentally.

When evaluating different settings (“hyperparameters”) for estimators, **there is still a risk of overfitting on the test set** because the parameters can be tweaked until the estimator performs optimally. 
<br>This way, knowledge about the test set can “leak” into the model and evaluation metrics no longer report on generalization performance. 
<br>To solve this problem, yet another part of the dataset can be held out as a so-called “validation set”: training proceeds on the training set, after which evaluation is done on the validation set, and when the experiment seems to be successful, final evaluation can be done on the test set.

However, by partitioning the available data into three sets, **we drastically reduce the number of samples which can be used for learning the model, and the results can depend on a particular random choice for the pair of (train, validation) sets.**

A solution to this problem is a procedure called **cross-validation (CV for short). A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV**. In the basic approach, called k-fold CV, the training set is split into k smaller sets (other approaches are described below, but generally follow the same principles). The following procedure is followed for each of the k “folds”:

- A model is trained using k-1 of the folds as training data;
- the resulting model is validated on the remaining part of the data 
<br>(i.e., it is used as a test set to compute a performance measure such as accuracy).
        
<img src="https://hsto.org/files/b1d/706/e6c/b1d706e6c9df49c297b6152878a2d03f.png" style="width:75%">

The performance measure reported by k-fold cross-validation **is then the average of the values computed in the loop**. 
<br>This approach can be computationally expensive, but does not waste too much data (as it is the case when fixing an arbitrary test set), which is a major advantage in problem such as inverse inference where the number of samples is very small.


Some classification problems can **exhibit a large imbalance in the distribution of the target classes: for instance there could be several times more negative samples than positive samples**. 
<br>In such cases it is recommended to use **stratified sampling** as implemented in sklearn's StratifiedKFold and StratifiedShuffleSplit to ensure that relative class frequencies is approximately preserved in each train and validation fold.

More details about different cross-validation strategies, implemented in sklearn, can be found [here](http://scikit-learn.org/stable/modules/cross_validation.html)
### =====================================================

Prepare KFold with 5 splits, stratified by target variable, shuffled, with fixed random_state = 42
<br>**Don't forget to filter by column 'is_train'!**
<br>Fit models on subset of features: [columns_to_use]

In [23]:
xtrain = df[df['is_train']][columns_to_use]
ytrain = df[df['is_train']]['target']
xtest =  df[df['is_train']==0][columns_to_use]
y_true = pd.read_csv('../data/4-mushrooms-y_test.csv')

In [24]:
xtrain.shape

(6499, 21)

In [25]:
ytrain.shape

(6499,)

In [26]:
xtest.shape

(1625, 21)

In [27]:
len(columns_to_use)

21

In [28]:
columns_to_use

['ring_number',
 'bruises',
 'gill_attachment',
 'gill_size',
 'gill_spacing',
 'stalk_shape',
 'cap_color',
 'cap_shape',
 'cap_surface',
 'gill_color',
 'habitat',
 'odor',
 'population',
 'ring_type',
 'spore_print_color',
 'stalk_color_above_ring',
 'stalk_color_below_ring',
 'stalk_root',
 'stalk_surface_above_ring',
 'stalk_surface_below_ring',
 'veil_color']

In [29]:
from sklearn.preprocessing import scale

In [30]:
from os import cpu_count

n_jobs = max(cpu_count()-1, 1)
# your code goes here
# ---------------------------------------------------------------
# cross-validation iterator
kf = StratifiedKFold(n_splits=5, random_state=42, shuffle=True)
# xtrain, ytrain, DataFrame-like
#xtrain = None
#ytrain = None
# ---------------------------------------------------------------

# create Decision Tree with default params, max_depth=4, random_state=42
dt = DecisionTreeClassifier(max_depth=4,random_state=42)

# estimate its f1-score with cross-validation (cross_val_score)
# your code goes here
# ---------------------------------------------------------------
scores_dt = cross_val_score(estimator=dt,
                            X=xtrain, 
                            y=ytrain,
                            scoring='f1',
                            cv=kf, # cross-validation strategy
                            n_jobs=n_jobs
                            ).mean()
print('DT scoring: {:.5f}'.format(scores_dt))


# ---------------------------------------------------------------


# create Logistic Regression with default params, random_state=42
lr = LogisticRegression(random_state=42)
    # ...


# estimate its f1-score with cross-validation
# your code goes here
# ---------------------------------------------------------------
scores_lr = cross_val_score(
                            estimator=lr,
                            X=scale(xtrain), # ...
                            y=ytrain, # ...
                            scoring='f1',
                            cv=kf,
                            n_jobs=n_jobs
                        ).mean()

print('LR scoring: {:.5f}'.format(scores_lr))

DT scoring: 0.91764
LR scoring: 0.89116


DT scoring: 0.91764
LR scoring: 0.89116 (scale)

Why is a score of Linear Regression lower than correspondent one of DT?
1. Is everything OK with the data format for linear models? (revision of 2 previous lectures). 
1. If not, what else you should do to use the data appropriately for linear models?
1. Why didn't point 1. affect Decision Tree performance?

### Task 4. Now it's time to do some hyperparam tuning
Perform suitable hyperparam tuning using created above cross-validation strategy
<br>Main parameters to perform grid-search for:
- max_depth (1,2,...None)
- min_samples_leaf (1,2,...)
- criterion (gini, entropy)
- weight (none, balanced)
- max_features (sqrt(features), 50%, 75%, all of them, ...)
- other params available, see documentation

So - use your fantasy for filling-in abovementioned lists

You should receive **a gain of 0.01 in f1-score or higher**
<br>(current benchmark = +0.0268 gain)

In [32]:
%%time
# your code goes here
# ---------------------------------------------------------------
# create base model (DT, random state = 42)
estimator = DecisionTreeClassifier(random_state=42)
    # ...


# create parameter grid
# http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
params = {'criterion': ['gini', 'entropy'],
          'max_depth': [ 7, 8, 9, 10, 11, 12],
          'min_samples_leaf': [1, 2, 5, 7],
          'max_features': ['sqrt', 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17],
          'class_weight': ['balanced', None]
    
}


# create grid search object
gs = GridSearchCV(
    estimator=estimator,  # base model
    param_grid=params,  # params grid to search within
    cv=kf,  # cross-validation strategy
    error_score=1,  # warnings only
    scoring='f1',  # f1-score
    # thread count, the higher count - the faster
    n_jobs=n_jobs,
    verbose=2,  # messages about performed actions
)

# perform grid search on TRAIN dataset ('is_train' filtering)
gs.fit(
    X=xtrain, # ...
    y=ytrain, # ...
)
# -------------------------------------------------------------
# extract best score on cross-validation
best_score = gs.best_score_
# extract the estimator (DT) with best params on cross-validation
best_dt = gs.best_estimator_
# check gain in f1-score
print('f1-score best: {:.4f}, +{:.4f} better than baseline'.format(
    best_score, (best_score - scores_dt))
)

Fitting 5 folds for each of 1152 candidates, totalling 5760 fits


[Parallel(n_jobs=3)]: Done  77 tasks      | elapsed:    3.8s
[Parallel(n_jobs=3)]: Done 1287 tasks      | elapsed:   16.1s
[Parallel(n_jobs=3)]: Done 3317 tasks      | elapsed:   37.4s


f1-score best: 0.9282, +0.0106 better than baseline
Wall time: 1min


[Parallel(n_jobs=3)]: Done 5760 out of 5760 | elapsed:  1.0min finished


In [33]:
# take a look on the best model, compare with the baseline
best_dt

DecisionTreeClassifier(class_weight='balanced', criterion='gini', max_depth=9,
            max_features=8, max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=5,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=42, splitter='best')

In [34]:
best_dt.predict(xtrain.iloc[1:2,:])

array([1.])

In [35]:
xtrain.iloc[1:2,:]

Unnamed: 0,ring_number,bruises,gill_attachment,gill_size,gill_spacing,stalk_shape,cap_color,cap_shape,cap_surface,gill_color,habitat,odor,population,ring_type,spore_print_color,stalk_color_above_ring,stalk_color_below_ring,stalk_root,stalk_surface_above_ring,stalk_surface_below_ring,veil_color
1,1,0,1,0,0,1,3,3,0,1,6,6,4,4,1,7,7,0,3,3,2


In [36]:
ytrain.iloc[1:2]

1    1.0
Name: target, dtype: float64

In [37]:
best_dt.predict_proba(xtrain.iloc[1:2,:])

array([[0.1174412, 0.8825588]])

In [38]:
# check performance on holdout dataset, unseen before (filter 'is_train' == False)

# your code goes here
# ---------------------------------------------------------------
# appropriate df_test data subset from 'df' dataframe

# fit baseline model 'dt' on xtrain, ytrain (because it's not fitted yet)
dt.fit(xtrain, ytrain)
# ---------------------------------------------------------------

# baseline model

y_pred_baseline = dt.predict(xtest)

print('Base on train:   {:.4f}\nBase on holdout: {:.4f}\ndiff: {:.4f}'.format(
    scores_dt, 
    f1_score(y_true, y_pred_baseline),
    scores_dt - f1_score(y_true, y_pred_baseline)
))

# best model
y_pred_best = best_dt.predict(xtest)

print('\nBest on train:   {:.4f}\nBest on holdout: {:.4f}\ndiff: {:.4f}'.format(
    best_score, 
    f1_score(y_true, y_pred_best),
    best_score - f1_score(y_true, y_pred_best)
))

Base on train:   0.9176
Base on holdout: 0.9050
diff: 0.0126

Best on train:   0.9282
Best on holdout: 0.9205
diff: 0.0077


Now you can see that 
<br>**absolute values of f1-score is higher and distance between train|holdout is lower** <br>for **best model** in comparison to **baseline**

**Bonus question**:

Consider two possibilities:
- (a) you have trained **one best** (on cross-validation) Decision Tree
- (b) you randomly choose 25 subsets of 70% of training data, fits "overfitted" (max_depth=None) Decision Trees on it - each of them performs slightly worse than Tree in (a), and then average predicted results over all 25 models (overfitted trees)

**Which one of them would most likely give the best results on hold-out dataset? What makes you think that way?**