### Part 1: Data exploration
Your first task is to download and explore the data. What features are there?
How are they related?
Hand two lines that describes
* what a frequency is
* what the median frequency means
* what the output label is

In [5]:
%pylab inline
import pandas as pd
import sklearn

Populating the interactive namespace from numpy and matplotlib


In [6]:
df = pd.read_csv('voice.csv')

In [47]:
df

Unnamed: 0,meanfreq,sd,median,Q25,Q75,IQR,skew,kurt,sp.ent,sfm,...,centroid,meanfun,minfun,maxfun,meandom,mindom,maxdom,dfrange,modindx,label
0,0.059781,0.064241,0.032027,0.015071,0.090193,0.075122,12.863462,274.402906,0.893369,0.491918,...,0.059781,0.084279,0.015702,0.275862,0.007812,0.007812,0.007812,0.000000,0.000000,male
1,0.066009,0.067310,0.040229,0.019414,0.092666,0.073252,22.423285,634.613855,0.892193,0.513724,...,0.066009,0.107937,0.015826,0.250000,0.009014,0.007812,0.054688,0.046875,0.052632,male
2,0.077316,0.083829,0.036718,0.008701,0.131908,0.123207,30.757155,1024.927705,0.846389,0.478905,...,0.077316,0.098706,0.015656,0.271186,0.007990,0.007812,0.015625,0.007812,0.046512,male
3,0.151228,0.072111,0.158011,0.096582,0.207955,0.111374,1.232831,4.177296,0.963322,0.727232,...,0.151228,0.088965,0.017798,0.250000,0.201497,0.007812,0.562500,0.554688,0.247119,male
4,0.135120,0.079146,0.124656,0.078720,0.206045,0.127325,1.101174,4.333713,0.971955,0.783568,...,0.135120,0.106398,0.016931,0.266667,0.712812,0.007812,5.484375,5.476562,0.208274,male
5,0.132786,0.079557,0.119090,0.067958,0.209592,0.141634,1.932562,8.308895,0.963181,0.738307,...,0.132786,0.110132,0.017112,0.253968,0.298222,0.007812,2.726562,2.718750,0.125160,male
6,0.150762,0.074463,0.160106,0.092899,0.205718,0.112819,1.530643,5.987498,0.967573,0.762638,...,0.150762,0.105945,0.026230,0.266667,0.479620,0.007812,5.312500,5.304688,0.123992,male
7,0.160514,0.076767,0.144337,0.110532,0.231962,0.121430,1.397156,4.766611,0.959255,0.719858,...,0.160514,0.093052,0.017758,0.144144,0.301339,0.007812,0.539062,0.531250,0.283937,male
8,0.142239,0.078018,0.138587,0.088206,0.208587,0.120381,1.099746,4.070284,0.970723,0.770992,...,0.142239,0.096729,0.017957,0.250000,0.336476,0.007812,2.164062,2.156250,0.148272,male
9,0.134329,0.080350,0.121451,0.075580,0.201957,0.126377,1.190368,4.787310,0.975246,0.804505,...,0.134329,0.105881,0.019300,0.262295,0.340365,0.015625,4.695312,4.679688,0.089920,male


Frequency describes the number of waves that pass a fixed place in a given amount of time. So if the time it takes for a wave to pass is 1/2 second, the frequency is 2 per second. If it takes 1/100 of an hour, the frequency is 100 per hour. Sound propagates as mechanical vibration waves of pressure and displacement, in air or other substances.

The median is the middle number in an ordered set of data. In a frequency table, the observations are already arranged in an ascending order. If there is an even number of observations, the median will be the mean of the two central numbers.

The label describes if the voice is a a male or female. 



### Part 2: Data preparation
When we train our model we'll use a 10-fold `KFold` 
cross-validator **with** shuffling.
Instantiate a `KFold` class and store it in a meaningful variable.

When that is done, illustrate that you indeed do get 10 
iterations of your data by iterating over the folds and simply
printing *the shape of* the four variables: `x_train`, `y_train`, `x_test` and
`y_test` (see 
the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) for examples on
how to do this).

Hand in
* the instantiation of the k-fold cross-validator
* a loop that prints the *shape* of `x_train`, `y_train`, `x_test` and `y_test`

In [39]:
from sklearn.model_selection import KFold

X = df.loc[:, df.columns != 'label']
y = df["label"];


#Instanciating the k-fold cross-validator
folder = KFold(n_splits=10, shuffle=True)


#Loop that prints the shape
for train_index, test_index in folder.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]


TRAIN: [   0    1    2 ... 3164 3166 3167] TEST: [   8   14   18   38   41   50   51   75   81   92   94  106  112  113
  123  134  137  138  165  182  184  192  195  204  213  217  248  251
  253  254  258  261  267  281  294  300  313  320  321  349  351  359
  371  374  376  403  409  432  437  441  449  458  507  511  515  523
  546  548  559  563  567  570  594  600  606  624  629  637  643  666
  674  686  688  700  713  720  721  732  734  739  742  744  748  755
  775  777  791  804  812  813  815  821  836  847  872  880  892  893
  903  915  918  936  947  955  964  969  972  973  978  990 1007 1020
 1030 1035 1038 1070 1080 1093 1099 1102 1117 1119 1134 1166 1168 1169
 1195 1212 1220 1223 1236 1237 1238 1245 1253 1277 1278 1282 1287 1290
 1296 1305 1312 1314 1315 1318 1336 1339 1360 1378 1388 1392 1422 1446
 1463 1523 1530 1533 1535 1536 1551 1566 1572 1586 1588 1607 1609 1610
 1625 1628 1635 1653 1656 1657 1658 1661 1667 1679 1723 1726 1742 1745
 1756 1768 1773 1777 1795 17

KeyError: "None of [Int64Index([   0,    1,    2,    3,    4,    5,    6,    7,    9,   10,\n            ...\n            3155, 3157, 3158, 3160, 3161, 3162, 3163, 3164, 3166, 3167],\n           dtype='int64', length=2851)] are in the [columns]"

###  Part 3: Model construction
We will use four different classification models for this task:
1. Logistic Regression
2. Support Vector Machine classifier
3. Decision Tree classifier
4. k-Nearest Neighbors classifier

Instantiate the four different classifiers in *four different 
pipelines*.

For now the default parameters are fine.

Hand in 
* the code for constructing the four pipelines
* one line of text per model describing how you think the classifier will perform, given the data type you are working with (voice)


In [42]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

p1 = make_pipeline(LogisticRegression(solver='lbfgs'))
p2 = make_pipeline(SVC(gamma='scale'))
p3 = make_pipeline(DecisionTreeClassifier())
p4 = make_pipeline(KNeighborsClassifier())

### Logistic Regression
The classificer will perform good because the output is a binary dependant variable. Which means that we have a dependant variable with two possible values; male and female. 

### Support Vector Machine classifier
Bad


### Decision Tree classifier
Good


### k-Nearest Neighbors classifier
Meh


## Part 4: Model validation
Now the time comes to train and validate your model.
This training and testing **should happen for all four models**.
The easiest way to do this is to use the `cross_val_score` 
function from `sklearn` once for all the four models.
The code should look something like this:
```python
pipeline1 = ...
pipeline2 = ...
pipeline3 = ...
pipeline4 = ...
my_kfold_validator = ...
for model in ... :
    score = cross_val_score(model, ...)
    print(score)
```

Hand in
* a list per model (four lists in total) of 10 values each, showing the scores of the 10 folds,
* at least one paragraph of text that describes what the 'score' means
* at least one paragraph of text that describes why the scores are different

In [46]:
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import cross_val_score


for model in [p1, p2, p3, p4]:
    print(cross_val_score(model, xs, ys, cv=folder))



[0.88958991 0.90536278 0.91167192 0.92744479 0.93059937 0.80757098
 0.9022082  0.9022082  0.8164557  0.90506329]
[0.66876972 0.73817035 0.6340694  0.69085174 0.68138801 0.72239748
 0.66246057 0.69085174 0.62974684 0.6835443 ]
[0.97160883 0.96529968 0.96214511 0.96845426 0.96845426 0.96845426
 0.97791798 0.95899054 0.95886076 0.94936709]
[0.71293375 0.72870662 0.74763407 0.7192429  0.7318612  0.69716088
 0.72239748 0.72555205 0.71202532 0.72151899]


### Logistic Regression
Good

### Support Vector Machine classifier
Bad


### Decision Tree classifier
Good


### k-Nearest Neighbors classifier
Meh

## Part 5: Model optimisation: scaling
On the website of the [Gender Recognition by Voice](https://www.kaggle.com/primaryobjects/voicegender) dataset, they say
we can do better. So let's try!

One thing that's very easy to do is to use a 
[StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html).
It's particularly easy, because it fits right into your existing
pipelines. So simply add four (separate!) instances of the
`StandardScaler` to the pipelines, one for each pipeline.

Now repeat the above validation code, where you run the 
`cross_val_score` for *each* of the four pipelines. But this 
time the `StandardScaler` is included in the pipeline.

Hand in
* the code for your new pipelines that includes the `StandardScaler`)
* at least one line of text that describes what scaling actually is
* the **mean** of the 10 scores of the four models (this time it's only **one** number per model
* at least two lines of text describing which model performed well, and whether this aligned with your expectation from part 3

In [11]:
p1 = make_pipeline(StandardScaler(), LogisticRegression(solver='lbfgs'))
p2 = make_pipeline(StandardScaler(), SVC(gamma='scale'))
p3 = make_pipeline(StandardScaler(), DecisionTreeClassifier())
p4 = make_pipeline(StandardScaler(),KNeighborsClassifier())

In [12]:
for model in [p1, p2, p3,p4]: ##SpECiAl coDe
    print(cross_val_score(model, xs, ys, cv=folder).mean())

0.9734866030427665
0.9820089446152618
0.9614912350756699


In [13]:
from sklearn.model_selection import GridSearchCV

p5 = make_pipeline(StandardScaler(), 
                   GridSearchCV(LogisticRegression(solver='lbfgs'), param_grid={'C': np.arange(0.0001, 2, 0.05)}, cv=10))
p5.fit(xs, ys)
p5.score(xs,ys)

0.9741161616161617

## Part 6: Manual Hyperparameter Tuning

For the fourth classifier in this project -- namely kNN -- conduct a manual search for the best value of $k$ (the hyperparameter ´n_neighbors´), that yields the highest score.

That means:

  1. choosing a value (positive integer >= 1), 
  2. putting it into the model, 
  3. (re-)training the kNN model, and 
  4. calculating the score. 
  5. Then try 1)-4) all over again. 

Do these steps at least 10 times to find a good value of the hyperparameter.