# SVM Challenge
## Identify gender by voice

**Gender recognition by voice - identify a voice as male or female**

The dataset you are going to use comes from a Kaggle competition. It contains measurements taken on 3,168 voice recordings, and the goal is to classify a data point as male or female. 

The dataset is balanced and it contains a relatively small number of features: those are two conditions in which using SVMs could lead to good results. 

Apart from choosing the kernel and tuning the hyper parameters, you should spend some time exploring the features and selecting only the ones you think are relevant. 

The dataset is stored in the file `voice.csv`. 

To share their findings and approach, every team should prepare a very short presentation (max 4 minutes) to talk through at the end of the afternoon. 


**The Dataset**

The following acoustic properties of each voice are measured and included within the CSV:

* **meanfreq**: mean frequency (in kHz)
* **sd**: standard deviation of frequency
* **median**: median frequency (in kHz)
* **Q25**: first quantile (in kHz)
* **Q75**: third quantile (in kHz)
* **IQR**: interquantile range (in kHz)
* **skew**: skewness (see note in specprop description)
* **kurt**: kurtosis (see note in specprop description)
* **sp.ent**: spectral entropy
* **sfm**: spectral flatness
* **mode**: mode frequency
* **centroid**: frequency centroid (see specprop)
* **peakf**: peak frequency (frequency with highest energy)
* **meanfun**: average of fundamental frequency measured across acoustic signal
* **minfun**: minimum fundamental frequency measured across acoustic signal
* **maxfun**: maximum fundamental frequency measured across acoustic signal
* **meandom**: average of dominant frequency measured across acoustic signal
* **mindom**: minimum of dominant frequency measured across acoustic signal
* **maxdom**: maximum of dominant frequency measured across acoustic signal
* **dfrange**: range of dominant frequency measured across acoustic signal
* **modindx**: modulation index. Calculated as the accumulated absolute difference between adjacent measurements of fundamental frequencies divided by the frequency range
* **label**: male or female

In [None]:
# import relevant libraries

import pandas as pd

from sklearn.svm import SVC
import numpy as np
import random

from matplotlib import pyplot as plt

%matplotlib inline

In [None]:
df = pd.read_csv("voice.csv")

In [None]:
df.head()

Enjoy!

In [None]:
df.shape

In [None]:
df.columns

In [None]:
# let's plot a scatterplot

use_colour = {"male": "blue", "female": "red"}
fig = plt.figure(figsize=(14, 12))

features = list(df.columns)
features.remove("label")

# Currently, selecting 5 random features to examine at random. Running this cell again will select 5 different features
# so the result of running this cell will differ every time it is run
features_to_examine = random.sample(features, 5)

# we are going to examine the scatter plots of all variables in the above list with all others, this means we will
# have 5x5=25 plots.
nfeat = len(features_to_examine)

counter = 1
for j in xrange(nfeat):
    for k in xrange(nfeat):
        # subplot takes 3 arguments.
        # If the final plot is going to be 5 x 5 subplots for example, both of these arguments must be equal to 5.
        # The third argument should be incremented sequentially and matplotlib will then decide where to place it
        # Here, the variable "counter" is incremented at every step
        plt.subplot(nfeat, nfeat, counter)
        counter += 1
        plt.scatter(
            df[features_to_examine[j]],
            df[features_to_examine[k]],
            c=[use_colour[i] for i in df["label"]],
        )
        plt.xlabel(features_to_examine[j])
        plt.ylabel(features_to_examine[k])
        fig.tight_layout()

plt.show()