# Predicting 'label' (male or female) using k-NN

In this Notebook I am going to predict whether the it's male or female voice. 

In [171]:
import seaborn as sns
import sklearn as sk
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split #We need this to split the data

# Data set

In [172]:
df = pd.read_csv('voice.csv')
df = df.dropna() #get rid of rows with empty cells
df.head()

Unnamed: 0,meanfreq,sd,median,Q25,Q75,IQR,skew,kurt,sp.ent,sfm,...,centroid,meanfun,minfun,maxfun,meandom,mindom,maxdom,dfrange,modindx,label
0,0.059781,0.064241,0.032027,0.015071,0.090193,0.075122,12.863462,274.402906,0.893369,0.491918,...,0.059781,0.084279,0.015702,0.275862,0.007812,0.007812,0.007812,0.0,0.0,male
1,0.066009,0.06731,0.040229,0.019414,0.092666,0.073252,22.423285,634.613855,0.892193,0.513724,...,0.066009,0.107937,0.015826,0.25,0.009014,0.007812,0.054688,0.046875,0.052632,male
2,0.077316,0.083829,0.036718,0.008701,0.131908,0.123207,30.757155,1024.927705,0.846389,0.478905,...,0.077316,0.098706,0.015656,0.271186,0.00799,0.007812,0.015625,0.007812,0.046512,male
3,0.151228,0.072111,0.158011,0.096582,0.207955,0.111374,1.232831,4.177296,0.963322,0.727232,...,0.151228,0.088965,0.017798,0.25,0.201497,0.007812,0.5625,0.554688,0.247119,male
4,0.13512,0.079146,0.124656,0.07872,0.206045,0.127325,1.101174,4.333713,0.971955,0.783568,...,0.13512,0.106398,0.016931,0.266667,0.712812,0.007812,5.484375,5.476562,0.208274,male


Label is the dependent variable

# Data cleaning

In [173]:
df['label'].value_counts() #Let's have a look at the variable decision by partner

male      1584
female    1584
Name: label, dtype: int64

In [174]:
df.dtypes #Checking types of variables to choose

meanfreq    float64
sd          float64
median      float64
Q25         float64
Q75         float64
IQR         float64
skew        float64
kurt        float64
sp.ent      float64
sfm         float64
mode        float64
centroid    float64
meanfun     float64
minfun      float64
maxfun      float64
meandom     float64
mindom      float64
maxdom      float64
dfrange     float64
modindx     float64
label        object
dtype: object

In [175]:
df_subset = df[['Q25', 'Q75', 'IQR', 'skew', 'kurt', 'centroid', 'sfm', 'label']]
df_subset.head()

Unnamed: 0,Q25,Q75,IQR,skew,kurt,centroid,sfm,label
0,0.015071,0.090193,0.075122,12.863462,274.402906,0.059781,0.491918,male
1,0.019414,0.092666,0.073252,22.423285,634.613855,0.066009,0.513724,male
2,0.008701,0.131908,0.123207,30.757155,1024.927705,0.077316,0.478905,male
3,0.096582,0.207955,0.111374,1.232831,4.177296,0.151228,0.727232,male
4,0.07872,0.206045,0.127325,1.101174,4.333713,0.13512,0.783568,male


Encoding the values of 'label' variable. Predictive model understands only numeric values like 0 or 1. 'label' contains values as,'male' and 'female' which are by default 'object' data type. That is why, it should be encoded 'astype (str)' to convert it into numeric like 'male' to 1 and 'female' to 0.

In [176]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(df_subset['label'].astype(str))
df_subset['label'] = le.transform(df_subset['label'].astype(str))
df_subset.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_subset['label'] = le.transform(df_subset['label'].astype(str))


Unnamed: 0,Q25,Q75,IQR,skew,kurt,centroid,sfm,label
0,0.015071,0.090193,0.075122,12.863462,274.402906,0.059781,0.491918,1
1,0.019414,0.092666,0.073252,22.423285,634.613855,0.066009,0.513724,1
2,0.008701,0.131908,0.123207,30.757155,1024.927705,0.077316,0.478905,1
3,0.096582,0.207955,0.111374,1.232831,4.177296,0.151228,0.727232,1
4,0.07872,0.206045,0.127325,1.101174,4.333713,0.13512,0.783568,1


# Building the model

In [177]:
from sklearn.preprocessing import normalize #get the function needed to normalize our data.

X = df_subset[['Q25', 'Q75', 'IQR', 'skew', 'kurt', 'centroid', 'sfm']] #create the X matrix
X = normalize(X) #normalize the matrix to put everything on the same scale
y = df_subset['label'] #create the y-variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) #split the data, store it into different variables

# Model evaluation

Let's use the KNeightborsClassifier class from sklearn

In [178]:
from sklearn.neighbors import KNeighborsClassifier #the object class we need

knn = KNeighborsClassifier(n_neighbors=5) #create a KNN-classifier with 5 neighbors (default)
knn = knn.fit(X_train, y_train) #this fits the k-nearest neigbor model with the train data
knn.score(X_test, y_test) #calculate the fit on the test data

0.8727655099894848

Accuracy is 87.3%. An easy comparison is to compare with the best baseline guess: always guess "Not Survived". That would give us 1584 / (1584 + 1584) = 50%. So the model is a lot better than the baseline guess. Let's create a confusion matrix to evaluate precision and recall.

In [179]:
from sklearn.metrics import confusion_matrix
y_test_pred = knn.predict(X_test) #the predicted values
cm = confusion_matrix(y_test, y_test_pred) #creates a "confusion matrix"
cm

array([[392,  65],
       [ 56, 438]])

In [180]:
knn.classes_ # to check the attribute of the model, 0 - female, 1 - male

array([0, 1])

In [181]:
#In order to read it easily , let's make a dataframe out of it, and add labels to it.
conf_matrix = pd.DataFrame(cm, index = ['female', 'male'], columns = ['female_predicted', 'male_predicted']) 
conf_matrix

Unnamed: 0,female_predicted,male_predicted
female,392,65
male,56,438
