In [1]:
NAME = "Artimes Rashidi Torghi"

In [2]:
%matplotlib inline 
import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

### Section 2. Gender Recognition by Speech Analysis


**FOR ALL MODELS THAT REQUIRES RANDOM STATE, SPECIFY RANDOM STATE TO 0**

**IF HYPERPARAMETER IS NOT SPECIFIED, LEAVE AS DEFAULT**


#### Setting and Data
This dataset is created to identify a voice as male or female, based upon acoustic properties of the voice and speech. It consists of 3,168 recorded voice samples, collected from male and female speakers. The voice samples are pre-processed by acoustic analysis, with an analyzed frequency range of 0hz-280hz (human vocal range).

The CSV file contains 20 acoustic properties of each voice, and one outcome variable, “label”, which identifies the gender of the speaker. The detailed information is listed below (you do NOT need to read through the variable description). 

- meanfreq: mean frequency (in kHz)
- sd: standard deviation of frequency
- median: median frequency (in kHz)
- Q25: first quantile (in kHz)
- Q75: third quantile (in kHz)
- IQR: interquantile range (in kHz)
- skew: skewness (see note in specprop description)
- kurt: kurtosis (see note in specprop description)
- sp.ent: spectral entropy
- sfm: spectral flatness
- mode: mode frequency
- centroid: frequency centroid (see specprop)
- meanfun: average of fundamental frequency measured across acoustic signal
- minfun: minimum fundamental frequency measured across acoustic signal
- maxfun: maximum fundamental frequency measured across acoustic signal
- meandom: average of dominant frequency measured across acoustic signal
- mindom: minimum of dominant frequency measured across acoustic signal
- maxdom: maximum of dominant frequency measured across acoustic signal
- dfrange: range of dominant frequency measured across acoustic signal
- modindx: modulation index. Calculated as the accumulated absolute difference between adjacent measurements of fundamental frequencies divided by the frequency range
- label: male or female

#### Data Preparation
Use the code below to load data and check the variable names.


In [3]:
import pandas as pd
voice = pd.read_csv('voice.csv') 
voice['label']=voice['label'].astype('category').cat.codes
voice.describe()

Unnamed: 0,meanfreq,sd,median,Q25,Q75,IQR,skew,kurt,sp.ent,sfm,...,centroid,meanfun,minfun,maxfun,meandom,mindom,maxdom,dfrange,modindx,label
count,3168.0,3168.0,3168.0,3168.0,3168.0,3168.0,3168.0,3168.0,3168.0,3168.0,...,3168.0,3168.0,3168.0,3168.0,3168.0,3168.0,3168.0,3168.0,3168.0,3168.0
mean,0.180907,0.057126,0.185621,0.140456,0.224765,0.084309,3.140168,36.568461,0.895127,0.408216,...,0.180907,0.142807,0.036802,0.258842,0.829211,0.052647,5.047277,4.99463,0.173752,0.5
std,0.029918,0.016652,0.03636,0.04868,0.023639,0.042783,4.240529,134.928661,0.04498,0.177521,...,0.029918,0.032304,0.01922,0.030077,0.525205,0.063299,3.521157,3.520039,0.119454,0.500079
min,0.039363,0.018363,0.010975,0.000229,0.042946,0.014558,0.141735,2.068455,0.738651,0.036876,...,0.039363,0.055565,0.009775,0.103093,0.007812,0.004883,0.007812,0.0,0.0,0.0
25%,0.163662,0.041954,0.169593,0.111087,0.208747,0.04256,1.649569,5.669547,0.861811,0.258041,...,0.163662,0.116998,0.018223,0.253968,0.419828,0.007812,2.070312,2.044922,0.099766,0.0
50%,0.184838,0.059155,0.190032,0.140286,0.225684,0.09428,2.197101,8.318463,0.901767,0.396335,...,0.184838,0.140519,0.04611,0.271186,0.765795,0.023438,4.992188,4.945312,0.139357,0.5
75%,0.199146,0.06702,0.210618,0.175939,0.24366,0.114175,2.931694,13.648905,0.928713,0.533676,...,0.199146,0.169581,0.047904,0.277457,1.177166,0.070312,7.007812,6.992188,0.209183,1.0
max,0.251124,0.115273,0.261224,0.247347,0.273469,0.252225,34.725453,1309.612887,0.981997,0.842936,...,0.251124,0.237636,0.204082,0.279114,2.957682,0.458984,21.867188,21.84375,0.932374,1.0


We would like to use all other variables to predict the gender of the speaker (label). To start, we prepare the data for analysis using code below.

In [4]:
from sklearn.model_selection import train_test_split

X = voice.iloc[:,0:19]
y = voice.iloc[:,20]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

**Problem 1: Scaling and Basic Models ** 

- Use standard scaler to scale the data, so that the data can be applied for **supervised learning models**.
- Based on the data description, choose the proper Naive Bayes model. Report the training and test accuracy. 
    - If we train the same NB model on unscaled data, do you expect a performance change of the model? If yes, briefly explain how the performance will change. If no, provide explanations. [Discussion Only]
- Train a Decision Tree model with maximum depth = 2. Report the training and test accuracy. 
    - If we train the same DT model on the same, scaled training set, but add an additional parameter, where maximum leaf node is 6. Do you expect a performance change of the model? If yes, briefly explain how the performance will change. If no, provide explanations. [Discussion Only]

In [5]:
# S1: Preprocessing
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

print( np.mean(X_train_scaled))
print(np.var(X_train_scaled))
# Standard scaler transfer X to mean = 0, s.d. = var = 1



-1.6683840596883445e-16
1.0


In [6]:
from sklearn.naive_bayes import GaussianNB
g_nb = GaussianNB()
g_nb.fit(X_train_scaled, y_train)
g_nb.score(X_test_scaled, y_test),g_nb.score(X_train_scaled, y_train)

(0.8926767676767676, 0.8943602693602694)

In [None]:
#No it is not influesnce because scaling is not important in naive bayes(it is based on the probability and similality
#betweeen records)

In [17]:
# Let maximum depth be 3

from sklearn.tree import DecisionTreeClassifier

# A Basic Tree
tree_3 = DecisionTreeClassifier(random_state = 0, max_depth = 3)

tree_3.fit(X_train_scaled, y_train)

DecisionTreeClassifier(max_depth=3, random_state=0)

In [18]:
tree_3.score(X_test_scaled, y_test), tree_3.score(X_train_scaled, y_train)

(0.9558080808080808, 0.9718013468013468)

In [None]:
#for sure we can say we will have better accuracy in training set,but when we increase the max depth
#it is possible to have overfittiong problem but always larger tree lead to larger accuracy
#so it would better to include both fit and over fit for evaluationg
#, but 6 is not that much and we will have better accuracy in both test 
#and train set

**Problem 2: Linear and Kernel SVM ** 

- Train a linear SVM classifier with C = 1. Report the training and test accuracy. 

- Tune a kernel SVM classifier. Let C be 10^k, where k are integers from -1 to 2, inclusive. Consider two kernel functions: polynomial and gaussian, each with their default hyperparameter specification.
    - What is the optimal C and kernel function?
    - Report the training and test accuracy of the best model. 

- If we train another kernel SVM classifier with C = 1000 (same kernel as the chosen model) and default hyperparameter. Without training the new model, do you expect its training accuracy to be higher or lower than the previous (chosen) kernel SVM model? Explain briefly.


In [19]:
# Linear SVC 
from sklearn.svm import LinearSVC

lr_svm = LinearSVC(random_state = 0,C = 1)
lr_svm.fit(X_train_scaled, y_train)
lr_svm.score(X_test_scaled, y_test), lr_svm.score(X_train_scaled, y_train)

(0.9747474747474747, 0.9718013468013468)

In [30]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
# Define Function
base_svm = SVC(random_state = 0, kernel = 'rbf')

#define a list of parameters
param_svc_kernel = {'C':  [ 0.1, 1, 10, 100]  } # C = 10,000 mimics hard-margin SVM

#apply grid search
grid_svm = GridSearchCV(base_svm, param_svc_kernel, cv = 5, n_jobs=2)

grid_svm.fit(X_train_scaled, y_train)
grid_svm.best_params_

{'C': 10}

In [31]:
from sklearn.svm import SVC
svc_kernel_basicc = SVC(random_state = 0, kernel = 'rbf',C=10)
svc_kernel_basicc.fit(X_train_scaled, y_train)
svc_kernel_basicc.score(X_test_scaled, y_test),svc_kernel_basic.score(X_train_scaled, y_train)

(0.9835858585858586, 0.9886363636363636)

In [32]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
# Define Function
base_svm = SVC(random_state = 0, kernel = 'poly')

#define a list of parameters
param_svc_kernel = {'C':  [ 0.1, 1, 10, 100]  } # C = 10,000 mimics hard-margin SVM

#apply grid search
grid_svm = GridSearchCV(base_svm, param_svc_kernel, cv = 5, n_jobs=2)

grid_svm.fit(X_train_scaled, y_train)
grid_svm.best_params_

{'C': 10}

In [33]:
from sklearn.svm import SVC
svc_kernel_basic = SVC(random_state = 0, kernel = 'poly',C=10)
svc_kernel_basic.fit(X_train_scaled, y_train)
svc_kernel_basic.score(X_test_scaled, y_test),svc_kernel_basic.score(X_train_scaled, y_train)

(0.9747474747474747, 0.9886363636363636)

In [None]:
#definatly we capture the noise and lead to overfitting,because we are near to hard margin and
#do not let the points be in the street(not tilerant to misclasification)

**Problem 3: Ensemble Method I **

For now, let us focus on the models we have trained: Decision Tree, Naïve Bayes, Linear SVM, and Kernel SVM.
- Use voting classifier (with hard voting method) that includes the above-mentioned FOUR models. For kernel SVM, use the optimal model.



In [38]:
# S2: Apply Voting Classifier

from sklearn.ensemble import VotingClassifier

# define voting classifier
voting_clf = VotingClassifier(
    estimators=[('svm', svc_kernel_basicc), ('DT', tree_3), ('naive', g_nb)],voting='hard')


# VotingClassifier(
#     estimators = [ ('lr', log_clf)  , ('svm', svm_clf), ('g_nb', nbg_clf) ], 
#     voting = 'hard')


# train the model
voting_clf.fit(X_train_scaled, y_train)

# Performance Measure
print("Test score for voting classifier is:", voting_clf.score(X_test_scaled, y_test))
print("Train score for voting classifier is:", voting_clf.score(X_train_scaled, y_train))

Test score for voting classifier is: 0.9747474747474747
Train score for voting classifier is: 0.9797979797979798


In [None]:
#no it does not,first of all, ensemble method ddoes not guarantee better performance
#second, when we have a well performed model such as svm with high accuracy with anothers weak learner
#it is obvious that voting less likely to help 

In [None]:
#it would be better than the hard voting but i think it does not help that much
#it is better than hard voting because it does not behave for example with svm and naive similar

In [None]:
#RandomForest:

In [42]:
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=200, random_state=0,max_depth = 2)
rnd_clf.fit(X_train_scaled, y_train)

print(rnd_clf.score(X_test_scaled, y_test))
print(rnd_clf.score(X_train_scaled, y_train))

0.9633838383838383
0.9608585858585859


In [41]:
from sklearn.ensemble import AdaBoostClassifier

# Define base model
naive_dt = DecisionTreeClassifier(max_depth=2)

# AdaBoost
ada_clf = AdaBoostClassifier(
    naive_dt, n_estimators=200,random_state=0)

ada_clf.fit(X_train_scaled, y_train)

# Performance
print(ada_clf.score(X_test_scaled, y_test))
print(ada_clf.score(X_train_scaled, y_train))

0.9835858585858586
1.0


In [None]:
#random forest is based on voting clasiffieers and bagging with help of the decision tree
#the randomness come from different trainnig sample and different attributes and use majority vote for prediction
#ada boost is based on two ideas: 1) points are not equal and 2)method are not equal and some of them are more accurate
#prediction is based on weighted combination of the prediction of k individual models

In [None]:
#the data contain around halph male and halph female voices means around 50% male and 50 % female
#if the data can predict 51% of the voices is female in test and 49% male in test set it was good but
# here it is predict 50% of the women voices which is not that good,
#51% accuracy is not good at all, if all of the weak learners aad 51% and they were independent from each other,
#ensemble method is more likely to help, but if after implementnd ensemble method 
#accuracy is 51%, it is not good 

**Problem 5: Clustering (10 pts)**

I Consider another scenario where only collect the acoustic information and fail to collect the gender information (e.g., in an online setting, some users prefer not to disclose their gender). This case, I'll consider using clustering methods to find groups of acoustic information similar to each other. 
- Use standard scaler to scale the data, so that the data can be applied for **unsupervised learning models.**
- Train a DBSCAN model with MinPts = 5 and epsilon = 3.

In [44]:
# S1: Preprocessing
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)

X_scaled = scaler.transform(X)




In [46]:
# S2: Apply DBSCAN
from sklearn.cluster import DBSCAN

dbscan = DBSCAN(eps=3, min_samples=5)

clusters = dbscan.fit_predict(X_scaled)

print(clusters)

np.max(clusters)

# 2 clusters

[-1 -1 -1 ...  0  0  0]


1

In [None]:
# we have 2 clusters

In [47]:
# S2: K-Means Clustering
from sklearn.cluster import KMeans

km_3 =  KMeans(n_clusters = 2, random_state = 0)

km_3.fit(X_scaled, y)

# print("Cluster Centers: \n", )  

KMeans(n_clusters=2, random_state=0)

In [48]:
# Predict the class labels

cls_predict = km_3.predict(X)
# cls_predict
print("Predicted Labels:", cls_predict)
print("Performance Score:", km_3.score(X_scaled, y))

Predicted Labels: [0 0 0 ... 1 1 0]
Performance Score: -41046.18784171887


In [None]:
#yes, db scan exaclty give us two clusters which is male and female 
#we specify k =2 because we know we have two clusters(malae female) so now kmean seperate it to two clusters
#if we have out liers dbscan can perform better than the kmean

** PCA **

In this question, I  conduct dimension reduction to our features using PCA.
- Apply PCA to original features (without scaling) and keep 4 components. 
    - Report: (1) the explained variance ratio of the first four components, and (2) the coefficients to obtain the first component.

- Apply PCA to features after standard scaling (i.e., data after scaling in Question 6) and keep 4 components. 
    - Report: (1) the explained variance ratio of the first four components, and (2) the coefficients to obtain the first component.

- Now compare the two PCA methods, and answer questions below.
    

In [49]:
# S1: Apply PCA
from sklearn.decomposition import PCA

pca_2 = PCA(n_components = 4, whiten = True, random_state = 0)

X2D = pca_2.fit_transform(X)

print("Before transfer, the dimension is:", X.shape[1], "\n",
      "After transfer, the dimension is:", X2D.shape[1])

Before transfer, the dimension is: 19 
 After transfer, the dimension is: 4


In [50]:
# S2: Finding the "explained variance", or the information kept after transfer

pca_2.explained_variance_ratio_

# 

np.sum(pca_2.explained_variance_ratio_)

# keep 99% of total info

0.9999977243489689

In [51]:
pca_2.components_

array([[-7.00771873e-05,  4.27241447e-05, -6.55942643e-05,
        -1.26314510e-04, -2.60997882e-05,  1.00214722e-04,
         3.06914989e-02,  9.99477060e-01, -4.25065635e-05,
         1.44660905e-04, -2.32694432e-04, -7.00771873e-05,
        -4.65719108e-05, -2.89465815e-05, -1.02112765e-05,
        -1.18093496e-03, -4.84225941e-05, -7.16975128e-03,
        -7.12132868e-03],
       [ 2.79533266e-03, -1.38316650e-03,  2.92352677e-03,
         3.80016113e-03,  1.51786755e-03, -2.28229358e-03,
        -3.56027611e-02,  1.12633067e-02, -3.45238806e-03,
        -1.55022124e-02,  6.10411420e-03,  2.79533266e-03,
         1.55657922e-03,  1.09225993e-03,  2.25184112e-03,
         8.31519275e-02, -1.49915783e-04,  7.03965197e-01,
         7.04115112e-01],
       [-9.03035639e-04,  3.67640770e-03,  6.16935537e-04,
        -1.01844232e-02,  6.10644258e-03,  1.62908658e-02,
        -9.96773616e-01,  3.02550823e-02,  2.10302175e-02,
         4.40278482e-02,  8.97860624e-03, -9.03035639e-04,
    

In [None]:
#xnew = x1 -7.00771873e-05+ x2  4.27241447e-05+ x3 -6.55942643e-05+ x4-1.26314510e-04

In [54]:
# S1: Apply PCA
from sklearn.decomposition import PCA

pca_ = PCA(n_components = 4, whiten = True, random_state = 0)

X2D = pca_.fit_transform(X_scaled)

print("Before transfer, the dimension is:", X.shape[1], "\n",
      "After transfer, the dimension is:", X2D.shape[1])

Before transfer, the dimension is: 19 
 After transfer, the dimension is: 4


In [55]:
# S2: Finding the "explained variance", or the information kept after transfer

pca_.explained_variance_ratio_

# 

np.sum(pca_.explained_variance_ratio_)

# keep 99% of total info

0.7736780290179156

In [56]:
pca_2.components_

array([[-0.31450539,  0.28175124, -0.28007641, -0.30537643, -0.18807944,
         0.24354445,  0.13436827,  0.13587422,  0.2229221 ,  0.27520915,
        -0.24397436, -0.31450539, -0.18922935, -0.15983627, -0.10756097,
        -0.22627007, -0.09237636, -0.22692684, -0.22533772],
       [ 0.03469199, -0.17726021,  0.0145081 ,  0.13551239, -0.16114449,
        -0.24322817,  0.41415506,  0.36312139, -0.39346786, -0.24928609,
        -0.13269506,  0.03469199,  0.12028806, -0.07355431, -0.18455983,
        -0.25464323,  0.24768088, -0.26641181, -0.27095034],
       [ 0.04658176,  0.17586645,  0.09289558, -0.10373272,  0.25745498,
         0.26028364,  0.45837217,  0.49984064, -0.12487963, -0.06039119,
        -0.06453794,  0.04658176, -0.03144965,  0.05373338,  0.34788174,
         0.15319886, -0.30251196,  0.20708172,  0.21258742],
       [-0.24240499, -0.18892872, -0.33688527, -0.06970064, -0.52481814,
        -0.21067488,  0.05044269,  0.02281774, -0.14814286, -0.03216107,
        -0.158

In [None]:
#xnew = x1 -0.31450539+ x2  0.28175124+ x3 -0.28007641+ x4 -0.30537643

In [None]:
#yes, because we did scaling, now the coeffs are much more smaller in value and the difference of scale among
#the features is not problematic anymore but we can see the sign of them(+,-) are the same, wich means the eigen vector
#covariace matrix is scaled some hoe and the positive or negative relationship between
#s new and other four components are the same as before

In [None]:
#yes, without scaling the variance was higher and that was the problem that the scale of the features were
#different and it is not the true explained variance ratio for the unscaled 

In [None]:
#kurt has very big numbers and this can affect our pca var exp ratio alot but scaling can fix this problem