Hello, this is an SVM analysis based on the knowledge I could gather from the internet (:p) and from the notebook by [Niraj Verma](https://www.kaggle.com/nirajvermafcb/support-vector-machine-detail-analysis). Please do go through this notebook and let me know if it makes sense. Do critically evaluate and let me know where I can do better. Thanks

I have given few links below if anyone wants to understand the math behind it.

References:

https://towardsdatascience.com/understanding-support-vector-machine-part-1-lagrange-multipliers-5c24a52ffc5e

https://www.youtube.com/watch?v=ax8LxRZCORU

https://www.youtube.com/watch?v=_PwhiWxHK8o

The contents of the notebook are given below:<br>
- [About this Dataset](#About-this-Dataset)
- [Check the data](#Step-1:-Check-the-data)
- [EDA](#Step-2:-EDA)
- [Preprocessing and Model builiding](#Step-3:-Preprocessing-and-Model-builiding)


# About this Dataset

### Voice Gender
#### Gender Recognition by Voice and Speech Analysis

This database was created to identify a voice as male or female, based upon acoustic properties of the voice and speech. The dataset consists of 3,168 recorded voice samples, collected from male and female speakers. The voice samples are pre-processed by acoustic analysis in R using the seewave and tuneR packages, with an analyzed frequency range of 0hz-280hz (human vocal range).

The following acoustic properties of each voice are measured and included within the CSV:

- meanfreq: mean frequency (in kHz)
- sd: standard deviation of frequency
- median: median frequency (in kHz)
- Q25: first quantile (in kHz)
- Q75: third quantile (in kHz)
- IQR: interquantile range (in kHz)
- skew: skewness (see note in specprop description)
- kurt: kurtosis (see note in specprop description)
- sp.ent: spectral entropy
- sfm: spectral flatness
- mode: mode frequency
- centroid: frequency centroid (see specprop)
- peakf: peak frequency (frequency with highest energy)
- meanfun: average of fundamental frequency measured across acoustic signal
- minfun: minimum fundamental frequency measured across acoustic signal
- maxfun: maximum fundamental frequency measured across acoustic signal
- meandom: average of dominant frequency measured across acoustic signal
- mindom: minimum of dominant frequency measured across acoustic signal
- maxdom: maximum of dominant frequency measured across acoustic signal
- dfrange: range of dominant frequency measured across acoustic signal
- modindx: modulation index. Calculated as the accumulated absolute difference between adjacent measurements of -fundamental frequencies divided by the frequency range
- label: male or female

### Questions
- What other features differ between male and female voices?
- Can we find a difference in resonance between male and female voices?
- Can we identify falsetto from regular voices? (separate data-set likely needed for this)
- Are there other interesting features in the data?

#### Step 1: Check the data

In [None]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
import seaborn as sns
import matplotlib.pyplot as plt
from tqdm import tqdm


In [None]:
data = pd.read_csv('../input/voicegender/voice.csv')

In [None]:
data.head()

In [None]:
data.shape

In [None]:
data.describe([.25,.50,.75,.80,.90])

In [None]:
data.info()

In [None]:
data.isna().sum() #no missing data

#### Step 2: EDA

In [None]:
#univariate
def dist_male(x):
    if x == 'label':
        pass
    else:
        data[x][data['label']=='male'].plot.kde()
        plt.xlabel(x)
        plt.show()

In [None]:
def dist_female(y):
    if y == 'label':
        pass
    else:
        data[y][data['label']=='female'].plot.kde(color='maroon')
        plt.xlabel(y)
        plt.show()

In [None]:
cols = data.columns.drop('label')


for j, i in enumerate(cols):
#     print(j)
    dist_male(i)
    dist_female(i)

In [None]:
data_male = data[data['label']=='male'].drop('label', axis=1)
data_female = data[data['label']=='female'].drop('label', axis=1)

In [None]:
def box_plt_m(x,i):
    sns.boxplot(x=x, data=data_male)
    plt.show()
    

def box_plt_f(x,i):
    sns.boxplot(x=x, data=data_female)
    plt.show()

In [None]:
for j,i in enumerate(cols):
    plt.figure(figsize=(20,50))
    plt.subplot(21,2,j+1)
    box_plt_m(i,j)
    plt.figure(figsize=(20,50))
    plt.subplot(21,2,j+2)
    box_plt_f(i,j)
    print(j)

<b>Inference:</b>

#### What other features differ between male and female voices?

    - The meanfreq has a different mean, there are more outliers in the female as compared to male data
    - The standard deviation is more for male as compared to female
    - The median is slightly different for male and female with more outliers in female 
    - The Q25 has a lot of outliers to the left for female while for the male it is both sides but more on the left
    - The Q75 has also slightly different median
    - The IQR is significantly different for male and female with male IQR having outliers on both sides (low and high outliers)
    - The skew has a lot of outliers for both female and male
    - The sp.ent and sfm is almost similar for both male and female
    - The mode is also similar but has outliers for female
    - The centroid is similar but has outliers for both male and female
    - The meanfun and minfun is also similar, the distribution are different
    - The maxfun, meandom is almost same
    - The mindom is varies in distribution and male data points have a lot of outliers
    - The maxdom and dfrange are also similar
    - The modindx is same

#### Can we find a difference in resonance between male and female voices? <br>

There are a number of factors which determine the resonance characteristics of a resonator. Included among them are the following: size, shape, type of opening, composition and thickness of the walls, surface, and combined resonators. The quality of a sound can be appreciably changed by rather small variations in these conditioning factors

Source: Wikipedia

#### Can we identify falsetto from regular voices? (separate data-set likely needed for this)

Yes, a separate dataset will be needed.

In [None]:
#check the correlation between features
#bivariate
data.corr()

In [None]:
plt.figure(figsize=(20,15))
sns.heatmap(data.corr(), annot=True, fmt='.2g')

<b>Inference: </b>

#### Are there other interesting features in the data?

- Multi-collinearity is really high as coefficient constant is high for feature variables
- These can effect our models if we use Logistic, Linear Classifiers
    

In [None]:
# checking for more than .50 and -.50 correlation

In [None]:
plt.figure(figsize=(20,15))
sns.heatmap(data.corr(), annot=True, fmt='.2g', mask=~(((data.corr()) <=-.50) | ((data.corr())>=.50)))

#### Since SVM are not that affected by multicollinearity, we will go to model building and cross validation process.

Before that we will check for class imbalance as well

In [None]:
print('The number of male in our output is: ',data[data['label']=='male'].shape[0])
print('The number of female in our output is: ',data[data['label']=='female'].shape[0])

#### Step 3: Preprocessing and Model builiding

First let us convert the object 'label' column type to category type

In [None]:
y = data.iloc[:, -1]

from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)
y

Data standardization: Since SVM deals with distance to classify we need to standardize the data

In [None]:
from sklearn.preprocessing import StandardScaler

X = data.iloc[:,:-2]
std_scaler = StandardScaler()
std_scaler.fit(X)

X = std_scaler.transform(X)

In [None]:
#splitting into train test

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=.3, random_state=1)

First we shall try on default parameters

Default SVM (RBF)

In [None]:
from sklearn.svm import SVC
from sklearn import metrics

svc = SVC() #default parameters
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)

print(f'The score for this model {svc.__class__.__name__} is {metrics.accuracy_score(y_test, y_pred)}')

Default SVM  (Linear)

In [None]:
svc = SVC(kernel='linear') #default parameters
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)

print(f'The score for this model {svc.__class__.__name__} is {metrics.accuracy_score(y_test, y_pred)}')

Default SVM (Polynomial)

In [None]:
svc = SVC(kernel='poly') #default parameters
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)

print(f'The score for this model {svc.__class__.__name__} is {metrics.accuracy_score(y_test, y_pred)}')

The polynomial kernel did not do so well as compared to other kernels, but all have a high accuracy score


We need to check with <b>K fold cross validation </b> that if the results are same when we split the training and testing more times.

Default SVM (RBF)

In [None]:
from sklearn.model_selection import cross_val_score

svc = SVC()
score = cross_val_score(svc, X, y, cv=10, scoring='accuracy')
print(score)
print('The mean accuracy for the model on 10 K fold cross validation is: {%.3f}'%score.mean())

Default SVM (Linear)

In [None]:
svc = SVC(kernel='linear')
score = cross_val_score(svc, X, y, cv=10, scoring='accuracy')
print(score)
print('The mean accuracy for the model on 10 K fold cross validation is: {%.3f}'%score.mean())

Default SVM (Polynomial)

In [None]:
svc = SVC(kernel='poly')
score = cross_val_score(svc, X, y, cv=10, scoring='accuracy')
print(score)
print('The mean accuracy for the model on 10 K fold cross validation is: {%.2f}'%score.mean())

<b>After K fold cross validation we get accuracy score for RBF and Linear as .97 and polynomial as .94</b>

The cross validation splits the data into train and test a number of times (here cv is 10) and gives us an accuracy score. Since scores are dependent on the data and the how the split occurred. Using cross validation we can reduce that error.

#### Taking different values of C and checking which is performing better

The C parameter trades off correct classification of training examples against maximization of the decision function’s margin. For larger values of C, a smaller margin will be accepted if the decision function is better at classifying all training points correctly. A lower C will encourage a larger margin, therefore a simpler decision function, at the cost of training accuracy.

Reference: https://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html

For large value of C, I am basically looking for every point to be correctly classified. I am not concerned with the width of the margin.

For small value of C, I am basically looking for the widest width between clusters of data points and do not mind misclassification

#### Putting different C and checking the result for linear model

In [None]:
C_range = list(range(1,26))


acc_score = []

for i in C_range:
    svc = SVC(kernel='linear', C=i)
    score = cross_val_score(svc, X, y, cv=10, scoring='accuracy')
    acc_score.append(score.mean())
print('The best mean accuracy for the model on 10 K fold cross validation with a range of C value (0-25) is: {} and index {}'.format(max(acc_score), acc_score.index(max(acc_score))))
    

In [None]:
#plotting a graph

plt.plot(C_range, acc_score)
plt.xticks(np.arange(0,27,2))
plt.xlabel('C values')
plt.ylabel('Cross-Validated Accuracy')

In [None]:
#fine tuning to see which c is the best

In [None]:
C_range = list(np.arange(7,13,.1))

acc_score = []

for i in C_range:
    svc = SVC(kernel='linear', C=i)
    score = cross_val_score(svc, X, y, cv=10, scoring='accuracy')
    acc_score.append(score.mean())
print('The best mean accuracy for the model on 10 K fold cross validation with a range of C value (0-25) is: {} and index {}'.format(max(acc_score), acc_score.index(max(acc_score))))


In [None]:

plt.plot(C_range, acc_score)
plt.xticks(np.arange(7,14,1))
plt.xlabel('C values')
plt.ylabel('Cross-Validated Accuracy')

<b>Inference:</b>
We have a range of C's (7,12) that have the same accuracy. C is the number that when we increase, we tell the classifer that we want all points to be correctly classified, hence the width will be small. 

We also are testing with linear kernel, which is not much affected by C and Gamma as you see below. (for this dataset)

Also the model evaluation is done on accuracy, which is (true positive + true negative) / (true positive + true negative + false positive + false negative)

In [None]:
#checking Gamma for kernel=linear

In [None]:
gamma_range = [.00001,.0001,.001,.01,.1,1,10,100]


acc_score = []

for i in gamma_range:
    svc = SVC(kernel='linear', gamma=i)
    score = cross_val_score(svc, X, y, cv=10, scoring='accuracy')
    acc_score.append(score.mean())
# print('The best mean accuracy for the model on 10 K fold cross validation with a range of C value (0-25) is: {} and index {}'.format(max(acc_score), acc_score.index(max(acc_score))))
acc_score  

Gamma has no effect on the model with kernel as linear

In [None]:
#checking c and Gamma for kernel=rbf

In [None]:
C_range = list(range(1,25))


acc_score = []

for i in tqdm(C_range):
    svc = SVC(kernel='rbf', C=i)
    score = cross_val_score(svc, X, y, cv=10, scoring='accuracy')
    acc_score.append(score.mean())
# print('The best mean accuracy for the model on 10 K fold cross validation with a range of C value (0-25) is: {} and index {}'.format(max(acc_score), acc_score.index(max(acc_score))))

acc_score

In [None]:
plt.plot(C_range, acc_score)
plt.xticks(np.arange(1,26,1))
plt.xlabel('C values')
plt.ylabel('Cross-Validated Accuracy')

In [None]:
gamma_range = [.00001,.0001,.001,.01,.1,1,10,100]


acc_score = []

for i in gamma_range:
    svc = SVC(kernel='rbf', gamma=i)
    score = cross_val_score(svc, X, y, cv=10, scoring='accuracy')
    acc_score.append(score.mean())
print('The best mean accuracy for the model on 10 K fold cross validation with a range of gamma is: {} and index {}'.format(max(acc_score), acc_score.index(max(acc_score))))
acc_score  

In [None]:
plt.plot(gamma_range, acc_score)
# plt.xticks(np.arange(0,9))
plt.xlabel('C values')
plt.ylabel('Cross-Validated Accuracy')

C and gamma change on every variation, with C between 1-2 giving us the highest accuracy. Gamma = .01 gives us the highest model accuracy. 

<b>Taking both gamma and C value together</b>

In [None]:
gamma_range = [.00001,.0001,.001,.01,.1,1,10,100]
C_range = list(range(1,25))


acc_score = []


for j in tqdm(gamma_range):
    for i in C_range:
        svc = SVC(kernel='rbf', C=i, gamma=j)
        score = cross_val_score(svc, X, y, cv=10, scoring='accuracy')
        acc_score.append(score.mean())
# print('The best mean accuracy for the model on 10 K fold cross validation with a range of C value (0-25) is: {} and index {}'.format(max(acc_score), acc_score.index(max(acc_score))))

temp = pd.DataFrame(acc_score)

In [None]:
temp['gamma_C'] = [(x,y) for x in gamma_range for y in C_range]


In [None]:
plt.plot(temp[0])
# plt.xticks(np.arange(0,27,2))
plt.xlabel('C values')
plt.ylabel('Cross-Validated Accuracy')

In [None]:
temp.sort_values(by=0,ascending=False)

temp.iloc[75,:]

The highest accuracy of .969 with kernel as rbf is with gamma = 0.01 and C = 4

Using the default parameters we did not get a high accuracy for the model with poly kernel. But we can see if there is any change when we have different degrees.

In [None]:
degrees = [2,3,4,5,6]

acc_score = []

for i in degrees:
    svc = SVC(kernel='poly', degree=i)
    score = cross_val_score(svc, X, y, cv=10, scoring='accuracy')
    acc_score.append(score.mean())

print('The mean accuracy for the model on 10 K fold cross validation is: {}'.format(acc_score))

In [None]:
plt.plot( degrees, acc_score)
plt.xlabel('Power')
plt.ylabel('Cross-Validated Accuracy')

The accuracy is highest for degree = 3 and goes down. As you increase the degree the complexity of the model increases and may cause overfitting.

#### The best accuracy was with rbf kernel model using gamma with gamma = 0.01 and C = 4

#### Checking on f1 and roc_auc score

In [None]:
svc = SVC(kernel='rbf', C=4, gamma=.01)
score = cross_val_score(svc, X, y, cv=10, scoring='f1')
score.mean()

In [None]:
svc = SVC(kernel='rbf', C=4, gamma=.01)
score = cross_val_score(svc, X, y, cv=10, scoring='roc_auc')
score.mean()