We have a dataset to identify a voice as male or female,based upon several properties of voice.We going to try to use support vector machines for this one.

First let load the needed modules

In [1]:
from pandas import read_csv 
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix,accuracy_score

Then let load the dataset and whether is there any missing values.

In [2]:
data=read_csv("../data/voice.csv")
isMissing=data.isna().any().any()
print(data.head(5))
print(isMissing)

   meanfreq        sd    median       Q25       Q75       IQR       skew  \
0  0.059781  0.064241  0.032027  0.015071  0.090193  0.075122  12.863462   
1  0.066009  0.067310  0.040229  0.019414  0.092666  0.073252  22.423285   
2  0.077316  0.083829  0.036718  0.008701  0.131908  0.123207  30.757155   
3  0.151228  0.072111  0.158011  0.096582  0.207955  0.111374   1.232831   
4  0.135120  0.079146  0.124656  0.078720  0.206045  0.127325   1.101174   

          kurt    sp.ent       sfm  ...  centroid   meanfun    minfun  \
0   274.402906  0.893369  0.491918  ...  0.059781  0.084279  0.015702   
1   634.613855  0.892193  0.513724  ...  0.066009  0.107937  0.015826   
2  1024.927705  0.846389  0.478905  ...  0.077316  0.098706  0.015656   
3     4.177296  0.963322  0.727232  ...  0.151228  0.088965  0.017798   
4     4.333713  0.971955  0.783568  ...  0.135120  0.106398  0.016931   

     maxfun   meandom    mindom    maxdom   dfrange   modindx  label  
0  0.275862  0.007812  0.007812  

Upon examining the dataset, it appears there is no missing data .However from the looks of it we need to standardize the features and encode the labels (male as 1 and female as 0).

In [3]:
x=data.drop("label",axis=1)
y=(data["label"]=="male")
scaler=StandardScaler()
X=scaler.fit_transform(x)
print(X,y)

[[-4.04924806  0.4273553  -4.22490077 ... -1.43142165 -1.41913712
  -1.45477229]
 [-3.84105325  0.6116695  -3.99929342 ... -1.41810716 -1.4058184
  -1.01410294]
 [-3.46306647  1.60384791 -4.09585052 ... -1.42920257 -1.41691733
  -1.06534356]
 ...
 [-1.29877326  2.32272355 -0.05197279 ... -0.5992661  -0.58671739
   0.17588664]
 [-1.2452018   2.012196   -0.01772849 ... -0.41286326 -0.40025537
   1.14916112]
 [-0.51474626  2.14765111 -0.07087873 ... -1.27608595 -1.2637521
   1.47567886]] 0        True
1        True
2        True
3        True
4        True
        ...  
3163    False
3164    False
3165    False
3166    False
3167    False
Name: label, Length: 3168, dtype: bool


Let split the data into training and testing sets.

In [4]:
xTrain,xTest,yTrain,yTest=train_test_split(X,y,random_state=42,test_size=0.2)

Now, we will create multiple SVM models with varying hyperparameters, evaluating each model's effectiveness using confusion matrix and accuracy score.

In [5]:
models=[
  (SVC(kernel="linear"),"Simple linear svm"),
  (SVC(kernel="linear",C=100),"Linear svm with 100 acceptable errors"),
 (SVC(kernel="linear",C=1000),"Linear svm with 1000 acceptable errors"),   
 ( SVC(kernel="poly"),"Polynomial svm")
]

Now let train and test all of these models

In [6]:
for (model,description) in models:
  model.fit(xTrain,yTrain)
  pred=model.predict(xTest)
  print(description)
  accuracyScore=accuracy_score(y_pred=pred,y_true=yTest)
  confusionMatrix=confusion_matrix(y_pred=pred,y_true=yTest)
  print("Accuracy score is : "+str(accuracyScore))
  print("Confusion matrix is : "+str(confusionMatrix))


Simple linear svm
Accuracy score is : 0.9763406940063092
Confusion matrix is : [[293   4]
 [ 11 326]]
Linear svm with 100 acceptable errors
Accuracy score is : 0.9763406940063092
Confusion matrix is : [[293   4]
 [ 11 326]]
Linear svm with 1000 acceptable errors
Accuracy score is : 0.9763406940063092
Confusion matrix is : [[293   4]
 [ 11 326]]
Polynomial svm
Accuracy score is : 0.9684542586750788
Confusion matrix is : [[282  15]
 [  5 332]]


Based on the results obtained, we can draw the following inferences:

1. The data exhibits a linear relationship, as evidenced by the decreased accuracy when employing a polynomial kernel, suggesting that a linear decision boundary is more appropriate for this dataset.

2. The observations for the **male** and **female** classes appear to be well-separated and does not have much outliers. This deduction arises from the consistent model accuracy across different values of the regularization parameter C.