<h1> Goal of the Kernel </h1>

In this notebook, we will apply Logistic Regression procedure to predict gender (male/female) by Voice Data.

<h1> Import libraries </h1>


In [None]:
#Loading libraries 
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # plotting
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import StandardScaler
import seaborn as sns
from collections import Counter
import time
import datetime as dt
from datetime import datetime
import collections
import os # accessing directory structure
from matplotlib.pyplot import rcParams
%matplotlib inline
rcParams['figure.figsize'] = 10,8
sns.set(style='whitegrid', palette='muted',
        rc={'figure.figsize': (15,10)})
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score, precision_score, confusion_matrix
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

<h1> Reading Data </h1>


The dataset for this project has been taken from Kaggle's link below:  https://www.kaggle.com/primaryobjects/voicegender . 

In [None]:
#loading data
data = pd.read_csv('../input/voice.csv')
data.head()

<h1> Data Exploration </h1>

In [None]:
data.shape 

In [None]:
data.info()

In [None]:
data.dtypes

<h1> Visualizing the correlation among the features</h1>


In [None]:
f,ax = plt.subplots(figsize=(18, 18))
sns.heatmap(data.corr(), annot=True, linewidths=.5, fmt= '.1f',ax=ax)
plt.show()

In [None]:
data.corr()

In [None]:
data.label.value_counts()

In [None]:
sns.countplot(x=data['label'])

In [None]:
data['meanfreq'].value_counts(dropna=False)

In [None]:
data['sd'].value_counts(dropna=False)

In [None]:
data['median'].value_counts(dropna=False)

In [None]:
data['Q25'].value_counts(dropna=False)

In [None]:
data['Q75'].value_counts(dropna=False)

In [None]:
data['IQR'].value_counts(dropna=False)

In [None]:
data['skew'].value_counts(dropna=False)

In [None]:
data['kurt'].value_counts(dropna=False)

In [None]:
data['sp.ent'].value_counts(dropna=False)

In [None]:
data['sfm'].value_counts(dropna=False)

In [None]:
data['mode'].value_counts(dropna=False)

In [None]:
data['centroid'].value_counts(dropna=False)

In [None]:
data['meanfun'].value_counts(dropna=False)

In [None]:
data['minfun'].value_counts(dropna=False)

In [None]:
data['maxfun'].value_counts(dropna=False)

In [None]:
data['meandom'].value_counts(dropna=False)

In [None]:
data['mindom'].value_counts(dropna=False)

In [None]:
data['maxdom'].value_counts(dropna=False)

In [None]:
data['dfrange'].value_counts(dropna=False)

In [None]:
data['modindx'].value_counts(dropna=False)

In [None]:
data['label'].value_counts(dropna=False)

<h4> IS THERE ANY MISSING FEATURE VALUE?</h4>

In [None]:
data.isnull().sum()

In [None]:
data.duplicated()

<h1> Visualizations </h1>

In [None]:
sns.lmplot( x="sfm", y="meanfreq", data=data, fit_reg=False, hue='label', legend=False)
plt.show()

In [None]:
sns.lmplot( x="meanfun", y="meanfreq", data=data, fit_reg=False, hue='label', legend=False)
plt.show()

In [None]:
#Lets check the distribution of meanfreq (mean vs women)
plt.figure()
sns.kdeplot(data['meanfreq'][data['label']=='male'], shade=True);
sns.kdeplot(data['meanfreq'][data['label']=='female'], shade=True);
plt.xlabel('meanfreq value')
plt.show()

In [None]:
plt.figure(figsize=(8,7))
sns.boxplot(x="label", y="dfrange", data=data)
plt.show()

Many outliers there outside the top whisker for males. That means that there quite many male persons that sound female-like, since they have high frequency ranges.

<h1> discretization</h1>

<h4> TRANSLATE MALE-FEMALE OPTIONS TO COMPUTER'S LANGUAGE </h4>

In order to make train,test, and prediction, we have to convert our binary male-female option into 0-1's. I choose male as '1' and female as '0'.

In [None]:
data.label=[  1 if i=="male" else 0 for i in data.label]

In [None]:
data.head(5)

Now, label is a binary output, and our data is convenient for Logistic Regression.

In [None]:
data.info()

<h1> Determine Values </h1>

First of all, we will determine x and y values for Logistic Regression.

In [None]:
y=data.label.values
x_data=data.drop(["label"],axis=1)

In [None]:
y

In [None]:
x_data.head()

<h1> Normalization </h1>

To get an appropriate model we need to normalize the values in x_data.

In [None]:
# normalization =(a-min(a))/(max(a)-min(a))

x=(x_data-np.min(x_data))/(np.max(x_data)-np.min(x_data)).values

In [None]:
x.head()

<h1> TRAIN TEST SPLIT </h1>

We want to train our data by Linear Regression. But after getting our model, we need another data to test our model. So we will use train_test_split to control the acurracy of our model. 20% of the data will be used for the test, rest of the data will be used for the training.

In [None]:
# create x_train, y_train, x_test, y_test arrays
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42)

# our features must be row in our matrix.

X_train=X_train.T
X_test=X_test.T
y_train=y_train.T
y_test=y_test.T

print("X_train: ", X_train.shape)
print("X_test: ", X_test.shape)
print("y_train: ", y_train.shape)
print("y_test: ", y_test.shape)

*<h1> Logistic Regression</h1>

In [None]:
#Lets use logistic Regression:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
#Producing X and y
X = np.array(data.drop(['label'], 1))
y = np.array(data['label'])

#Dividing the data randomly into training and test set
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2)

#Now we’ll fit the model on the training data:
model=LogisticRegression()
model.fit(X_train,y_train)

print('Accuracy1 :',model.score(X_train,y_train))
print('Accuracy2 :',model.score(X_test,y_test))

In [None]:
model.predict(X_test)

<h1> Support Vector Machine </h1>

In [None]:
from sklearn import svm
from sklearn.metrics import roc_auc_score
svc = svm.SVC(kernel='linear', C=1,gamma='auto').fit(X_train, y_train)
y_pred=svc.predict(X_test)
accuracy=roc_auc_score(y_test,y_pred)
print('Accuracy :',accuracy)

Finally, the best accuracy I have is 0.9170327426311647, and I think it is good result, because I didn't use any complicated structures but line.