<p align="center"><img src="https://github.com/insaid2018/Term-1/blob/master/Images/INSAID_Full%20Logo.png?raw=true" width="260" height="110" /></p>

![image-2.png](attachment:image-2.png)

## GENDER RECOGNITION BY VOICE (MACHINE LEARNING - INTERMEDIATE)

**Submitted by NARAYAN V. SHANBHAG**

**GCD Student, INSAID**

**Batch: May 9,2021**

---
# **Table of Contents**
---

1. [**Introduction**](#Section1)<br>
2. [**Problem Statement**](#Section2)<br>
3. [**Installing & Importing Libraries**](#Section3)<br>
  3.1 [**Installing Libraries**](#Section31)<br>
  3.2 [**Upgrading Libraries**](#Section32)<br>
  3.3 [**Importing Libraries**](#Section33)<br>
4. [**Data Acquisition & Description**](#Section4)<br>
5. [**Data Pre-Profiling**](#Section5)<br>
6. [**Data Pre-Processing**](#Section6)<br>
7. [**Data Post-Profiling**](#Section7)<br>
8. [**Exploratory Data Analysis**](#Section8)<br>
9. [**Summarization**](#Section9)</br>
  9.1 [**Conclusion**](#Section91)</br>
  9.2 [**Actionable Insights**](#Section91)</br>

---

---
<a name = Section1></a>
# **1. Introduction**
---
### Company Introduction - KPP Communications

##### Your client for this project is a Telecom Company - KPP Communications

- 1) They are a leading telecom company with 5 million users.
- 2) They want to keep track of the number of male and female users but as the user count increases the task becomes more tedious.
- 3) They want to automate the process of keeping track of male and female users using their voice.
- 4) Their research and development teams are trying to understand the acoustic properties of the voice and speech so that they can use it to enhance the customer experience in their new product.

### Current Scenario
   - 1) Determining a person’s gender as male or female, based upon a sample of their voice, initially seems to be an easy task.
   - 2) Often, the human ear can easily detect the difference between a male or a female voice within the first few spoken words.
   - 3) However, designing a computer program to do this turns out to be a bit trickier.
   - 4) Currently, the company is keeping track by manually entering the data for the user being male or female by listening to their voice which is a tedious task for the employees.

---
<a name = Section2></a>
# **2. Problem Statement**
---
- This section is emphasised on providing some generic introduction to the problem that most companies confront.

### The company suffers from the following problems:

   - The current process is a manual classification of gender using their voice.
   - This is very tedious and time-consuming as it needs to be repeated every time a new customer joins.

The company has hired you as a data science consultant.
They want to automate the process of predicting the male or female voice using acoustic properties of the voice or speech rather than doing this manual work.

### Your Role
   - You are given a dataset consisting of recorded voice samples, collected from male and female speakers.
   - Your task is to build a classification model using the dataset.
   - Because there was no machine learning model for this problem in the company, you don’t have a quantifiable win condition. You need to build the best possible model.

### Project Deliverable
   - Deliverable: Gender prediction using voice.
   - Machine Learning Task: Classification
   - Target Variable: label(Male/Female)
   - Win Condition: N/A (best possible model)

### Evaluation Metric
   - The model evaluation will be based on the F1-Score.

# **3. Installing & Importing Libraries**
---
- This section is emphasised on installing and importing the necessary libraries that will be required.

### **Installing Libraries**

In [None]:
!pip install -q datascience                                         
# Package that is required by pandas profiling
!pip install -q pandas-profiling                                    
# Library to generate basic statistics about data

# To install more libraries insert your code here..


### **Upgrading Libraries**

- **After upgrading** the libraries, you need to **restart the runtime** to make the libraries in sync.

- Make sure not to execute the cell under Installing Libraries and Upgrading Libraries again after restarting the runtime.

In [None]:
!pip install -q --upgrade pandas-profiling                          
# Upgrading pandas profiling to the latest version

### **Importing Libraries**

- You can headstart with the basic libraries as imported inside the cell below.

- If you want to import some additional libraries, feel free to do so.

In [None]:
#-------------------------------------------------------------------------------------------------------------------------------
import pandas as pd                                                 # Importing package pandas (For Panel Data Analysis)
from pandas_profiling import ProfileReport                          # Import Pandas Profiling (To generate Univariate Analysis)
#-------------------------------------------------------------------------------------------------------------------------------
import numpy as np                                                  # Importing package numpys (For Numerical Python)
#-------------------------------------------------------------------------------------------------------------------------------
import matplotlib.pyplot as plt                                     # Importing pyplot interface to use matplotlib
import seaborn as sns                                               # Importing seaborn library for interactive visualization
%matplotlib inline
#-------------------------------------------------------------------------------------------------------------------------------
import scipy as sp                                                  # Importing library for scientific calculations
#-------------------------------------------------------------------------------------------------------------------------------
import warnings
warnings.filterwarnings('ignore')

# 4. Data  Description

 - This database was created to identify a voice as male or female, based upon acoustic properties of the voice and speech.

 - The column label is also present in the dataset which has two values male and female.

 - This is the data that we have to predict for future samples.
 
#### The dataset is divided into two parts: Train and Test sets.

#### Training Set: 
 - The train set contains **2851 rows** and **22 columns** (what about **peakf** ??)
 - The last column **label** is the **target variable**.

#### Testing Set:
 - The test set contains **317 rows** and **21 columns**.
 - The test set **doesn’t contain** the **label** column.
 - It needs to be predicted for the test set.
 

**Train Set:**

| Records |Features|Target Variable|
|:--|:--|:--|
|**2851**|**22**|**label**|

**Test Set:**

|Records|Features|Predicted Variable|
|:--|:--|:--|
|**317**|**21**|**label**|
 
 |***ID***|****Feature****|****Description****|
|:--|:--|:--|
|01| Id        | Unique Identity of the Customer|
|02| meanfreq  | Mean frequency (in kHz) for the voice sample|  
|03| sd        | Standard deviation of the frequency| 
|04| median    | Median frequency (in kHz) for the voice sample|   
|05| Q25       | First quantile (in kHz) |
|06| Q75       | Third quantile (in kHz)|
|07| IQR       | Interquartile range (in kHz)|
|08| skew      | Skewness of the voice sample|
|09| kurt      | Kurtosis of the voice sample|
|10| sp.ent    | Spectral entropy|
|11| sfm       | Spectral flatness of the voice sample|
|12| mode      | Mode frequency |
|13| centroid  | Frequency centroid|
|14| peakf     | Peak frequency (the frequency with the highest energy)???(This is **NOT PRESENT** in the .csv file)|
|15| meanfun   | Average of fundamental frequency measured across the acoustic signal|
|16| minfun    | Minimum fundamental frequency measured across the acoustic signal|
|17| maxfun    | Maximum fundamental frequency measured across the acoustic signal|
|18| meandom   | Average of dominant frequency measured across the acoustic signal|
|19| mindom     | Minimum of dominant frequency measured across the acoustic signal|
|20| maxdom    |Maximum of dominant frequency measured across the acoustic signal|
|21| dfrange   |Range of dominant frequency measured across the acoustic signal|
|22| modindx   |Modulation index. Calculated as the accumulated absolute difference between adjacent Measurements of fundamental frequencies divided by the frequency range
|23| label   |The label for the voice sample (male or female)

---
<a name = Section4></a>
# **5. Data Acquisition**
---

- This section is emphasised on the accquiring the data and obtain some descriptive information out of it.

- You could either scrap the data and then continue, or use a direct source of link (generally preferred in most cases).

- You will be working with a direct source of link to head start your work without worrying about anything.

- Before going further you must have a good idea about the features of the data set:



In [None]:
#from google.colab import files
#uploaded = files.upload()
#import io
#data = pd.read_csv(io.BytesIO(uploaded['Churn_train.csv']))


In [None]:
#data = pd.read_csv(filepath_or_buffer = 'https://raw.githubusercontent.com/insaid2018/Term-1/master/Data/Projects/car_sales.csv', encoding='cp1252')
data = pd.read_csv("D://voice_train.csv",skipinitialspace=True)
data
print('Data Shape:', data.shape)
data.head()
#Initial Analysis shows there are 5634 RECORDS/ROWS and 21 FEATURES/COLUMNS

 ### **Data Information**
- **Totally there are 22 Columns/Features out of which 1(Id) is NUMERICAL(INTEGER), 1 is CATEGORICAL and the rest are NUMERICAL(FLOAT)**
- **Check whether the NUMERICAL Columns have 0s and if they are relevant**


In [None]:
# Insert your code here
data.info()


### **Data Description**

- **To get some quick description out of the data you can use describe method defined in pandas library.**
-**Gives the 5-Point or 5-Number summary and other details such as Count, Mean and Standard Deviation of the data-set**

In [None]:
data.describe()

In [None]:
data_final = pd.read_csv("D://voice_test.csv",skipinitialspace=True)
print('Data Shape:', data_final.shape)
data_final.head()

In [None]:
data_final.info()

In [None]:
data_final.describe()

---
<a name = Section5></a>
# **6. Data Pre-Profiling**
---

- This section is emphasised on getting a report about the data.

- You need to perform pandas profiling and get some observations out of it...

In [None]:
# Insert your code here...
profile_gender = ProfileReport(df=data)
profile_gender

**5.1 Data Pre-Profiling for TEST SET**

In [None]:
profile1_gender = ProfileReport(df=data_final)
profile1_gender

---
<a name = Section6></a>
# **7. Data Pre-Processing**
---

- This section is emphasised on performing data manipulation over unstructured data for further processing and analysis.

- To modify unstructured data to strucuted data you need to verify and manipulate the integrity of the data by:
  - **Handling/Checking Duplicate Data for both the TRAIN and TEST Data Sets**

  - Handling missing data,

  - Handling redundant data,

  - Handling inconsistent data,

  - Handling outliers,

  - Handling typos

-**There are actually NO DUPLICATE RECORDS/ROWS in the DATA-SETS**



In [None]:
# Insert your code here...
data[data.duplicated()]

In [None]:
data_final[data_final.duplicated()]

-**Now Check ALL the NUMERICAL COLUMNS for ZERO values and Replace/Substitute them with appropriate values**

In [None]:
data.isnull().sum()

In [None]:
(data == 0 ).sum(axis = 0)

In [None]:
data_final.isnull().sum()

In [None]:
(data_final == 0 ).sum(axis = 0)

### **Check whether the DataSet is Balanced for the label column**

In [None]:
data.label.value_counts().plot(kind='pie',autopct='%1.1f%%', startangle=90);
data.label.value_counts()

In [None]:
figure = plt.figure(figsize = (20,10))
HeatMap = sns.heatmap(data.corr(),annot = True,cmap = 'coolwarm',vmin = -1, vmax = 1,linecolor = 'black',linewidths = 1)
HeatMap.set_title('Correlation HeatMap', fontdict = {'fontsize':16})

### OBSERVATIONS
- **meanfreq** has HIGH CORRELATION with **mode, median Q25**
- **skew** has HIGH CORRELATION with **sp.ent**
- **centroid** has HIGH CORRELATION with **median, Q25**
- **meandom** has HIGH CORRELATION with **maxdom,fdrange**

In [None]:
x = data[['meanfreq', 'sd', 'median', 'Q25', 'Q75', 'IQR', 'skew', 'kurt',
       'sp.ent', 'sfm', 'mode', 'centroid', 'meanfun', 'minfun', 'maxfun',
       'meandom', 'mindom', 'maxdom', 'dfrange', 'modindx']]
x

In [None]:
data

In [None]:
y = data[['label']]
y

### Classification Report

In [None]:
print(classification_report(y_train,y_pred_train))

In [None]:
print(classification_report(y_test,y_pred_test))

### **5.2 Data Cleaning**

- In this section, we will **remove** columns which are **redundant** for model.

- We will first remove the **label** column then remove the columns with **standard deviation = 0** from the dataset.

- Standard Deviation = 0 means that **every data point in a column is equal to its mean**.

- Also means that all of a column's values are **identical** and such columns do **not help** us in **prediction** so we will drop them.

In [None]:
data_value = data.drop('label', axis = 1)

drop_cols=[]

for cols in data_value.columns:
    if data_value[cols].std() == 0:
        drop_cols.append(cols)

print("Number of constant columns to be dropped: ", len(drop_cols))
print(drop_cols)
data_value.drop(drop_cols, axis=1, inplace=True)

data_value.head()

---
# **6. Exploratory Data Analysis**

#### **Question:** What is the distribution of the **target** feature?

In [None]:
# Plot a displot on target variable
data_sns = data
sns.displot(x='label', data=data_sns, kde=True, aspect=3)

# Add some cosmetics
plt.xticks(size=12)
plt.yticks(size=12)
plt.xlabel(xlabel='Values', size=14)
plt.ylabel(ylabel='Count', size=14)
plt.title(label='Displot on target', size=16)
plt.grid(b=True)

# Display the plot
plt.show()

# **7. Data Post Processing**

### **7.1 Feature Scaling**

- Now we will **standardize** the columns of the dataframe using `StandardScaler`. 

In [None]:
# Importing StandardScaler function
from sklearn.preprocessing import StandardScaler

# Instantiating a standard scaler object
scaler = StandardScaler()

# Transforming our data
scaled_arr = scaler.fit_transform(x)

# Inputting our transformed data in a dataframe
scaled_frame = pd.DataFrame(data=scaled_arr, columns=x.columns)

# Getting a glimpse of transformed data
scaled_frame.head()
scaled_frame.describe()

In [None]:
scaled_frame.head()

In [None]:
x = scaled_frame
x

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
y = y.apply(LabelEncoder().fit_transform)
y

### **7.2 Data Splitting**

- Now we will split our data into train set and test set.

- We will keep **80%** data in the **train** set, and **20%** data in the **test** set.

### Import the train Import the train test split and split the data into train and test data for x and y.

-from sklearn.model_selection import train_test_split split and split the data into train and test data for x and y.


In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# Splitting data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=42)

# Display the shape of training and testing data
print('x_train shape: ', x_train.shape)
print('y_train shape: ', y_train.shape)
print('x_test shape: ', x_test.shape)
print('y_test shape: ', y_test.shape)

# **8. Model Development & Evaluation**

### **8.1 Model Development & Evaluation without PCA**

In [None]:
#-------------------------------------------------------------------------------------------------------------------------------
import pandas as pd                                                 # Importing for panel data analysis
pd.set_option('display.max_columns', None)                          # Unfolding hidden features if the cardinality is high
pd.set_option('display.max_colwidth', None)                         # Unfolding the max feature width for better clearity
pd.set_option('display.max_rows', None)                             # Unfolding hidden data points if the cardinality is high
pd.set_option('mode.chained_assignment', None)                      # Removing restriction over chained assignments operations
pd.set_option('display.float_format', lambda x: '%.5f' % x)         # To suppress scientific notation over exponential values
#-------------------------------------------------------------------------------------------------------------------------------
import numpy as np                                                  # Importing package numpys (For Numerical Python)
#-------------------------------------------------------------------------------------------------------------------------------
import matplotlib.pyplot as plt                                     # Importing pyplot interface using matplotlib
import seaborn as sns                                               # Importin seaborm library for interactive visualization
import plotly.graph_objs as go                                      # Importing plotly for interactive visualizations
%matplotlib inline
#-------------------------------------------------------------------------------------------------------------------------------
from sklearn.preprocessing import StandardScaler                    # Importing to scale the features in the dataset
from sklearn.model_selection import train_test_split                # To properly split the dataset into train and test sets
from sklearn.ensemble import RandomForestClassifier                  # To create a random forest regressor model
from sklearn.linear_model import LogisticRegression                   # To create a linear regression model
from sklearn import metrics                                         # Importing to evaluate the model used for regression
from sklearn.decomposition import PCA                               # Importing to create an instance of PCA model
#-------------------------------------------------------------------------------------------------------------------------------
from random import randint                                          # Importing to generate random integers
#-------------------------------------------------------------------------------------------------------------------------------
import time                                                         # For time functionality
import warnings                                                     # Importing warning to disable runtime warnings
warnings.filterwarnings("ignore")                                   # Warnings will appear only once

### Importing the evaluation metrics for Classification model - Logistic Regression


In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score

In [None]:
#clfs = [LogisticRegression(solver='liblinear'), RandomForestClassifier(random_state=42)]

In [None]:
logreg_model =  LogisticRegression(solver='liblinear')

In [None]:
randforest_model = RandomForestClassifier(random_state=42)

In [None]:
logreg_model.fit(x_train, y_train)

In [None]:
randforest_model.fit(x_train, y_train)

In [None]:
y_pred_train_lr = logreg_model.predict(x_train)
y_pred_train_lr

In [None]:
y_pred_test_lr = logreg_model.predict(x_test)
y_pred_test_lr

In [None]:
y_pred_train_rf = randforest_model.predict(x_train)
y_pred_train_rf

In [None]:
y_pred_test_rf = randforest_model.predict(x_test)
y_pred_test_rf

### ACCURACY SCORE and F1-SCORE of Logistic regression Train Set

In [None]:
ACCURACY_SCORE_TRAIN_LR = accuracy_score(y_train,y_pred_train_lr)
F1_SCORE_TRAIN_LR = f1_score(y_train,y_pred_train_lr)
print(ACCURACY_SCORE_TRAIN_LR)
print(F1_SCORE_TRAIN_LR)

### ACCURACY SCORE and F1-SCORE of Logistic regression Test Set

In [None]:
ACCURACY_SCORE_TEST_LR = accuracy_score(y_test,y_pred_test_lr)
F1_SCORE_TEST_LR = f1_score(y_test,y_pred_test_lr)
print(ACCURACY_SCORE_TEST_LR)
print(F1_SCORE_TEST_LR)

### ACCURACY SCORE and F1-SCORE of Random Forest Classifier Train Set

In [None]:
ACCURACY_SCORE_TRAIN_RF = accuracy_score(y_train,y_pred_train_rf)
F1_SCORE_TRAIN_RF = f1_score(y_train,y_pred_train_rf)
print(ACCURACY_SCORE_TRAIN_RF)
print(F1_SCORE_TRAIN_RF)

### ACCURACY SCORE and F1-SCORE of Random Forest Classifier Train Set

In [None]:
ACCURACY_SCORE_TEST_RF = accuracy_score(y_test,y_pred_test_rf)
F1_SCORE_TEST_RF = f1_score(y_test,y_pred_test_rf)
print(ACCURACY_SCORE_TRAIN_RF)
print(F1_SCORE_TRAIN_RF)

## Analyzing the Test File

### Seperate out the Categorical and Numerical Columns in the Test DataSet

In [None]:
data_submission = data_final['Id']
data_submission

In [None]:
data_final_numerical = data_final[['meanfreq', 'sd', 'median', 'Q25', 'Q75', 'IQR', 'skew', 'kurt',
       'sp.ent', 'sfm', 'mode', 'centroid', 'meanfun', 'minfun', 'maxfun',
       'meandom', 'mindom', 'maxdom', 'dfrange', 'modindx']]
data_final_numerical

In [None]:
scaled_arr_final = scaler.fit_transform(data_final_numerical)

# Inputting our transformed data in a dataframe
data_final_model = pd.DataFrame(data=scaled_arr_final, columns=data_final_numerical.columns)

# Getting a glimpse of transformed data
data_final_model.head()
data_final_model.describe()

### Predict the label values using the earlier trained Logistic Regression Model

In [None]:
y_pred_test_final_logistic = logreg_model.predict(data_final_model)
y_pred_test_final_logistic

### Convert the array into a DataFrame

In [None]:
y_pred_test_final_logistic = pd.DataFrame(y_pred_test_final_logistic)

In [None]:
y_pred_test_final_logistic

In [None]:
y_pred_test_final_random_forest = randforest_model.predict(data_final_model)
y_pred_test_final_random_forest

In [None]:
y_pred_test_final_random_forest = pd.DataFrame(y_pred_test_final_random_forest)
y_pred_test_final_random_forest

## Prepare the submission file which should have only two columns viz. the KEY/INDEX column(Id) and TARGET column(label)

In [None]:
submission_file = pd.concat([data_submission,y_pred_test_final_random_forest], axis = 1)

In [None]:
submission_file

### To write the final data to the submission file which is .csv without HEADER and INDEX

In [None]:
submission_file.to_csv('D://Gender_Prediction_Submission.csv', header=False, index=False)

### Thank You !!!