<center><img src="https://github.com/insaid2018/Term-1/blob/master/Images/INSAID_Full%20Logo.png?raw=true" width="240" height="100" /></center>

**<center><h3>Support Vector Machines Assignment Problem</h3></center>**

---
# **Table of Contents**
---

**1.** [**Problem Statement**](#Section1)<br>
**2.** [**Objective**](#Section2)<br>
**3.** [**Installing & Importing Libraries**](#Section3)<br>
  - **3.1** [**Installing Libraries**](#Section31)
  - **3.2** [**Upgrading Libraries**](#Section32)
  - **3.3** [**Importing Libraries**](#Section33)

**4.** [**Data Acquisition & Description**](#Section4)<br>
  - **4.1** [**Data Description**](#Section41)
  - **4.2** [**Data Information**](#Section42)

**5.** [**Data Pre-processing**](#Section5)<br>
  - **5.1** [**Pre-Profiling Report**](#Section51)<br>

**6.** [**Exploratory Data Analysis**](#Section6)<br>
**7.** [**Post Data Processing & Feature Selection**](#Section7)<br>
  - **7.1** [**Data Encoding**](#Section71)<br>
  - **7.2** [**Data Standardization**](#Section72)<br>
  - **7.3** [**Data Preparation**](#Section73)<br>

**8.** [**Model Development & Evaluation**](#Section8)<br>
  - **8.1** [**Support Vector Machine Classifier - RBF Kernel**](#Section81)<br>
  - **8.2** [**Support Vector Machine Classifier - Sigmoid Kernel**](#Section82)<br>
  - **8.3** [**Support Vector Machine Classifier - Polynomial Kernel**](#Section83)<br>

**9.** [**Conclusion**](#Section9)<br>



---
<a name = Section1></a>
# **1. Problem Statement**
---

- The **database** was created to **identify** a voice as **male or female**, based upon acoustic properties of the **voice and speech**.

- The dataset consists of **3,168 recorded voice samples**, collected from male and female speakers. 

- The voice samples are **pre-processed by acoustic analysis** in R using the seewave and tuneR packages.

- Additionally, samples have been analyzed with **frequency range** of **0Hz-280Hz** (human vocal range).

---
<a name = Section2></a>
# **2. Objective**
---

- The objective of this assignment is to **predict** the **gender** based on the voice features.

---
<a name = Section3></a>
# **3. Installing & Importing Libraries**
---

<a name = Section31></a>
### **3.1 Installing Libraries**

In [None]:
!pip install -q datascience                   # Package that is required by pandas profiling
!pip install -q pandas-profiling              # Library to generate basic statistics about data

<a name = Section32></a>
### **3.2 Upgrading Libraries**

- **After upgrading** the libraries, you need to **restart the runtime** to make the libraries in sync. 

- Make sure not to execute the cell above (3.1) and below (3.2) again after restarting the runtime.

In [None]:
!pip install -q --upgrade pandas-profiling

<a name = Section33></a>
### **3.3 Importing Libraries**

In [None]:
#-------------------------------------------------------------------------------------------------------------------------------
import pandas as pd                                                 # Importing for panel data analysis
from pandas_profiling import ProfileReport                          # Import Pandas Profiling (To generate Univariate Analysis) 
pd.set_option('display.max_columns', None)                          # Unfolding hidden features if the cardinality is high      
pd.set_option('display.max_colwidth', None)                         # Unfolding the max feature width for better clearity      
pd.set_option('display.max_rows', None)                             # Unfolding hidden data points if the cardinality is high
pd.set_option('mode.chained_assignment', None)                      # Removing restriction over chained assignments operations
#-------------------------------------------------------------------------------------------------------------------------------
import numpy as np                                                  # Importing package numpys (For Numerical Python)
from sklearn.preprocessing import StandardScaler                    # To scale the data with mean = 0 and std = 1
#-------------------------------------------------------------------------------------------------------------------------------
import matplotlib.pyplot as plt                                     # Importing pyplot interface using matplotlib                                              
import seaborn as sns                                               # Importin seaborm library for interactive visualization
%matplotlib inline
#-------------------------------------------------------------------------------------------------------------------------------
from sklearn.metrics import precision_recall_curve                  # For precision and recall metric estimation
from sklearn.metrics import classification_report                   # To generate complete report of evaluation metrics
from sklearn.metrics import plot_confusion_matrix                   # To plot confusion matrix 
#-------------------------------------------------------------------------------------------------------------------------------
from sklearn.model_selection import train_test_split                # To split the data in training and testing part     
from sklearn.svm import SVC                                         # To create model for support vector classifier
#-------------------------------------------------------------------------------------------------------------------------------
import warnings                                                     # Importing warning to disable runtime warnings
warnings.filterwarnings("ignore")                                   # Warnings will appear only once

---
<a name = Section4></a>
# **4. Data Acquisition & Description**
---

- This corpus has been collected from free for research sources at the Internet and it can be retrieved from the attached <a href = "https://raw.githubusercontent.com/insaid2018/Term-3/master/Data/Assignment/spam.csv">**link**</a>.

| Records | Features | Dataset Size |
| :-- | :-- | :-- |
| 3168 | 21 | 1.01 MB| 

|Id|Feature|Description|
|:--|:--|:--|
|01|**meanfreq**|Mean frequency of voice (in kHz).|
|02|**sd**|Standard deviation of frequency of voice.|
|03|**median**|Median frequency of voice (in kHz).|
|04|**Q25**|First quantile value (in kHz).|
|05|**Q75**|Third quantile value (in kHz).|
|06|**IQR**|Interquantile range (in kHz).|
|07|**skew**|Skewness (see note in specprop description).|
|08|**kurt**|Kurtosis (see note in specprop description).|
|09|**sp.ent**|Spectral entropy of voice.|
|10|**sfm**|Spectral flatness of voice.|
|11|**mode**|Mode frequency of voice.|
|12|**centroid**|Frequency centroid (see specprop).|
|13|**meanfun**|Average of fundamental frequency measured across acoustic signal.|
|14|**minfun**|Minimum fundamental frequency measured across acoustic signal.|
|15|**maxfun**|Maximum fundamental frequency measured across acoustic signal.|
|16|**meandom**|Average of dominant frequency measured across acoustic signal.|
|17|**mindom**|Minimum of dominant frequency measured across acoustic signal.|
|18|**maxdom**|Maximum of dominant frequency measured across acoustic signal.|
|19|**dfrange**|Range of dominant frequency measured across acoustic signal.|
|20|**modindx**|Modulation index. Calculated as the accumulated absolute difference |
|||between adjacent measurements of fundamental frequencies divided by the frequency range|
|21|**label**|Target containing male or female.|

In [None]:
data = pd.read_csv('https://raw.githubusercontent.com/insaid2018/Term-3/master/Data/Assignment/voice.csv')
print('Data Shape:', data.shape)
data.head()

<a name = Section41></a>
### **4.1 Data Description**

- In this section we will get desription about the numerical features.

In [None]:
data.describe()

<a name = Section42></a>
### **4.2 Data Information**

- In this section we will get **information about the data** and see some observations.

In [None]:
data.info()

<a name = Section5></a>

---
# **5. Data Pre-Processing**
---

<a name = Section51></a>
### **5.1 Pre Profiling Report**

- For **quick analysis** pandas profiling is very handy.

- Generates profile reports from a pandas DataFrame.

- For each column **statistics** are presented in an interactive HTML report.

In [None]:
# profile = ProfileReport(df = data)
# profile.to_file(output_file = 'Pre Profiling Report.html')
# print('Accomplished!')

In [None]:
# from google.colab import files                   # Use only if you are using Google Colab, otherwise remove it
# files.download('Pre Profiling Report.html')      # Use only if you are using Google Colab, otherwise remove it

---
**<h4>Question 1:** Create a function which takes data as input and remove duplicates.</h4>

---

- Additionlly, you need to show the amount of data dropped during operations.

**Performing Operations**

In [None]:
def handle_duplicates(data):

  # Enter your code here...

In [None]:
handle_duplicates(data)

<a name = Section6></a>

---
# **6. Exploratory Data Analysis**
---

---
**<h4>Question 2:** Show the frequency and proportion of male and female using "label" feature.</h4>

---

In [None]:
def plotLabel():
  
  # Enter your code here...

In [None]:
plotLabel()

<a name = Section7></a>

---
# **7. Post Data Processing**
---

- Now we will **transform** our **data** into **compatible format** so that machine can understand.

- We can observe that **label feature** contains categories which we need to **convert** into **numerics**.

<a name = Section71></a>
## **7.1 Data Encoding**

---
**<h4>Question 3:** Create a function which converts the label feature categories to numeric.</h4>

---

- More specifically female should be valued as 0 and male as 1.

In [None]:
def dataEncoding():
  
  # Enter your code here...

In [None]:
# Calling the User Defined Function
dataEncoding()

print('Encoding Success!')

<a name = Section72></a>
## **7.2 Data Standardization**


---
**<h4>Question 4:** Create a function which scales the input features with mean = 0 and standard deviation = 1.</h4>

---

- You need to create an object of StandardScaler() and perform operation over data except "label" feature.

In [None]:
def dataScaling(data, input_drop = None):

  # Enter your code here...

  return data_scaled

In [None]:
# Calling the User Defined Function
data_scaled = dataScaling(data = data, input_drop = 'label')

print('Scaling Success!')

<a name = Section73></a>
## **7.3 Data Preparation**

- Now we will **split** our **data** in **training** and **testing** part for further development.

---
**<h4>Question 5:** Create a function which split the data into training and testing parts.</h4>

---

- **Split** the data into **80:20** along with the **stratify** parameter inside train_test_split.

- Make sure to set the **random_state = 42**.

In [None]:
def dataPrep(input = None, target = None, test_size = None):

  # Enter your code here...

  return X_train, X_test, y_train, y_test

In [None]:
# Calling the User Defined Function
X_train, X_test, y_train, y_test = dataPrep(input = data_scaled, target = data['label'], test_size = 0.25)

<a name = Section8></a>

---
# **8. Model Development & Evaluation**
---

- In this section we will **develop Support Vector Classifier** and **tune** our **model if required**.

- Then we will **analyze the results** obtained and **make our observation**.

- For **evaluation purpose** we will **focus** on **precision and recall value**.

- **Remember** that **we want generalize results** i.e. same results or error on testing data as that of training data.

---
**<h4>Question 6:** Create a function that plots precision and recall curve.</h4>

---

- You can use precision_recall_curve() and estimated the average precision and recall over the returned values.

- You need to plot a line plot over the precision and recall values returned by precision_recall_curve().

In [None]:
def plot_precision_recall(y_true, y_pred, train_or_test):

  # Enter your code here...

---
**<h4>Question 7:** Create a function that plots confusion matrix over training and testing data.</h4>

---

- You can use plot_confusion_matrix with the suplots of the figure to show side by side comparison.

In [None]:
def plotConfusion(model = None):

  # Enter your code here...

---
**<h4>Question 8:** Create a function that plots classification report over training and testing data.</h4>

---

- You can use classification_report() to show side by side comparison over training and testing data.

In [None]:
def plotClassificationReport(model = None):

  # Enter your code here...

---
**<h4>Question 9:** Create a function that plots Precision and Recall Curve over training and testing data.</h4>

---

- **Note:** This function is entirely different from Question 6.
- We are using plot_precision_recall() made in Question 6 over here.
- You can plot the precision and recall over training and testing data side by side with the help of subplots.

In [None]:
def plotPRCurve(model = model):

  # Enter your code here...

<a name = Section81></a>
## **8.1 Support Vector Machine Classifier - RBF Kernel**

---
**<h4>Question 10:** Create a function that fits the SVC() over training data and returns the model.</h4>

---

- Use the probability = True and kernel = 'rbf'.

In [None]:
def modelFit():
  
  # Enter your code here...

  return svc

- Call the modelFit() method created earlier and plot the confusion matrix using plotConfusion()

In [None]:
# Calling fit over default setting
model = modelFit()

# Calling User Defined Function to plot confusion matrix
plotConfusion(model = model)

- Call the plotClassificationReport() method created earlier over the model.

In [None]:
plotClassificationReport(model = model)

- Call the plotPRCurve() method created earlier over the model.

In [None]:
plotPRCurve(model = model)

<a name = Section82></a>
## **8.2 Support Vector Machine Classifier - Sigmoid Kernel**

---
**<h4>Question 11:** Create a function that fits the SVC() over training data and returns the model.</h4>

---

- Use the probability = True and kernel = 'sigmoid'.

In [None]:
def modelFit():
  
  # Enter your code here...

  return svc

- Call the modelFit() method created earlier and plot the confusion matrix using plotConfusion().

In [None]:
# Calling fit over default setting
model = modelFit()

# Calling User Defined Function to plot confusion matrix
plotConfusion(model = model)

- Call the plotClassificationReport() method created earlier over the model.

In [None]:
plotClassificationReport(model = model)

- Call the plotPRCurve() method created earlier over the model.

In [None]:
plotPRCurve(model = model)

<a name = Section83></a>
## **8.3 Support Vector Machine Classifier - Polynomial Kernel**

---
**<h4>Question 12:** Create a function that fits the SVC() over training data and returns the model.</h4>

---

- Use the probability = True and kernel = 'poly'.

In [None]:
def modelFit():
  
  # Enter your code here...

  return svc

- Call the modelFit() method created earlier and plot the confusion matrix using plotConfusion().

In [None]:
# Calling fit over default setting
model = modelFit()

# Calling User Defined Function to plot confusion matrix
plotConfusion(model = model)

- Call the plotClassificationReport() method created earlier over the model.

In [None]:
plotClassificationReport(model = model)

- Call the plotPRCurve() method created earlier over the model.

In [None]:
plotPRCurve(model = model)

<a name = Section9></a>

---
# **9. Conclusion**
---