<a href="https://colab.research.google.com/github/Pragna235/Capstone_Python_Projects/blob/main/Gender_Voices_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 5: Gender Voices Classification

### Overview

In this project, you are provided with a dataset which contains some statistical information about the audio frequencies of different male and female voices. Based on the information provided, you have to find out which voice belongs to which gender using the `RandomForestClassifier` algorithm.

The data description is provided below for the dataset.

<img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/project-4/gender_voice_dataset_attributes.png' width=600>

*Dataset credits: https://www.mldata.io/dataset-details/gender_voice/*

Please note that

1. In the `label` column, `0` denotes the male class and `1` denotes the female class.

2. You are not required to know the meaning of all the other attributes in the dataset. You just need to deploy a prediction model with these attributes as the feature variables and the `label` column as the target variable. 

You can get the datasets from the following links:

1. Train dataset 
  
  https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/project-4/gender-voice-train.csv

2. Test dataset 

  https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/project-4/gender-voice-test.csv

---

### Project Requirements

1. Create a pandas DataFrame for the train and test datasets.

2. Display the first five rows of both the training and test DataFrames.

3. Display the last five rows of both the training and test DataFrames.

4. Find the number of rows and columns in the train and test DataFrames.

5. Check for the missing values in the train and test DataFrames.

6. Count the number of `male` and `female` classes in the train DataFrame. 

7. Separate the feature variables, i.e., `x_train` and `x_test` from both the DataFrames.

8. Separate the target variable, i.e., `y_train` and `y_test` from both the DataFrames.

9. Apply the `RandomForestClassifier` machine learning model to predict the `males` and `female` classes in the test DataFrame, i.e, `x_test`.

10. Print the confusion matrix and the classification report to evaluate your prediction model. Also, based on the confusion matrix, precision, recall and f1-score values, report whether the prediction model deployed by you is making accurate predictions or not.

---

#### Getting Started

Follow the steps described below to solve the project:

1. Click on the link provided below to open the Colab file for this project.
   
   https://colab.research.google.com/drive/1eP0L1-vjjYnZSHtiEQ1EL0UFQ3NiyA65?usp=sharing

2. Create the duplicate copy of the Colab file. Here are the steps to create the duplicate copy:

    - Click on the **File** menu. A new drop-down list will appear.

      <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/0_file_menu.png' width=500>

    - Click on the **Save a copy in Drive** option. A duplicate copy will get created. It will open up in the new tab on your web browser.

      <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/1_create_colab_duplicate_copy.png' width=500>

     - After creating the duplicate copy of the notebook, please rename it in the **YYYY-MM-DD_StudentName_CapstoneProject5** format. 

3. Now, write your code in the prescribed code cells.

---

#### 1. Create The DataFrames

Create the Pandas DataFrames for both the training and test datasets.

In [None]:
# Create the DataFrames for both the train and test datasets and store them in the 'train_df' and 'test_df' variables respectively.
import pandas as pd
train_df=pd.read_csv("https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/project-4/gender-voice-train.csv")
print(train_df)
test_df=pd.read_csv("https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/project-4/gender-voice-test.csv")
test_df

      label  meanfreq        sd  ...     maxdom    dfrange   modindx
0         1  0.235668  0.035568  ...   9.140625   9.117188  0.087997
1         0  0.190578  0.059717  ...   4.757812   4.734375  0.194908
2         0  0.202016  0.062551  ...   8.460938   8.437500  0.085771
3         1  0.236232  0.041141  ...  10.617188  10.593750  0.103917
4         0  0.189003  0.061455  ...   5.953125   5.929688  0.119684
...     ...       ...       ...  ...        ...        ...       ...
2117      1  0.166738  0.052677  ...   0.214844   0.053711  0.136364
2118      0  0.198718  0.058959  ...   4.968750   4.945312  0.155766
2119      0  0.202333  0.063001  ...   7.429688   7.406250  0.093438
2120      0  0.181799  0.058102  ...   4.359375   4.335938  0.062312
2121      0  0.167732  0.066225  ...   6.023438   6.000000  0.123384

[2122 rows x 21 columns]


Unnamed: 0,label,meanfreq,sd,median,Q25,Q75,IQR,skew,kurt,sp.ent,sfm,mode,centroid,meanfun,minfun,maxfun,meandom,mindom,maxdom,dfrange,modindx
0,1,0.186833,0.027472,0.184325,0.173955,0.204731,0.030777,2.655225,10.565846,0.821812,0.159883,0.180645,0.186833,0.172790,0.023495,0.271186,0.178571,0.007812,0.226562,0.218750,0.108929
1,0,0.188879,0.060316,0.195537,0.138072,0.242975,0.104904,1.497393,5.037085,0.909425,0.374225,0.140386,0.188879,0.133092,0.050847,0.272727,0.855938,0.023438,8.718750,8.695312,0.098712
2,0,0.150705,0.087127,0.174299,0.069666,0.226082,0.156416,2.603951,22.328899,0.969287,0.781729,0.050181,0.150705,0.109992,0.017260,0.266667,1.240954,0.007812,5.562500,5.554688,0.332396
3,1,0.183667,0.040607,0.182534,0.156480,0.207646,0.051166,2.054138,7.483019,0.898138,0.313925,0.177040,0.183667,0.149237,0.018648,0.262295,0.550312,0.007812,3.421875,3.414062,0.166503
4,1,0.205159,0.039543,0.210805,0.186667,0.228908,0.042241,2.099683,7.562209,0.876002,0.271880,0.224885,0.205159,0.154736,0.047105,0.277457,1.578835,0.187500,10.804688,10.617188,0.113644
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1041,0,0.143504,0.073334,0.150426,0.086383,0.197713,0.111330,3.586009,28.346615,0.961517,0.734525,0.059947,0.143504,0.090472,0.017448,0.235294,0.462015,0.007812,2.335938,2.328125,0.203020
1042,0,0.151532,0.058071,0.151257,0.098246,0.195877,0.097632,3.523045,18.652713,0.893303,0.407681,0.094971,0.151532,0.095423,0.056497,0.212766,0.519064,0.087891,0.776367,0.688477,0.650870
1043,1,0.230144,0.033577,0.231368,0.222526,0.246105,0.023579,3.413691,17.897104,0.809968,0.174234,0.229158,0.230144,0.201065,0.047059,0.279070,1.242622,0.234375,8.648438,8.414062,0.120313
1044,0,0.204982,0.061316,0.234010,0.133237,0.254976,0.121739,1.850012,5.706426,0.867338,0.283762,0.249565,0.204982,0.120147,0.018161,0.231884,0.408381,0.148438,0.796875,0.648438,0.312048


---

#### 2. Display First Five Rows

Display the first five rows of both the `train_df` and `test_df` DataFrames.

In [None]:
# Display the first 5 rows of the 'train_df' DataFrame.
train_df.head()

Unnamed: 0,label,meanfreq,sd,median,Q25,Q75,IQR,skew,kurt,sp.ent,sfm,mode,centroid,meanfun,minfun,maxfun,meandom,mindom,maxdom,dfrange,modindx
0,1,0.235668,0.035568,0.240065,0.222366,0.257763,0.035397,2.006333,6.427613,0.828406,0.177183,0.242788,0.235668,0.190731,0.047478,0.27907,1.240513,0.023438,9.140625,9.117188,0.087997
1,0,0.190578,0.059717,0.190747,0.151985,0.242259,0.090273,2.727301,13.710214,0.90115,0.370874,0.186667,0.190578,0.11814,0.047198,0.27907,1.262109,0.023438,4.757812,4.734375,0.194908
2,0,0.202016,0.062551,0.233775,0.133369,0.258403,0.125034,2.132116,7.588438,0.872074,0.283257,0.26295,0.202016,0.121069,0.046967,0.277457,0.898143,0.023438,8.460938,8.4375,0.085771
3,1,0.236232,0.041141,0.244934,0.227932,0.260342,0.03241,2.175879,7.324086,0.831379,0.235322,0.24334,0.236232,0.179134,0.049896,0.274286,1.379012,0.023438,10.617188,10.59375,0.103917
4,0,0.189003,0.061455,0.202376,0.135446,0.234851,0.099406,1.620451,6.374442,0.928735,0.472934,0.21505,0.189003,0.114454,0.047856,0.271186,1.147995,0.023438,5.953125,5.929688,0.119684


In [None]:
# Display the first 5 rows of the 'test_df' DataFrame.
test_df.head()

Unnamed: 0,label,meanfreq,sd,median,Q25,Q75,IQR,skew,kurt,sp.ent,sfm,mode,centroid,meanfun,minfun,maxfun,meandom,mindom,maxdom,dfrange,modindx
0,1,0.186833,0.027472,0.184325,0.173955,0.204731,0.030777,2.655225,10.565846,0.821812,0.159883,0.180645,0.186833,0.17279,0.023495,0.271186,0.178571,0.007812,0.226562,0.21875,0.108929
1,0,0.188879,0.060316,0.195537,0.138072,0.242975,0.104904,1.497393,5.037085,0.909425,0.374225,0.140386,0.188879,0.133092,0.050847,0.272727,0.855938,0.023438,8.71875,8.695312,0.098712
2,0,0.150705,0.087127,0.174299,0.069666,0.226082,0.156416,2.603951,22.328899,0.969287,0.781729,0.050181,0.150705,0.109992,0.01726,0.266667,1.240954,0.007812,5.5625,5.554688,0.332396
3,1,0.183667,0.040607,0.182534,0.15648,0.207646,0.051166,2.054138,7.483019,0.898138,0.313925,0.17704,0.183667,0.149237,0.018648,0.262295,0.550312,0.007812,3.421875,3.414062,0.166503
4,1,0.205159,0.039543,0.210805,0.186667,0.228908,0.042241,2.099683,7.562209,0.876002,0.27188,0.224885,0.205159,0.154736,0.047105,0.277457,1.578835,0.1875,10.804688,10.617188,0.113644


---

#### 3. Display Last Five Rows

Display the last five rows of both the `train_df` and `test_df` DataFrames.

In [None]:
# Display the last 5 rows of the 'train_df' DataFrame.
train_df.tail()

Unnamed: 0,label,meanfreq,sd,median,Q25,Q75,IQR,skew,kurt,sp.ent,sfm,mode,centroid,meanfun,minfun,maxfun,meandom,mindom,maxdom,dfrange,modindx
2117,1,0.166738,0.052677,0.169427,0.16265,0.189223,0.026573,7.550412,76.134526,0.865282,0.427317,0.167465,0.166738,0.152651,0.022727,0.208333,0.174154,0.161133,0.214844,0.053711,0.136364
2118,0,0.198718,0.058959,0.217333,0.143111,0.252,0.108889,1.116666,3.569725,0.917123,0.363369,0.227556,0.198718,0.139322,0.050473,0.27907,0.792092,0.023438,4.96875,4.945312,0.155766
2119,0,0.202333,0.063001,0.221946,0.137544,0.264817,0.127273,2.000371,6.681799,0.873847,0.261759,0.272855,0.202333,0.12361,0.047291,0.269663,1.190168,0.023438,7.429688,7.40625,0.093438
2120,0,0.181799,0.058102,0.192037,0.12367,0.225568,0.101897,1.09166,4.009295,0.925575,0.427947,0.190731,0.181799,0.110586,0.049741,0.274286,0.789062,0.023438,4.359375,4.335938,0.062312
2121,0,0.167732,0.066225,0.171886,0.112598,0.225196,0.112598,0.822981,3.103282,0.95321,0.634648,0.173381,0.167732,0.126107,0.048096,0.27907,0.813616,0.023438,6.023438,6.0,0.123384


In [None]:
# Display the last 5 rows of the 'test_df' DataFrame.
test_df.tail()

Unnamed: 0,label,meanfreq,sd,median,Q25,Q75,IQR,skew,kurt,sp.ent,sfm,mode,centroid,meanfun,minfun,maxfun,meandom,mindom,maxdom,dfrange,modindx
1041,0,0.143504,0.073334,0.150426,0.086383,0.197713,0.11133,3.586009,28.346615,0.961517,0.734525,0.059947,0.143504,0.090472,0.017448,0.235294,0.462015,0.007812,2.335938,2.328125,0.20302
1042,0,0.151532,0.058071,0.151257,0.098246,0.195877,0.097632,3.523045,18.652713,0.893303,0.407681,0.094971,0.151532,0.095423,0.056497,0.212766,0.519064,0.087891,0.776367,0.688477,0.65087
1043,1,0.230144,0.033577,0.231368,0.222526,0.246105,0.023579,3.413691,17.897104,0.809968,0.174234,0.229158,0.230144,0.201065,0.047059,0.27907,1.242622,0.234375,8.648438,8.414062,0.120313
1044,0,0.204982,0.061316,0.23401,0.133237,0.254976,0.121739,1.850012,5.706426,0.867338,0.283762,0.249565,0.204982,0.120147,0.018161,0.231884,0.408381,0.148438,0.796875,0.648438,0.312048
1045,0,0.171713,0.061671,0.189894,0.15266,0.207766,0.055106,2.818282,11.990365,0.920936,0.553174,0.209255,0.171713,0.177866,0.027165,0.271186,1.133854,0.03125,5.789062,5.757812,0.198301


---

#### 4. Rows & Columns In Train DataFrame

Find the number of rows and columns in both the `train_df` and `test_df` DataFrames.

In [None]:
# Print the number of rows and columns in both the DataFrames.
print(train_df.shape)
test_df.shape

(2122, 21)


(1046, 21)

---

#### 5. Check For Missing Values 

Check whether any of the columns in both the `train_df` and `test_df` DataFrames has any missing value.

In [None]:
# Check for the missing values in the 'train_df' DataFrame.
num_missing_values=0
for column in train_df.columns:
  for item in train_df[column].isnull():
    if item == True:
      num_missing_values+=1
print("No. of missing values = ",num_missing_values)

No. of missing values =  0


In [None]:
# Check for the missing values in the 'test_df' DataFrame.
num_missing_values=0
for column in test_df.columns:
  for item in test_df[column].isnull():
    if item == True:
      num_missing_values+=1
print("No. of missing values = ",num_missing_values)


No. of missing values =  0


---

#### 6. Count The `male` & `female` Classes

Find out the number of `male` and `female` values in both the `train_df` and `test_df` DataFrames.

In [None]:
# Print the count of the 'male' and 'female' classes in the 'train_df' DataFrame.
train_df["label"].value_counts()

1    1085
0    1037
Name: label, dtype: int64

In [None]:
# Print the count of the 'male' and 'female' classes in the 'test_df' DataFrame.
test_df["label"].value_counts()

0    547
1    499
Name: label, dtype: int64

---

#### 7. Feature Variables Extraction

Get the feature variables, i.e., `x_train` and `x_test` from both the `train_df` and `test_df` DataFrames respectively. Then, display the first 5 rows of `x_train` and `x_test` DataFrames.

In [None]:
# Get the feature variables from the 'train_df' DataFrame.
x_train=train_df.iloc[:,1:]

# Display the first 5 rows of the 'x_train' DataFrame.
x_train.head()

Unnamed: 0,meanfreq,sd,median,Q25,Q75,IQR,skew,kurt,sp.ent,sfm,mode,centroid,meanfun,minfun,maxfun,meandom,mindom,maxdom,dfrange,modindx
0,0.235668,0.035568,0.240065,0.222366,0.257763,0.035397,2.006333,6.427613,0.828406,0.177183,0.242788,0.235668,0.190731,0.047478,0.27907,1.240513,0.023438,9.140625,9.117188,0.087997
1,0.190578,0.059717,0.190747,0.151985,0.242259,0.090273,2.727301,13.710214,0.90115,0.370874,0.186667,0.190578,0.11814,0.047198,0.27907,1.262109,0.023438,4.757812,4.734375,0.194908
2,0.202016,0.062551,0.233775,0.133369,0.258403,0.125034,2.132116,7.588438,0.872074,0.283257,0.26295,0.202016,0.121069,0.046967,0.277457,0.898143,0.023438,8.460938,8.4375,0.085771
3,0.236232,0.041141,0.244934,0.227932,0.260342,0.03241,2.175879,7.324086,0.831379,0.235322,0.24334,0.236232,0.179134,0.049896,0.274286,1.379012,0.023438,10.617188,10.59375,0.103917
4,0.189003,0.061455,0.202376,0.135446,0.234851,0.099406,1.620451,6.374442,0.928735,0.472934,0.21505,0.189003,0.114454,0.047856,0.271186,1.147995,0.023438,5.953125,5.929688,0.119684


In [None]:
# Get the feature variables from the 'test_df' DataFrame.
x_test=test_df.iloc[:,1:]
# Display the first 5 rows of the 'x_test' DataFrame.
x_test.head()

Unnamed: 0,meanfreq,sd,median,Q25,Q75,IQR,skew,kurt,sp.ent,sfm,mode,centroid,meanfun,minfun,maxfun,meandom,mindom,maxdom,dfrange,modindx
0,0.186833,0.027472,0.184325,0.173955,0.204731,0.030777,2.655225,10.565846,0.821812,0.159883,0.180645,0.186833,0.17279,0.023495,0.271186,0.178571,0.007812,0.226562,0.21875,0.108929
1,0.188879,0.060316,0.195537,0.138072,0.242975,0.104904,1.497393,5.037085,0.909425,0.374225,0.140386,0.188879,0.133092,0.050847,0.272727,0.855938,0.023438,8.71875,8.695312,0.098712
2,0.150705,0.087127,0.174299,0.069666,0.226082,0.156416,2.603951,22.328899,0.969287,0.781729,0.050181,0.150705,0.109992,0.01726,0.266667,1.240954,0.007812,5.5625,5.554688,0.332396
3,0.183667,0.040607,0.182534,0.15648,0.207646,0.051166,2.054138,7.483019,0.898138,0.313925,0.17704,0.183667,0.149237,0.018648,0.262295,0.550312,0.007812,3.421875,3.414062,0.166503
4,0.205159,0.039543,0.210805,0.186667,0.228908,0.042241,2.099683,7.562209,0.876002,0.27188,0.224885,0.205159,0.154736,0.047105,0.277457,1.578835,0.1875,10.804688,10.617188,0.113644


---

#### 8. Target Variable Extraction

Get the target variables, i.e., `y_train` and `y_test` from both the `train_df` and `test_df` DataFrames respectively. Then, display the first 5 rows of `y_train` and `y_test` Pandas series.

In [None]:
# Get the target variable from the 'train_df' DataFrame.
y_train=train_df.iloc[:,0]
# Display the first 5 rows of the 'y_train' Pandas series.
y_train.head()

0    1
1    0
2    0
3    1
4    0
Name: label, dtype: int64

In [None]:
# Get the target variable from the 'test_df' DataFrame.
y_test=test_df.iloc[:,0]

# Display the first 5 rows of the 'y_test' Pandas series.
y_test.head()

0    1
1    0
2    0
3    1
4    1
Name: label, dtype: int64

--- 

#### 9. Building A Random Forest Classifier Model

In [None]:
# Build a Random Forest Classifier model.
# Import the 'RandomForestClassifier' module.
from sklearn.ensemble import RandomForestClassifier
# Import the confusion_matrix and classification_report modules.
from sklearn.metrics import confusion_matrix,classification_report

In [None]:
# Predict the target variable based on the feature variables of the test DataFrame.
rf_clf=RandomForestClassifier(n_jobs=-1,n_estimators=50)
rf_clf.fit(x_train,y_train)
print(rf_clf.score(x_train,y_train))
y_predicted=rf_clf.predict(x_test)
y_predicted

1.0


array([1, 0, 0, ..., 1, 0, 1])

In [None]:
y_predicted=pd.Series(y_predicted)

y_predicted.value_counts()

0    540
1    506
dtype: int64

---

#### 10. Confusion Matrix & Classification Report

Print the confusion matrix and classification report to evaluate the model. Interpret and report the results obtained from the confusion matrix and the classification report.

In [None]:
# Print the confusion matrix to see the number of TN, FN, TP and FP values.
confusion_matrix(y_test,y_predicted)

array([[533,  14],
       [  7, 492]])

In [None]:
# Print the precision, recall and f1-score values for both the male and female classes.
print(classification_report(y_test,y_predicted))

              precision    recall  f1-score   support

           0       0.99      0.97      0.98       547
           1       0.97      0.99      0.98       499

    accuracy                           0.98      1046
   macro avg       0.98      0.98      0.98      1046
weighted avg       0.98      0.98      0.98      1046



**Write your interpretation of the results here.**

- Interpretation 1: 

- Interpretation 2:

- Interpretation 3:

  $\dots$

- Interpretation N


---

### Submitting the Project

Follow the steps described below to submit the project.

1. After finishing the project, click on the **Share** button on the top right corner of the notebook. A new dialog box will appear.

  <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/2_share_button.png' width=500>

2. In the dialog box, click on the **Copy link** button.

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/3_copy_link.png' width=500>


3. The link of the duplicate copy (named as **YYYY-MM-DD_StudentName_CapstoneProject5**) of the notebook will get copied 

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/4_copy_link_confirmation.png' width=500>

4. Go to your dashboard and click on the **My Projects** option.

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/5_student_dashboard.png' width=800>

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/6_my_projects.png' width=800>

5. Click on the **View Project** button for the project you want to submit.

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/7_view_project.png' width=800>

6. Click on the **Submit Project Here** button.

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/8_submit_project.png' width=800>

7. Paste the link to the project file named as **YYYY-MM-DD_StudentName_CapstoneProject5** in the URL box and then click on the **Submit** button.

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/9_enter_project_url.png' width=800>


---