# Project 5: Gender Voices Classification

### Overview

In this project, you are provided with a dataset which contains some statistical information about the audio frequencies of different male and female voices. Based on the information provided, you have to find out which voice belongs to which gender using the `RandomForestClassifier` algorithm.

The data description is provided below for the dataset.

<img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/project-4/gender_voice_dataset_attributes.png' width=600>

*Dataset credits: https://www.mldata.io/dataset-details/gender_voice/*

Please note that

1. In the `label` column, `0` denotes the male class and `1` denotes the female class.

2. You are not required to know the meaning of all the other attributes in the dataset. You just need to deploy a prediction model with these attributes as the feature variables and the `label` column as the target variable.

You can get the datasets from the following links:

1. Train dataset
  
  https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/project-4/gender-voice-train.csv

2. Test dataset

  https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/project-4/gender-voice-test.csv

---

### Project Requirements

1. Create a pandas DataFrame for the train and test datasets.

2. Display the first five rows of both the training and test DataFrames.

3. Display the last five rows of both the training and test DataFrames.

4. Find the number of rows and columns in the train and test DataFrames.

5. Check for the missing values in the train and test DataFrames.

6. Count the number of `male` and `female` classes in the train DataFrame.

7. Separate the feature variables, i.e., `x_train` and `x_test` from both the DataFrames.

8. Separate the target variable, i.e., `y_train` and `y_test` from both the DataFrames.

9. Apply the `RandomForestClassifier` machine learning model to predict the `males` and `female` classes in the test DataFrame, i.e, `x_test`.

10. Print the confusion matrix and the classification report to evaluate your prediction model. Also, based on the confusion matrix, precision, recall and f1-score values, report whether the prediction model deployed by you is making accurate predictions or not.

---

#### 1. Create The DataFrames

Create the Pandas DataFrames for both the training and test datasets.

In [None]:
# Create the DataFrames for both the train and test datasets and store them in the 'train_df' and 'test_df' variables
import pandas as pd
train_df = pd.read_csv("https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/project-4/gender-voice-train.csv")
test_df = pd.read_csv("https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/project-4/gender-voice-test.csv")

---

#### 2. Display First Five Rows

Display the first five rows of both the `train_df` and `test_df` DataFrames.

In [None]:
# Display the first 5 rows of the 'train_df' DataFrame.
train_df.head()
test_df.head()

Unnamed: 0,label,meanfreq,sd,median,Q25,Q75,IQR,skew,kurt,sp.ent,...,mode,centroid,meanfun,minfun,maxfun,meandom,mindom,maxdom,dfrange,modindx
0,1,0.186833,0.027472,0.184325,0.173955,0.204731,0.030777,2.655225,10.565846,0.821812,...,0.180645,0.186833,0.17279,0.023495,0.271186,0.178571,0.007812,0.226562,0.21875,0.108929
1,0,0.188879,0.060316,0.195537,0.138072,0.242975,0.104904,1.497393,5.037085,0.909425,...,0.140386,0.188879,0.133092,0.050847,0.272727,0.855938,0.023438,8.71875,8.695312,0.098712
2,0,0.150705,0.087127,0.174299,0.069666,0.226082,0.156416,2.603951,22.328899,0.969287,...,0.050181,0.150705,0.109992,0.01726,0.266667,1.240954,0.007812,5.5625,5.554688,0.332396
3,1,0.183667,0.040607,0.182534,0.15648,0.207646,0.051166,2.054138,7.483019,0.898138,...,0.17704,0.183667,0.149237,0.018648,0.262295,0.550312,0.007812,3.421875,3.414062,0.166503
4,1,0.205159,0.039543,0.210805,0.186667,0.228908,0.042241,2.099683,7.562209,0.876002,...,0.224885,0.205159,0.154736,0.047105,0.277457,1.578835,0.1875,10.804688,10.617188,0.113644


In [None]:
# Display the first 5 rows of the 'test_df' DataFrame.
test_df.head()

Unnamed: 0,label,meanfreq,sd,median,Q25,Q75,IQR,skew,kurt,sp.ent,sfm,mode,centroid,meanfun,minfun,maxfun,meandom,mindom,maxdom,dfrange,modindx
0,1,0.186833,0.027472,0.184325,0.173955,0.204731,0.030777,2.655225,10.565846,0.821812,0.159883,0.180645,0.186833,0.17279,0.023495,0.271186,0.178571,0.007812,0.226562,0.21875,0.108929
1,0,0.188879,0.060316,0.195537,0.138072,0.242975,0.104904,1.497393,5.037085,0.909425,0.374225,0.140386,0.188879,0.133092,0.050847,0.272727,0.855938,0.023438,8.71875,8.695312,0.098712
2,0,0.150705,0.087127,0.174299,0.069666,0.226082,0.156416,2.603951,22.328899,0.969287,0.781729,0.050181,0.150705,0.109992,0.01726,0.266667,1.240954,0.007812,5.5625,5.554688,0.332396
3,1,0.183667,0.040607,0.182534,0.15648,0.207646,0.051166,2.054138,7.483019,0.898138,0.313925,0.17704,0.183667,0.149237,0.018648,0.262295,0.550312,0.007812,3.421875,3.414062,0.166503
4,1,0.205159,0.039543,0.210805,0.186667,0.228908,0.042241,2.099683,7.562209,0.876002,0.27188,0.224885,0.205159,0.154736,0.047105,0.277457,1.578835,0.1875,10.804688,10.617188,0.113644


---

#### 3. Display Last Five Rows

Display the last five rows of both the `train_df` and `test_df` DataFrames.

In [None]:
# Display the last 5 rows of the 'train_df' DataFrame.
train_df.tail()

Unnamed: 0,label,meanfreq,sd,median,Q25,Q75,IQR,skew,kurt,sp.ent,sfm,mode,centroid,meanfun,minfun,maxfun,meandom,mindom,maxdom,dfrange,modindx
2117,1,0.166738,0.052677,0.169427,0.16265,0.189223,0.026573,7.550412,76.134526,0.865282,0.427317,0.167465,0.166738,0.152651,0.022727,0.208333,0.174154,0.161133,0.214844,0.053711,0.136364
2118,0,0.198718,0.058959,0.217333,0.143111,0.252,0.108889,1.116666,3.569725,0.917123,0.363369,0.227556,0.198718,0.139322,0.050473,0.27907,0.792092,0.023438,4.96875,4.945312,0.155766
2119,0,0.202333,0.063001,0.221946,0.137544,0.264817,0.127273,2.000371,6.681799,0.873847,0.261759,0.272855,0.202333,0.12361,0.047291,0.269663,1.190168,0.023438,7.429688,7.40625,0.093438
2120,0,0.181799,0.058102,0.192037,0.12367,0.225568,0.101897,1.09166,4.009295,0.925575,0.427947,0.190731,0.181799,0.110586,0.049741,0.274286,0.789062,0.023438,4.359375,4.335938,0.062312
2121,0,0.167732,0.066225,0.171886,0.112598,0.225196,0.112598,0.822981,3.103282,0.95321,0.634648,0.173381,0.167732,0.126107,0.048096,0.27907,0.813616,0.023438,6.023438,6.0,0.123384


In [None]:
# Display the last 5 rows of the 'test_df' DataFrame.
test_df.tail()

Unnamed: 0,label,meanfreq,sd,median,Q25,Q75,IQR,skew,kurt,sp.ent,sfm,mode,centroid,meanfun,minfun,maxfun,meandom,mindom,maxdom,dfrange,modindx
1041,0,0.143504,0.073334,0.150426,0.086383,0.197713,0.11133,3.586009,28.346615,0.961517,0.734525,0.059947,0.143504,0.090472,0.017448,0.235294,0.462015,0.007812,2.335938,2.328125,0.20302
1042,0,0.151532,0.058071,0.151257,0.098246,0.195877,0.097632,3.523045,18.652713,0.893303,0.407681,0.094971,0.151532,0.095423,0.056497,0.212766,0.519064,0.087891,0.776367,0.688477,0.65087
1043,1,0.230144,0.033577,0.231368,0.222526,0.246105,0.023579,3.413691,17.897104,0.809968,0.174234,0.229158,0.230144,0.201065,0.047059,0.27907,1.242622,0.234375,8.648438,8.414062,0.120313
1044,0,0.204982,0.061316,0.23401,0.133237,0.254976,0.121739,1.850012,5.706426,0.867338,0.283762,0.249565,0.204982,0.120147,0.018161,0.231884,0.408381,0.148438,0.796875,0.648438,0.312048
1045,0,0.171713,0.061671,0.189894,0.15266,0.207766,0.055106,2.818282,11.990365,0.920936,0.553174,0.209255,0.171713,0.177866,0.027165,0.271186,1.133854,0.03125,5.789062,5.757812,0.198301


---

#### 4. Rows & Columns In Train DataFrame

Find the number of rows and columns in both the `train_df` and `test_df` DataFrames.

In [None]:
# Print the number of rows and columns in both the DataFrames.
test_df.shape
train_df.shape

(2122, 21)

---

#### 5. Check For Missing Values

Check whether any of the columns in both the `train_df` and `test_df` DataFrames has any missing value.

In [None]:
# Check for the missing values in the 'train_df' DataFrame.
train_df.isnull().sum()

label       0
meanfreq    0
sd          0
median      0
Q25         0
Q75         0
IQR         0
skew        0
kurt        0
sp.ent      0
sfm         0
mode        0
centroid    0
meanfun     0
minfun      0
maxfun      0
meandom     0
mindom      0
maxdom      0
dfrange     0
modindx     0
dtype: int64

In [None]:
# Check for the missing values in the 'test_df' DataFrame.
test_df.isnull().sum()

label       0
meanfreq    0
sd          0
median      0
Q25         0
Q75         0
IQR         0
skew        0
kurt        0
sp.ent      0
sfm         0
mode        0
centroid    0
meanfun     0
minfun      0
maxfun      0
meandom     0
mindom      0
maxdom      0
dfrange     0
modindx     0
dtype: int64

---

#### 6. Count The `male` & `female` Classes

Find out the number of `male` and `female` values in both the `train_df` and `test_df` DataFrames.

In [None]:
# Print the count of the 'male' and 'female' classes in the 'train_df' DataFrame.
train_df['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
1,1085
0,1037


In [None]:
# Print the count of the 'male' and 'female' classes in the 'test_df' DataFrame.
test_df['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
0,547
1,499


---

#### 7. Feature Variables Extraction

Get the feature variables, i.e., `x_train` and `x_test` from both the `train_df` and `test_df` DataFrames respectively. Then, display the first 5 rows of `x_train` and `x_test` DataFrames.

In [None]:
# Get the feature variables from the 'test_df' DataFrame.
x_test = test_df.iloc[:,1:]
# Display the first 5 rows of the 'x_test' DataFrame.
x_test.head()

Unnamed: 0,meanfreq,sd,median,Q25,Q75,IQR,skew,kurt,sp.ent,sfm,mode,centroid,meanfun,minfun,maxfun,meandom,mindom,maxdom,dfrange,modindx
0,0.186833,0.027472,0.184325,0.173955,0.204731,0.030777,2.655225,10.565846,0.821812,0.159883,0.180645,0.186833,0.17279,0.023495,0.271186,0.178571,0.007812,0.226562,0.21875,0.108929
1,0.188879,0.060316,0.195537,0.138072,0.242975,0.104904,1.497393,5.037085,0.909425,0.374225,0.140386,0.188879,0.133092,0.050847,0.272727,0.855938,0.023438,8.71875,8.695312,0.098712
2,0.150705,0.087127,0.174299,0.069666,0.226082,0.156416,2.603951,22.328899,0.969287,0.781729,0.050181,0.150705,0.109992,0.01726,0.266667,1.240954,0.007812,5.5625,5.554688,0.332396
3,0.183667,0.040607,0.182534,0.15648,0.207646,0.051166,2.054138,7.483019,0.898138,0.313925,0.17704,0.183667,0.149237,0.018648,0.262295,0.550312,0.007812,3.421875,3.414062,0.166503
4,0.205159,0.039543,0.210805,0.186667,0.228908,0.042241,2.099683,7.562209,0.876002,0.27188,0.224885,0.205159,0.154736,0.047105,0.277457,1.578835,0.1875,10.804688,10.617188,0.113644


In [None]:
# Get the feature variables from the 'test_df' DataFrame.
x_test = test_df.iloc[:,1:]
# Display the first 5 rows of the 'x_test' DataFrame.
x_test.head()

Unnamed: 0,meanfreq,sd,median,Q25,Q75,IQR,skew,kurt,sp.ent,sfm,mode,centroid,meanfun,minfun,maxfun,meandom,mindom,maxdom,dfrange,modindx
0,0.186833,0.027472,0.184325,0.173955,0.204731,0.030777,2.655225,10.565846,0.821812,0.159883,0.180645,0.186833,0.17279,0.023495,0.271186,0.178571,0.007812,0.226562,0.21875,0.108929
1,0.188879,0.060316,0.195537,0.138072,0.242975,0.104904,1.497393,5.037085,0.909425,0.374225,0.140386,0.188879,0.133092,0.050847,0.272727,0.855938,0.023438,8.71875,8.695312,0.098712
2,0.150705,0.087127,0.174299,0.069666,0.226082,0.156416,2.603951,22.328899,0.969287,0.781729,0.050181,0.150705,0.109992,0.01726,0.266667,1.240954,0.007812,5.5625,5.554688,0.332396
3,0.183667,0.040607,0.182534,0.15648,0.207646,0.051166,2.054138,7.483019,0.898138,0.313925,0.17704,0.183667,0.149237,0.018648,0.262295,0.550312,0.007812,3.421875,3.414062,0.166503
4,0.205159,0.039543,0.210805,0.186667,0.228908,0.042241,2.099683,7.562209,0.876002,0.27188,0.224885,0.205159,0.154736,0.047105,0.277457,1.578835,0.1875,10.804688,10.617188,0.113644


---

#### 8. Target Variable Extraction

Get the target variables, i.e., `y_train` and `y_test` from both the `train_df` and `test_df` DataFrames respectively. Then, display the first 5 rows of `y_train` and `y_test` Pandas series.

In [None]:
# Get the target variable from the 'test_df' DataFrame.
y_test = test_df.iloc[:,0]
# Display the first 5 rows of the 'y_test' Pandas series.
y_test.head()

Unnamed: 0,label
0,1
1,0
2,0
3,1
4,1


In [None]:
# Get the target variable from the 'test_df' DataFrame.
y_test = test_df.iloc[:,0]
# Display the first 5 rows of the 'y_test' Pandas series.
y_test.head()

Unnamed: 0,label
0,1
1,0
2,0
3,1
4,1


---

#### 9. Building A Random Forest Classifier Model

In [None]:
# Build a Random Forest Classifier model.
# Import the 'RandomForestClassifier' module.
from sklearn.ensemble import RandomForestClassifier
# Import the confusion_matrix and classification_report modules.
from sklearn.metrics import confusion_matrix,classification_report
x_train = train_df.iloc[:, 1:]
y_train = train_df.iloc[:, 0]

rf_clf = RandomForestClassifier(n_jobs=-1,n_estimators=50)

rf_clf.fit(x_train, y_train)

x_test = test_df.iloc[:,1:]
y_test = test_df.iloc[:,0]

rf_clf.score(x_test, y_test)

0.9818355640535373

In [None]:
# Predict the target variable based on the feature variables of the test DataFrame.
y_predicted = rf_clf.predict(x_test)
y_predicted[]


1

---

#### 10. Confusion Matrix & Classification Report

Print the confusion matrix and classification report to evaluate the model. Interpret and report the results obtained from the confusion matrix and the classification report.

In [None]:
# Print the confusion matrix to see the number of TN, FN, TP and FP values.
confusion_matrix(y_test,y_predicted)

In [None]:
# Print the precision, recall and f1-score values for both the male and female classes.
print(classification_report(y_test,y_predicted))

---