# **GENDER CLASSIFICATION**
**By : Garry Ariel**

This notebook contains the steps of processing given information, build and train the model, and use it to predict gender. The model used here are Logistic Regression (LR) and Neural Network (NN).

First thing to do is import some necessary packages and read the data in.

In [1]:
# Import packages
import numpy as np
import pandas as pd

# Read the data
data_df = pd.read_csv("/kaggle/input/gender-classification/Transformed Data Set - Sheet1.csv")

# Take a look at some data examples
data_df.head(10)

Unnamed: 0,Favorite Color,Favorite Music Genre,Favorite Beverage,Favorite Soft Drink,Gender
0,Cool,Rock,Vodka,7UP/Sprite,F
1,Neutral,Hip hop,Vodka,Coca Cola/Pepsi,F
2,Warm,Rock,Wine,Coca Cola/Pepsi,F
3,Warm,Folk/Traditional,Whiskey,Fanta,F
4,Cool,Rock,Vodka,Coca Cola/Pepsi,F
5,Warm,Jazz/Blues,Doesn't drink,Fanta,F
6,Cool,Pop,Beer,Coca Cola/Pepsi,F
7,Warm,Pop,Whiskey,Fanta,F
8,Warm,Rock,Other,7UP/Sprite,F
9,Neutral,Pop,Wine,Coca Cola/Pepsi,F


Take a look some statistics about the data using following syntax.

In [2]:
# Describe the data
data_df.describe()

Unnamed: 0,Favorite Color,Favorite Music Genre,Favorite Beverage,Favorite Soft Drink,Gender
count,66,66,66,66,66
unique,3,7,6,4,2
top,Cool,Rock,Doesn't drink,Coca Cola/Pepsi,F
freq,37,19,14,32,33


## **Preprocessing the Data**  
Next, we will do some pre-processing to the data, such as turn categorical variables into one-hot-encoding form.

In [3]:
# Turn male into 1 and female 0
data_df['Gender'].replace(to_replace = 'F', value = 0, inplace = True)
data_df['Gender'].replace(to_replace = 'M', value = 1, inplace = True)

In [4]:
# Create one hot encoding
fav_color_df = pd.get_dummies(data_df[["Favorite Color"]], prefix = "color")
fav_music_df = pd.get_dummies(data_df[["Favorite Music Genre"]], prefix = "music")
fav_beverage_df = pd.get_dummies(data_df[["Favorite Beverage"]], prefix = "beverage")
fav_drink_df = pd.get_dummies(data_df[["Favorite Soft Drink"]], prefix = "drink")

In [5]:
# Merging one hot encoding and create new dataframe
transformed_df = pd.merge(fav_color_df, fav_music_df, left_index = True, right_index = True)
transformed_df = pd.merge(transformed_df, fav_beverage_df, left_index = True, right_index = True)
transformed_df = pd.merge(transformed_df, fav_drink_df, left_index = True, right_index = True)

# Take a look at some data examples
transformed_df.head(10)

Unnamed: 0,color_Cool,color_Neutral,color_Warm,music_Electronic,music_Folk/Traditional,music_Hip hop,music_Jazz/Blues,music_Pop,music_R&B and soul,music_Rock,beverage_Beer,beverage_Doesn't drink,beverage_Other,beverage_Vodka,beverage_Whiskey,beverage_Wine,drink_7UP/Sprite,drink_Coca Cola/Pepsi,drink_Fanta,drink_Other
0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0
1,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0
2,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0
3,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0
4,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0
5,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0
6,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0
7,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0
8,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0
9,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,1,0,0


## **Feature Selection**
Next, we will select some features which will be used to feed the model later. We specified 3 ways to select the features.

1. Choose the feature manually.
We can experiment about which features give higher accuracy.

In [6]:
# Choose feature (Manual)
feature = [
    "music_Electronic",
    "music_Hip hop",
    "music_Jazz/Blues",
    "music_Pop",
    "music_R&B and soul",
    "beverage_Vodka",
    "drink_Other"
]

2. Choose all features without filter it.

In [7]:
# Choose all feature
# feature = []
# for col in transformed_df.columns:
#     feature.append(col)

3. Choose features based on its correlation to gender variable. Specify a threshold, such that every features which have correlation to gender variable greater than threshold will be chosen as a feature.

In [8]:
# Choose feature (By rule)
# feature = []
# analyze_df = pd.merge(transformed_df, data_df["Gender"], left_index = True, right_index = True)
# for index, row in analyze_df.corr().iterrows():
#     if abs(row["Gender"]) > 0.08 and index != "Gender":
#         feature.append(index)

## **Preparing Data**
In the following step, we will format the data so that the data can be feed into the model. We will also split the data into train and test dataset with the comparison of 4:1.

In [9]:
# Import packages related to training model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, matthews_corrcoef, accuracy_score

# Turn into numpy array
X = np.asarray(transformed_df[feature])
y = np.asarray(data_df['Gender'])

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [10]:
# Preprocess train data
header = []
for col in transformed_df[feature].columns:
    header.append(col)
header = np.array(header)

x_df = pd.DataFrame(
    X_train,
    columns = header
)

y_df = pd.DataFrame(
    y_train,
    columns = ["gender"]
)

train_df = pd.merge(x_df, y_df, left_index = True, right_index = True)

# Look at the correlation
corr_df = train_df.corr()
corr_df.head(len(feature))

Unnamed: 0,music_Electronic,music_Hip hop,music_Jazz/Blues,music_Pop,music_R&B and soul,beverage_Vodka,drink_Other,gender
music_Electronic,1.0,-0.106383,-0.080705,-0.188311,-0.106383,0.041723,-0.117797,0.195698
music_Hip hop,-0.106383,1.0,-0.080705,-0.188311,-0.106383,0.041723,0.086384,0.195698
music_Jazz/Blues,-0.080705,-0.080705,1.0,-0.142857,-0.080705,0.123091,-0.089363,-0.082479
music_Pop,-0.188311,-0.188311,-0.142857,1.0,-0.188311,-0.246183,0.208514,-0.310881
music_R&B and soul,-0.106383,-0.106383,-0.080705,-0.188311,1.0,-0.139077,-0.117797,0.065233
beverage_Vodka,0.041723,0.041723,0.123091,-0.246183,-0.139077,1.0,-0.153998,0.1066
drink_Other,-0.117797,0.086384,-0.089363,0.208514,-0.117797,-0.153998,1.0,0.120386


## **Build and Train Logistic Regression Model**

We will just simply feed the model with the data using any default parameters. After trained, we use the model to predict the gender, and evaluate the accuracy. To experiment with the accuracy, we can change the features we used in previous steps.

In [11]:
# Create logistic regression
LR = LogisticRegression().fit(X_train, y_train)

In [12]:
# Predict result
y_predict = LR.predict(X_test)

# Evaluate the accuracy
score = accuracy_score(y_predict, y_test)

# Print result
print("The accuracy is " + str(score))

The accuracy is 0.7142857142857143


## **Build and Train Neural Network**

The model we used here are as the following.
1. Fully connected layer with 128 neurons using ReLU activation function.
2. Dropout layer with probability 0.4.
3. Fully connected layer with 2 neurons (as output) using softmax activation function.

In [13]:
# Using NN model
import tensorflow as tf
from tensorflow import keras

# Create callback
class myCallback(tf.keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs = {}):
        if ((logs.get('val_accuracy') > 0.72 and logs.get('val_loss') <= 0.5931) or logs.get('val_accuracy') >= 0.9):
            self.model.stop_training = True
            print("Stop here")
callback = myCallback()

# Build model
tf.random.set_seed(42)
model = keras.Sequential([
    keras.layers.Dense(128, activation = 'relu', input_shape = [len(feature)]),
    keras.layers.Dropout(0.4),
    keras.layers.Dense(2, activation = 'softmax')
])

# Compile model
model.compile(
    loss = 'binary_crossentropy',
    optimizer = keras.optimizers.Adam(0.001),
    metrics = ['accuracy']
)

# Fit the model
model.fit(
    X_train, y_train,
    epochs = 200,
    batch_size = 1,
    verbose = 1,
    validation_split = 0.2,
    callbacks = [callback]
)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78

<tensorflow.python.keras.callbacks.History at 0x7ff0f82e0e90>

Then we used the trained NN model to predict the gender and evaluate the accuracy. As before, we can experiment with the accuracy by change the features we used, or changing some parameters.

In [14]:
# Predict result (If the last layer using softmax)
y_predict = model.predict(X_test)
result = []
for index in range(len(y_predict)):
  each_result = np.argmax(y_predict[index])
  result.append(each_result)

# Formatting
result = np.array(result)
    
# Evaluate the accuracy
score = accuracy_score(result, y_test)

# Print result
print("The accuracy is " + str(score))

The accuracy is 0.8571428571428571
