This dataset is probably best suited for unsupervised ML techniques, but I was curious to see whether there are attributes that can help predict whether a customer is male or female. The attributes in question are Age, Income and Score. It's a small dataset (both in terms of features and number of customers), however I think this notebook gives a useful introduction to applying ML algorithms such as Random Forest, SVM and KNN.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_style('darkgrid')

## **Data Preprocessing**

In [None]:
columns = ['CustomerID', 'Gender', 'Age', 'Income', 'Score']
df = pd.read_csv('../input/Mall_Customers.csv', index_col='CustomerID', names=columns, header=0)
df = df[['Age', 'Income', 'Score', 'Gender']] # Putting Gender (target variable) at the end
df.head()

In [None]:
df.shape

We have data on 200 customers. This is not very much. 

In [None]:
df.describe()

Age ranges between 18-70, income ranges between 15K-137K, and score ranges between 1-99.

In [None]:
# % share of gender in dataset
df.Gender.value_counts(normalize=True)

## **Exploratory Data Analysis**

Let's look into the data. We start off by isolating the attributes and look at differences between the genders.

In [None]:
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(9,4))

female = df[df.Gender == 'Female']
male = df[df.Gender == 'Male']

sns.distplot(female.Age, bins=12 ,ax=ax1)
sns.distplot(male.Age, bins=12, ax=ax2)

ax1.set_title('Age distr among females')
ax2.set_title('Age distr among males')

In [None]:
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(9,4))

female = df[df.Gender == 'Female']
male = df[df.Gender == 'Male']

sns.distplot(female.Income, bins=12 ,ax=ax1)
sns.distplot(male.Income, bins=12, ax=ax2)

ax1.set_title('Income distr among females')
ax2.set_title('Income distr among males')

In [None]:
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(9,4))

female = df[df.Gender == 'Female']
male = df[df.Gender == 'Male']

sns.distplot(female.Score, bins=12 ,ax=ax1)
sns.distplot(male.Score, bins=12, ax=ax2)

ax1.set_title('Score distr among females')
ax2.set_title('Score distr among males')

Interestingly, females' scores are clustered relatively symetrically around the midpoint. Males, however, peak at the very bottom, very top, and the middle.

In [None]:
# Map Gender to 1 for female and 0 for male

mapping = {'Female': 1, 'Male': 0}
df.Gender.replace(mapping, inplace=True)
df.head()

In [None]:
# Comparing pairwise correlations between variables
sns.pairplot(df[['Age', 'Income', 'Score']])

No significant correlations between the variables.

In [None]:
sns.lmplot('Score', 'Income', hue='Gender', data=df, fit_reg=False)

We see a pattern, although this is not a linear one. Interestingly there seems to be a cluster where most people who have income in the 40k-60k range have a score between 40-60. Anyone with income above or below this range seems to be at the extremes of the Score range - either very high or very low.

## **Machine Learning**

Can we predict the gender of a customer (target variable) based on attributes such as age, income and score? We try training 3 different algorithms for this task - KNN, Random Forest, and SVM.

### **Feature scaling - standardizing the data**

In [None]:
# Standardize the data to all be the same unit

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(df.drop('Gender', axis=1))

# Transforming the data
scaled_features = scaler.transform(df.drop('Gender', axis=1))
scaled_features

In [None]:
# Use the scaler to create scaler dataframe
# This gives us a standardized version of our data

df_feat = pd.DataFrame(scaled_features, columns=df.columns[:-1])
df_feat.head()

### **Train test split**

In [None]:
from sklearn.model_selection import train_test_split, GridSearchCV

X = df_feat
y = df['Gender']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

### **KNN**

In [None]:
# Training and Predictions

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5) # k=5
knn.fit(X_train, y_train)
pred = knn.predict(X_test)
pred

In [None]:
# Evaluating the algorithm

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print (confusion_matrix(y_test, pred))
print (classification_report(y_test, pred))
print ('Accuracy Score: ' + str(accuracy_score(y_test, y_pred)))

The algorithm doesn't do a better prediction than a random draw would. Let's find the k with the lowest error rate through iterations.

In [None]:
error_rate = []

for i in range(1,40): # Checking every possible k value between 1-40
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train, y_train)
    pred_i = knn.predict(X_test)
    error_rate.append(np.mean(pred_i != y_test))
    
error_rate

The error rate is what we want to minimize, so we want to know the k that gives the smallest error rate. Let's create a visual representation to make life easier. 

In [None]:
plt.figure(figsize=(10,6))
plt.plot(range(1,40), error_rate, color='grey', marker='o', markerfacecolor='red')
plt.title('Error rate vs K value')
plt.xlabel('K value')
plt.ylabel('Mean error rate')

0.35 is a very high error rate, but it is the best we're able to find. Let's now run the model again with k=17 again instead of k=1.

In [None]:
knn = KNeighborsClassifier(n_neighbors=17)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print ('Accuracy Score: ' + str(accuracy_score(y_test, y_pred)))

We were able to classify a couple of more points correctly, but in general, an accuracy score of 0.65 is not good. It looks like we'd need more data (more features or larger dataset) to build a more robust model.

### **Random Forest**

Let's try the Random Forest algorithm instead. We have already scaled data and split into train and test sets.

In [None]:
# Training the algorithm

from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(n_estimators=100, random_state=101)
forest.fit(X_train, y_train)
y_pred = forest.predict(X_test)

In [None]:
# Evaluating the algorithm

print (confusion_matrix(y_test, y_pred))
print (classification_report(y_test, y_pred))
print ('Accuracy Score: ' + str(accuracy_score(y_test, y_pred)))

Ouch, this ain't good. Nothing better than a random draw. Let's instead use grid search to find the best parameter values. Parameter tuning is the process to selecting the values for a model’s parameters that maximize the accuracy of the model.

In [None]:
# Grid search

grid_param = {  
    'n_estimators': [50, 80, 100, 120],
    'criterion': ['gini', 'entropy'],
    'bootstrap': [True, False],
    'max_depth': [10,30,50],
    'max_features': ['auto', 'sqrt'],
    'min_samples_split': [3,9,20],
    'min_samples_leaf': [1, 2, 4]
    }

gs = GridSearchCV(estimator=forest,  
                     param_grid=grid_param,
                     scoring='accuracy',
                     cv=5,
                     n_jobs=-1)

gs.fit(X_train, y_train)

In [None]:
print(gs.best_params_)

Let's train the algorithm again, using the information from the grid search.

In [None]:
# Training the tuned algorithm

forest_tuned = RandomForestClassifier(n_estimators=100,
                                      criterion= 'gini',
                                      bootstrap= False,
                                      max_depth= 10,
                                      max_features= 'auto',
                                      min_samples_split= 20,
                                      min_samples_leaf= 1,
                                      random_state=101)
forest_tuned.fit(X_train, y_train)
y_pred = forest_tuned.predict(X_test)

In [None]:
# Evaluating the tuned algorithm

print (confusion_matrix(y_test, y_pred))
print (classification_report(y_test, y_pred))
print ('Accuracy Score: ' + str(accuracy_score(y_test, y_pred)))

Slightly better than previously, but still not noticably different from a random draw.

### **SVM**

In [None]:
# Training the algorithm

from sklearn.svm import SVC

svm = SVC()
svm.fit(X_train, y_train)
y_pred = svm.predict(X_test)

In [None]:
# Evaluating the algorithm

print (confusion_matrix(y_test, y_pred))
print (classification_report(y_test, y_pred))
print ('Accuracy Score: ' + str(accuracy_score(y_test, y_pred)))

Not great, let's search for the best parameters using grid search.

In [None]:
# Grid search

# "C" controls the cost of misclassification on the training data. 
# A large C-value gives you low bias and high variance. Low bias causes you penalize the cost of misclassification a lot.

# Small "gamma" means a Gaussian of a large variance. Large gamma leads to high bias and low variance in the model. 

param_grid = {
    'C': [0.1, 1, 10, 100, 1000],
    'gamma': [1, 0.1, 0.01, 0.001, 0.0001]
}

gs = GridSearchCV(SVC(), param_grid, verbose=3)
gs.fit(X_train, y_train)

In [None]:
print(gs.best_params_)

Let's train the algorithm again, using the information from the grid search.

In [None]:
# Training the tuned algorithm

svm_tuned = SVC(C = 10, gamma = 0.1)
svm_tuned.fit(X_train, y_train)
y_pred = svm_tuned.predict(X_test)

In [None]:
# Evaluating the algorithm

print (confusion_matrix(y_test, y_pred))
print (classification_report(y_test, y_pred))
print ('Accuracy Score: ' + str(accuracy_score(y_test, y_pred)))

Slightly better than previously, but the model still has poor predictive ability.

## **Conclusion**

Neither KNN, Random Forest nor the SVM algorithm are very useful in terms of predicting the gender of the customer based on the features Age, Income and Score. This indicates that the data does not have prediction capability. This doesn't come as a huge surprise, as we could already see in the EDA that there was little that suggested any major differences between the two genders when it came to these variables. However, the sample used is very small (n=200), and having more data might have given us higher accuracy scores.

A high error rate indicates that the model is underfitting and has high bias. The model is not sufficiently complex, so it's simply not capable of representing the relationship between y and the input features. To combat this we could try increasing the number of input features.