In this notebook I will use the dataset 'adult income', containing information determining people's salary. The main objective is to succeed in predicting potential "high salary", which are obviously in the minority compared to the total population.

# Step 1. Import the packages and dataset

In [None]:
import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.preprocessing import StandardScaler
from imblearn.under_sampling import RandomUnderSampler
from imblearn.metrics import classification_report_imbalanced, geometric_mean_score
from sklearn.metrics import f1_score
from sklearn.svm import SVC

data_train = pd.read_csv('/kaggle/input/income-adult/adult_data.csv')
data_test = pd.read_csv('/kaggle/input/income-adult/adult_test.csv')

# Step 2. Interpret the data

In [None]:
data_train.head()

In [None]:
data_test.head()

 Note that there is a '.' in column 'salary'.

In [None]:
data_train.info()

In [None]:
data_test.info()

Note that there is a space ' ' in all columns' title except column 'age'. We are glad to see that there is no missing values.
The "fnlwgt"= "final weight" which means the number of people who are in the same situation(age, race, education background...etc.). So it seems useless since there are the other columns who say the same thing.
The same problem for the "education" column because of the "education-num".
Finally, we are in a racist society (at least I hope so ahah)  we would like to be able to make predictions about the population regardless of their country of origin. so we can get rid of the "race" and "native-country" columns.

# Step 3. Data cleaning and processing

In [None]:
data_train.drop([" fnlwgt", " education", " race", " native-country"], axis = 1, inplace= True)
data_test.drop([" fnlwgt", " education", " race", " native-country"], axis = 1, inplace= True)

In [None]:
# Data viz of 'salary' column

plt.figure(figsize=(15,8))
sns.countplot(x=data_train[' salary'])
plt.title("Proportion salary")

It prove again the high salary is the minority.


In [None]:
# replace the values in "salary" columns by 0 and 1.
data_train[' salary'].replace({" <=50K": 0, " >50K": 1}, inplace=True)
data_test[' salary'].replace({" <=50K.": 0, " >50K.": 1}, inplace=True)

In [None]:
# split the data_train to y_train and X_train.
y_train=data_train[' salary']
X_train=data_train.drop(' salary', axis=1)

# split the data_test to y_train and X_train.
y_test=data_test[' salary']
X_test=data_test.drop(' salary', axis=1)

In [None]:
y_train.value_counts(normalize = True)

In [None]:
# standardize the numerical variables in data_train.

num_train = X_train.dtypes[X_train.dtypes!= 'object'].index

X_train[num_train] = pd.DataFrame(StandardScaler().fit_transform(X_train[num_train]))

# Transform each categorical variable into indicator variables.
X_train = pd.get_dummies(X_train)

In [None]:
# standardize the numerical variables in data_test.

num_test = X_test.dtypes[data_test.dtypes!= 'object'].index

X_test[num_test] = pd.DataFrame(StandardScaler().fit_transform(X_test[num_test]))

# Transform each categorical variable in data_test into indicator variables.
X_test = pd.get_dummies(X_test)

# Step 4. Creat and evaluate classification model SVM

In [None]:
svm = SVC(gamma = 'scale')
svm.fit(X_train, y_train)

print('Score sur ensemble test', svm.score(X_test, y_test))

In [None]:
y_pred = svm.predict(X_test)

print(pd.crosstab(y_test, y_pred, colnames= ['Predictions']))

In [None]:
print(classification_report_imbalanced(y_test, y_pred))

The recall and f1_score for 1 are not too bad but we can make it better.

To go further, I use **Undersampling** methods who work by reducing the number of observations of the majority class in order to arrive at a satisfactory minority class / majority class ratio. What's more, this methode can make our training faster.

In [None]:
# Apply the methode randomUnderSempler.
rUs = RandomUnderSampler()
X_ru, y_ru = rUs.fit_resample(X_train, y_train)

print("Classes échantillon undersampled :", dict(pd.Series(y_ru).value_counts()))

In [None]:
svm = SVC(gamma='scale')
svm.fit(X_ru, y_ru)

y_pred = svm.predict(X_test)
print(pd.crosstab(y_test, y_pred))
print(classification_report_imbalanced(y_test, y_pred))

I try also with **'probability'**, however it may slowdown the training.

In [None]:
svm = SVC(probability= True, gamma ='scale') 
svm.fit(X_ru, y_ru)                         

threshold = 0.5 # give a try with 0.4, 0.6, ...

probs = svm.predict_proba(X_test)
pred_class =  (probs[:,1] >= threshold).astype('int')

print(pd.crosstab(y_test, pred_class))
print(classification_report_imbalanced(y_test, y_pred))

One more methode: **BalancedRandonForestClassifier**

In [None]:
from imblearn.ensemble import BalancedRandomForestClassifier

bclf = BalancedRandomForestClassifier()
bclf.fit(X_train, y_train) 
y_pred = bclf.predict(X_test)
print(pd.crosstab(y_test, y_pred))
print(classification_report_imbalanced(y_test, y_pred))

# Conclusion

All latest trained models perform better than first one who use the initial dataset.

In fact, classification on unbalanced data is a classification problem where the training sample contains a strong disparity between the classes to be predicted.

It is important to remember that the greater the imbalance between the classes, the less successful the classical models will be in predicting the minority class. In many cases, the actual data is affected by an imbalance problem, so it will be necessary to use or even combine some of the methods presented in the notebook.

Thank you for your reading.