Data Science - EA9: Classification

In [28]:
# imports for all tasks
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score


# Load train_data_data and test dataset
train_data = pd.read_csv('./data/aug_train.csv')
test_data = pd.read_csv('./data/aug_test.csv')

# Display first few rows of the datasets
print(train_data.head())
print(test_data.head())

   city_development_index  gender      relevent_experience  \
0                   0.624    Male   No relevent experience   
1                   0.926    Male  Has relevent experience   
2                   0.920    Male  Has relevent experience   
3                   0.624    Male   No relevent experience   
4                   0.920  Female  Has relevent experience   

  enrolled_university education_level major_discipline experience  \
0       no_enrollment     High School              NaN          5   
1       no_enrollment        Graduate             STEM        >20   
2       no_enrollment        Graduate             STEM        >20   
3    Full time course     High School              NaN          1   
4       no_enrollment         Masters             STEM        >20   

    company_type last_new_job  training_hours  target  
0            NaN        never              21       0  
1            NaN           >4              12       0  
2  Public Sector           >4              2

Task 1 - Data clean, imputation

In [29]:
# function to clean the experience column and last new job column
# first make sure that nan-values are reaplced with 0
# for experience: replace values bigger than 20 with 21, and values smaller than 1 with 1 and make the column numeric
# for last new job:replace values bigger than 4 with 5, and 'never' with 0 and make the column numeric
def input_replace_missing_experience_last_new_job(df):
    # inpute missing values
    df['experience'].fillna('0', inplace=True)
    df['last_new_job'].fillna('0', inplace=True)
    
    # Replace values
    df['experience'] = df['experience'].replace('>20', '21').replace('<1', '1').astype(int)
    df['last_new_job'] = df['last_new_job'].replace('>4', '5').replace('never', '0').astype(int)
    
    return df

# function to input missing values with it's mode for categorical columns and median for numerical columns
def impute_missing_values(df):
    for column in df.columns:
        if df[column].dtype == 'object':
            df[column].fillna(df[column].mode()[0], inplace=True)
        else:
            df[column].fillna(df[column].median(), inplace=True)
    return df

# clean the train data with the inpute_replace_missing_experience_last_new_job function
train_data = input_replace_missing_experience_last_new_job(train_data)
test_data = input_replace_missing_experience_last_new_job(test_data)

# add the missing values for train and test data
train_data = impute_missing_values(train_data)
test_data = impute_missing_values(test_data)

Task 2 - Classification

In [30]:
# define categorical columns because they need to be encoded before feeding them to the model
categorical_columns = ['gender', 'relevent_experience', 'enrolled_university', 'education_level', 'major_discipline', 'company_type']

# initialize the OneHotEncoder for categorical columns
encoder = OneHotEncoder(drop='first', sparse_output=False)

# fit and transform the training data
encoded_train = encoder.fit_transform(train_data[categorical_columns])

# transform the test data
encoded_test = encoder.transform(test_data[categorical_columns])

# convert encoded features to pandas dataframe
encoded_train_df = pd.DataFrame(encoded_train, columns=encoder.get_feature_names_out(categorical_columns))
encoded_test_df = pd.DataFrame(encoded_test, columns=encoder.get_feature_names_out(categorical_columns))

# drop original categorical columns and concatenate the encoded dataframe
train_data = train_data.drop(categorical_columns, axis=1)
test_data = test_data.drop(categorical_columns, axis=1)

train_data = pd.concat([train_data, encoded_train_df], axis=1)
test_data = pd.concat([test_data, encoded_test_df], axis=1)

# define features and target variable
X_train = train_data.drop('target', axis=1)
y_train = train_data['target']

# I decided to got with the random forest classification model, so train the model
classification_model = RandomForestClassifier(random_state=42)
classification_model.fit(X_train, y_train)

# predictions on the training set
train_preds = classification_model.predict(X_train)

# generate metrics
train_conf_matrix = confusion_matrix(y_train, train_preds)
train_accuracy = accuracy_score(y_train, train_preds)
train_precision = precision_score(y_train, train_preds)
train_recall = recall_score(y_train, train_preds)
train_f1 = f1_score(y_train, train_preds)

print("Training Set Evaluation")
print("Confusion Matrix:")
print(train_conf_matrix)
print(f"Accuracy: {train_accuracy:.2f}")
print(f"Precision: {train_precision:.2f}")
print(f"Recall: {train_recall:.2f}")
print(f"F1-score: {train_f1:.2f}")

# define features for the test set
X_test = test_data.drop('target', axis=1)
y_test = test_data['target']

# Predictions on the test set
test_preds = classification_model.predict(X_test)

# Calculate metrics
test_conf_matrix = confusion_matrix(y_test, test_preds)
test_accuracy = accuracy_score(y_test, test_preds)
test_precision = precision_score(y_test, test_preds)
test_recall = recall_score(y_test, test_preds)
test_f1 = f1_score(y_test, test_preds)

print("\nTest Set Evaluation")
print("Confusion Matrix:")
print(test_conf_matrix)
print(f"Accuracy: {test_accuracy:.2f}")
print(f"Precision: {test_precision:.2f}")
print(f"Recall: {test_recall:.2f}")
print(f"F1-score: {test_f1:.2f}")

Training Set Evaluation
Confusion Matrix:
[[1565    0]
 [   2  533]]
Accuracy: 1.00
Precision: 1.00
Recall: 1.00
F1-score: 1.00

Test Set Evaluation
Confusion Matrix:
[[74  4]
 [16  6]]
Accuracy: 0.80
Precision: 0.60
Recall: 0.27
F1-score: 0.37


In [31]:
# compare training and test set results
print("\nComparison of Training and Test Set Results")
print(f"{'Metric':<15} {'Training':<10} {'Test':<10}")
print(f"{'Accuracy':<15} {train_accuracy:.2f} {test_accuracy:.2f}")
print(f"{'Precision':<15} {train_precision:.2f} {test_precision:.2f}")
print(f"{'Recall':<15} {train_recall:.2f} {test_recall:.2f}")
print(f"{'F1-score':<15} {train_f1:.2f} {test_f1:.2f}")


Comparison of Training and Test Set Results
Metric          Training   Test      
Accuracy        1.00 0.80
Precision       1.00 0.60
Recall          1.00 0.27
F1-score        1.00 0.37


The comparison of metrics between the training and test sets indicates potential overfitting of the model. The significant drop in performance metrics, especially precision, recall, and F1-score, when moving from the training set to the test set, suggests that the model might be memorizing the training data rather than generalizing well to new data.

Accuracy: The model achieves perfect accuracy on the training set (1.00) but drops to 0.80 on the test set. This suggests the model is performing well on known data but not as well on unseen data.
Precision: The precision drops from 1.00 to 0.60, indicating that the model has a higher rate of false positives on the test set.
Recall: The recall drops significantly from 1.00 to 0.27, showing that the model is missing a large number of true positives on the test set.
F1-score: The F1-score decreases from 1.00 to 0.37, reinforcing that the model’s ability to balance precision and recall is much worse on the test set.

This problems could be addressed by the performance improvements in the next section (Extra points).

Extra points - think about what kind of the method can increase the performance

1. Better handling of the inbalanced data
Resampling is possible, either with oversampling or undersampling or a combination of both. Also the class weights in the random forest classifier could be improved.
2. Features
Creating new features or transforming existing features to have even more data for the model
3. Adavenced algorithms
Could use boosted algorithms, like XGBoost. Or could even use Neural Networks if using a large dataset.
4. Hyperparameter tuning
Could optimize the model's hyperparameters with techniques like grid search, random search of bayesian optimization.
5. Model ensembling
Cloud use multiple models and combine them to get a better performance

Maybe other improvements like cross-validation or regularization could improve the performance further.