## Import Libraries

In [2]:
# preprocessing/data manipulation
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import pandas as pd

# classifiers
from sklearn.linear_model import Perceptron
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

## Read CSVs

In [3]:
test_data = pd.read_csv('/kaggle/input/mini-kaggle-project2-dataset2/test.csv')

train_data = pd.read_csv('/kaggle/input/mini-kaggle-project2-dataset2/train.csv')

## Splitting and Pre-processing Dataset

Here, we pre-process the dataset by first converting all objects in all categorical columns to numeric values via mapping dictionary. We then split the dataset, stratify the target variable, as well as add a standard scaler. Missing data imputations are conducted as well, using median values so as to mitigate both outlier and skewed data influence.  

In [4]:
# Create a mapping dictionary for all categorical columns
mapping_dict = {}

for col in train_data.columns: 
    if train_data[col].dtype == 'object': 
        mapping = {label: idx for idx, label in enumerate(np.unique(train_data[col]))}
        mapping_dict[col] = mapping

for col, mapping in mapping_dict.items():
    train_data[col] = train_data[col].map(mapping)

In [5]:
# Training Data 
X = train_data.drop(columns =['income'], axis = 1)
y = train_data['income']

# Stratified train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1, stratify=y)

In [6]:
# Adding standard scaling to X_train and X_test
sc = StandardScaler()

X_train_scaled = sc.fit_transform(X_train)
X_test_scaled = sc.transform(X_test)

In [7]:
# Handling Missing Data
imputer = SimpleImputer(strategy='median')
X_train_imputed = imputer.fit_transform(X_train_scaled)
X_test_imputed = imputer.transform(X_test_scaled)

## Employing Classification Methods on Training Dataset

After pre-processing is complete, we then begin running the training dataset through each classification method. The accuracy score of each will be shown as an output to compare. 

## Perceptron

In terms of preprocessing data, the perceptron algorithm does not support regularization outright. Thus, its accuracy score ranks the lowest out of all available methods, but could possibly stand to improve if penalty terms are added to its loss function and subsequently updating its weights. 

In [8]:
perceptron = Perceptron()

perceptron.fit(X_train_imputed, y_train)

percep_pred = perceptron.predict(X_test_imputed)

acc = accuracy_score(y_test, percep_pred)
print(f'Accuracy: {acc:.3f}')

Accuracy: 0.684


## Logistic Regression

As logistic and linear regression typically utilize L1 and L2 regularization, L2 was applied in this case to account for all categorical features - adding a penalty to avoid overfitting. Its accuracy score is a marked improvement over perceptron, yet is not quite high enough to be chosen as the main method. 

In [9]:
lr = LogisticRegression(penalty='l2', C=1.0)

lr.fit(X_train_imputed, y_train)

lr_pred = lr.predict(X_test_imputed)

acc = accuracy_score(y_test, lr_pred)
print(f'Accuracy: {acc:.3f}')

Accuracy: 0.825


## SVM 

SVM generates the second highest accuracy score out of all of the classifier methods used. This can most likely be attested to its ability to handle high dimensional spaces - of which there are a good number of features - and its robustness to overfitting. It is only slightly less accurate than our chosen classifier method. 

In [10]:
svc = SVC()

svc.fit(X_train_imputed, y_train)
svc_pred = svc.predict(X_test_imputed)

acc = accuracy_score(y_test, svc_pred)
print(f'Accuracy: {acc:.3f}')

Accuracy: 0.847


## Decision Tree

While the resulting accuracy score is adequate, it ranks as one of the lower scores overall. This could be due to its tendencies to overfit. 

In [11]:
tree = DecisionTreeClassifier()

tree.fit(X_train_imputed, y_train)
tree_pred = tree.predict(X_test_imputed)

acc = accuracy_score(y_test, tree_pred)
print(f'Accuracy: {acc:.3f}')

Accuracy: 0.811


## K-Nearest Neighbors

This was another classifier method which had an adequate accuracy score, but not quite high enough to be considered for the final classifier method. One reason why its accuracy score may not be as high could be due to its sensitivity to irrelevant features. 


In [12]:
knn = KNeighborsClassifier()

knn.fit(X_train_imputed, y_train)
knn_pred = knn.predict(X_test_imputed)

acc = accuracy_score(y_test, knn_pred)
print(f'Accuracy: {acc:.3f}')

Accuracy: 0.822


## Random Forest

**This classification method was chosen**, as it had the highest accuracy score when compared with the others. Its accuracy score could be attributed to its ability to handle missing values effectively while still maintaining accuracy - especially when a large proportion of the dataset's features are irrelevant. 

In [13]:
# Chose Random Forest because it had the highest accuracy score 

rf = RandomForestClassifier()

rf.fit(X_train_imputed, y_train)
rf_pred = rf.predict(X_test_imputed)

acc = accuracy_score(y_test, rf_pred)
print(f'Accuracy: {acc:.3f}')

Accuracy: 0.860


## Preparing Classifer Method for Test Dataset

Here, we create the code to pre-process the training and testing dataset to be evaluated by the Random Forest algorithm. The code remains mostly the same as earlier, with the biggest differences being the mapping of the categorical data in the test set, as well as encoding the results of the income level to a new column. 

In [14]:
# Create a mapping dictionary for all categorical columns
mapping_dict = {}

# Creating mapping dictionary for training data
for col in train_data.columns: 
    if train_data[col].dtype == 'object': 
        mapping = {label: idx for idx, label in enumerate(np.unique(train_data[col]))}
        mapping_dict[col] = mapping

for col, mapping in mapping_dict.items():
    train_data[col] = train_data[col].map(mapping)

# Creating mapping dictionary for testing data    
for col in test_data.columns: 
    if test_data[col].dtype == 'object': 
        mapping = {label: idx for idx, label in enumerate(np.unique(test_data[col]))}
        mapping_dict[col] = mapping

for col, mapping in mapping_dict.items():
    test_data[col] = test_data[col].map(mapping)
    
# Training Data 
X_train = train_data.drop(columns =['income'], axis = 1)
y_train = train_data['income']

# Test Data
X_test = test_data

# Adding standard scaling to X_train and X_test
sc = StandardScaler()

X_train_scaled = sc.fit_transform(X_train)
X_test_scaled = sc.transform(X_test)

# Handling Missing Data
imputer = SimpleImputer(strategy='median')
X_train_imputed = imputer.fit_transform(X_train_scaled)
X_test_imputed = imputer.transform(X_test_scaled)

## Evaluating Testing Dataset and Generating CSV File

Once the data from the training and testing dataset have been pre-processed, the testing dataset is then evaluated with our chosen classifer method, of which the results will be converted to a csv file. 

In [15]:
rf2 = RandomForestClassifier()

rf2.fit(X_train_imputed, y_train)

rf2_pred = rf2.predict(X_test_imputed)
result = pd.DataFrame({'id': test_data.id, 'income': rf2_pred})
result.to_csv('submission.csv', index = False)


## Final Visualization

Once the csv file has been generated, a quick visualization of the DataFrame is conducted before final submission.  

In [16]:
submission = pd.read_csv('/kaggle/working/submission.csv')
print(submission)

value_counts = submission['income'].value_counts()
print(value_counts)


         id  income
0       392       0
1      1900       0
2     24507       0
3     32817       1
4     47893       0
...     ...     ...
9764  13000       0
9765  43012       0
9766  34782       0
9767  23538       0
9768  23097       0

[9769 rows x 2 columns]
income
0    7836
1    1933
Name: count, dtype: int64
