## Import Libraries

In [2]:
# preprocessing/data manipulation
import numpy as np
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer
from sklearn.model_selection import RandomizedSearchCV, cross_val_score, train_test_split
from sklearn.metrics import make_scorer, accuracy_score, classification_report, f1_score
import pandas as pd


# classifiers
from sklearn.linear_model import Perceptron
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

## Read CSVs

In [3]:
test_data = pd.read_csv('/kaggle/input/mini-kaggle-project3/test.csv')

train_data = pd.read_csv('/kaggle/input/mini-kaggle-project3/train.csv')

## Splitting and Pre-processing Dataset

Here, we begin pre-processing the dataset. 

We first check the DataFrame for any NA values. Once complete, we continue pre-processing by converting all objects in all categorical columns to numeric values via mapping dictionary. We then split the dataset, stratify the target variable, as well as add a standard scaler. Missing data imputations are conducted as well, using median values so as to mitigate both outlier and skewed data influence. Finally, to reduce dimensionality, we include the PCA function to our training and testing data. 

In [4]:
# Check for NAs in entire DataFrame
print(train_data.isnull().values.any())

# Check for NAs in the columns
print(train_data.isnull().any())

# Check for NAs in the rows
print(train_data.isnull().any(axis=1))

# Check for null values in DataFrame
na_ct = train_data.isnull().values.flatten().sum() 

# Count number of False values
non_na_ct = train_data.size - na_ct

print("Number of True Values (NAs):", na_ct)
print("Number of False values (Non-NAs):", non_na_ct)

False
Elevation                             False
Aspect                                False
Slope                                 False
Horizontal_Distance_To_Hydrology      False
Vertical_Distance_To_Hydrology        False
Horizontal_Distance_To_Roadways       False
Hillshade_9am                         False
Hillshade_Noon                        False
Hillshade_3pm                         False
Horizontal_Distance_To_Fire_Points    False
Wilderness_Area1                      False
Wilderness_Area2                      False
Wilderness_Area3                      False
Wilderness_Area4                      False
Soil_Type1                            False
Soil_Type2                            False
Soil_Type3                            False
Soil_Type4                            False
Soil_Type5                            False
Soil_Type6                            False
Soil_Type7                            False
Soil_Type8                            False
Soil_Type9                

In [5]:
# Create a mapping dictionary for all categorical columns
mapping_dict = {}

for col in train_data.columns: 
    if train_data[col].dtype == 'object': 
        mapping = {label: idx for idx, label in enumerate(np.unique(train_data[col]))}
        mapping_dict[col] = mapping

for col, mapping in mapping_dict.items():
    train_data[col] = train_data[col].map(mapping)

In [6]:
# Training Data 
X = train_data.drop(columns =['label'], axis = 1)
y = train_data['label']

# Stratified train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1, stratify=y)

In [7]:
# Adding standard scaling to X_train and X_test
sc = StandardScaler()

X_train_scaled = sc.fit_transform(X_train)
X_test_scaled = sc.transform(X_test)

In [8]:
# Handling Missing Data
imputer = SimpleImputer(strategy='median')
X_train_imputed = imputer.fit_transform(X_train_scaled)
X_test_imputed = imputer.transform(X_test_scaled)

In [9]:
# Further pre-processing data by applying PCA
pca = PCA(n_components=10)
X_train_pca = pca.fit_transform(X_train_imputed)

## Employing Classification Methods on Training Dataset

After pre-processing is complete, we then begin running the training dataset through each classification method. 

We first introduce each classifier method and then conduct cross-validation to check the robustness of each model. Additionally, cross-validation helps in fine-tuning the hyperparameters used in the evaluation of the dataset. For the purpose of this project, 5 folds will be used, save for SVM. 

The F1 scores of each will be shown as an output to compare, with the mean F1 score of the 5 splits being used as final metric to choose the best model. 

## Perceptron

Perceptron may be able to introduce the concept of classifiers, but its simple nature may not be able to handle the complexity of larger datasets, especially with imbalanced classes. Because of this, the resulting mean F1 score is the lowest out of all the classifiers used. 

In [None]:
# Introduce model
percep_clf = Perceptron()

# Cross-validation 
cv_percep = cross_val_score(percep_clf, X_train_pca, y_train, cv=5, scoring='f1_weighted')

# Print F1 scores
print("Cross-validation F1 scores:", cv_percep)
print("Mean F1 score:", cv_percep.mean())

## Logistic Regression

Similarly to Perceptron, the Logistic Regression model suffers when datasets become too large or too complex, leading to suboptimal scores. While logistic regression *does* have more robustness in comparison to perceptron, the resulting F1 score does not show much improvement overall. 

In [None]:
# Introduce model
lr_clf = LogisticRegression(max_iter=2000, penalty='l2', C=1.0)

# Cross-validation 
cv_lr = cross_val_score(lr_clf, X_train_pca, y_train, cv=5, scoring='f1_weighted')

# Print F1 scores
print("Cross-validation F1 scores:", cv_lr)
print("Mean F1 score:", cv_lr.mean())

## SVM 

When attempting to utilize SVM, the computation time was simply too long to produce any tangible output. Constant adjustments to n_components, parameter range, or even the number of iterations did not seem to make much of a difference. In this case, SVM could possibly be considered *too* complex to take on the dataset. 

On a personal note, I felt it necessary to add the code to at least record the attempt and remark on the output (or lack thereof). If improvements are to be considered, perhaps adjustments to the class imbalances could be made - specifically in terms of up or downsampling. 

In [None]:
# Create separate PCA in order to reduce number of features to 2
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_imputed)

# Create a pipeline for SVM and Standard Scaling
pipe_svc = make_pipeline(
    StandardScaler(),
    SVC(random_state=1),
)

# Define parameter distributions for random search
param_range = [0.01, 1.0, 100.0]

param_grid = [{'svc__C': param_range,
              'svc__kernel': ['linear']},
             {'svc__C': param_range,
             'svc__gamma': param_range,
             'svc__kernel': ['rbf']}]

# Creating Randomized Search, setting estimators, parameters, and maximum CPU usage 
rs = RandomizedSearchCV(estimator=pipe_svc, param_distributions=param_grid, scoring='f1_weighted', refit=True, n_iter=5, cv=5, random_state=1, n_jobs=-1)
rs.fit(X_train_pca, y_train)

# Get the best parameters and best F1 score
best_params = rs.best_params_
best_score = rs.best_score_

# Print F1 scores
print("Best Parameters:", best_params)
print("Best F1 Score:", best_score)

## Decision Tree

Decision Tree was able to fare better in terms of F1 score. This may be due to its robust nature and ability to model complex, non-linear relationships from large datasets, moreso than logistic regression. Additionally, it naturally performs feature selection by choosing the most relevant and informative features at each split. Unfortunately, while this *does* showcase a considerable improvement from logistic regression, it neither surpasses the required F1 score percentage, nor is it the highest F1 score overall. 

In [10]:
# Introduce model
dt_clf = DecisionTreeClassifier()

# Cross-validation
cv_dt = cross_val_score(dt_clf, X_train_pca, y_train, cv=5, scoring='f1_weighted')

# Print F1 scores
print("Cross-validation F1 scores:", cv_dt)
print("Mean F1 score:", cv_dt.mean())

Cross-validation F1 scores: [0.85730482 0.85545796 0.8553183  0.85894307 0.8565576 ]
Mean F1 score: 0.8567163510468265


## K-Nearest Neighbors

Being a nonparametric classifier, KNN does not make any assumptions about the underlying data distribution. With this advantage, KNN is able to outperform logistic regression and perceptron. Exchanging computational expense for robustness, KNN is able to handle large and complex datasets, being able to achieve a score both higher than decision tree, and enough to surpass the required F1 score percentage. However, it does not have the highest score overall. 

In [12]:
# Introduce model
knn_clf = KNeighborsClassifier(n_neighbors=5, n_jobs=-1)

# Cross-validation
cv_knn = cross_val_score(knn_clf, X_train_pca, y_train, cv=5, scoring='f1_weighted')

# Print F1 scores
print("Cross-validation F1 scores:", cv_knn)
print("Mean F1 score:", cv_knn.mean())

Cross-validation F1 scores: [0.88836871 0.88876536 0.88836229 0.88969826 0.88812142]
Mean F1 score: 0.8886632090693791


## Random Forest

**This classification method was chosen**, as it had the highest overall F1 score among the tested classifier methods. While one decision tree alone may not be enough to produce an adequate F1 score, having multiple decision trees would allow for a significant improvement in predictions. With multiple decision trees producing an aggregate prediction, less variance is generated, and increases the overall F1 score. 

For these reasons, Random Forest shall be selected as the chosen classifier method.

In [13]:
# This classifier was chosen due to having the highest F1 score overall

# Introduce model, setting additional parameters for n number of trees and maximum CPU usage 
rf_clf = RandomForestClassifier(n_estimators=50, n_jobs=-1)

# Cross-validation
cv_rf = cross_val_score(rf_clf, X_train_pca, y_train, cv=5, scoring='f1_weighted')

# Print F1 scores
print("Cross-validation F1 scores:", cv_rf)
print("Mean F1 score:", cv_rf.mean())

Cross-validation F1 scores: [0.91565308 0.91559906 0.91386446 0.91460716 0.91414672]
Mean F1 score: 0.9147740973880507


## Preparing Classifer Method for Test Dataset

Here, we create the code to pre-process the training and testing dataset to be evaluated by the Random Forest algorithm. The code remains mostly the same as earlier, with the biggest differences being the mapping of the categorical data in the test set, as well as encoding the results of the label to a new column. 

In [None]:
# Create a mapping dictionary for all categorical columns
mapping_dict = {}

# Creating mapping dictionary for training data
for col in train_data.columns: 
    if train_data[col].dtype == 'object': 
        mapping = {label: idx for idx, label in enumerate(np.unique(train_data[col]))}
        mapping_dict[col] = mapping

for col, mapping in mapping_dict.items():
    train_data[col] = train_data[col].map(mapping)

# Creating mapping dictionary for testing data    
for col in test_data.columns: 
    if test_data[col].dtype == 'object': 
        mapping = {label: idx for idx, label in enumerate(np.unique(test_data[col]))}
        mapping_dict[col] = mapping

for col, mapping in mapping_dict.items():
    test_data[col] = test_data[col].map(mapping)
    
# Training Data 
X_train = train_data.drop(columns =['label'], axis = 1)
y_train = train_data['label']

# Test Data
X_test = test_data

# Adding standard scaling to X_train and X_test
sc = StandardScaler()

X_train_scaled = sc.fit_transform(X_train)
X_test_scaled = sc.transform(X_test)

# Handling Missing Data
imputer = SimpleImputer(strategy='median')
X_train_imputed = imputer.fit_transform(X_train_scaled)
X_test_imputed = imputer.transform(X_test_scaled)

# Further Pre-Processing Data w/ PCA 
pca = PCA(n_components=10)
X_train_pca = pca.fit_transform(X_train_imputed)
X_test_pca = pca.fit_transform(X_test_imputed)

## Evaluating Testing Dataset and Generating CSV File

Once the data from the training and testing dataset have been pre-processed, the testing dataset is then evaluated with our chosen classifer method, of which the results will be converted to a csv file. 

In [None]:
rf2 = RandomForestClassifier()

rf2.fit(X_train_pca, y_train)

rf2_pred = rf2.predict(X_test_pca)
result = pd.DataFrame({'id': test_data.id, 'label': rf2_pred})
result.to_csv('submission.csv', index = False)

## Final Visualization

Once the csv file has been generated, a final check and visualization of the DataFrame is conducted before submission.  

In [None]:
submission = pd.read_csv('/kaggle/working/submission.csv')
print(submission)

value_counts = submission['label'].value_counts()
print(value_counts)