# CS 7641 Machine Learning - Assignment 1

## 1. Abstract

This assignment will focus on some techniques in supervised learning, and their performance across different scenarios. 

## 2. Introduction

Five learning algorithms will be tested and compared:
- Decision trees with some form of pruning
- Neural networks
- Boosting
- Support Vector Machines
- k-nearest neighbors

I will use implementation from `sklearn` package to run the learning process and only compare the results. Even though I will not implement any code, I will explain the selection of function and its implementation briefly, as well as the selection parameters.

## 3. Dataset

For this project, I selected the following datasets:
- [Check Loan Eligibility](https://www.kaggle.com/datasets/mukeshmanral/check-loan-eligibility)
- [Student Performance (Math only)](https://www.kaggle.com/datasets/whenamancodes/student-performance)

The loan eligibility data set is a set of processed data for modeling the eligibility check, based on features like gender, education, income, loan amount, credit history, etc. It consists of 12 columns, of which 10 are integer columns (binary) and 2 are decimal columns. It has a clear `y` column named `Loan_Status`. I have evaluated it from a ethical perspective in another course at OMSCS (AI Ethics) and it is an interesting data set to evaluate historical biases as well. Therefore, I would like to use it for the ML algorithms and see if different algorithms will induce different biases towards different groups.

The student performance data set was drawn from student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. I have only selected the data of the Math subject. The data was specifically modeled under binary/five-level classification and regression tasks. Interesting point about this data set is that it includes three potential `y` columns: G1, G2, G3, corresponding to 1st, 2nd and 3rd period grades. However, the G2 is dependent on G1, and G3 is dependent on both G1 and G2. I find that it could be useful for Bayesian analysis in the future. Therefore, I would like to use it for the ML course as it is versatile and easy to implement for multiple models.

## 4. Method

The implementation methods will be briefly introduced in the following section, but many details will be skipped since it is available on `sklearn` [documentation page](https://scikit-learn.org/stable/)

### 4.1 Decision trees with some form of pruning

I selected the implemented algorithm from `sklearn.tree.DecisionTreeClassifier`, which has the training part as well as the pruning method (`cost_complexity_pruning_path`). By default, it uses [GINI (see reference for calculation)](https://en.wikipedia.org/wiki/Gini_coefficient) as the criterion to measure the quality of a split. It uses `best` as its strategy to choose the split at each node so that we make sure to pick the attribute with the best GINI coefficient.

Detailed code in file clf_tree.py.


Current tree has 209 nodes, with alpha: 0
Train score 1.0
Test score 0.7402597402597403

Current tree has 11 nodes, with alpha: 0.005
Train score 0.808695652173913
Test score 0.8311688311688312

Current tree has 21 nodes, with alpha: 0
Train score 1.0
Test score 0.9494949494949495

Current tree has 5 nodes, with alpha: 0.012
Train score 0.9662162162162162
Test score 0.9696969696969697




### 4.2 Neural networks

I selected the implemented algorithm from `sklearn.neural_network.MLPClassifier`. By default, it limits hidden layer size to 100, which is what I will use here. I selected "logistic" as its `activation` parameter, which uses the logistic sigmoid function we discussed in class. 

Detailed code in file neural.py.

Train score 0.6934782608695652
Test score 0.7272727272727273


Train score 0.956081081081081
Test score 0.9292929292929293


### 4.3 Boosting with Bagging

I selected the implemented algorithm from `sklearn.ensemble.BaggingClassifier` as my boosting algorithm for the decision tree model. I selected parameters `max_samples=0.3`, `max_features=0.8` so that for each individual model to be bagged, it will sample 30% of the data and 80% of the features to be considered. 

Detailed code in file bagged.py.

Train score 0.808695652173913
Test score 0.8246753246753247
Train score 0.9662162162162162
Test score 0.9696969696969697



### 4.4 Support Vector Machines

I selected the implemented algorithm from `sklearn.svm.SVC` as my SVM algorithm. I selected parameter `gamma='scale'` for ingesting the gamma values to the kernel functions. Other parameters will be kept at default values. For the kernel functions, I will use `‘rbf’`and `‘sigmoid’` to test the model performance.

Detailed code in file svm.py.

Train score 0.808695652173913
Test score 0.8246753246753247
Train score 0.9662162162162162
Test score 0.9696969696969697



### 4.5 k-Nearest Neighbors

I selected the implemented algorithm from `sklearn.neighbors.KNeighborsClassifier` 

Detailed code in file knn.py.

Train score 0.808695652173913
Test score 0.8246753246753247
Train score 0.9662162162162162
Test score 0.9696969696969697



## Discussion

In [None]:
Method how I conduct my experiments

Experiment discussion on my base learning/validation curves and my learning curve with tuned parameters

wall clock time, discuss what I observed

summary what I observed from my experiments

reference