# Machine Learning Projects Notebook

This repository contains three machine learning projects created for AI Model Share Competitions. These projects were created for Advanced Projects in Machine Learning, a Quantitative Methods in the Social Sciences course at Columbia University.

## Project 1 - World Happiness Classification, Tabular Data

[project notebook](https://github.com/AmeliaSys/Machine-Learning-Projects/blob/main/World%20Happiness%20Classification/Assignment_1.ipynb)

This project used tabular data to predict a country's happiness level from one of 5 categories (very high, high, average, low, very low). The dataset consisted of 7 numerical features and dummy variables for the region and sub-region of each country. 

Using the Random Forest SFM method, feature importances of each variable were exracted to determine the most important features in the dataset. The numerical features have higher feature importance than region / sub-region. GDP per capita is the variable with the highest feature importance, followed by healthy life expectancy.

The models experimented with were:
1. Support Vector Machines
2. Random Forest Classifier
3. Logistic Regression
4. Gradient Boosting Trees

Logistic Regression, Random Forest, and Support Vector Machine models tuned with GridSearchCV performed very similarly when evaluated on stratified k-fold, with an accuracy score ~0.89. A basic SVC model generalized the strongest on the test data, with a f1-score of 0.499.

The top models on the leaderboard reached an f1-score of 0.64 and used Sequential approaches, a model type which I did not experiment with. A Random Forest model also reached a f1-score of 0.58, which is substantially higher than the results of my architecture.

## Project 2 - Xray Classification, Image Data

[project notebook](https://github.com/AmeliaSys/Machine-Learning-Projects/blob/main/Xray%20Classification/Assignment_2.ipynb)

This project uses chest xray image data of healthy patients, patients with COVID-19, and patients with pneunomia to create classification models. The dataset consists of 4032 images split across three categories (1344 in each).

The types of models experimented with were:
1. Simple Convolutional Neural Networks (CNNs)
2. CNN with Fire Modules
3. VGG16 Transfer Learning
4. CNN with CONV1D layers
5. CNN with L1 and L2 Regularization

We also experimented with augmented data, altering rotation, zoom, width shift and height shift parameters.

The top performing model was a four layer CNN with Conv1D layers and l1 and l2 regularization, with ReLU activation and SGD optimization. My models placed around 100th in rank. 

## Project 3 - IMDB Sentiment Analysis, Text Data

[project_notebook](https://github.com/AmeliaSys/Machine-Learning-Projects/blob/main/IMDB%20Sentiment%20Analysis/Assignment_3.ipynb)

The third project uses text data from 6900 IMDB movie reviews to predict two sentiment classifications, positive or negative. 

In the models built, the following techniques / modeling approaches are experimented with:
1. LSTM cells
2. Embedding layers
3. Conv1D layers
4. Transfer learning with glove embeddings
5. Bidirectional LSTM
6. Maximum length of text used

A more complex approach of 3 LSTM and embedding layers with dropout produced the strongest results, with a f1-score on the test data of 0.81. My top performing model is ranked #17 on the leaderboard. The top performing models reached an f1-score of 0.83, with even more complex architectures. 