# Machine Learning Project

1. The main goal of project is to solve wine classification problem using machine learning algorithms.
We will try to predict wine quality based on its features. For that we will use supervised learning algorithms: 
- Decision Trees
- Multilayer Perceptron
- K-Nearest Neighbors

For this purpose, we will use the [Wine Quality Data Set](https://archive.ics.uci.edu/ml/datasets/wine+quality) from UCI Machine Learning Repository. The dataset consists of 4898 instances of white wine and 1599 of red wine. The dataset contains 11 features and 1 target variable. The target variable is quality of wine. The quality is a score between 0 and 10. The higher score means better quality. The features are:

* fixed acidity
* volatile acidity
* citric acid
* residual sugar
* chlorides
* free sulfur dioxide
* total sulfur dioxide
* density
* pH
* sulphates
* alcohol

2. The project is divided into 3 parts:
* Data preprocessing
* Classification
* Evaluation

3. The project is written in Python 3.11.5. The following libraries are used:
* numpy
* pandas
* matplotlib
* seaborn
* sklearn
* copy
* os 
* sklearn
    
All libraries can be installed using pip. There is a requirements.txt file in the project folder. To install all libraries, run the following command in the project folder: `pip install -r requirements.txt`


# Data Understanding

The understanding of the dataset was done in VisualExplore.ipynb notebook. It includes basic statistics presentation, correlation analysis, quality distributions, box plots for outlier detection, and the application of the Recursive Feature Elimination method to identify key features. All the code and results are presented there.

# Data Preparation

For that purpose we will: 
* Randomly remove 10%, 20%, and 30% of the values of the features of each dataset and
explore two different strategies to handle missing values;
* Experiment with data normalization and data discretization methods. Apply these steps to the original, unchanged, dataset.

All the following steps are done in Separate notebooks for clarity of the code. The notebooks are named as follows:
* RandomRemove.ipynb
* DataNormDisc.ipynb


# Modeling

For that purpose we will use the following algorithms:
* Decision Trees
* Multilayer Perceptron
* K-Nearest Neighbors

All the following steps are done in Separate notebooks for clarity of the code. The notebooks are named as follows:
* DecisionTrees.ipynb
* MPL.ipynb
* KNN.ipynb



# Evaluation

## Datset evaluation
The most important thing about the dataset is the imbalance of the target variable. The dataset contains quality from 3 to 9. However the quality is from 0-10. Most of the values are 5 and 6 with a huge disproportion to the other values. 

## Decision Trees
The model looked very promising after SMOTE oversampling. The model has high accuracy. 

## Multilayer Perceptron
The model performs quite well, particularly following data normalization and balancing through SMOTE data augmentation, achieving noteworthy accuracy rates of 87% on the red wine dataset and 78% on the white wine dataset. A comprehensive overview of all experiments and conclusions can be found in the MLP.ipynb notebook.


## K-Nearest Neighbors


