# **OPEN-ARC**
---

### Project 4: Red Wine Quality Classification Model:
**Challenge:** Create an AI model, capable of classifying the quality of red wine.


### Terms and Use:
Learn more about the project's [LICENSE](https://github.com/Infinitode/OPEN-ARC/blob/main/LICENSE) and read our [CODE_OF_CONDUCT](https://github.com/Infinitode/OPEN-ARC/blob/main/CODE_OF_CONDUCT) before contributing to the project. You can contribute to this project from here: [https://github.com/Infinitode/OPEN-ARC/](https://github.com/Infinitode/OPEN-ARC/).

---

Please fill out this performance sheet to help others quickly see your model's performance **(optional)**:

### Performance Sheet:
| Contributor | Architecture Type | Platform | Base Model | Dataset | Accuracy | Link |
|-------------|-------------------|----------|------------|---------|----------|------|
| Infinitode  | GradientBoostingClassifier  | Kaggle   | ✗  | Red Wine Quality | 72.8%    | [Notebook](https://github.com/Infinitode/OPEN-ARC/Project-4-RWQC/project-4-rwqc.ipynb) |
| Username  | Unknown  | Kaggle   | ✗/✔  | Red Wine Quality | Score    | [Notebook](https://github.com) |

---

### Model: Decision Tree Classifier:
This model uses **Grid Search** to optimize the model for the best performance and accuracy score while training. Grid Search uses a defined `grid` so to speak, to tune the model's parameters. Whichever combination of parameters in the grid has the highest accuracy score is used.

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the dataset
data = pd.read_csv('/kaggle/input/red-wine-quality-cortez-et-al-2009/winequality-red.csv')

X = data.drop('quality', axis=1)
y = data['quality']

# Scale the numerical features to improve the accuracy score
numerical_cols = X.select_dtypes(include=['float64', 'int64']).columns
scaler = MinMaxScaler()
X[numerical_cols] = scaler.fit_transform(X[numerical_cols])

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Tune the decision tree model using GridSearchCV
param_grid = {
    'max_depth': [3, 5, 7, None],
    'min_samples_leaf': [1, 2, 4],
    'criterion': ['gini', 'entropy']
}

dt_grid_search = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=5, scoring='accuracy')
dt_grid_search.fit(X_train, y_train)

best_params = dt_grid_search.best_params_
best_dt_model = dt_grid_search.best_estimator_

# Evaluate the best decision tree model on the test set
y_pred = best_dt_model.predict(X_test)
test_acc = accuracy_score(y_test, y_pred)
print(f"Decision Tree Test Accuracy: {test_acc}")

Decision Tree Test Accuracy: 0.584375


58.4% as a testing accuracy score is not great, but quite okay to serve as a starting point in our case.

### Model: Gradient Boosting Classifier:
This code preprocesses our data (removes outliers, and scales the data), performs `GridSearch` on the model, optimizing it for the best performance on the data, and also defines some testing code you can use to try out the model.

In [16]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score

# Load the dataset
data = pd.read_csv('/kaggle/input/red-wine-quality-cortez-et-al-2009/winequality-red.csv')

def remove_outliers(df):
    Q1 = df.quantile(0.25)
    Q3 = df.quantile(0.75)
    IQR = Q3 - Q1
    return df[~((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).any(axis=1)]

# Remove outliers from the dataset
data_cleaned = remove_outliers(data)

print(f"Original dataset shape: {data.shape}")
print(f"Cleaned dataset shape: {data_cleaned.shape}")

X = data_cleaned.drop('quality', axis=1)
y = data_cleaned['quality']

# Identify numerical columns for scaling
numerical_cols = X.select_dtypes(include=['float64', 'int64']).columns

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and fit the scaler on the training data
scaler = MinMaxScaler()
X_train[numerical_cols] = scaler.fit_transform(X_train[numerical_cols])
X_test[numerical_cols] = scaler.transform(X_test[numerical_cols])

# Create the model
gb_model = GradientBoostingClassifier(random_state=42)

# Define the parameter grid
param_grid = {
    'n_estimators': [200],
    'learning_rate': [0.01],
    'max_depth': [15],
    'min_samples_split': [2],
    'min_samples_leaf': [8]
}

# Perform Grid Search with Cross-Validation
grid_search = GridSearchCV(estimator=gb_model, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

# Get the best parameters and model
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

# Evaluate the model on the test set
y_pred = best_model.predict(X_test)
test_acc = accuracy_score(y_test, y_pred)
print(f"Gradient Boosting Test Accuracy: {test_acc}")
print(f"Best Parameters: {best_params}")

# Function to take user input and predict the quality of red wine
def predict_red_wine_quality():
    input_data = {}
    for feature in X.columns:
        if feature == "fixed acidity":
            print("Typically ranges from 4.6 to 15.9")
        if feature == "volatile acidity":
            print("Typically ranges from 0.12 to 1.58")
        if feature == "citric acid":
            print("Typically ranges from 0.00 to 1.00")
        if feature == "residual sugar":
            print("Typically ranges from 0.9 to 15.5")
        if feature == "chlorides":
            print("Typically ranges from 0.01 to 0.61")
        if feature == "free sulfur dioxide":
            print("Typically ranges from 1 to 72")
        if feature == "total sulfur dioxide":
            print("Typically ranges from 6 to 289")
        if feature == "density":
            print("Typically ranges from 0.9900 to 1. (Compared to the density of water)")
        if feature == "pH":
            print("Typically ranges from 2.74 to 4.01")
        if feature == "sulphates":
            print("Typically ranges from 0.33 to 2")
        if feature == "alcohol":
            print("Typically ranges from 8.4 to 14.9 (% alcohol)")
        value = input(f"Enter the value for '{feature}': ")

        input_data[feature] = [float(value) if feature in numerical_cols else value]

    input_df = pd.DataFrame(input_data)

    # Scale the input data
    input_df[numerical_cols] = scaler.transform(input_df[numerical_cols])

    # Make prediction
    prediction = best_model.predict(input_df)
    print(f"Predicted quality of wine: {prediction[0]}")

Original dataset shape: (1599, 12)
Cleaned dataset shape: (1179, 12)
Gradient Boosting Test Accuracy: 0.7288135593220338
Best Parameters: {'learning_rate': 0.01, 'max_depth': 15, 'min_samples_leaf': 8, 'min_samples_split': 2, 'n_estimators': 200}


Even while removing outliers from the data, the model still does not reach the ideal accuracy (only 72.8%). You can try different model architectures or preprocessing steps to achieve better results.

In [13]:
# Call the prediction function to test out the model
predict_red_wine_quality()

# Sample from the dataset:
# 7.5, 0.52, 0.16, 1.9, 0.085, 12, 35, 0.9968, 3.38, 0.62, 9.5 => 7

Typically ranges from 4.6 to 15.9


Enter the value for 'fixed acidity':  7.5


Typically ranges from 0.12 to 1.58


Enter the value for 'volatile acidity':  0.52


Typically ranges from 0.00 to 1.00


Enter the value for 'citric acid':  0.16


Typically ranges from 0.9 to 15.5


Enter the value for 'residual sugar':  1.9


Typically ranges from 0.01 to 0.61


Enter the value for 'chlorides':  0.085


Typically ranges from 1 to 72


Enter the value for 'free sulfur dioxide':  12


Typically ranges from 6 to 289


Enter the value for 'total sulfur dioxide':  35


Typically ranges from 0.9900 to 1. (Compared to the density of water)


Enter the value for 'density':  0.9968


Typically ranges from 2.74 to 4.01


Enter the value for 'pH':  3.38


Typically ranges from 0.33 to 2


Enter the value for 'sulphates':  0.62


Typically ranges from 8.4 to 14.9 (% alcohol)


Enter the value for 'alcohol':  9.5


Predicted quality of wine: 5


### The End:
This is the end of this project notebook, make sure to experiment and contribute to help improve the model and implementation. You can browse more of the open-source free projects on our GitHub repository: [https://github.com/Infinitode/OPEN-ARC](https://github.com/Infinitode/OPEN-ARC). If you like this project, make sure to star the repo and contribute your implementation, or help others in the community.

~ Infinitode