# Wine Classification Model in Databricks

## Overview

This notebook demonstrates the process of building a wine classification model using various machine learning algorithms. The dataset used is the Wine dataset from the UCI Machine Learning Repository. The following classifiers are explored:

- Random Forest Classifier
- Support Vector Classifier (SVC)
- K-Nearest Neighbors Classifier (KNN)

The notebook covers data preprocessing, model training, evaluation, and saving/loading the model for future predictions.

## 1. Importing Libraries

We begin by importing the necessary libraries for data manipulation, model training, and evaluation.

## 2. Loading the Dataset

The dataset is loaded directly from the UCI repository. The dataset contains 13 features and a target variable representing the class of wine.

## 3. Data Exploration

Basic exploration of the dataset is performed using `info()` and `describe()` methods to understand the structure and summary statistics of the data.

## 4. Data Preprocessing

### 4.1 Splitting Labels from Features

The target variable `y` is separated from the features.

### 4.2 Handling Missing Values

Missing values in the features are replaced with the mean of the respective columns.

### 4.3 Data Standardization

Features are standardized using `StandardScaler` to have a mean of 0 and a standard deviation of 1.

## 5. Splitting Data into Train and Test Sets

The dataset is split into training (70%) and testing (30%) sets.

## 6. Model Training

Three different classifiers are trained on the training data.

### 6.1 Random Forest Classifier

### 6.2 Support Vector Classifier

### 6.3 K-Nearest Neighbors Classifier

## 7. Model Evaluation

The accuracy and classification report for each model are computed and displayed.

## 8. Saving and Loading the Best Model

The best-performing model (in this case, the Random Forest Classifier) is saved to disk using `joblib` and then reloaded to make predictions on new data.

### 8.1 Saving the Model

### 8.2 Loading the Model

## 9. Making Predictions with the Trained Model

The loaded model is used to predict the class of new wine samples.

Import necessary Python libraries

In [0]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler

from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier



Load sample data

In [0]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data"
df = pd.read_csv(url, delimiter=',', header=None)
columns_labels = [
    "y",
    "Alcohol",
    "Malic acid",
    "Ash",
    "Alcalinity of ash",
    "Magnesium",
    "Total phenols",
    "Flavanoids",
    "Nonflavanoid phenols",
    "Proanthocyanins",
    "Color intensity",
    "Hue",
    "OD280/OD315 of diluted wines",
    "Proline"
]
df.columns = columns_labels

In [0]:
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 14 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   y                             178 non-null    int64  
 1   Alcohol                       178 non-null    float64
 2   Malic acid                    178 non-null    float64
 3   Ash                           178 non-null    float64
 4   Alcalinity of ash             178 non-null    float64
 5   Magnesium                     178 non-null    int64  
 6   Total phenols                 178 non-null    float64
 7   Flavanoids                    178 non-null    float64
 8   Nonflavanoid phenols          178 non-null    float64
 9   Proanthocyanins               178 non-null    float64
 10  Color intensity               178 non-null    float64
 11  Hue                           178 non-null    float64
 12  OD280/OD315 of diluted wines  178 non-null    float64
 13  Proli

Unnamed: 0,y,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
count,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0
mean,1.938202,13.000618,2.336348,2.366517,19.494944,99.741573,2.295112,2.02927,0.361854,1.590899,5.05809,0.957449,2.611685,746.893258
std,0.775035,0.811827,1.117146,0.274344,3.339564,14.282484,0.625851,0.998859,0.124453,0.572359,2.318286,0.228572,0.70999,314.907474
min,1.0,11.03,0.74,1.36,10.6,70.0,0.98,0.34,0.13,0.41,1.28,0.48,1.27,278.0
25%,1.0,12.3625,1.6025,2.21,17.2,88.0,1.7425,1.205,0.27,1.25,3.22,0.7825,1.9375,500.5
50%,2.0,13.05,1.865,2.36,19.5,98.0,2.355,2.135,0.34,1.555,4.69,0.965,2.78,673.5
75%,3.0,13.6775,3.0825,2.5575,21.5,107.0,2.8,2.875,0.4375,1.95,6.2,1.12,3.17,985.0
max,3.0,14.83,5.8,3.23,30.0,162.0,3.88,5.08,0.66,3.58,13.0,1.71,4.0,1680.0


Split labels and features, and clean and normalize features

In [0]:
# Split labels from features
Y = df["y"]
X = df[columns_labels[1:]]
# Replace NaN values with the mean of each column
X = X.apply(lambda x: x.fillna(x.mean()), axis=0)
# Initialize the StandardScaler
scaler = StandardScaler()
# Fit and transform the data
X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)

Split labels and features and get train and test data

In [0]:
# Get train and test data
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

In [0]:
y_train

Out[11]: 138    1.373864
104    0.079960
78     0.079960
36    -1.213944
93     0.079960
         ...   
71     0.079960
106    0.079960
14    -1.213944
92     0.079960
102    0.079960
Name: y, Length: 124, dtype: float64

Train RandomForestClassifier model

In [0]:
RFC_model = RandomForestClassifier(n_estimators=100, random_state=42)
RFC_model.fit(X_train, y_train)

Out[6]: RandomForestClassifier(random_state=42)

Train SVC model

In [0]:
svm_model = SVC(kernel='rbf', random_state=42)
svm_model.fit(X_train, y_train)

Out[7]: SVC(random_state=42)

Train KNeighborsClassifier model

In [0]:
knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(X_train, y_train)

Out[8]: KNeighborsClassifier()

Test RandomForestClassifier model

In [0]:
rfc_y_pred = RFC_model.predict(X_test)
rfc_accuracy = accuracy_score(y_test, rfc_y_pred)

Test SVC Model

In [0]:
svm_y_pred = svm_model.predict(X_test)
svm_accuracy = accuracy_score(y_test, svm_y_pred)

Test KNeighborsClassifier Model

In [0]:
knn_y_pred = knn_model.predict(X_test)
knn_accuracy = accuracy_score(y_test, knn_y_pred)

In [0]:
print(f"RFC Accuracy: {rfc_accuracy * 100:.2f}%")
print(classification_report(y_test, rfc_y_pred))
print(f"SVM Accuracy: {svm_accuracy * 100:.2f}%")
print("SVM Classification Report:")
print(classification_report(y_test, svm_y_pred))
print(f"KNN Accuracy: {knn_accuracy * 100:.2f}%")
print("KNN Classification Report:")
print(classification_report(y_test, knn_y_pred))

RFC Accuracy: 100.00%
              precision    recall  f1-score   support

           1       1.00      1.00      1.00        19
           2       1.00      1.00      1.00        21
           3       1.00      1.00      1.00        14

    accuracy                           1.00        54
   macro avg       1.00      1.00      1.00        54
weighted avg       1.00      1.00      1.00        54

SVM Accuracy: 98.15%
SVM Classification Report:
              precision    recall  f1-score   support

           1       1.00      1.00      1.00        19
           2       0.95      1.00      0.98        21
           3       1.00      0.93      0.96        14

    accuracy                           0.98        54
   macro avg       0.98      0.98      0.98        54
weighted avg       0.98      0.98      0.98        54

KNN Accuracy: 96.30%
KNN Classification Report:
              precision    recall  f1-score   support

           1       0.95      1.00      0.97        19
           

Save model

In [0]:
import joblib
joblib.dump(RFC_model, "/wine_quality_model.pkl")

Out[15]: ['/wine_quality_model.pkl']

Load Model

In [0]:
loaded_model = joblib.load("/wine_quality_model.pkl")

Test loaded model

In [0]:
# Load new samples to classify
columns_labels = [
    "y",
    "Alcohol",
    "Malic acid",
    "Ash",
    "Alcalinity of ash",
    "Magnesium",
    "Total phenols",
    "Flavanoids",
    "Nonflavanoid phenols",
    "Proanthocyanins",
    "Color intensity",
    "Hue",
    "OD280/OD315 of diluted wines",
    "Proline"
]
new_samples = [
    [13.72,1.43,2.5,16.7,108,3.4,3.67,0.19,2.04,6.8,0.89,2.87,1285],
    [12.37,0.94,1.36,10.6,88,1.98,0.57,0.28,0.42,1.95,1.05,1.82,520]
]
new_df = pd.DataFrame(new_samples, columns=columns_labels[1:])
new_df

Unnamed: 0,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
0,13.72,1.43,2.5,16.7,108,3.4,3.67,0.19,2.04,6.8,0.89,2.87,1285
1,12.37,0.94,1.36,10.6,88,1.98,0.57,0.28,0.42,1.95,1.05,1.82,520


In [0]:
y_pred = loaded_model.predict(new_df)
print("Prediction by the best model:\n", y_pred)

Prediction by the best model:
 [1 1]


### Conclusión

En este proyecto, se construyó y evaluó un modelo de clasificación de vinos utilizando tres algoritmos de machine learning: Random Forest, Support Vector Classifier (SVM) y K-Nearest Neighbors (KNN). El objetivo era identificar cuál de estos modelos era más efectivo para clasificar las muestras de vino en base a 13 características químicas.

#### Evaluación de Modelos

1. **Random Forest Classifier (RFC):**  
   - **Precisión:** 100.00%
   - **Reporte de Clasificación:** El modelo logró una precisión, recall y f1-score perfectos (1.00) para todas las clases. Esto sugiere que el modelo Random Forest fue capaz de capturar la complejidad de los datos y clasificar correctamente todas las muestras en el conjunto de prueba.

2. **Support Vector Classifier (SVM):**  
   - **Precisión:** 98.15%
   - **Reporte de Clasificación:** El modelo SVM también mostró un rendimiento notable con una precisión del 98.15%. Sin embargo, tuvo una ligera disminución en el recall para la clase 3, lo que indica que este modelo no fue tan efectivo como el Random Forest en clasificar correctamente todas las muestras.

3. **K-Nearest Neighbors (KNN):**  
   - **Precisión:** 96.30%
   - **Reporte de Clasificación:** El modelo KNN obtuvo una precisión del 96.30%, siendo el menos preciso de los tres. Aunque su desempeño es bueno, presentó una menor efectividad en la clasificación de las clases 2 y 3 en comparación con los otros modelos.

#### Selección del Mejor Modelo

El **Random Forest Classifier** fue seleccionado como el mejor modelo debido a su precisión perfecta del 100% y su capacidad para clasificar correctamente todas las muestras en el conjunto de prueba. Este resultado indica que el modelo es altamente confiable para esta tarea específica de clasificación de vinos.

#### Conclusión General

El proceso de desarrollo y evaluación de los modelos ha demostrado la importancia de probar diferentes algoritmos para encontrar el más adecuado para un conjunto de datos específico. En este caso, el modelo Random Forest se destacó por su alta precisión y capacidad para manejar la complejidad del dataset, lo que lo convierte en la mejor opción para el problema de clasificación de vinos. Además, la capacidad del modelo para predecir correctamente nuevas muestras subraya su utilidad en aplicaciones prácticas relacionadas con la clasificación de vinos en función de sus características químicas.

Este proyecto no solo resalta la eficacia de los métodos de machine learning en tareas de clasificación, sino que también subraya la importancia del proceso iterativo de modelado y evaluación para obtener resultados óptimos.