## Wine Quality Prediction Using Logistic Regression

### **Introduction**

In this project, we aim to build a logistic regression model to predict whether a wine is "good" or "bad" based on its chemical properties. The goal is to apply machine learning techniques to classify wine quality and understand the factors influencing it.

### **Outline**

1. **Data Loading and Exploration**
2. **Data Preprocessing**
3. **Model Training**
4. **Model Evaluation**
5. **Results and Interpretation**
6. **Conclusion**


In [1]:
import numpy as np  # NumPy for numerical operations
import pandas as pd  # Pandas for data manipulation and analysis
import matplotlib.pyplot as plt  # Matplotlib for data visualization
import seaborn as sns  # Seaborn for advanced data visualization
from sklearn.preprocessing import StandardScaler  # StandardScaler for feature scaling
from sklearn.linear_model import LogisticRegression  # LogisticRegression for our machine learning model
from sklearn.model_selection import train_test_split  # train_test_split for splitting data into training and testing sets
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report  # Metrics for evaluating the performance of the model

In [2]:
# Load the dataset
df = pd.read_csv('Wine Quality Dataset.csv')
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,bad
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,bad
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,bad
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,good
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,bad


In [3]:
df.info()  # Provides summary of the DataFrame, including data types and missing values

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   object 
dtypes: float64(11), object(1)
memory usage: 150.0+ KB


In [4]:
df.describe()  # Provides summary statistics for numerical columns in the dataset

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,8.319637,0.527821,0.270976,2.538806,0.087467,15.874922,46.467792,0.996747,3.311113,0.658149,10.422983
std,1.741096,0.17906,0.194801,1.409928,0.047065,10.460157,32.895324,0.001887,0.154386,0.169507,1.065668
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99675,3.31,0.62,10.2
75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.997835,3.4,0.73,11.1
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9


In [5]:
# Convert the 'quality' column to numerical values: 0 for 'bad', 1 for 'good'
df['quality'] = df['quality'].map({'bad': 0, 'good': 1})
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,0
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,0
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,0
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,1
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,0


In [6]:
# Divide the dataset into predictors (X) and target variable (y)
X = df.drop(['quality'], axis=1)
y = df['quality'] 

In [7]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Scales the features to have mean 0 and variance 1

In [8]:
# Split the scaled data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.20, random_state=33)

In [9]:
# Initialize the Logistic Regression model
reg_model = LogisticRegression()

# Fit the model to the training data
reg_model.fit(x_train, y_train)  # Trains the model using the training features and target variable

In [10]:
y_pred = reg_model.predict(x_test)  # Predicts the target variable for the test features

In [11]:
comparison_df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
comparison_df  # Shows a side-by-side comparison of actual and predicted values

Unnamed: 0,Actual,Predicted
246,0,0
1538,0,1
1386,0,0
539,0,1
1466,1,1
...,...,...
989,1,1
780,1,0
715,1,0
356,0,1


In [12]:
# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy:{accuracy: .2f}')

Accuracy: 0.74


## Model Accuracy
Accuracy: 0.74

## Explanation:

Definition: Accuracy measures the percentage of correct predictions made by the model out of all predictions.

Result: An accuracy of 0.74 means that 74% of the time, the model correctly predicted whether the wine is "good" or "bad."

Implication: This indicates a reasonably good performance, but there is still potential for improvement. Additional analysis and model tuning could help increase accuracy.

In [13]:
print(classification_report(y_test, y_pred))  # Displays precision, recall, F1-score, and support for each class

              precision    recall  f1-score   support

           0       0.73      0.72      0.72       150
           1       0.76      0.76      0.76       170

    accuracy                           0.74       320
   macro avg       0.74      0.74      0.74       320
weighted avg       0.74      0.74      0.74       320



## Key Points:

Precision: The model is more precise in predicting "good" wine (76%) compared to "bad" wine (73%).

Recall: The model is equally good at detecting both classes, with slightly higher recall for "good" wine.

F1-Score: Shows balanced performance with slightly better results for "good" wine.

Accuracy: Overall, the model correctly classifies 74% of the test instances.


In [14]:
cm = confusion_matrix(y_test, y_pred)
print(cm)  # Displays the confusion matrix with counts of true and false predictions

[[108  42]
 [ 40 130]]


## Key Points:

True Negatives (TN): 108 (correctly predicted "bad" wines)

False Positives (FP): 42 (incorrectly predicted "good" wines as "bad")

False Negatives (FN): 40 (incorrectly predicted "bad" wines as "good")

True Positives (TP): 130 (correctly predicted "good" wines)

## Interpretation:

The model has a higher number of true positives (130) and true negatives (108), showing reasonable accuracy in identifying both classes.

The number of false positives (42) and false negatives (40) indicates areas where the model may need improvement.