<a href="https://colab.research.google.com/github/Kishanmvs/MachineLearningLabWork/blob/main/Lab1-K%20Folds.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Wine Quality Classification**

**Data Loading and Inspection**


In [14]:
import pandas as pd

# Load the dataset
data = pd.read_csv('winequality-red.csv', sep=';')

# Display first 5 rows to check the data
print(data.head())

   fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
0            7.4              0.70         0.00             1.9      0.076   
1            7.8              0.88         0.00             2.6      0.098   
2            7.8              0.76         0.04             2.3      0.092   
3           11.2              0.28         0.56             1.9      0.075   
4            7.4              0.70         0.00             1.9      0.076   

   free sulfur dioxide  total sulfur dioxide  density    pH  sulphates  \
0                 11.0                  34.0   0.9978  3.51       0.56   
1                 25.0                  67.0   0.9968  3.20       0.68   
2                 15.0                  54.0   0.9970  3.26       0.65   
3                 17.0                  60.0   0.9980  3.16       0.58   
4                 11.0                  34.0   0.9978  3.51       0.56   

   alcohol  quality  
0      9.4        5  
1      9.8        5  
2      9.8        5 

**Simple Train-Test Split & Logistic Regression Training**

**What was done:**

Converted the wine quality scores into a binary classification problem:

1 = Good quality (quality ≥ 7)

0 = Not good (quality < 7)

Split dataset into training (80%) and testing (20%) sets using stratification to keep class balance.

Trained a Logistic Regression model on training data.

Evaluated performance on test data using accuracy and classification report.

**Type of training:**

Supervised learning with train-test split validation.

**Algorithm:**
Logistic Regression (linear classification model).



In [15]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

# Define features and target
X = data.drop('quality', axis=1)
y = data['quality']

# For simplicity, convert quality scores into binary classes: good (>=7) and not good (<7)
y_binary = (y >= 7).astype(int)

# Split into train and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y_binary, test_size=0.2, random_state=42, stratify=y_binary)

# Initialize Logistic Regression model
model = LogisticRegression(max_iter=1000, random_state=42)

# Train the model
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.89375

Classification Report:
               precision    recall  f1-score   support

           0       0.90      0.98      0.94       277
           1       0.74      0.33      0.45        43

    accuracy                           0.89       320
   macro avg       0.82      0.65      0.70       320
weighted avg       0.88      0.89      0.88       320



**Train-Validation-Test Split & Logistic Regression**

**What was done:**

Split dataset into three parts:

60% training

20% validation

20% testing

The split is done by first separating test (20%) and then splitting remaining 80% into train and validation (75% train, 25% val).

Trained Logistic Regression on training set.

Evaluated on validation set to tune/monitor performance before final evaluation.

Tested final model on test set after tuning.

**Type of training:**

Supervised learning with train-validation-test split for model tuning and unbiased final evaluation.

**Algorithm:**
Logistic Regression.

In [16]:
# First, split into 80% (train + val) and 20% test
X_train_val, X_test, y_train_val, y_test = train_test_split(
    X, y_binary, test_size=0.2, random_state=42, stratify=y_binary)

# Then, split train_val into 75% train and 25% val (which gives 60% train, 20% val of total)
X_train, X_val, y_train, y_val = train_test_split(
    X_train_val, y_train_val, test_size=0.25, random_state=42, stratify=y_train_val)

# Initialize Logistic Regression model
model = LogisticRegression(max_iter=1000, random_state=42)

# Train on training set
model.fit(X_train, y_train)

# Evaluate on validation set
y_val_pred = model.predict(X_val)
print("Validation Accuracy:", accuracy_score(y_val, y_val_pred))
print("\nValidation Classification Report:\n", classification_report(y_val, y_val_pred))

# After (optional) tuning, evaluate on test set
y_test_pred = model.predict(X_test)
print("Test Accuracy:", accuracy_score(y_test, y_test_pred))
print("\nTest Classification Report:\n", classification_report(y_test, y_test_pred))

Validation Accuracy: 0.859375

Validation Classification Report:
               precision    recall  f1-score   support

           0       0.88      0.96      0.92       276
           1       0.47      0.20      0.29        44

    accuracy                           0.86       320
   macro avg       0.68      0.58      0.60       320
weighted avg       0.83      0.86      0.83       320

Test Accuracy: 0.8875

Test Classification Report:
               precision    recall  f1-score   support

           0       0.90      0.98      0.94       277
           1       0.71      0.28      0.40        43

    accuracy                           0.89       320
   macro avg       0.80      0.63      0.67       320
weighted avg       0.87      0.89      0.87       320



**Stratified K-Fold Cross-Validation with Logistic Regression**

**What was done:**

Used Stratified K-Fold Cross-Validation (5 folds):

Dataset split into 5 folds, maintaining class balance.

Model trained on 4 folds, tested on 1 fold.

Repeated for all folds.

Calculated accuracy for each fold.

Reported mean and standard deviation of accuracies.

**Type of training:**

Supervised learning with cross-validation for robust, unbiased performance estimation.

**Algorithm:**
Logistic Regression.

In [18]:
from sklearn.model_selection import cross_val_score, StratifiedKFold

# Initialize Logistic Regression model
model = LogisticRegression(max_iter=1000, random_state=42)

# Define stratified K-Fold (to maintain class balance)
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Perform cross-validation with accuracy as scoring metric
cv_scores = cross_val_score(model, X, y_binary, cv=kf, scoring='accuracy')

print("Cross-validation accuracies for each fold:", cv_scores)
print("Mean CV accuracy:", cv_scores.mean())
print("Standard deviation of CV accuracy:", cv_scores.std())

Cross-validation accuracies for each fold: [0.875      0.88125    0.9        0.85625    0.89341693]
Mean CV accuracy: 0.8811833855799373
Standard deviation of CV accuracy: 0.015255405252639943
