## FINAL ASSIGNMENT :
#### Dataset: Red Wine Quality

##### The dataset is related to the red variant of "Vinho Verde" wine. It contains 1599 data points where features are the physicochemical properties and the target value is quality which is an integer score ranging from 0-10. Your task is to classify if the wine provided is good based on its physicochemical properties.

##### (i) Create a new column on the dataset with binary values (i.e, 0 or 1) telling whether the wine is of good quality or not. You can categorise wines with quality>=7 to be of good quality. Drop the original ‘quality’ column.

##### (ii) Perform the data pre-processing steps that you feel are important for the given dataset.

##### (iii) Apply following classification algorithms on the given dataset (you are allowed to use scikit-learn library until not specified ‘from scratch’):

 ##### Logistic Regression
 ##### K-Nearest Neighbors
 ##### Decision Trees Classifier
 ##### Random Forest Classifier
 ##### Logistic Regression from Scratch 

##### (iv) Evaluate all your models based on the accuracy score and f1 score obtained on the test dataset.



In [2]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score

# Load the dataset
df = pd.read_csv("winequality-red.csv")

# (i) Create a new column for good quality
df['good_quality'] = (df['quality'] >= 7).astype(int)

# Drop the original 'quality' column
df.drop('quality', axis=1, inplace=True)

# (ii) Data pre-processing steps
# Split the data into features (X) and target variable (y)
X = df.drop('good_quality', axis=1)
y = df['good_quality']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardize the features

# Create a StandardScaler object
scaler = StandardScaler() 
# Fit the scaler on the training data and simultaneously transform the features
X_train_scaled = scaler.fit_transform(X_train)
# Transform the test data using the previously fitted scaler (ensures same scaling parameters as training data)
X_test_scaled = scaler.transform(X_test)

# (iii) Apply classification algorithms

# Logistic Regression

# Create a Logistic Regression model instance
logistic_model = LogisticRegression()
# Train the Logistic Regression model on the scaled training data
logistic_model.fit(X_train_scaled, y_train)
# Use the trained Logistic Regression model to make predictions on the scaled test data
logistic_preds = logistic_model.predict(X_test_scaled)

# K-Nearest Neighbors

# Create an instance of the K-Nearest Neighbors (KNN) classifier model
knn_model = KNeighborsClassifier()
# Train the KNN model using the scaled training data and corresponding target values
knn_model.fit(X_train_scaled, y_train)
# Use the trained KNN model to make predictions on the scaled test data
knn_preds = knn_model.predict(X_test_scaled)

# Decision Trees Classifier

# Create an instance of the Decision Tree classifier model
tree_model = DecisionTreeClassifier()
# Train the Decision Tree model using the scaled training data and corresponding target values
tree_model.fit(X_train_scaled, y_train)
# Use the trained Decision Tree model to make predictions on the scaled test data
tree_preds = tree_model.predict(X_test_scaled)

# Random Forest Classifier

# Create an instance of the Random Forest classifier model
rf_model = RandomForestClassifier()
# Train the Random Forest model using the scaled training data and corresponding target values
rf_model.fit(X_train_scaled, y_train)
# Use the trained Random Forest model to make predictions on the scaled test data
rf_preds = rf_model.predict(X_test_scaled)

# Define a custom Logistic Regression class from scratch
class LogisticRegressionScratch:
    def __init__(self, learning_rate=0.0001, num_iterations=1500):
        # Initialize the logistic regression model with default or user-defined learning rate and number of iterations
        self.learning_rate = learning_rate
        self.num_iterations = num_iterations
        # Initialize weights and bias to None
        self.weights = None
        self.bias = None

    def sigmoid(self, z):
        # Sigmoid activation function used to squash values between 0 and 1
        return 1 / (1 + np.exp(-z))

    def fit(self, X, y):
        # Get the number of samples (m) and features (n) from the input data X
        m, n = X.shape
        # Initialize weights to zeros and bias to zero
        self.weights = np.zeros(n)
        self.bias = 0

        for _ in range(self.num_iterations):
            # Calculate the predicted values
            y_pred = self.sigmoid(np.dot(X, self.weights) + self.bias)

            # Calculate the gradients
            dw = (1 / m) * np.dot(X.T, (y_pred - y))
            db = (1 / m) * np.sum(y_pred - y)

            # Update weights and bias using gradient descent
            self.weights -= self.learning_rate * dw
            self.bias -= self.learning_rate * db

    def predict(self, X):
        # Use the trained weights and bias to make predictions
        return np.round(self.sigmoid(np.dot(X, self.weights) + self.bias))

# Train the Logistic Regression model from scratch

# Create an instance of the Logistic Regression model from scratch
lr_scratch = LogisticRegressionScratch(learning_rate=0.0001, num_iterations=1500)
# Train the Logistic Regression model from scratch using the scaled training data
lr_scratch.fit(X_train_scaled, y_train)
# Use the trained Logistic Regression model from scratch to make predictions on the scaled test data
lr_scratch_preds = lr_scratch.predict(X_test_scaled)

# (iv) Evaluate models
# Define a function to print evaluation metrics
def evaluate_model(y_true, y_pred, model_name):
    # Calculate accuracy using accuracy_score
    accuracy = accuracy_score(y_true, y_pred)
    # Calculate F1 score using f1_score
    f1 = f1_score(y_true, y_pred)
    # Print model evaluation metrics
    print(f"{model_name}:\nAccuracy: {accuracy:.4f}\nF1 Score: {f1:.4f}\n")

# Evaluate and print performance metrics for Logistic Regression model
evaluate_model(y_test, logistic_preds, "Logistic Regression")
# Evaluate and print performance metrics for K-Nearest Neighbors model
evaluate_model(y_test, knn_preds, "K-Nearest Neighbors")
# Evaluate and print performance metrics for Decision Trees Classifier model
evaluate_model(y_test, tree_preds, "Decision Trees Classifier")
# Evaluate and print performance metrics for Random Forest Classifier model
evaluate_model(y_test, rf_preds, "Random Forest Classifier")
# Evaluate and print performance metrics for Logistic Regression from scratch model
evaluate_model(y_test, lr_scratch_preds, "Logistic Regression from scratch")

Logistic Regression:
Accuracy: 0.8646
F1 Score: 0.3810

K-Nearest Neighbors:
Accuracy: 0.8708
F1 Score: 0.4918

Decision Trees Classifier:
Accuracy: 0.8667
F1 Score: 0.5676

Random Forest Classifier:
Accuracy: 0.9000
F1 Score: 0.6066

Logistic Regression from scratch:
Accuracy: 0.8542
F1 Score: 0.5070



In [5]:
!pip install xgboost

Collecting xgboost
  Obtaining dependency information for xgboost from https://files.pythonhosted.org/packages/24/ec/ad387100fa3cc2b9b81af0829b5ecfe75ec5bb19dd7c19d4fea06fb81802/xgboost-2.0.3-py3-none-win_amd64.whl.metadata
  Downloading xgboost-2.0.3-py3-none-win_amd64.whl.metadata (2.0 kB)
Downloading xgboost-2.0.3-py3-none-win_amd64.whl (99.8 MB)
   ---------------------------------------- 0.0/99.8 MB ? eta -:--:--
   ---------------------------------------- 0.0/99.8 MB ? eta -:--:--
   ---------------------------------------- 0.0/99.8 MB ? eta -:--:--
   ---------------------------------------- 0.0/99.8 MB ? eta -:--:--
   ---------------------------------------- 0.0/99.8 MB ? eta -:--:--
   ---------------------------------------- 0.0/99.8 MB 187.9 kB/s eta 0:08:51
   ---------------------------------------- 0.0/99.8 MB 187.9 kB/s eta 0:08:51
   ---------------------------------------- 0.0/99.8 MB 187.9 kB/s eta 0:08:51
   ---------------------------------------- 0.0/99.8 MB 187.9

In [6]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression, ElasticNet
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier, XGBRegressor
from sklearn.metrics import accuracy_score, f1_score, mean_squared_error

# Load the dataset
df = pd.read_csv("winequality-red.csv")

# (i) Create a new column for good quality
df['good_quality'] = (df['quality'] >= 7).astype(int)

# Drop the original 'quality' column
df.drop('quality', axis=1, inplace=True)

# (ii) Data pre-processing steps
# Split the data into features (X) and target variable (y)
X = df.drop('good_quality', axis=1)
y = df['good_quality']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# (iii) Apply classification algorithms

# Logistic Regression
logistic_model = LogisticRegression()
logistic_model.fit(X_train_scaled, y_train)
logistic_preds = logistic_model.predict(X_test_scaled)

# K-Nearest Neighbors
knn_model = KNeighborsClassifier()
knn_model.fit(X_train_scaled, y_train)
knn_preds = knn_model.predict(X_test_scaled)

# Decision Trees Classifier
tree_model = DecisionTreeClassifier()
tree_model.fit(X_train_scaled, y_train)
tree_preds = tree_model.predict(X_test_scaled)

# Random Forest Classifier
rf_model = RandomForestClassifier()
rf_model.fit(X_train_scaled, y_train)
rf_preds = rf_model.predict(X_test_scaled)

# Support Vector Machine (SVM)
svm_model = SVC()
svm_model.fit(X_train_scaled, y_train)
svm_preds = svm_model.predict(X_test_scaled)

# XGBoost Classifier
xgb_classifier = XGBClassifier()
xgb_classifier.fit(X_train_scaled, y_train)
xgb_classifier_preds = xgb_classifier.predict(X_test_scaled)

# Elastic Net Regression
elastic_net = ElasticNet()
elastic_net.fit(X_train_scaled, y_train)
elastic_net_preds = elastic_net.predict(X_test_scaled)
elastic_net_class_preds = np.round(elastic_net_preds).astype(int)

# XGBoost Regressor
xgb_regressor = XGBRegressor()
xgb_regressor.fit(X_train_scaled, y_train)
xgb_regressor_preds = xgb_regressor.predict(X_test_scaled)
xgb_regressor_class_preds = np.round(xgb_regressor_preds).astype(int)

# Define a custom Logistic Regression class from scratch
class LogisticRegressionScratch:
    def __init__(self, learning_rate=0.0001, num_iterations=1500):
        self.learning_rate = learning_rate
        self.num_iterations = num_iterations
        self.weights = None
        self.bias = None

    def sigmoid(self, z):
        return 1 / (1 + np.exp(-z))

    def fit(self, X, y):
        m, n = X.shape
        self.weights = np.zeros(n)
        self.bias = 0

        for _ in range(self.num_iterations):
            y_pred = self.sigmoid(np.dot(X, self.weights) + self.bias)
            dw = (1 / m) * np.dot(X.T, (y_pred - y))
            db = (1 / m) * np.sum(y_pred - y)
            self.weights -= self.learning_rate * dw
            self.bias -= self.learning_rate * db

    def predict(self, X):
        return np.round(self.sigmoid(np.dot(X, self.weights) + self.bias))

# Train the Logistic Regression model from scratch
lr_scratch = LogisticRegressionScratch(learning_rate=0.0001, num_iterations=1500)
lr_scratch.fit(X_train_scaled, y_train)
lr_scratch_preds = lr_scratch.predict(X_test_scaled)

# (iv) Evaluate models
def evaluate_model(y_true, y_pred, model_name):
    accuracy = accuracy_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)
    print(f"{model_name}:\nAccuracy: {accuracy:.4f}\nF1 Score: {f1:.4f}\n")

# Evaluation of models
evaluate_model(y_test, logistic_preds, "Logistic Regression")
evaluate_model(y_test, knn_preds, "K-Nearest Neighbors")
evaluate_model(y_test, tree_preds, "Decision Trees Classifier")
evaluate_model(y_test, rf_preds, "Random Forest Classifier")
evaluate_model(y_test, svm_preds, "Support Vector Machine")
evaluate_model(y_test, xgb_classifier_preds, "XGBoost Classifier")
evaluate_model(y_test, elastic_net_class_preds, "Elastic Net Regression")
evaluate_model(y_test, xgb_regressor_class_preds, "XGBoost Regressor")
evaluate_model(y_test, lr_scratch_preds, "Logistic Regression from scratch")

# For Elastic Net and XGBoost Regressor, also calculate and print RMSE
def evaluate_regressor(y_true, y_pred, model_name):
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    print(f"{model_name}:\nRMSE: {rmse:.4f}\n")

evaluate_regressor(y_test, elastic_net_preds, "Elastic Net Regression")
evaluate_regressor(y_test, xgb_regressor_preds, "XGBoost Regressor")

Logistic Regression:
Accuracy: 0.8646
F1 Score: 0.3810

K-Nearest Neighbors:
Accuracy: 0.8708
F1 Score: 0.4918

Decision Trees Classifier:
Accuracy: 0.8646
F1 Score: 0.5695

Random Forest Classifier:
Accuracy: 0.8958
F1 Score: 0.5763

Support Vector Machine:
Accuracy: 0.8812
F1 Score: 0.4242

XGBoost Classifier:
Accuracy: 0.8875
F1 Score: 0.5714

Elastic Net Regression:
Accuracy: 0.8604
F1 Score: 0.0000

XGBoost Regressor:
Accuracy: 0.8771
F1 Score: 0.5496

Logistic Regression from scratch:
Accuracy: 0.8542
F1 Score: 0.5070

Elastic Net Regression:
RMSE: 0.3466

XGBoost Regressor:
RMSE: 0.2950



: 