<h1 align="center">Introduction to Machine Learning - 25737-2</h1>
<h4 align="center">Dr. R. Amiri</h4>
<h4 align="center">Sharif University of Technology, Spring 2024</h4>


**<font color='red'>Plagiarism is strongly prohibited!</font>**


**Student Name**: Sara Rezanezhad

**Student ID**: 99101643





# Logistic Regression

**Task:** Implement your own Logistic Regression model, and test it on the given dataset of Logistic_question.csv!

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

class MyLogisticRegression(nn.Module):
    def __init__(self, input_dim):
        super(MyLogisticRegression, self).__init__()
        self.linear = nn.Linear(input_dim, 1)

    def forward(self, x):
        return torch.sigmoid(self.linear(x))

class LogisticRegressionModel:
    def __init__(self, input_dim):
        self.model = MyLogisticRegression(input_dim).cuda()
        self.criterion = nn.BCELoss()
        self.optimizer = optim.SGD(self.model.parameters(), lr=0.01)

    def fit(self, X_train, y_train, epochs=100):
        train_data = TensorDataset(torch.from_numpy(X_train).float().cuda(), torch.from_numpy(y_train).float().view(-1, 1).cuda())
        train_loader = DataLoader(train_data, batch_size=32, shuffle=True)

        for epoch in range(epochs):
            for inputs, labels in train_loader:
                self.optimizer.zero_grad()
                outputs = self.model(inputs)
                loss = self.criterion(outputs, labels)
                loss.backward()
                self.optimizer.step()

    def predict(self, X_test):
        with torch.no_grad():
            outputs = self.model(torch.from_numpy(X_test).float().cuda())
            predictions = (outputs.cpu().numpy() > 0.5).astype(int)
        return predictions


**Task:** Test your model on the given dataset. You must split your data into train and test, with a 0.2 split, then normalize your data using X_train data. Finally, report 4 different evaluation metrics of the model on the test set. (You might want to first make the Target column binary!)

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load the dataset
data = pd.read_csv('Logistic_question.csv')

# Drop rows with missing values
data.dropna(inplace=True)

# Define the threshold for converting continuous target variable to binary
threshold = 0.5  # Define the threshold value as needed

# Convert continuous target variable to a binary one
data['Target'] = data['Target'].apply(lambda x: 1 if x >= threshold else 0)

# Separate features and target
X = data.drop(columns='Target')
y = data['Target']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize and train the logistic regression model
model = LogisticRegression()
model.fit(X_train_scaled, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test_scaled)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

# Print results
print(f"Accuracy: {accuracy:.2f}")
print("Confusion Matrix:")
print(conf_matrix)
print("Classification Report:")
print(classification_rep)

Accuracy: 0.94
Confusion Matrix:
[[ 5  5]
 [ 0 70]]
Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.50      0.67        10
           1       0.93      1.00      0.97        70

    accuracy                           0.94        80
   macro avg       0.97      0.75      0.82        80
weighted avg       0.94      0.94      0.93        80



**Question:** What are each of your used evaluation metrics? And for each one, mention situations in which they convey more data on the model performance in specific tasks.

**Your answer:**
The evaluation metrics used in the code snippet are accuracy, confusion matrix, and classification report.

1. **Accuracy**: Accuracy is a common metric used to evaluate classification models. It measures the proportion of correctly classified instances out of the total instances. Accuracy is useful when the classes are balanced and there is no significant class imbalance. However, in cases of imbalanced datasets, accuracy may not provide a complete picture of model performance.

2. **Confusion Matrix**: A confusion matrix provides a more detailed breakdown of the model's performance by showing the counts of true positive, true negative, false positive, and false negative predictions. It is particularly useful in tasks where different types of errors have varying levels of importance. For example, in medical diagnosis, false negatives (missing a positive case) may be more critical than false positives.

3. **Classification Report**: The classification report includes precision, recall, F1-score, and support for each class in the target variable. Precision measures the proportion of true positive predictions out of all positive predictions, recall measures the proportion of true positive predictions out of all actual positives, and the F1-score is the harmonic mean of precision and recall. The classification report is beneficial when you want to understand the trade-off between precision and recall, especially in imbalanced datasets where one class is more prevalent than the others.



**Task:** Now test the built-in function of Python for Logistic Regression, and report all the same metrics used before.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Initialize and train the logistic regression model
model = LogisticRegression()
model.fit(X_train_scaled, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test_scaled)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

# Print results
print(f"Accuracy: {accuracy:.2f}")
print("Confusion Matrix:")
print(conf_matrix)
print("Classification Report:")
print(classification_rep)

Accuracy: 0.94
Confusion Matrix:
[[ 5  5]
 [ 0 70]]
Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.50      0.67        10
           1       0.93      1.00      0.97        70

    accuracy                           0.94        80
   macro avg       0.97      0.75      0.82        80
weighted avg       0.94      0.94      0.93        80



**Question:** Compare your function with the built-in function. On the matters of performance and parameters. Briefly explain what the parameters of the built-in function are and how they affect the model's performance?

**Your answer:**
When comparing a custom Logistic Regression function with the built-in Logistic Regression function in Python's scikit-learn library, there are several aspects to consider in terms of performance and parameters.

Performance:
1. **Accuracy**: Both the custom function and the built-in function should provide similar accuracy values if implemented correctly. The accuracy metric measures the overall correctness of the model's predictions.

2. **Confusion Matrix and Classification Report**: The confusion matrix and classification report should also be consistent between the custom function and the built-in function, providing insights into the model's performance in terms of true positives, true negatives, false positives, and false negatives, as well as precision, recall, and F1-score.

Parameters:
The built-in Logistic Regression function in scikit-learn allows for customization through various parameters. Some of the key parameters and their effects on model performance are:

1. **penalty**: Determines the regularization term used in the model. Regularization helps prevent overfitting by penalizing large coefficients. The 'l1' penalty uses L1 regularization (lasso), while the 'l2' penalty uses L2 regularization (ridge).

2. **C**: Inverse of regularization strength. A smaller value of C indicates stronger regularization, which can help prevent overfitting. Tuning the value of C can impact the model's ability to generalize to unseen data.

3. **solver**: Specifies the optimization algorithm used to fit the model. Different solvers are suitable for different types of problems. For example, 'liblinear' is a good choice for small datasets, while 'lbfgs' and 'sag' are suitable for larger datasets.

4. **class_weight**: Allows for handling imbalanced datasets by assigning different weights to classes. This parameter can be useful when one class dominates the dataset.


# Multinomial Logistic Regression

**Task:** Implement your own Multinomial Logistic Regression model. Your model must be able to handle any number of labels!

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

class MyMultinomialLogisticRegression(nn.Module):
    def __init__(self, input_size, num_classes):
        super(MyMultinomialLogisticRegression, self).__init__()
        self.linear = nn.Linear(input_size, num_classes)

    def forward(self, x):
        return self.linear(x)

    def loss_function(self, outputs, targets):
        loss = F.cross_entropy(outputs, targets)
        return loss

    def fit(self, X_train, y_train, epochs, lr):
        optimizer = optim.SGD(self.parameters(), lr=lr)

        for epoch in range(epochs):
            optimizer.zero_grad()
            outputs = self.forward(X_train)
            loss = self.loss_function(outputs, y_train)
            loss.backward()
            optimizer.step()

            if epoch % 10 == 0:
                print(f'Epoch {epoch}, Loss: {loss.item()}')

    def predict(self, X_test):
        outputs = self.forward(X_test)
        _, predicted = torch.max(outputs, 1)
        return predicted

**Task:** Test your model on the given dataset. Do the same as the previous part, but here you might want to first make the Target column quantized into $i$ levels. Change $i$ from 2 to 10.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

# Load the dataset
data = pd.read_csv('Logistic_question.csv')

# Preprocessing
X = data.drop('Target', axis=1)
y = data['Target']

# Quantize the target column into i levels (change i from 2 to 10)
for i in range(2, 11):
    y_quantized = pd.qcut(y, i, labels=False)

    # Train-test split
    X_train, X_test, y_train, y_test = train_test_split(X, y_quantized, test_size=0.2, random_state=42)

    # Standardize the features
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)

    # Convert data to PyTorch tensors
    X_train = torch.tensor(X_train, dtype=torch.float32)
    y_train = torch.tensor(y_train.values, dtype=torch.long)
    X_test = torch.tensor(X_test, dtype=torch.float32)

    # Define and train the model
    input_size = X.shape[1]
    num_classes = len(np.unique(y_quantized))
    model = MyMultinomialLogisticRegression(input_size, num_classes)

    epochs = 100
    lr = 0.01
    model.fit(X_train, y_train, epochs, lr)

    # Evaluate the model
    with torch.no_grad():
        predicted = model.predict(X_test)
        accuracy = (predicted == y_test).sum().item() / len(y_test)
        print(f'Quantization levels: {i}, Test Accuracy: {accuracy}')

Epoch 0, Loss: 0.8812591433525085
Epoch 10, Loss: 0.7094274759292603
Epoch 20, Loss: 0.6008730530738831
Epoch 30, Loss: 0.5311353802680969
Epoch 40, Loss: 0.48425325751304626
Epoch 50, Loss: 0.45115169882774353
Epoch 60, Loss: 0.42674511671066284
Epoch 70, Loss: 0.408090204000473
Epoch 80, Loss: 0.39340636134147644
Epoch 90, Loss: 0.3815666735172272
Quantization levels: 2, Test Accuracy: 0.0
Epoch 0, Loss: 1.4809350967407227
Epoch 10, Loss: 1.2940505743026733
Epoch 20, Loss: 1.155160665512085
Epoch 30, Loss: 1.0545694828033447
Epoch 40, Loss: 0.98167484998703
Epoch 50, Loss: 0.9279400706291199
Epoch 60, Loss: 0.8873610496520996
Epoch 70, Loss: 0.8559190034866333
Epoch 80, Loss: 0.8309462666511536
Epoch 90, Loss: 0.8106552958488464
Quantization levels: 3, Test Accuracy: 0.0
Epoch 0, Loss: 1.432816505432129
Epoch 10, Loss: 1.365410566329956
Epoch 20, Loss: 1.3107023239135742
Epoch 30, Loss: 1.2659567594528198
Epoch 40, Loss: 1.2289838790893555
Epoch 50, Loss: 1.1980741024017334
Epoch 60,

**Question:** Report for which $i$ your model performs best. Describe and analyze the results! You could use visualizations or any other method!

**Your answer:**

# Going a little further!

First we download Adult income dataset from Kaggle! In order to do this create an account on this website, and create an API. A file named kaggle.json will be downloaded to your device. Then use the following code:

In [None]:
from google.colab import files
files.upload()  # Use this to select the kaggle.json file from your computer
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

KeyboardInterrupt: 

Then use this code to automatically download the dataset into Colab.

In [None]:
!kaggle datasets download -d wenruliu/adult-income-dataset
!unzip /content/adult-income-dataset.zip

**Task:** Determine the number of null entries!

In [None]:
# Your code goes here!


**Question:** In many widely used datasets there are a lot of null entries. Propose 5 methods by which, one could deal with this problem. Briefly explain how do you decide which one to use in this problem.

**Your answer:**

**Task:** Handle null entries using your best method.

In [None]:
# Your code goes here!


**Task:** Convert categorical features to numerical values. Split the dataset with 80-20 portion. Normalize all the data using X_train. Use the built-in Logistic Regression function and GridSearchCV to train your model, and report the parameters, train and test accuracy of the best model.

In [None]:
# Your code goes here!


**Task:** To try a different route, split X_train into $i$ parts, and train $i$ separate models on these parts. Now propose and implement 3 different *ensemble methods* to derive the global models' prediction for X_test using the results(not necessarily predictions!) of the $i$ models. Firstly, set $i=10$ to find the method with the best test accuracy(the answer is not general!). You must Use your own Logistic Regression model.(You might want to modify it a little bit for this part!)

In [None]:
# Your code goes here!


**Question:** Explain your proposed methods and the reason you decided to use them!

**Your answer:**

**Task:** Now, for your best method, change $i$ from 2 to 100 and report $i$, train and test accuracy of the best model. Also, plot test and train accuracy for $2\leq i\leq100$.

In [None]:
# Your code goes here!


**Question:** Analyze the results.

**Your Answer:**