-------
# **Assignment 2: Classification Models**
## *Shaikh Mariyam Harunor Rashid - A20MJ4010*
-------

##**Table of Contents**
| No. |  | Content       |
|---------|----|---------------|
| 1.|   | Importing libraries  |
| 2.|   | Dataset - Stars|
|   |2.1| *Loading the Dataset*|
|   |2.2| *Processing the Dataset*|
|   |2.3| *Model Training and Evaluation*|
|   |2.4| *Model Training and Evaluation*|
| 3.|   | Dataset - Mushrooms  |
|   |3.1| *Loading the Dataset*|
|   |3.2| *Processing the Dataset*|
|   |3.3| *Model Training and Evaluation*|
|   |3.4| *Model Training and Evaluation*|

<br>

----

# 1. Importing libraries

In [3]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer
import pandas as pd
import numpy as np

The 'star_data' and'mushroom_data' datasets are used to apply different models in the Python code that is given. The script includes all of the important preprocessing procedures for data, including coding categorical features, resolving missing values, normalising numerical attributes, and assessing model performance. It makes use of popular classifiers such as Gradient Boosting, Support Vector Machines (SVM), Decision Trees, Random Forests, K-Nearest Neighbours (KNN), and Logistic Regression. In addition, the code evaluates model accuracy and generates complete classification reports that give a thorough picture of each model's performance on the designated datasets.

# 2. Dataset - Stars

## Loading the Dataset

In [4]:
# Load the dataset
star_data = pd.read_csv('/content/Star3642_balanced.csv')
star_data.head()
imputer = SimpleImputer(strategy='mean')

To give an overview of the dataset, the code loads the 'Star3642_balanced.csv' dataset using pandas and shows the first few rows. The 'mean' technique is then used to initialise a SimpleImputer, meaning that any missing values in the dataset will be replaced with the mean values of the corresponding columns. In order to ensure consistency and completeness in the dataset and to get it ready for additional analysis and model training, this step is essential.

## Processing the Dataset

In [5]:
label_encoders = {}
for column in star_data.select_dtypes(include=['object']).columns:
    label_encoders[column] = LabelEncoder()
    star_data[column] = label_encoders[column].fit_transform(star_data[column])

# Splitting the data into features and target
X = star_data.drop('TargetClass', axis=1)
y = star_data['TargetClass']

This code block turns the categorical variables in the 'star_data' dataset into numerical representations by using Label Encoding. After that, the encoded data is divided into the target variable (y) and features (X), readying the dataset for training and assessing machine learning models.

In [6]:
# Normalizing numerical variables
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)

To make sure that the features are on a similar scale, this part of the code uses StandardScaler to normalise the numerical variables. Train_test_split is then used to split the dataset into training and testing sets. 30% of the data are in the testing set (X_test and y_test), while the remaining 70% are in the training set (X_train and y_train). To further prepare the data for model training and assessment, the imputer is also used to handle any remaining missing values.

## Model Training and Evaluation

In [7]:
# Dictionary of models for training and evaluation
models = {
    'LogisticRegression': LogisticRegression(),
    'DecisionTreeClassifier': DecisionTreeClassifier(),
    'RandomForestClassifier': RandomForestClassifier(),
    'SVC': SVC(),
    'KNeighborsClassifier': KNeighborsClassifier(),
    'GradientBoostingClassifier': GradientBoostingClassifier()
}

# Dictionary to store accuracy of each model
model_accuracies = {}

# Training and evaluating each model
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred)
    model_accuracies[name] = accuracy
    #print(model_accuracies)
    #print(report)
    print(f"\n{'='*40}\nModel: {name}\nAccuracy: {accuracy:.4f}\n{'-'*40}\nClassification Report:\n{report}")
    print('='*40)



Model: LogisticRegression
Accuracy: 0.8856
----------------------------------------
Classification Report:
              precision    recall  f1-score   support

           0       0.88      0.89      0.88       540
           1       0.89      0.89      0.89       553

    accuracy                           0.89      1093
   macro avg       0.89      0.89      0.89      1093
weighted avg       0.89      0.89      0.89      1093


Model: DecisionTreeClassifier
Accuracy: 0.8966
----------------------------------------
Classification Report:
              precision    recall  f1-score   support

           0       0.89      0.90      0.90       540
           1       0.90      0.89      0.90       553

    accuracy                           0.90      1093
   macro avg       0.90      0.90      0.90      1093
weighted avg       0.90      0.90      0.90      1093


Model: RandomForestClassifier
Accuracy: 0.9103
----------------------------------------
Classification Report:
              

This part creates a dictionary called models that contains several machine learning classifiers, including K-Neighbors, Gradient Boosting, Support Vector Classifier (SVC), Decision Tree, Random Forest, and Logistic Regression. After that, the algorithm goes over each model iteratively, trains it on the X_train and y_train training sets, and assesses its performance on the X_test and y_test testing sets. Every model's accuracy is recorded in the model_accuracies dictionary, and comprehensive reports on categorization are generated, offering valuable perspectives into the predictive powers of the models.

## Summary and Conclusion

- The models exhibit competitive performance, with the Gradient Boosting Classifier outperforming others with the highest accuracy of 91.13%.
- All models showcase balanced precision, recall, and F1-score, indicating their effectiveness across both classes.
- The Decision Tree, Random Forest, and Gradient Boosting models stand out as strong candidates for this classification task.
- Further hyperparameter tuning and optimization could potentially enhance the models' performance.


# 3. Dataset - Mushrooms

## Loading the Dataset

In [8]:
mushroom_data = pd.read_csv('/content/mushrooms.csv')
mushroom_data.head()
imputer = SimpleImputer(strategy='mean')

This section of the code begins by loading the 'mushrooms.csv' dataset using pandas (pd.read_csv). The head() function is then used to display the first few rows of the dataset, offering a preliminary look at its structure. Following this, a SimpleImputer is initialized with the strategy of replacing missing values with the mean, preparing the data for subsequent preprocessing steps.

## Processing the Dataset

In [9]:
label_encoders = {}
for column in mushroom_data.select_dtypes(include=['object']).columns:
    label_encoders[column] = LabelEncoder()
    mushroom_data[column] = label_encoders[column].fit_transform(mushroom_data[column])

# Splitting the data into features and target
X = mushroom_data.drop('class', axis=1)
y = mushroom_data['class']

In this section, Label Encoding is used to encode categorical variables in the 'mushroom_data' dataset. Instances of the LabelEncoder for every category column are kept in a dictionary called label_encoders. The code converts categorical values into numerical representations by iterating over object-type columns.

The dataset is then divided into the target variable (y) and characteristics (X). The target variable (y) is given the values from the 'class' column, and features (X) are derived by removing the 'class' column. The data is made ready for machine learning model training and assessment by this encoding and splitting process.

In [10]:
# Normalizing numerical variables
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)

In this part of the code, StandardScaler is used to normalise the numerical variables in the dataset (X). By ensuring that the features are on a consistent scale, this step helps to improve the effectiveness of model training.

The dataset is then divided using train_test_split into training and testing sets (X_train, X_test, y_train, y_test). 30% of the data is from the testing set, and 70% is from the training set. Furthermore, the previously initialised imputer is used to handle missing values in the training and testing sets. In order to prepare the data for machine learning model training and subsequent evaluation, several preprocessing processes are essential.

## Model Training and Evaluation

In [11]:
# Dictionary of models for training and evaluation
models2 = {
    'LogisticRegression': LogisticRegression(),
    'DecisionTreeClassifier': DecisionTreeClassifier(),
    'RandomForestClassifier': RandomForestClassifier(),
    'SVC': SVC(),
    'KNeighborsClassifier': KNeighborsClassifier(),
    'GradientBoostingClassifier': GradientBoostingClassifier()
}

# Dictionary to store accuracy of each model
model_accuracies2 = {}

# Training and evaluating each model
for name, model in models2.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy2 = accuracy_score(y_test, y_pred)
    report2 = classification_report(y_test, y_pred)
    with np.errstate(divide='ignore', invalid='ignore'):
        report = classification_report(y_test, y_pred, zero_division=1)
    model_accuracies2[name] = accuracy2
    print(f"\n{'='*40}\nModel: {name}\nAccuracy: {accuracy:.4f}\n{'-'*40}\nClassification Report:\n{report}")
    print('='*40)


Model: LogisticRegression
Accuracy: 0.9113
----------------------------------------
Classification Report:
              precision    recall  f1-score   support

           0       0.95      0.95      0.95      1257
           1       0.95      0.95      0.95      1181

    accuracy                           0.95      2438
   macro avg       0.95      0.95      0.95      2438
weighted avg       0.95      0.95      0.95      2438


Model: DecisionTreeClassifier
Accuracy: 0.9113
----------------------------------------
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      1257
           1       1.00      1.00      1.00      1181

    accuracy                           1.00      2438
   macro avg       1.00      1.00      1.00      2438
weighted avg       1.00      1.00      1.00      2438


Model: RandomForestClassifier
Accuracy: 0.9113
----------------------------------------
Classification Report:
              

A new set of machine learning models (models2) is introduced in this section of the code for evaluation and training on an alternative dataset. Among the models are the following: K-Neighbors, Gradient Boosting, Support Vector Classifier (SVC), Random Forest, Decision Tree, and Logistic Regression.

The algorithm predicts on the testing set (X_test), fits each model to the training set (X_train and y_train), and then iterates through each model, assessing its performance. Each model's accuracy is recorded in the model_accuracies2 dictionary, and comprehensive reports on classification are generated that offer valuable information about each model's predictive power on the second dataset. The code also addresses any division mistakes in the classification report.

## Summary and Conclusion
- All models exhibit remarkable accuracy and performance on the mushroom dataset, achieving perfect metrics for both classes.
- The Decision Tree, Random Forest, SVC, K-Neighbors, and Gradient Boosting models showcase identical accuracy, precision, recall, and F1-score, indicating potential overfitting or an issue with the dataset.
- Further investigation into dataset characteristics, potential class imbalances, or the need for additional features is recommended to address the observed high performance and potential overfitting.
