# Feature Engineering - Regularisation: Classifying Raisins with Hyperparameter Tuning 

This project focuses on building a classification model to distinguish between different types of raisins using machine learning techniques. The data includes key features that describe the physical and geometric properties of raisins, enabling the model to learn patterns and make predictions.

## Project Goals

- **Data Exploration**:  
  Analyze and visualize the dataset to understand feature distributions and correlations.

- **Model Training**:  
  Train and evaluate classification models using two approaches:  
  - **Grid Search with a Decision Tree Classifier**:  
    Exhaustively search for the best hyperparameters to optimize the Decision Tree model.  
  - **Random Search with Logistic Regression**:  
    Use a randomized approach to efficiently tune hyperparameters for the Logistic Regression model.

- **Hyperparameter Tuning**:  
  Systematically tune hyperparameters to enhance model performance using both Grid Search and Random Search techniques.

## Dataset Overview

The dataset consists of physical measurements of raisins, including attributes like **area**, **perimeter**, **major axis length**, **minor axis length**, and **eccentricity**. These features serve as the input for classifying raisins into distinct categories.

### Dataset Information
- [Original dataset post in Kaggle](https://www.kaggle.com/datasets/muratkokludataset/raisin-dataset)  
- [Dataset author: Murat Koklu](https://www.muratkoklu.com/datasets/)
- `Raisin_Dataset.csv` was provided by Codecademy.com



### 1. Explore the Dataset

In [2]:
# 1. Setup
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

rs = 19
raisins = pd.read_csv('Raisin_Dataset.csv')
raisins.head()

Unnamed: 0,Area,MajorAxisLength,MinorAxisLength,Eccentricity,ConvexArea,Extent,Perimeter,Class
0,87524,442.246011,253.291155,0.819738,90546,0.758651,1184.04,0
1,75166,406.690687,243.032436,0.801805,78789,0.68413,1121.786,0
2,90856,442.267048,266.328318,0.798354,93717,0.637613,1208.575,0
3,45928,286.540559,208.760042,0.684989,47336,0.699599,844.162,0
4,79408,352.19077,290.827533,0.564011,81463,0.792772,1073.251,0


In [3]:
# 2. Create predictor and target variables, X and y
X = raisins.drop('Class', axis=1)
y = raisins['Class']

num_unique_classes = y.nunique()
print("Number of unique classes:", num_unique_classes)

Number of unique classes: 2


In [4]:
# 3. Examine the dataset

print(f"Total number of features: {X.shape[1]}")
print(f"Total number of samples: {X.shape[0]}")
print(f"Samples belonging to class '1': {y.value_counts()[1]}")

Total number of features: 7
Total number of samples: 900
Samples belonging to class '1': 450


In [5]:
# 4. Split the data set into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=rs)

### 2. Grid Search with Decision Tree Classifier

In [6]:
# 5. Create a Decision Tree model
tree = DecisionTreeClassifier(random_state=rs)

In [7]:
# 6. Dictionary of parameters for GridSearchCV
parameters_tree = {
    'max_depth': [3, 5, 7],
    'min_samples_split': [2, 3, 4]
}

In [8]:
# 7. Create a GridSearchCV model
gscv = GridSearchCV(estimator=tree,
                    param_grid=parameters_tree)




#Fit the GridSearchCV model to the training data
gscv.fit(X_train, y_train)


In [9]:
# Model and hyperparameters obtained by GridSearchCV
best_gscv_tree = gscv.best_estimator_
print(f"Best Estimator Hyperparameters:{best_gscv_tree}")

# Best score obtained

print(f"Best Cross-Validation Accuracy:{gscv.best_score_}")

# Accuracy of the final model on the test data
test_score = best_gscv_tree.score(X_test, y_test)
print(f"Test Data Accuracy:{test_score}")


Best Estimator Hyperparameters:DecisionTreeClassifier(max_depth=3, random_state=19)
Best Cross-Validation Accuracy:0.8541666666666667
Test Data Accuracy:0.8555555555555555


In [10]:
# Extract the mean test scores and hyperparameters
mean_test_scores = gscv.cv_results_['mean_test_score']
params = gscv.cv_results_['params']

# Convert to DataFrames
scores_df = pd.DataFrame(mean_test_scores, columns=['Mean Test Score'])
params_df = pd.DataFrame(params)

# Concatenate the two DataFrames
results_df = pd.concat([params_df, scores_df], axis=1)

# Print the resulting DataFrame
print(results_df)


   max_depth  min_samples_split  Mean Test Score
0          3                  2         0.854167
1          3                  3         0.854167
2          3                  4         0.854167
3          5                  2         0.851389
4          5                  3         0.851389
5          5                  4         0.851389
6          7                  2         0.819444
7          7                  3         0.820833
8          7                  4         0.820833


### 2. Random Search with Logistic Regression

In [11]:
# Logistic regression model definition
lr = LogisticRegression(solver='liblinear', max_iter=1000, random_state=rs)

In [12]:
# Definition of distributions to choose hyperparameters from
from scipy.stats import uniform

distributions_lr = {
    'penalty': ['l1', 'l2'],  # Discrete distribution for regularization type
    'C': uniform(0, 100)      # Uniform distribution for regularization strength
}

In [13]:
# Creation of a RandomizedSearchCV model
clf = RandomizedSearchCV(
    estimator=lr,                 # Logistic regression model
    param_distributions=distributions_lr,  # Parameter distributions
    n_iter=8,                     # Number of random draws
    random_state=rs               # Ensures reproducibility
)
# Self-reflection: Why random_state is needed for RandomizedSearchCV but not for GridSearchCV? --> RandomizedSearchCV uses random draws to select hyperparameters, so the random_state parameter is needed to ensure reproducibility.

# Fit the random search model
clf.fit(X_train, y_train)

In [14]:
# Best estimator and best score
print(f"Best Estimator from Random Search: {clf.best_estimator_}, Penalty: {clf.best_estimator_.get_params()['penalty']}")
print("Best Cross-Validation Accuracy:", clf.best_score_)

#Summary table of the results from RandomSearchCV
# Extract the results from random search
mean_test_scores = clf.cv_results_['mean_test_score']
params = clf.cv_results_['params']

# Convert to DataFrames
scores_df = pd.DataFrame(mean_test_scores, columns=['Mean Test Score'])
params_df = pd.DataFrame(params)

# Concatenate the two DataFrames
results_df = pd.concat([params_df, scores_df], axis=1)

# Print the summary table
print("\nSummary of Random Search Results:")
print(results_df)


Best Estimator from Random Search: LogisticRegression(C=67.19770812804666, max_iter=1000, random_state=19,
                   solver='liblinear'), Penalty: l2
Best Cross-Validation Accuracy: 0.8708333333333333

Summary of Random Search Results:
           C penalty  Mean Test Score
0   9.753360      l2         0.869444
1  41.274294      l2         0.869444
2  13.813169      l2         0.869444
3  67.563267      l1         0.868056
4  67.197708      l2         0.870833
5   0.814826      l1         0.868056
6  63.566073      l2         0.869444
7  84.901482      l2         0.869444
