<a href="https://colab.research.google.com/github/Similoluwa224/Simi_portfolio/blob/main/ASSIGNMENT_4_OKUSANYA_SIMILOLUWA_ATINUKE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**ASSIGNMENT 4: TECHCRUSH AI BOOTCAMP**
BY:
*OKUSANYA, SIMILOLUWA ATINUKE*


# **Project Overview**
This project applies **Logistic Regression** to classify raisin varieties (`Kecimen` and `Besni`) based on their morphological features.  
The dataset is from the **UCI Machine Learning Repository** and contains 7 numerical features extracted from raisin images.  

The goal is to build a **predictive model** that achieves at least **81% accuracy** on the test set.  


## 1. Importing Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

## 2. Loading Raisin Variety Classification Dataset

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
file_path = '/content/drive/My Drive/Colab Notebooks/Raisin_Dataset.xlsx'

In [4]:
df = pd.read_excel(file_path)
print("\nFirst 5 Rows:\n",df.head())


First 5 Rows:
     Area  MajorAxisLength  MinorAxisLength  Eccentricity  ConvexArea  \
0  87524       442.246011       253.291155      0.819738       90546   
1  75166       406.690687       243.032436      0.801805       78789   
2  90856       442.267048       266.328318      0.798354       93717   
3  45928       286.540559       208.760042      0.684989       47336   
4  79408       352.190770       290.827533      0.564011       81463   

     Extent  Perimeter    Class  
0  0.758651   1184.040  Kecimen  
1  0.684130   1121.786  Kecimen  
2  0.637613   1208.575  Kecimen  
3  0.699599    844.162  Kecimen  
4  0.792772   1073.251  Kecimen  


In [5]:
print("Dataset Shape:", df.shape)

Dataset Shape: (900, 8)


## 3. Data Cleaning

In [6]:
# Checking for missing values
print("\nMissing Values:\n", df.isnull().sum())


Missing Values:
 Area               0
MajorAxisLength    0
MinorAxisLength    0
Eccentricity       0
ConvexArea         0
Extent             0
Perimeter          0
Class              0
dtype: int64


In [7]:
# Checking for duplicates
duplicates = df.duplicated().sum()
print("\nNumber of duplicate rows:", duplicates)


Number of duplicate rows: 0


## 4. Features & Target

In [8]:
X = df.drop("Class", axis=1)
y = df["Class"]

In [9]:
# Encoding target labels (Kecimen=0, Besni=1)
le = LabelEncoder()
y = le.fit_transform(y)

## 5. Train-Test Split (80:20)

In [10]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

## 6. Feature Scaling

In [11]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## 7. Logistic Regression Model

In [12]:
model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train_scaled, y_train)

## 8. Prediction & Evaluation

In [13]:
y_pred = model.predict(X_test_scaled)

test_acc = accuracy_score(y_test, y_pred)
print("\n✅ Test Accuracy:", round(test_acc*100, 2), "%")
print("\nClassification Report:\n", classification_report(y_test, y_pred, target_names=le.classes_))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))


✅ Test Accuracy: 88.89 %

Classification Report:
               precision    recall  f1-score   support

       Besni       0.94      0.83      0.88        90
     Kecimen       0.85      0.94      0.89        90

    accuracy                           0.89       180
   macro avg       0.89      0.89      0.89       180
weighted avg       0.89      0.89      0.89       180


Confusion Matrix:
 [[75 15]
 [ 5 85]]


## 9. Cross-Validation (5-Fold)

In [14]:
cv_scores = cross_val_score(model, scaler.transform(X), y, cv=5, scoring="accuracy")
print("\nCross-validation scores:", cv_scores)
print("Mean CV Accuracy:", round(cv_scores.mean()*100, 2), "%")


Cross-validation scores: [0.89444444 0.85555556 0.86111111 0.86111111 0.84444444]
Mean CV Accuracy: 86.33 %


## 10. Conclusion

The Logistic Regression model performed strongly on the Raisin Variety Classification task. After proper preprocessing (label encoding, scaling, and dataset cleaning), the model achieved a test accuracy of 88.89%, which exceeds the target benchmark of 81%.

Furthermore, the 5-Fold Cross-Validation confirmed the model’s robustness with an average accuracy of 86.33%, showing consistent generalization across different subsets of the data.

These results demonstrate that even a relatively simple linear model like Logistic Regression can effectively separate the two raisin varieties (Kecimen and Besni) based on their morphological features. This highlights the potential of traditional machine learning algorithms in agricultural product classification tasks, where interpretability and efficiency are often as important as predictive accuracy.