# Predicting Raisin Variety with Decision Trees and Logistic Regression

## Introduction

In this data analysis project, the goal is to predict the variety of raisins using machine learning models. The dataset used in this analysis contains various features related to raisin properties and is labeled with two classes: 'Kecimen' and 'Besni.' The project explores the use of Decision Tree Classifier and Logistic Regression models for classification tasks. Additionally, hyperparameter tuning techniques, such as Grid Search and Randomized Search, are utilized to optimize model performance.

Data source: [Raisin Dataset](https://www.muratkoklu.com/datasets/) by [Murat Koklu](https://www.kaggle.com/muratkokludataset)

## Methods and Objectives

This project follows a systematic approach to build and evaluate machine learning models for raisin variety prediction:

1. **Data Loading and Preprocessing**: The necessary libraries are imported, and the dataset is loaded. The 'Class' column, representing the raisin variety, is recoded into binary values (0 for 'Kecimen' and 1 for 'Besni') for classification purposes.

2. **Exploratory Data Analysis (EDA)**: An initial exploration of the dataset is conducted, including printing the dataset's head and summarizing key statistics. This step aids in understanding the dataset's structure and class distribution.

3. **Data Splitting**: The dataset is divided into training and testing sets using the `train_test_split` function to assess model performance on unseen data.

4. **Grid Search with Decision Tree Classifier**: A Decision Tree Classifier model is created, and hyperparameter tuning is performed using Grid Search. The tuned parameters include 'min_samples_split' and 'max_depth.' The best estimator and the model's accuracy on the test data are printed.

5. **Randomized Search with Logistic Regression**: A Logistic Regression model is defined, and Randomized Search is employed for hyperparameter tuning. 'penalty' and 'C' hyperparameters are explored using a distribution defined by the uniform function. The best score and the model's accuracy on the test data are printed.

6. **Results Summary**: The results of both hyperparameter tuning processes are summarized, including the best hyperparameters, best scores on the training data, and accuracy on the test data. This summary provides insights into the model's performance and the impact of hyperparameter optimization.

By following this approach, the project aims to identify the most suitable machine learning model and hyperparameters for accurately predicting raisin variety. It also assesses the effectiveness of different search strategies (Grid Search and Randomized Search) for hyperparameter tuning.

Let's proceed with the code and delve into the details of these methods.

------ 

## Importing Libraries

In [23]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from scipy.stats import uniform

## Load Data

In [13]:
# Load data as xlsx
raisins = pd.read_excel('dataset/Raisin_Dataset.xlsx')

# Recode class to binary
raisins['Class'] = raisins['Class'].replace({'Kecimen': 0, 'Besni': 1})

## Explore the Dataset

In [14]:
# Print head of raisins
raisins.head()

Unnamed: 0,Area,MajorAxisLength,MinorAxisLength,Eccentricity,ConvexArea,Extent,Perimeter,Class
0,87524,442.246011,253.291155,0.819738,90546,0.758651,1184.04,0
1,75166,406.690687,243.032436,0.801805,78789,0.68413,1121.786,0
2,90856,442.267048,266.328318,0.798354,93717,0.637613,1208.575,0
3,45928,286.540559,208.760042,0.684989,47336,0.699599,844.162,0
4,79408,352.19077,290.827533,0.564011,81463,0.792772,1073.251,0


In [10]:
# Create predictor and target variables, X and y
X = raisins.drop('Class', axis=1)
y = raisins['Class']

In [11]:
# Examine the dataset
print("Number of features:", X.shape[1])
print("Total number of samples:", len(y))
print("Samples belonging to class '1':", y.sum())

Number of features: 7
Total number of samples: 900
Samples belonging to class '1': 450


In [12]:
# Split the data set into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 19)

## Grid Search with Decision Tree Classifier

In [15]:
# Create a Decision Tree model
tree = DecisionTreeClassifier()

# Dictionary of parameters for GridSearchCV
parameters = {'min_samples_split': [2,3,4], 'max_depth': [3,5,7]}

# Create a GridSearchCV model
grid = GridSearchCV(tree, parameters)

# Fit the GridSearchCV model to the training data
grid.fit(X_train, y_train)

In [21]:
# Print the model and hyperparameters obtained by GridSearchCV
print(grid.best_estimator_)

# Print best score
print("Best score on train data based on grid search: ", round(grid.best_score_, 3))
# Print the accuracy of the final model on the test data
print("Accuracy on test data based on grid search: ", round(grid.score(X_test, y_test), 3))

DecisionTreeClassifier(max_depth=5, min_samples_split=3)
Best score on train data:  0.868
Accuracy on test data:  0.813


In [22]:
# Print a table summarizing the results of GridSearchCV
df = pd.concat([pd.DataFrame(grid.cv_results_['params']), pd.DataFrame(grid.cv_results_['mean_test_score'], columns=['Score'])], axis=1)
print(df)

   max_depth  min_samples_split     Score
0          3                  2  0.859259
1          3                  3  0.860741
2          3                  4  0.860741
3          5                  2  0.863704
4          5                  3  0.868148
5          5                  4  0.860741
6          7                  2  0.850370
7          7                  3  0.848889
8          7                  4  0.835556


## Random Search with Logistic Regression

In [25]:
# Define  logistic regression model
lr = LogisticRegression(solver = 'liblinear', max_iter = 1000)

# Define distributions to choose hyperparameters from
distributions = {'penalty': ['l1', 'l2'], 'C': uniform(loc=0, scale=100)}

# Create a RandomizedSearchCV model
clf = RandomizedSearchCV(lr, distributions, n_iter=8)

# Fit the random search model
clf.fit(X_train, y_train)

# Print best score
print("Best score on train data based on randomized search: ", round(clf.best_score_, 3))
# Print the accuracy of the final model on the test data
print("Accuracy on test data based on randomized search: ", round(clf.score(X_test, y_test), 3))

Best score on train data:  0.876
Accuracy on test data:  0.88


In [26]:
# Print a table summarizing the results of RandomSearchCV
df = pd.concat([pd.DataFrame(clf.cv_results_['params']), pd.DataFrame(clf.cv_results_['mean_test_score'], columns=['Accuracy'])] ,axis=1)
print(df.sort_values('Accuracy', ascending = False))

           C penalty  Accuracy
5   0.072707      l2  0.875556
2  86.940062      l2  0.875556
7  21.982322      l2  0.875556
0  89.140474      l1  0.874074
1  92.191806      l2  0.874074
3   9.099659      l2  0.874074
4  85.493831      l1  0.874074
6  79.254236      l1  0.874074
