# <p><center style="font-family:newtimeroman;font-size:180%;">Detection of Pistachio Types Using Classifier Models </center></p>
### Table of contents:

* [Introduction](#1)
* [Import Libraries](#2)
* [Import Dataset](#3)
* [Visualization](#4)
* [Preprocessing And Feature Engineering](#5)
* [Using GridSearch to find optimal hyperparameters](#6)
* [Train And Evaluate Classifier Models](#7)
* [Choosing The Best Model](#8)

<a id="1"></a>
# <p style="background-image: url(https://i.postimg.cc/K87ByXmr/stage5.jpg);font-family:camtasia;font-size:120%;color:white;text-align:center;border-radius:15px 50px; padding:7px"> Introduction</p>

<html>
<body>
  <h1>Detection of Pistachio Types Using Classifier Models</h1>
  <p>Did you know that pistachios come in various types? In Iran alone, botanists have identified over a dozen pistachio varieties, although a smaller number are commercially recognized and available. Given that each different species possesses unique and distinct characteristics, the identification of different types and the ability to differentiate between pistachios is a highly practical endeavor. For instance, some pistachios are more suitable for confectionery purposes, while others, which are crunchier in texture, are primarily consumed as snacks.</p>
  <p>Now, imagine a company or a wholesale store that purchases pistachios in bulk. They want to identify the specific type of pistachio they have purchased to ensure it matches their desired selection and that no one has engaged in fraudulent practices. But how can pistachios be judged and differentiated based on their appearance? If we entrust this task to a specialized individual (a pistachio expert!), in the best-case scenario, when they examine a pistachio, they assess its resemblance to known pistachio varieties in terms of shape. They also consider factors such as length, size, and even the level of openness, comparing them to known pistachio types. By asking such questions and conducting a thorough evaluation, the expert ultimately announces their final determination.</p>
</body>
</html>

<html>

<style>
body {
  font-family: Arial, sans-serif;
  font-size: 16px;
  line-height: 1.6;
}

h1 {
  text-align: center;
  margin-bottom: 30px;
}

table {
  margin: 0 auto;
  border-collapse: collapse;
  width: 100%;
}

th, td {
  padding: 10px;
  text-align: left;
  border-bottom: 1px solid #ddd;
}

th {
  background-color: #f2f2f2;
}

.description {
  text-align: left;
}
</style>
</head>
<body>
<h1>DataSet</h1>
<p>The dataset provided encompasses 1,718 data points about two distinct varieties of pistachios, namely Kirmizi and Siirt, cultivated within the borders of Turkey. High-resolution imagery was captured for both types, followed by the application of advanced image processing techniques. Through the utilization of these techniques and subsequent feature extraction, a total of 16 distinctive attributes were documented for each sample. Certain features, such as length and perimeter, were directly derived from the visual content, while others were generated based on these primary characteristics. In order to mitigate complexity arising from the interdependencies among the derived features and their formulas, a detailed elaboration of these relationships has been omitted. Nevertheless, a selection of these features has been visually represented in the accompanying image. In the following, the features of this dataset are listed:</p>
<table>
<tr>
<th>Description</th>
<th>Column</th>
</tr>
<tr>
<td class="description">The area of the pistachio region detected in the image</td>
<td><code>AREA</code></td>
</tr>
<tr>
<td class="description">The perimeter (total length) of the pistachio region boundary</td>
<td><code>PERIMETER</code></td>
</tr>
<tr>
<td class="description">The length of the major axis of the ellipse that best fits the pistachio region</td>
<td><code>MAJOR_AXIS</code></td>
</tr>
<tr>
<td class="description">The length of the minor axis of the ellipse that best fits the pistachio region</td>
<td><code>MINOR_AXIS</code></td>
</tr>
<tr>
<td class="description">The eccentricity of the ellipse that describes the elongation of the pistachio region</td>
<td><code>ECCENTRICITY</code></td>
</tr>
<tr>
<td class="description">The equivalent diameter of the pistachio region</td>
<td><code>EQDIASQ</code></td>
</tr>
<tr>
<td class="description">The ratio of the pistachio region's area to the area of its convex hull</td>
<td><code>SOLIDITY</code></td>
</tr>
<tr>
<td class="description">The area of the convex hull of the pistachio region</td>
<td><code>CONVEX_AREA</code></td>
</tr>
<tr>
<td class="description">The ratio of the area of the pistachio region to the area of the bounding box that encloses it</td>
<td><code>EXTENT</code></td>
</tr>
<tr>
<td class="description">The ratio of the major axis length to the minor axis length, indicating the shape of the pistachio region</td>
<td><code>ASPECT_RATIO</code></td>
</tr>
<tr>
<td class="description">A measure of how closely the pistachio region resembles a perfect circle (circularity)</td>
<td><code>ROUNDNESS</code></td>
</tr>
<tr>
<td class="description">A measure of how compact the shape of the pistachio region is</td>
<td><code>COMPACTNESS</code></td>
</tr>
<tr>
<td class="description">Shape factor 1, a geometric descriptor calculated from the area and perimeter of the pistachio region</td>
<td><code>SHAPEFACTOR_1</code></td>
</tr>
<tr>
<td class="description">Shape factor 2, another geometric descriptor calculated from the area and perimeter of the pistachio region</td>
<td><code>SHAPEFACTOR_2</code></td>
</tr>
<tr>
<td class="description">Shape factor 3, a geometric descriptor derived from the area and perimeter of the pistachio region</td>
<td><code>SHAPEFACTOR_3</code></td>
</tr>
<tr>
<td class="description">Shape factor 4, another geometric descriptor derived from the area and perimeter of the pistachio region</td>
<td><code>SHAPEFACTOR_4</code></td>
</tr>
<tr>
<td class="description">The class or category of the pistachio</td>
<td><code><b>Class</b></code></td>
</tr>
</table>
</body>
</html>

<a id="2"></a>
# <p style="background-image: url(https://i.postimg.cc/K87ByXmr/stage5.jpg);font-family:camtasia;font-size:120%;color:white;text-align:center;border-radius:15px 50px; padding:7px">Import Libraries </p>

In [None]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style="darkgrid")
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score,roc_curve, classification_report, accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix

import warnings
warnings.filterwarnings('ignore')

<a id="3"></a>
# <p style="background-image: url(https://i.postimg.cc/K87ByXmr/stage5.jpg);font-family:camtasia;font-size:120%;color:white;text-align:center;border-radius:15px 50px; padding:7px">Import Dataset </p>
<a class="btn" href="#home">Tabel of Contents</a>

In [None]:
# Import dataset
train_data = pd.read_csv("/kaggle/input/pistachio-types-detection/pistachio.csv")
train_data.head()

In [None]:
# More details about dataset
train_data.info()

<a id="4"></a>
# <p style="background-image: url(https://i.postimg.cc/K87ByXmr/stage5.jpg);font-family:camtasia;font-size:120%;color:white;text-align:center;border-radius:15px 50px; padding:7px">Visualization </p>
<a class="btn" href="#home">Tabel of Contents</a>

In [None]:
# Feature-Class Relationship Box Plots

class RelationshipPlotter:
    
    def __init__(self, data):
        self.data = data
        # Create a 4x4 subplot grid
        self.fig, self.axes = plt.subplots(nrows=8, ncols=2, figsize=(12, 24))  
    
    def plot_box(self, row, col, x, y):
        # Select the appropriate subplot
        ax = self.axes[row, col]  
        sns.boxplot(data=self.data, x=x, y=y, ax=ax)
        ax.set_title(f"{x.capitalize()} vs. {y.capitalize()}")

    def show_plots(self):
        plt.tight_layout() 
        plt.show()

columns_of_box = train_data.columns[:-1]

plotter = RelationshipPlotter(data=train_data)

# Plot the box plots
for i, col in enumerate(columns_of_box):
    plotter.plot_box(i // 2, i % 2, x=col, y='Class')

plotter.show_plots()

In [None]:
# Class Label Mapping for Pistachio Types
def map_class_labels(data, mapping):
    data['Class'] = data['Class'].map(mapping)
    return data

mapping_class = {
    'Kirmizi_Pistachio': 0,
    'Siit_Pistachio': 1
}

train_data = map_class_labels(train_data, mapping_class)

In [None]:
# Calculate the correlation matrix
corr_matrix =train_data.corr()
fig, ax = plt.subplots(figsize=(30, 20))
sns.heatmap(corr_matrix, cmap='coolwarm', annot=True, ax=ax)
ax.set_title('Correlation Matrix')
plt.show()

<a id="5"></a>
# <p style="background-image: url(https://i.postimg.cc/K87ByXmr/stage5.jpg);font-family:camtasia;font-size:120%;color:white;text-align:center;border-radius:15px 50px; padding:7px">Preprocessing And Feature Engineering </p>
<a class="btn" href="#home">Tabel of Contents</a>

In [None]:
# Splitting Data into Training and Validation Sets 
Y= train_data['Class']
X= train_data.drop(columns=['Class'])
X_train, X_val, y_train, y_val = train_test_split(X,Y, test_size=0.2, shuffle=True)

In [None]:
# Scalling the dataset using StandardScaler
scaler = StandardScaler()

X_V = X_val.values
scaled_x_train = scaler.fit_transform(X_train)
scaled_x_val = scaler.transform(X_V)

<a id="6"></a>
# <p style="background-image: url(https://i.postimg.cc/K87ByXmr/stage5.jpg);font-family:camtasia;font-size:120%;color:white;text-align:center;border-radius:15px 50px; padding:7px"> Using GridSearch to find optimal hyperparameters </p>
<a class="btn" href="#home">Tabel of Contents</a>

<!DOCTYPE html>
<html>
<head>
</head>
<body>
  <h1>GridSearchCV for Hyperparameter Tuning</h1>
  <p>
    GridSearchCV is a technique used in machine learning for hyperparameter tuning, which is the process of finding the best combination of hyperparameters for a given model. Hyperparameters are adjustable parameters that are not learned from the data but are set prior to the model training. GridSearchCV systematically searches through a specified grid of hyperparameter values and evaluates the model's performance using cross-validation to determine the optimal set of hyperparameters.
  </p>
</body>
</html>

In [None]:
# Define hyperparameter grids
DecisionTree_hyperparameters = {
    'max_depth': [3, 5, 7, None],
    'min_samples_split': [2, 4, 6, 8],
    'min_samples_leaf': [1, 2, 3, 4],
    'criterion': ['gini', 'entropy']
}

RandomForest_hyperparameters = {
    'n_estimators': [100, 200, 300],
    'max_depth': [5, 10, None],
    'min_samples_split': [2, 4, 6],
    'min_samples_leaf': [1, 2, 3],
    'criterion': ['gini', 'entropy']
}

XGBoost_hyperparameters = {
    'max_depth': [3, 5, 7],
    'learning_rate': [0.1, 0.01, 0.001],
    'n_estimators': [100, 200, 300],
    'subsample': [0.8, 0.9, 1.0],
    'colsample_bytree': [0.8, 0.9, 1.0],
    'gamma': [0, 0.1, 0.2]
}

SVM_hyperparameters = {
    'C': [0.1, 1.0, 10.0],
    'kernel': ['linear', 'rbf', 'poly'],
    'gamma': ['scale', 'auto'],
    'degree': [2, 3, 4]
}

KNN_hyperparameters = {
    'n_neighbors': [3, 5, 7, 9],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan']
}

LogisticRegression_hyperparameters = {
    'C': [0.1, 1.0, 10.0],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear', 'saga']
}

# Perform grid search for each model
models = {
    'DecisionTree': DecisionTreeClassifier(),
    'RandomForest': RandomForestClassifier(),
    'XGBoost': XGBClassifier(),
    'SVM': SVC(),
    'KNN': KNeighborsClassifier(),
    'LogisticRegression': LogisticRegression()
}

for model_name, model in models.items():
    hyperparameters = eval(model_name + '_hyperparameters')
    grid_search = GridSearchCV(model, hyperparameters, cv=5)
    grid_search.fit(scaled_x_train, y_train)
    best_params = grid_search.best_params_
    best_score = grid_search.best_score_
    print(f"{model_name} Best Parameters: {best_params}")
    print(f"{model_name} Best Score: {best_score}")
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++")

<a id="7"></a>
# <p style="background-image: url(https://i.postimg.cc/K87ByXmr/stage5.jpg);font-family:camtasia;font-size:120%;color:white;text-align:center;border-radius:15px 50px; padding:7px"> Train and Evaluate Models  </p>
<a class="btn" href="#home">Tabel of Contents</a>

In [None]:
# Classifier Evaluation Metrics Plotter
class ClassifierEvaluationPlot:
    def __init__(self, classifiers):
        self.classifiers = classifiers
        self.classifier_names = []
        self.accuracies = []
        self.precisions = []
        self.recalls = []
        self.f1_scores = []

    def evaluate_classifiers(self, pca_train, y_train, pca_test, y_test):
        from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

        for classifier in self.classifiers:
            classifier_name = type(classifier).__name__
            self.classifier_names.append(classifier_name)

            classifier.fit(pca_train, y_train)
            y_pred = classifier.predict(pca_test)

            accuracy = accuracy_score(y_test, y_pred)
            precision = precision_score(y_test, y_pred)
            recall = recall_score(y_test, y_pred)
            f1 = f1_score(y_test, y_pred)

            self.accuracies.append(accuracy)
            self.precisions.append(precision)
            self.recalls.append(recall)
            self.f1_scores.append(f1)

    def plot_evaluation_metrics(self):
        x = np.arange(len(self.classifier_names))
        width = 0.2
        sns.set_style('darkgrid')
        fig, ax = plt.subplots(figsize=(15, 10))
        rects1 = ax.bar(x - 1.5 * width, self.accuracies, width, label='Accuracy')
        rects2 = ax.bar(x - 0.5 * width, self.precisions, width, label='Precision')
        rects3 = ax.bar(x + 0.5 * width, self.recalls, width, label='Recall')
        rects4 = ax.bar(x + 1.5 * width, self.f1_scores, width, label='F1 Score')

        ax.set_ylabel('Score')
        ax.set_title('Evaluation Metrics for Classifiers')
        ax.set_xticks(x)
        ax.set_xticklabels(self.classifier_names, rotation=45, ha="right")
        ax.legend(loc='lower right')

        def autolabel(rects):
            for rect in rects:
                height = rect.get_height()
                ax.annotate(f'{height:.2f}', xy=(rect.get_x() + rect.get_width() / 2, height), xytext=(0, 3),
                            textcoords="offset points", ha='center', va='bottom')

        autolabel(rects1)
        autolabel(rects2)
        autolabel(rects3)
        autolabel(rects4)

        plt.tight_layout()
        plt.show()

In [None]:
# Evaluation of Default Classifiers and Plotting Metrics
classifiers_with_default_values = [
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    XGBClassifier(),
    SVC(),
    KNeighborsClassifier(),
    LogisticRegression()
]

evaluation_plot = ClassifierEvaluationPlot(classifiers_with_default_values)
evaluation_plot.evaluate_classifiers(scaled_x_train, y_train, scaled_x_val, y_val)
evaluation_plot.plot_evaluation_metrics()

In [None]:
# Evaluation of Classifiers with Best Hyperparameters and Metrics Plotting
classifiers_with_best_hyperparameters = [
    DecisionTreeClassifier(criterion = 'gini' ,max_depth = 5, min_samples_leaf = 2,min_samples_split = 4, random_state=42),
    RandomForestClassifier(criterion = 'entropy', max_depth= 10, min_samples_leaf= 2, min_samples_split= 6, n_estimators= 200 , random_state=42),
    XGBClassifier(colsample_bytree= 1.0, gamma= 0.2, learning_rate= 0.01, max_depth= 7, n_estimators= 300, subsample= 1.0, random_state=42 ),
    SVC(C= 1.0, degree= 2, gamma='scale', kernel= 'rbf', random_state=42),
    KNeighborsClassifier(metric= 'euclidean',n_neighbors= 9,weights= 'distance'),
    LogisticRegression(C= 10.0, penalty='l1', solver= 'liblinear', random_state=42)
]

evaluation_plot = ClassifierEvaluationPlot(classifiers_with_best_hyperparameters)
evaluation_plot.evaluate_classifiers(scaled_x_train, y_train, scaled_x_val, y_val)
evaluation_plot.plot_evaluation_metrics()

In [None]:
# Receiver Operating Characteristic (ROC) Curve for Classifiers
fig, axes = plt.subplots(nrows=2, figsize=(10, 12))

for i, cl in enumerate([classifiers_with_default_values, classifiers_with_best_hyperparameters]):
    ax = axes[i]
    cl_name = "classifiers_with_default_values" if i == 0 else "classifiers_with_best_hyperparameters"

    for classifier in cl:
        classifier.fit(scaled_x_train, y_train)
        y_pred = classifier.predict(scaled_x_val)
        roc_auc = roc_auc_score(y_val, y_pred)

        try:
            # Try using predict_proba
            y_scores = classifier.predict_proba(scaled_x_val)[:, 1]
        except AttributeError:
            # If predict_proba is not available, use decision_function
            y_scores = classifier.decision_function(scaled_x_val)

        fpr, tpr, thresholds = roc_curve(y_val, y_scores)
        ax.plot(fpr, tpr, label=f'{type(classifier).__name__} (area={roc_auc:.2f})')

    ax.plot([0, 1], [0, 1], 'r--')
    ax.set_xlim([0.0, 1.0])
    ax.set_ylim([0.0, 1.05])
    ax.set_xlabel('False Positive Rate')
    ax.set_ylabel('True Positive Rate')
    ax.set_title(f'Receiver Operating Characteristic Curve - {cl_name}')
    ax.legend(loc='lower right')

plt.tight_layout()
plt.show()

<a id="8"></a>
# <p style="background-image: url(https://i.postimg.cc/K87ByXmr/stage5.jpg);font-family:camtasia;font-size:120%;color:white;text-align:center;border-radius:15px 50px; padding:7px"> Choosing The Best Model  </p>
<a class="btn" href="#home">Tabel of Contents</a>

In [None]:
# Final Model(XGBClassifier)
# Trainig The Best Model
Final_Model = XGBClassifier(colsample_bytree= 1.0, gamma= 0.2, learning_rate= 0.01, max_depth= 7, n_estimators= 300, subsample= 1.0, random_state=42 )
Final_Model.fit(scaled_x_train, y_train)
y_pred = Final_Model.predict(scaled_x_val)

# Calculate the Confusion Matrix
print(classification_report(y_val,y_pred))
cm = confusion_matrix(y_val, y_pred)
print('Confusion Matrix : \n', cm)
total=sum(sum(cm))
sensitivity = cm[0,0]/(cm[0,0]+cm[0,1])
print('Sensitivity : ', sensitivity )
specificity = cm[1,1]/(cm[1,0]+cm[1,1])
print('Specificity : ', specificity)
print('f1 score:', f1_score(y_val, y_pred))

# visualize confusion matrix with seaborn heatmap
cm_matrix = pd.DataFrame(data=cm, columns=['Actual Positive:1', 'Actual Negative:0'], index=['Predict Positive:1', 'Predict Negative:0'])
sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='YlGnBu')

<a id="8"></a>
# <p style="background-image: url(https://i.postimg.cc/K87ByXmr/stage5.jpg);font-family:camtasia;font-size:120%;color:white;text-align:center;border-radius:15px 50px; padding:7px">Thank you for taking the time to review my notebook. If you have any questions or criticisms, please kindly let me know in the comments section.  </p>
