# Deep Learning Major Task
## CNN Leaf Classification

<ol>
  <li><a href="#overview">Overview</a></li>
  <li><a href="#part1">Part I: Data Preparation</a>
    <ol>
      <li><a href="#lookatdata">Describe the Data</a></li>
      <li><a href="#clean-the-data">Clean the Data</a></li>
      <li><a href="#check-for-missing-values-and-duplicates">Check for Missing Values and Duplicates</a></li>
      <li><a href="#visualize-the-data">Visualize the Data</a></li>
      <li><a href="#draw-images">Draw Images</a></li>
      <li><a href="#correlation-analysis">Correlation Analysis</a></li>
      <li><a href="#divide-the-data">Divide the Data</a></li>
      <li><a href="#standardize-the-data">Standardize the Data</a></li>
      <li><a href="#encode-the-labels">Encode the Labels</a></li>
    </ol>
  </li>
  <li><a href="#training-a-neural-network">Part II: Training a Neural Network (CNN)</a>
    <ol>
      <li><a href="#implement-a-cnn-model">Implement a CNN Model</a></li>
      <li><a href="#write-training-function">Write Training Function</a></li>
      <li><a href="#explore-hyperparameter-settings">Explore Hyperparameter Settings</a></li>
      <li><a href="#tensorboard-monitoring">TensorBoard Monitoring</a></li>
      <li><a href="#evaluation-function">Evaluation Function</a></li>
    </ol>
  </li>
</ol>

<h3>Description</h3>
<a id="description"></a>

### First lets write our imports

In [None]:
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder


# Part I: Data Preparation
<a id="part1"></a>

<h2>Taking a look and Describing the data</h2>
<a id="lookatdata"></a>

### Training dataset

In [None]:
# Load the training set
train_df = pd.read_csv(r'.\data_files\train.csv')

print("#-----> First 5 rows of the training set:\n")
train_df.head(5)

In [None]:
print("-----> training set description:")
train_df.describe()

In [None]:
print("-----> training set information")
train_df.info()

In [None]:
print("-----> training set value types")
train_df.dtypes

### Testing dataset

In [None]:
# Load the testing set
test_df = pd.read_csv(r'.\data_files\test.csv')

print("#-----> First 5 rows of the testing set:")
test_df.head(5)

In [None]:
print("-----> testing set description:")
test_df.describe()

In [None]:
print("-----> testing set information")
test_df.info()

In [None]:
print("-----> testing set value types")
test_df.dtypes

<h2>Cleaning the data</h2>

### Checking the data for missing values or duplicates and carrying out proper correction methods

In [None]:
# Check for missing values
print("Missing values:\n", train_df.isnull().sum(), "\n")

# Check for duplicates
print("Duplicate values:\n", train_df.duplicated().sum())


### ----> Looks like we don't have any missing or duplicate values

Before we continue lets setup our data by dropping the the id and species from the features and set the target on species

In [None]:
# Exclude 'id' and 'species' columns
X_features = train_df.drop(['id', 'species'], axis=1)
y_target = train_df['species']

## Visualizing the data

In [None]:
# Feature Distributions
features = train_df.iloc[:, 2:]  # Assuming features start from column 2
plt.figure(figsize=(24, 16))
for i, feature in enumerate(features.columns, 1):
    plt.subplot(3, 3, i)
    sns.histplot(train_df[feature], kde=True)
    plt.title(f'Distribution of {feature}')
plt.tight_layout()
plt.show()

In [None]:
# # Visualization 3: Pairwise Feature Scatter Plots
# sns.pairplot(train_df.sample(500), hue='species', diag_kind='kde')
# plt.suptitle('Pairwise Scatter Plots for Features', y=1.02)
# plt.show()

In [None]:
# Dimensionality Reduction Visualization (using PCA)
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca_result = pca.fit_transform(features)
plt.figure(figsize=(10, 6))
sns.scatterplot(x=pca_result[:, 0], y=pca_result[:, 1], hue=train_df['species'])
plt.title('PCA Visualization')
plt.show()

### Let's display some leaf images from the training set

In [None]:
from PIL import Image
import os

image_dir = '.\data_files\images'
image_ids = train_df['id'].head(5).tolist() 

plt.figure(figsize=(15, 8))

for i, image_id in enumerate(image_ids, 1):
    image_path = os.path.join(image_dir, f"{image_id}.jpg")
    image = Image.open(image_path).convert('RGB')

    plt.subplot(1, 5, i)
    plt.imshow(image)
    plt.title(f"Image {i}")
    plt.axis('off')

plt.show()


<h2>Correleation Analysis </h2>

we are going to calculate the correlation matrix for shape features<br>
we will use heatmap


In [None]:
# Extracting columns related to 'margin' and 'texture'
margin_texture_columns = train_df.loc[:, 'margin1':'texture64']

# Calculate correlation matrix for 'margin' and 'texture' features
correlation_matrix_margin_texture = margin_texture_columns.corr()

# Display heatmap for the correlation matrix
plt.figure(figsize=(20, 12))
sns.heatmap(correlation_matrix_margin_texture, cmap='coolwarm')
plt.title('Correlation Matrix for Margin and Texture Features')
plt.show()


### Deciding which split method to use

#### We got two methods for splitting:
<ol>
<li>train_test_split</li>
<li>StratifiedShuffleSplit (sss)</li>
</ol>

<b>train_test_split:</b></br>
Usage: Commonly used for general train-test splitting, especially when the class distribution is not a significant concern.<br>
How it works: Randomly shuffles and splits the data into training and test sets.<br>
Advantage: Simplicity and ease of use. Suitable for well-balanced datasets.<br>

<b>StratifiedShuffleSplit:</b></br>
Usage: Typically used when you want to ensure that the distribution of classes in both the training and validation sets is representative of the overall distribution in the dataset.<br>
How it works: StratifiedShuffleSplit maintains the class distribution when creating random splits. It shuffles the data and then creates splits, ensuring that each split has a similar class distribution.<br>
Advantage: Useful when dealing with imbalanced datasets where certain classes have significantly fewer samples than others.<br>

If the dataset has a <b>balanced</b> class distribution, and just need a simple split, train_test_split is often sufficient and easier to use.<br>

If the dataset has <b>imbalanced</b> classes, and want to ensure that the class distribution is maintained in both training and validation sets, then StratifiedShuffleSplit is a good choice.<br>

To decide which approach is better the dataset, we can can check the distribution of the 'species' column in our dataset.

In [None]:
plt.figure(figsize=(14, 6))
sns.countplot(x='species', data=train_df)
plt.title('Distribution of Leaf Classes')
plt.xticks(rotation=90)
plt.xticks(fontsize=8)
plt.show()

-----> since all the bars are the same height that means its balanced and we can use the regular train_test_split method

<h2>Train/Test split</h2>
Divide the data into a training and testing set using approximately 80% for training

In [None]:
# test_size = 0.2 meaning that the training set will be 0.8 (80%)
X_train, X_test, y_train, y_test = train_test_split(X_features, y_target, test_size=0.2, random_state=42)

<h2>Data Standardization</h2>

In [None]:
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

<h2>Label Encoding</h2>

In [None]:
label_encoder = LabelEncoder()

y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)

# Image Preprocessing

In [None]:
import cv2
import numpy as np

# read image
img = cv2.imread('data_files/images/1.jpg')
color = (0,0,0)
result = img.copy()
result2b = cv2.copyMakeBorder(result, 0,0,90,90, cv2.BORDER_CONSTANT, value=color)

plt.figure(figsize=(24, 16))

plt.subplot(2,1,1)
plt.imshow(img)
cv2.imshow("result2b", result2b)
plt.subplot(2,1,2)
plt.imshow(result2b)



In [None]:
# import zipfile
# with zipfile.ZipFile('/data_files/images/leaf-classification/images.zip') as z_img:
#     z_img.extractall()
from PIL import Image, ImageOps
import glob
image_list = []
for filename in glob.glob('data_files/images/*.jpg'): #assuming jpg
    # im=Image.open(filename)
    img = cv2.imread(filename)
    dimensions = img.shape
 
    # height, width, number of channels in image
    height = img.shape[0]
    width = img.shape[1]
    diff = abs(width-height)
    color = (0,0,0)
    result = img.copy()
    if width<height:
        result2b = cv2.copyMakeBorder(result, 0,0,diff,diff, cv2.BORDER_CONSTANT, value=color)
    elif height>width:
        result2b = cv2.copyMakeBorder(result, diff,diff,0,0, cv2.BORDER_CONSTANT, value=color)
        
    image_list.append(result2b)
    
plt.figure(figsize=(24, 16))
for i in range(25):
    # j=np.random.choice((os.listdir('images')))
    plt.subplot(5,5,i+1)
    # img=load_img(os.path.join('/kaggle/working/images',j))
    img = image_list[i]
    plt.imshow(img)

<h1>Part II: Training the Neural Network</h1>