# Data Exploration

In this notebook, we will explore the dataset containing leaf images and their corresponding annotations. The goal is to visualize the data, understand the distribution of leaf types, and identify any potential issues in the dataset.

In [1]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set the style for seaborn
sns.set(style='whitegrid')

In [2]:
# Load the annotations
annotations_path = '../data/annotations/annotations.csv'
annotations = pd.read_csv(annotations_path)
annotations.head()

In [3]:
# Visualize the distribution of leaf types
plt.figure(figsize=(12, 6))
sns.countplot(data=annotations, x='leaf_type')
plt.title('Distribution of Leaf Types')
plt.xticks(rotation=45)
plt.xlabel('Leaf Type')
plt.ylabel('Count')
plt.show()

In [4]:
# Check for missing values in the annotations
missing_values = annotations.isnull().sum()
missing_values[missing_values > 0]

In [5]:
# Visualize some sample images from the dataset
sample_images = annotations.sample(5)

plt.figure(figsize=(15, 10))
for i, row in enumerate(sample_images.iterrows()):
    plt.subplot(1, 5, i + 1)
    img_path = os.path.join('../data/raw', row[1]['image_name'])
    img = plt.imread(img_path)
    plt.imshow(img)
    plt.title(row[1]['leaf_type'])
    plt.axis('off')
plt.show()

## Conclusion

In this notebook, we explored the dataset by visualizing the distribution of leaf types and checking for missing values. We also displayed some sample images to get a better understanding of the data. This exploration will help inform the next steps in data augmentation and model training.