In [None]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("ayyanarkadalkani/paddy-seeds-quality-classification-dataset")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/ayyanarkadalkani/paddy-seeds-quality-classification-dataset?dataset_version_number=1...


100%|██████████| 183M/183M [00:01<00:00, 113MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/ayyanarkadalkani/paddy-seeds-quality-classification-dataset/versions/1


## Load and Inspect Dataset

### Subtask:
Load the dataset from the provided path and inspect its structure, including the types of features and the target variable (pure/impure). This step will help us understand if the data is tabular, image-based, or another format.


**Reasoning**:
The first instruction is to list the contents of the provided path to identify the data files and their format. I will use `os.listdir` to achieve this.



In [None]:
import os

# List the contents of the downloaded dataset directory
print(os.listdir(path))

['Paddy Seeds Classification Dataset']


**Reasoning**:
The previous output showed that the downloaded dataset is within a subfolder named 'Paddy Seeds Classification Dataset'. I need to explore the contents of this subfolder to identify the actual data files and determine if they are tabular or image-based, as per the subtask's instruction to inspect the dataset structure.



In [None]:
import os

dataset_subfolder_path = os.path.join(path, 'Paddy Seeds Classification Dataset')
print(os.listdir(dataset_subfolder_path))

['train', 'test', 'validation']


**Reasoning**:
The previous output showed 'train', 'test', and 'validation' subfolders. To further inspect the dataset structure and confirm if it's an image-based dataset, I need to list the contents of one of these subfolders, specifically the 'train' directory.



In [None]:
import os

train_path = os.path.join(dataset_subfolder_path, 'train')
print(os.listdir(train_path))

['pure', 'impure']


**Reasoning**:
The previous output showed 'pure' and 'impure' subfolders within the 'train' directory, strongly suggesting an image-based dataset with these as class labels. To confirm this and inspect the actual data files, I will list the contents of the 'pure' subfolder to see if it contains image files.



In [None]:
import os

pure_class_path = os.path.join(train_path, 'pure')
print(os.listdir(pure_class_path)[:5]) # Print first 5 files to avoid long output
print(f"Total files in 'pure' class: {len(os.listdir(pure_class_path))}")

['20200531_161748.jpg', '20200531_154930.jpg', '20200531_155431.jpg', '20200601_125937.jpg', '20200531_160523.jpg']
Total files in 'pure' class: 567


**Reasoning**:
The previous steps have confirmed that the dataset is image-based, with 'train', 'test', and 'validation' directories, each containing 'pure' and 'impure' subfolders for classification. Now, I need to load these images into a format suitable for machine learning. I will use `tensorflow.keras.preprocessing.image.ImageDataGenerator` to load the images, which will also handle resizing and batching.



In [None]:
import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator

# Define image dimensions and batch size
IMG_HEIGHT = 128
IMG_WIDTH = 128
BATCH_SIZE = 32

# Create an ImageDataGenerator for training with augmentation
train_datagen = ImageDataGenerator(
    rescale=1./255,
    rotation_range=20,
    width_shift_range=0.2,
    height_shift_range=0.2,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True,
    fill_mode='nearest'
)

# Create ImageDataGenerator for validation and test (only rescaling)
validation_test_datagen = ImageDataGenerator(rescale=1./255)

# Load training data
train_generator = train_datagen.flow_from_directory(
    os.path.join(dataset_subfolder_path, 'train'),
    target_size=(IMG_HEIGHT, IMG_WIDTH),
    batch_size=BATCH_SIZE,
    class_mode='binary'
)

# Load validation data
validation_generator = validation_test_datagen.flow_from_directory(
    os.path.join(dataset_subfolder_path, 'validation'),
    target_size=(IMG_HEIGHT, IMG_WIDTH),
    batch_size=BATCH_SIZE,
    class_mode='binary'
)

# Load test data
test_generator = validation_test_datagen.flow_from_directory(
    os.path.join(dataset_subfolder_path, 'test'),
    target_size=(IMG_HEIGHT, IMG_WIDTH),
    batch_size=BATCH_SIZE,
    class_mode='binary',
    shuffle=False # Keep data in order for evaluation
)

print("Image data generators created successfully.")
print(f"Training classes: {train_generator.class_indices}")
print(f"Validation classes: {validation_generator.class_indices}")
print(f"Test classes: {test_generator.class_indices}")


Found 1214 images belonging to 2 classes.
Found 118 images belonging to 2 classes.
Found 231 images belonging to 2 classes.
Image data generators created successfully.
Training classes: {'impure': 0, 'pure': 1}
Validation classes: {'impure': 0, 'pure': 1}
Test classes: {'impure': 0, 'pure': 1}


## Train Classification Model

### Subtask:
Select and train an appropriate classification model on the prepared training data.


**Reasoning**:
I need to define a CNN model, compile it, and then train it using the image data generators that were created in the previous steps. This will involve importing Keras layers, specifying the model architecture, and setting up the training process.



In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

# Define the CNN model architecture
model = Sequential([
    Conv2D(32, (3, 3), activation='relu', input_shape=(IMG_HEIGHT, IMG_WIDTH, 3)),
    MaxPooling2D((2, 2)),
    Conv2D(64, (3, 3), activation='relu'),
    MaxPooling2D((2, 2)),
    Conv2D(128, (3, 3), activation='relu'),
    MaxPooling2D((2, 2)),
    Flatten(),
    Dense(512, activation='relu'),
    Dropout(0.5),
    Dense(1, activation='sigmoid') # 'sigmoid' for binary classification
])

# Compile the model
model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

# Display model summary
model.summary()

# Define callbacks for training
early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
model_checkpoint = ModelCheckpoint('best_paddy_seed_model.keras', save_best_only=True, monitor='val_accuracy', mode='max')

# Train the model
EPOCHS = 20 # You can adjust this number
history = model.fit(
    train_generator,
    epochs=EPOCHS,
    validation_data=validation_generator,
    callbacks=[early_stopping, model_checkpoint]
)

print("Model training complete.")

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


  self._warn_if_super_not_called()


Epoch 1/20
[1m38/38[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m53s[0m 1s/step - accuracy: 0.5879 - loss: 0.7413 - val_accuracy: 0.5000 - val_loss: 1.5081
Epoch 2/20
[1m38/38[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m81s[0m 1s/step - accuracy: 0.8938 - loss: 0.2843 - val_accuracy: 0.5593 - val_loss: 1.7015
Epoch 3/20
[1m38/38[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m50s[0m 1s/step - accuracy: 0.9388 - loss: 0.2086 - val_accuracy: 0.5593 - val_loss: 1.4438
Epoch 4/20
[1m38/38[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m50s[0m 1s/step - accuracy: 0.9258 - loss: 0.2252 - val_accuracy: 0.5508 - val_loss: 2.1748
Epoch 5/20
[1m38/38[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m50s[0m 1s/step - accuracy: 0.9418 - loss: 0.1990 - val_accuracy: 0.8220 - val_loss: 0.5106
Epoch 6/20
[1m38/38[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m49s[0m 1s/step - accuracy: 0.8845 - loss: 0.3311 - val_accuracy: 0.6356 - val_loss: 1.7440
Epoch 7/20
[1m38/38[0m [32m━━━━━━━━━━

**Reasoning**:
The previous cell's output indicates that the training process was interrupted before completion. I will regenerate the training code to ensure the model training finishes.



In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

# Define the CNN model architecture
model = Sequential([
    Conv2D(32, (3, 3), activation='relu', input_shape=(IMG_HEIGHT, IMG_WIDTH, 3)),
    MaxPooling2D((2, 2)),
    Conv2D(64, (3, 3), activation='relu'),
    MaxPooling2D((2, 2)),
    Conv2D(128, (3, 3), activation='relu'),
    MaxPooling2D((2, 2)),
    Flatten(),
    Dense(512, activation='relu'),
    Dropout(0.5),
    Dense(1, activation='sigmoid') # 'sigmoid' for binary classification
])

# Compile the model
model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

# Display model summary
model.summary()

# Define callbacks for training
early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
model_checkpoint = ModelCheckpoint('best_paddy_seed_model.keras', save_best_only=True, monitor='val_accuracy', mode='max')

# Train the model
EPOCHS = 20 # You can adjust this number
history = model.fit(
    train_generator,
    epochs=EPOCHS,
    validation_data=validation_generator,
    callbacks=[early_stopping, model_checkpoint]
)

print("Model training complete.")

**Reasoning**:
The model training is complete. Now I need to evaluate the trained model's performance on the test dataset to assess its accuracy and generalization capabilities. This involves using the `evaluate` method with the `test_generator`.



In [None]:
print("Evaluating model on test data...")
loss, accuracy = model.evaluate(test_generator)
print(f"Test Loss: {loss:.4f}")
print(f"Test Accuracy: {accuracy:.4f}")