<a href="https://colab.research.google.com/github/DurgaPittala/Dhana/blob/main/task3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [6]:
import os
import numpy as np
import cv2
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.preprocessing import StandardScaler

# --- 1. Configuration ---
DATADIR = "dataset/train"  # Path to your extracted Kaggle data
CATEGORIES = ["cat", "dog"]
IMG_SIZE = 64  # SVMs struggle with high resolution; 64x64 is a good balance

def create_training_data():
    training_data = []
    for category in CATEGORIES:
        path = os.path.join(DATADIR, category)
        class_num = CATEGORIES.index(category)  # 0 for cat, 1 for dog

        for img in os.listdir(path):
            try:
                img_array = cv2.imread(os.path.join(path, img), cv2.IMREAD_GRAYSCALE)
                new_array = cv2.resize(img_array, (IMG_SIZE, IMG_SIZE))
                training_data.append([new_array.flatten(), class_num])
            except Exception as e:
                pass
    return training_data

# --- 2. Data Preparation ---
data = create_training_data()
X = np.array([i[0] for i in data])
y = np.array([i[1] for i in data])

# Split into 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# --- 3. Feature Scaling ---
# Critical for SVM because it relies on distance calculations
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# --- 4. Training the SVM ---
print("Training SVM... This may take a few minutes depending on dataset size.")
model = SVC(kernel='rbf', C=1.0, gamma='auto')
model.fit(X_train, y_train)

# --- 5. Predictions and Evaluation ---
y_pred = model.predict(X_test)

print("\n--- Model Evaluation ---")
print(f"Accuracy: {accuracy_score(y_test, y_pred) * 100:.2f}%")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=CATEGORIES))

FileNotFoundError: [Errno 2] No such file or directory: 'dataset/train/cat'

# Task
Unzip `archive (1).zip`, inspect the extracted directory structure to find the `train` folder containing `cat` and `dog` images, then update `DATADIR` to the correct path, and finally, run the provided code to train and evaluate the SVM model.

## Unzip Dataset

### Subtask:
Unzip the `archive (1).zip` file to extract the dataset. The extracted files will likely be in `/content/dataset` or `/content/archive` depending on the zip file structure.


**Reasoning**:
The subtask is to unzip the `archive (1).zip` file. I will use the `!unzip` command to extract its contents into the `/content/dataset` directory as instructed.



In [7]:
import zipfile
import os

zip_path = '/content/archive (1).zip'
extract_path = '/content/dataset'

os.makedirs(extract_path, exist_ok=True)

with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(extract_path)

print(f"Extracted '{zip_path}' to '{extract_path}'.")

Extracted '/content/archive (1).zip' to '/content/dataset'.


**Reasoning**:
The archive has been extracted. To find the correct `train` folder path, I need to inspect the directory structure of the extracted contents. Listing the directory recursively will help visualize the structure.



In [8]:
import os

print(f"Directory structure of {extract_path}:")
for root, dirs, files in os.walk(extract_path):
    level = root.replace(extract_path, '').count(os.sep)
    indent = ' ' * 4 * (level)
    print(f'{indent}{os.path.basename(root)}/')
    subindent = ' ' * 4 * (level + 1)
    for f in files:
        if f.endswith(('.jpg', '.jpeg', '.png')):
            print(f'{subindent}{f}')


Directory structure of /content/dataset:
dataset/
    Cat vs Dog/
        validation/
            Dog/
                ec-conf-0.2932733-t-1658094070788.jpg
                ec-conf-0.50390625-t-1638406234410.jpg
                de-conf-0.88796717-t-1647786531272.jpg
                de-conf-0.421875-t-1646841077680.jpg
                es-conf-0.59375-t-1647537738640.jpg
                de-conf-0.4984515-t-1649584443411.jpg
                de-conf-0.78348607-t-1650271396382.jpg
                dk-conf-0.234375-t-1644393536554.jpg
                de-conf-0.13718252-t-1649090061240.jpg
                English-co-conf-0.44921875-t-1636292119452.jpg
                db8937ed-847c-4d90-8406-b85710bab56a.jpg
                de-conf-0.58812-t-1650106878963.jpg
                ec-conf-0.88371176-t-1648477366737.jpg
                de-conf-0.083465576-t-1655539885717.jpg
                de-conf-0.453125-t-1637247903117.jpg
                ec-conf-0.50390625-t-1642712320295.jpg
                ecd0

**Reasoning**:
Based on the directory listing, the correct path for the training data is `/content/dataset/Cat vs Dog/train`. I need to update the `DATADIR` variable in the initial code cell to this new path and re-run the cell to ensure the model training proceeds correctly.



In [9]:
import os
import numpy as np
import cv2
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.preprocessing import StandardScaler

# --- 1. Configuration ---
DATADIR = "/content/dataset/Cat vs Dog/train"  # Updated path to your extracted Kaggle data
CATEGORIES = ["cat", "dog"]
IMG_SIZE = 64  # SVMs struggle with high resolution; 64x64 is a good balance

def create_training_data():
    training_data = []
    for category in CATEGORIES:
        path = os.path.join(DATADIR, category)
        class_num = CATEGORIES.index(category)  # 0 for cat, 1 for dog

        for img in os.listdir(path):
            try:
                img_array = cv2.imread(os.path.join(path, img), cv2.IMREAD_GRAYSCALE)
                new_array = cv2.resize(img_array, (IMG_SIZE, IMG_SIZE))
                training_data.append([new_array.flatten(), class_num])
            except Exception as e:
                pass
    return training_data

# --- 2. Data Preparation ---
data = create_training_data()
X = np.array([i[0] for i in data])
y = np.array([i[1] for i in data])

# Split into 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# --- 3. Feature Scaling ---
# Critical for SVM because it relies on distance calculations
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# --- 4. Training the SVM ---
print("Training SVM... This may take a few minutes depending on dataset size.")
model = SVC(kernel='rbf', C=1.0, gamma='auto')
model.fit(X_train, y_train)

# --- 5. Predictions and Evaluation ---
y_pred = model.predict(X_test)

print("\n--- Model Evaluation ---")
print(f"Accuracy: {accuracy_score(y_test, y_pred) * 100:.2f}%")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=CATEGORIES))

FileNotFoundError: [Errno 2] No such file or directory: '/content/dataset/Cat vs Dog/train/cat'

# Task
Correct the `CATEGORIES` list from `['cat', 'dog']` to `['Cat', 'Dog']` to match the actual folder names, then run the code to load data, prepare it, train the SVM model, and evaluate its performance.

## Correct Category Names

### Subtask:
Update the `CATEGORIES` list from `['cat', 'dog']` to `['Cat', 'Dog']` to match the actual folder names within the training directory.


**Reasoning**:
The subtask requires updating the `CATEGORIES` list. I will modify the existing code cell (`8728c8f5`) to change the `CATEGORIES` variable to `['Cat', 'Dog']` to match the actual folder names observed in the directory structure, and then re-run the entire cell.



In [10]:
import os
import numpy as np
import cv2
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.preprocessing import StandardScaler

# --- 1. Configuration ---
DATADIR = "/content/dataset/Cat vs Dog/train"  # Updated path to your extracted Kaggle data
CATEGORIES = ["Cat", "Dog"] # Updated to match folder names
IMG_SIZE = 64  # SVMs struggle with high resolution; 64x64 is a good balance

def create_training_data():
    training_data = []
    for category in CATEGORIES:
        path = os.path.join(DATADIR, category)
        class_num = CATEGORIES.index(category)  # 0 for Cat, 1 for Dog

        for img in os.listdir(path):
            try:
                img_array = cv2.imread(os.path.join(path, img), cv2.IMREAD_GRAYSCALE)
                new_array = cv2.resize(img_array, (IMG_SIZE, IMG_SIZE))
                training_data.append([new_array.flatten(), class_num])
            except Exception as e:
                pass
    return training_data

# --- 2. Data Preparation ---
data = create_training_data()
X = np.array([i[0] for i in data])
y = np.array([i[1] for i in data])

# Split into 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# --- 3. Feature Scaling ---
# Critical for SVM because it relies on distance calculations
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# --- 4. Training the SVM ---
print("Training SVM... This may take a few minutes depending on dataset size.")
model = SVC(kernel='rbf', C=1.0, gamma='auto')
model.fit(X_train, y_train)

# --- 5. Predictions and Evaluation ---
y_pred = model.predict(X_test)

print("\n--- Model Evaluation ---")
print(f"Accuracy: {accuracy_score(y_test, y_pred) * 100:.2f}%")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=CATEGORIES))

Training SVM... This may take a few minutes depending on dataset size.

--- Model Evaluation ---
Accuracy: 57.95%

Classification Report:
              precision    recall  f1-score   support

         Cat       0.59      0.60      0.60       228
         Dog       0.56      0.56      0.56       212

    accuracy                           0.58       440
   macro avg       0.58      0.58      0.58       440
weighted avg       0.58      0.58      0.58       440



## Final Task

### Subtask:
Confirm that the `FileNotFoundError` has been resolved and the model has successfully trained and evaluated.


## Summary:

### Q&A
The `FileNotFoundError` has been resolved by correcting the `CATEGORIES` list from `['cat', 'dog']` to `['Cat', 'Dog']`. The model has successfully trained and evaluated.

### Data Analysis Key Findings
*   The SVM model achieved an overall accuracy of 57.95%.
*   For the 'Cat' category, the model demonstrated a precision of 0.59, a recall of 0.60, and an f1-score of 0.60.
*   For the 'Dog' category, the model showed a precision of 0.56, a recall of 0.56, and an f1-score of 0.56.
*   The macro average and weighted average f1-scores for the model were both 0.58.

### Insights or Next Steps
*   The current model's accuracy (57.95%) is relatively low for image classification. Further improvements to feature extraction (e.g., using more sophisticated techniques than simple pixel flattening) or exploring more complex model architectures (e.g., Convolutional Neural Networks) are necessary.
*   Given the low accuracy, it would be beneficial to perform a deeper error analysis on misclassified images to understand common failure patterns and guide future model enhancements.
