# SVM Dataset Loader and Evaluator

This notebook loads datasets from multiple sources (sklearn, TensorFlow, and Kaggle),
trains an SVM model, and evaluates its performance. The datasets used are:

- **Iris Dataset** (from sklearn)
- **MNIST Handwritten Digits Dataset** (from TensorFlow)
- **Titanic Dataset** (from Kaggle)

The notebook follows these steps:

1. Load datasets from different sources
2. Display 10 samples from each dataset
3. Preprocess and split the data
4. Train an SVM model
5. Make predictions
6. Evaluate model performance

## Requirements:
- Ensure you have Kaggle API configured to access datasets.
- Required libraries: `numpy`, `pandas`, `sklearn`, `tensorflow`, `kaggle`.

Let's get started! 🚀

## Step 1: Import Necessary Libraries

We import the required libraries for data handling, machine learning, and deep learning:

- `numpy` and `pandas` for numerical operations and data handling.
- `sklearn` for loading datasets, splitting data, training an SVM model, and evaluating performance.
- `tensorflow` for loading the MNIST dataset.
- `kaggle` API to fetch datasets from Kaggle.


In [7]:
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix
import tensorflow as tf
from kaggle.api.kaggle_api_extended import KaggleApi


## Step 2: Load Dataset from sklearn (Iris Dataset)

The Iris dataset contains measurements of different species of flowers.
- `datasets.load_iris()` loads the dataset.
- `X` contains the features (sepal length, sepal width, petal length, petal width).
- `y` contains the target labels (flower species: 0, 1, or 2).
- We display 10 samples to understand the dataset structure.


In [8]:
# Step 2: Load Dataset from sklearn
iris = datasets.load_iris()
X, y = iris.data, iris.target
print("Iris Dataset Sample:")
print(pd.DataFrame(X).head(10))


Iris Dataset Sample:
     0    1    2    3
0  5.1  3.5  1.4  0.2
1  4.9  3.0  1.4  0.2
2  4.7  3.2  1.3  0.2
3  4.6  3.1  1.5  0.2
4  5.0  3.6  1.4  0.2
5  5.4  3.9  1.7  0.4
6  4.6  3.4  1.4  0.3
7  5.0  3.4  1.5  0.2
8  4.4  2.9  1.4  0.2
9  4.9  3.1  1.5  0.1


## Step 3: Load Dataset from TensorFlow (MNIST Dataset)

The MNIST dataset consists of 28x28 grayscale images of handwritten digits (0-9).
- `tf.keras.datasets.mnist.load_data()` loads the dataset.
- `X_train` and `X_test` contain the image data.
- `y_train` and `y_test` contain the corresponding labels (digits).
- Since SVM requires a flat feature representation, we reshape the 28x28 images into 1D arrays.
- We display 10 samples to check the dataset structure.


In [9]:
# Step 3: Load Dataset from TensorFlow
mnist = tf.keras.datasets.mnist
(X_train, y_train), (X_test, y_test) = mnist.load_data()
# Flatten the images for SVM
X_train = X_train.reshape((X_train.shape[0], -1))
X_test = X_test.reshape((X_test.shape[0], -1))
print("MNIST Dataset Sample:")
print(pd.DataFrame(X_train).head(10))


MNIST Dataset Sample:
   0    1    2    3    4    5    6    7    8    9    ...  774  775  776  777  \
0    0    0    0    0    0    0    0    0    0    0  ...    0    0    0    0   
1    0    0    0    0    0    0    0    0    0    0  ...    0    0    0    0   
2    0    0    0    0    0    0    0    0    0    0  ...    0    0    0    0   
3    0    0    0    0    0    0    0    0    0    0  ...    0    0    0    0   
4    0    0    0    0    0    0    0    0    0    0  ...    0    0    0    0   
5    0    0    0    0    0    0    0    0    0    0  ...    0    0    0    0   
6    0    0    0    0    0    0    0    0    0    0  ...    0    0    0    0   
7    0    0    0    0    0    0    0    0    0    0  ...    0    0    0    0   
8    0    0    0    0    0    0    0    0    0    0  ...    0    0    0    0   
9    0    0    0    0    0    0    0    0    0    0  ...    0    0    0    0   

   778  779  780  781  782  783  
0    0    0    0    0    0    0  
1    0    0    0    0    0   

## Step 4: Load Dataset from Kaggle (Titanic Dataset)

The Titanic dataset contains information about passengers, used to predict survival.
- `KaggleApi()` is used to authenticate and download the dataset.
- The dataset is read using `pd.read_csv()`.
- We drop missing values (`dropna()`) to clean the data.
- `X` contains passenger attributes (age, fare, class, etc.), and `y` contains survival labels (0 or 1).
- We display 10 samples to analyze the dataset.


In [13]:
# Step 4: Load Dataset from Kaggle
api = KaggleApi()
#api.authenticate()

# Download a dataset from Kaggle (example: Titanic dataset)
api.dataset_download_files('heptapod/titanic', path='./', unzip=True)

# Load the dataset into a pandas DataFrame
titanic = pd.read_csv('/content/train_and_test2.csv')
# Preprocess the dataset as needed
titanic = titanic.dropna()  # Example preprocessing step
X = titanic.drop('2urvived', axis=1)
y = titanic['2urvived']
print("Titanic Dataset Sample:")
print(X.head(10))


Dataset URL: https://www.kaggle.com/datasets/heptapod/titanic
Titanic Dataset Sample:
   Passengerid   Age     Fare  Sex  sibsp  zero  zero.1  zero.2  zero.3  \
0            1  22.0   7.2500    0      1     0       0       0       0   
1            2  38.0  71.2833    1      1     0       0       0       0   
2            3  26.0   7.9250    1      0     0       0       0       0   
3            4  35.0  53.1000    1      1     0       0       0       0   
4            5  35.0   8.0500    0      0     0       0       0       0   
5            6  28.0   8.4583    0      0     0       0       0       0   
6            7  54.0  51.8625    0      0     0       0       0       0   
7            8   2.0  21.0750    0      3     0       0       0       0   
8            9  27.0  11.1333    1      0     0       0       0       0   
9           10  14.0  30.0708    1      1     0       0       0       0   

   zero.4  ...  zero.11  zero.12  zero.13  zero.14  Pclass  zero.15  zero.16  \
0       

## Step 5: Train-Test Split

Splitting the dataset into training and testing sets:
- `train_test_split()` randomly divides the data into 70% training and 30% testing.
- This ensures the model is trained on one portion of data and evaluated on another.


In [14]:
# Step 5: Train-Test Split (for sklearn and Kaggle datasets)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


## Step 6: Train SVM Model

We use a Support Vector Machine (SVM) classifier:
- `SVC(kernel='linear')` initializes a linear SVM model.
- `fit(X_train, y_train)` trains the model on the training dataset.


In [15]:
# Step 6: Train SVM Model
svm_model = SVC(kernel='linear')
svm_model.fit(X_train, y_train)


## Step 7: Make Predictions

We use the trained SVM model to predict outcomes:
- `predict(X_test)` predicts labels for the test dataset.
- These predictions are stored in `y_pred`.


In [16]:
# Step 7: Make Predictions
y_pred = svm_model.predict(X_test)


## Step 8: Evaluate the Model

We evaluate the model's performance using multiple metrics:
- `accuracy_score(y_test, y_pred)`: Measures overall correctness.
- `precision_score(y_test, y_pred, average='weighted')`: Measures correct positive predictions.
- `recall_score(y_test, y_pred, average='weighted')`: Measures coverage of actual positives.
- `confusion_matrix(y_test, y_pred)`: Displays true positives, false positives, etc.
- The results are printed for analysis.


In [17]:
# Step 8: Evaluate the Model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
conf_matrix = confusion_matrix(y_test, y_pred)

# Print the results
print(f'Accuracy: {accuracy:.2f}')
print(f'Precision: {precision:.2f}')
print(f'Recall: {recall:.2f}')
print('Confusion Matrix:')
print(conf_matrix)


Accuracy: 0.82
Precision: 0.82
Recall: 0.82
Confusion Matrix:
[[262  15]
 [ 54  62]]
