## Multiclass Classification Using SVM
In this assignment, you will perform multiclass classification using SVMs. You will use the SVM library from the ``sklearn`` library. Some of the important libraries you 'may' use include the following:

**SVC from sklearn.svm**

**classification_report from sklearn.metrics**

**confusion_matrix from sklearn.metrics**

You can consult the SVC documentation at: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

## Data Set Information:

The database consists of the multi-spectral values of pixels in 3x3 neighbourhoods in a satellite image, and the classification associated with the central pixel in each neighbourhood. The aim is to predict this classification, given the multi-spectral values. In the sample database, the class of a pixel is coded as a number.

The Landsat satellite data is one of the many sources of information available for a scene. The interpretation of a scene by integrating spatial data of diverse types and resolutions including multispectral and radar data, maps indicating topography, land use etc. is expected to assume significant importance with the onset of an era characterised by integrative approaches to remote sensing (for example, NASA's Earth Observing System commencing this decade). Existing statistical methods are ill-equipped for handling such diverse data types. Note that this is not true for Landsat MSS data considered in isolation (as in this sample database). This data satisfies the important requirements of being numerical and at a single resolution, and standard maximum-likelihood classification performs very well. Consequently, for this data, it should be interesting to compare the performance of other methods against the statistical approach.

One frame of Landsat MSS imagery consists of four digital images of the same scene in different spectral bands. Two of these are in the visible region (corresponding approximately to green and red regions of the visible spectrum) and two are in the (near) infra-red. Each pixel is a 8-bit binary word, with 0 corresponding to black and 255 to white. The spatial resolution of a pixel is about 80m x 80m. Each image contains 2340 x 3380 such pixels.

The database is a (tiny) sub-area of a scene, consisting of 82 x 100 pixels. Each line of data corresponds to a 3x3 square neighbourhood of pixels completely contained within the 82x100 sub-area. Each line contains the pixel values in the four spectral bands (converted to ASCII) of each of the 9 pixels in the 3x3 neighbourhood and a number indicating the classification label of the central pixel. The number is a code for the following classes:

Number Class

1 red soil

2 cotton crop

3 grey soil

4 damp grey soil

5 soil with vegetation stubble

6 mixture class (all types present)

7 very damp grey soil

NB. There are no examples with class 6 in this dataset.

In each line of data the four spectral values for the top-left pixel are given first followed by the four spectral values for the top-middle pixel and then those for the top-right pixel, and so on with the pixels read out in sequence left-to-right and top-to-bottom. Thus, the four spectral values for the central pixel are given by attributes 17,18,19 and 20. If you like you can use only these four attributes, while ignoring the others. This avoids the problem which arises when a 3x3 neighbourhood straddles a boundary.


https://archive.ics.uci.edu/ml/datasets/Statlog+(Landsat+Satellite)


## 1. Import the data and split it into train/validate/test

Import the file '186_satimage.csv' and split it into train/validate/test set with ratio ``70/15/15``

**note: use random_state=777 whereever needed**

**note: the first 36 column are features and the last column is the class label**


In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset from the CSV file
data = pd.read_csv('186_satimage.csv')

# Extract features (first 36 columns) and labels (last column)
X = data.iloc[:, :-1]
y = data.iloc[:, -1]

# Split the data into training (70%), validation (15%), and test (15%) sets
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=777)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=777)

# Print the shapes of the datasets to verify the split
print("Training set shape:", X_train.shape, y_train.shape)
print("Validation set shape:", X_val.shape, y_val.shape)
print("Test set shape:", X_test.shape, y_test.shape)


Training set shape: (4500, 36) (4500,)
Validation set shape: (964, 36) (964,)
Test set shape: (965, 36) (965,)


## 2. Check the details of the features and labels

Check some basic statistics of the features and the label

In [2]:
# For features
feature_statistics = X.describe()

# For the label
label_statistics = y.describe()

print("Feature Statistics:")
print(feature_statistics)

print("\nLabel Statistics:")
print(label_statistics)


Feature Statistics:
          0.117596     1.241362     1.184036     0.815302    -0.158561  \
count  6429.000000  6429.000000  6429.000000  6429.000000  6429.000000   
mean     -0.000864    -0.000941    -0.000787    -0.000399    -0.000838   
std       0.999726     0.999902     0.999934     1.000222     0.999772   
min      -2.234329    -2.473310    -2.780894    -2.624275    -2.223275   
25%      -0.690878    -0.550421    -0.858503    -0.719279    -0.674739   
50%      -0.102897     0.148811     0.102692    -0.084280    -0.084821   
75%       0.779075     0.848043     0.823588     0.497802     0.800057   
max       2.543020     2.333912     2.445605     3.778629     2.569812   

          1.256483     1.193546     0.818486    -0.141965     0.879481  ...  \
count  6429.000000  6429.000000  6429.000000  6429.000000  6429.000000  ...   
mean     -0.000982    -0.000797    -0.000460    -0.000745    -0.000916  ...   
std       0.999913     0.999935     1.000245     0.999883     0.999921  ... 

## 3. Train an SVM classifier and fine-tune the hyper parameters on the validation set

After all the fine tuning, report the best results on the validation set.

In [3]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

# Initialize the SVM classifier
svm_classifier = SVC()

# Define a parameter grid for hyperparameter tuning
param_grid = {
    'C': [0.1, 1, 10],           # Regularization parameter
    'kernel': ['linear', 'rbf'],  # Kernel type
    'gamma': ['scale', 'auto'],   # Kernel coefficient
}

# Create a GridSearchCV object to perform hyperparameter tuning
grid_search = GridSearchCV(svm_classifier, param_grid, cv=5, n_jobs=-1)

# Fit the model on the training set
grid_search.fit(X_train, y_train)

# Report the best hyperparameters and results on the validation set
best_params = grid_search.best_params_
best_score = grid_search.best_score_
best_estimator = grid_search.best_estimator_

print("Best Hyperparameters:", best_params)
print("Best Cross-Validation Score:", best_score)

# Evaluate the best model on the validation set
y_val_pred = best_estimator.predict(X_val)
classification_rep = classification_report(y_val, y_val_pred)

print("\nClassification Report on Validation Set:\n", classification_rep)


Best Hyperparameters: {'C': 10, 'gamma': 'scale', 'kernel': 'rbf'}
Best Cross-Validation Score: 0.9071111111111112

Classification Report on Validation Set:
               precision    recall  f1-score   support

           1       0.99      0.97      0.98       238
           2       0.96      0.97      0.97       103
           3       0.89      0.95      0.92       194
           4       0.74      0.64      0.69        94
           5       0.88      0.84      0.86       108
           7       0.85      0.88      0.86       227

    accuracy                           0.90       964
   macro avg       0.88      0.87      0.88       964
weighted avg       0.90      0.90      0.90       964



## 4. Do the final test on the test set

Do the final scoring on the test set. Report different measure and show the confusion matrix. Record your observations.

In [4]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Use the best model found during hyperparameter tuning
best_model = best_estimator

# Make predictions on the test set
y_test_pred = best_model.predict(X_test)

# Calculate various performance metrics
accuracy = accuracy_score(y_test, y_test_pred)
precision = precision_score(y_test, y_test_pred, average='weighted')
recall = recall_score(y_test, y_test_pred, average='weighted')
f1 = f1_score(y_test, y_test_pred, average='weighted')

# Generate the confusion matrix
conf_matrix = confusion_matrix(y_test, y_test_pred)

# Print the performance metrics and confusion matrix
print("Test Set Metrics:")
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-Score:", f1)

print("\nConfusion Matrix:")
print(conf_matrix)


Test Set Metrics:
Accuracy: 0.9202072538860103
Precision: 0.9182070119708354
Recall: 0.9202072538860103
F1-Score: 0.9172320589040983

Confusion Matrix:
[[218   0   1   0   2   0]
 [  0 115   0   0   1   0]
 [  0   0 173   3   0   0]
 [  3   0  24  60   2  12]
 [  0   1   1   1  98   4]
 [  0   1   2  14   5 224]]
