### Short Coding Project: Support Vector Machines (SVM)

#### Project Overview

In this project, you will apply Support Vector Machines (SVM) to classify whether culverts need repair based on various environmental and physical attributes using the Augmented Culvert Dataset. You will preprocess the data, handle categorical variables, perform feature scaling, build and evaluate an SVM model, and explore advanced topics such as hyperparameter tuning.

- Delete the `# YOUR CODE HERE` comments and write your code.
- **Do not change** the variable names.

### Load the Dataset

Start by loading the Augmented Culvert Dataset and examining its structure.

In [None]:
# Import necessary libraries
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
url = 'https://raw.githubusercontent.com/CyConProject/Lab/main/Datasets/Augmented%20Culvert%20Dataset.csv'
data = pd.read_csv(url)

# Display the first few rows of the dataset
data.head()

### Data Preprocessing

**Handle missing values and encode categorical variables**

- For the `'Flooding_Frequency'` column, replace missing values with `'None'`.
- Convert the `'Cul_rating'` column to a binary target variable where ratings of 0 or 1 are mapped to `0` (needs repair), and ratings of 2, 3, or 4 are mapped to `1` (satisfactory to good condition).
- Encode categorical variables using label encoding. The columns to encode are:
  - `'cul_matl'`
  - `'cul_type'`
  - `'Soil_Drainage_Class'`
  - `'Soil_Surface_Texture'`
  - `'Flooding_Frequency'`

In [None]:
# Fill missing values in 'Flooding_Frequency' with 'None'
data['Flooding_Frequency'].fillna('None', inplace=True)

# Convert 'Cul_rating' to binary target variable
def convert_rating(rating):
    if rating in [0, 1]:
        return 0
    else:
        return 1

data['Cul_rating'] = data['Cul_rating'].apply(convert_rating)

# Encode categorical variables using label encoding
from sklearn.preprocessing import LabelEncoder

categorical_columns = ['cul_matl', 'cul_type', 'Soil_Drainage_Class', 'Soil_Surface_Texture', 'Flooding_Frequency']
label_encoders = {}
for column in categorical_columns:
    le = LabelEncoder()
    data[column] = le.fit_transform(data[column])
    label_encoders[column] = le

# Display the first few rows of the updated dataset
data.head()

### Question 1: Visualize the Target Variable Distribution

Visualize the distribution of the target variable `'Cul_rating'` to understand the class balance in the dataset.

1. Use `value_counts()` to find the distribution. Plot a bar chart showing the number of instances in each class of `'Cul_rating'`.

In [None]:
# Plot a bar chart of 'Cul_rating' distribution
class_counts = # YOUR CODE HERE
plt.figure(figsize=(6, 4))
sns.barplot(x=class_counts.index, y=class_counts.values)
plt.xlabel('Cul_rating')
plt.ylabel('Number of Instances')
plt.title('Distribution of Cul_rating')
plt.show()

### Question 2: Feature Correlation Analysis

Understanding the correlation between features can help in feature selection and data understanding.

1. **Compute the correlation matrix** of the dataset.
2. **Visualize the correlation matrix** using a heatmap.

**Hint**: Use the `.corr()` method and `seaborn`'s `heatmap` function.

In [None]:
# Compute correlation matrix
corr_matrix = # YOUR CODE HERE

# Visualize correlation matrix
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap='coolwarm')
plt.show()


### Question 3: Train and Evaluate the SVM Classifier

1. **Split the dataset** into features (`X`) and target (`y`), and then into training and testing sets using an 80-20 split. Set the `random_state=42`.
2. **Standardize the feature data** using `StandardScaler`.
3. **Initialize** the SVM classifier with default parameters (`kernel='rbf'`, `C=1.0`, `gamma='scale'`, `random_state=42`).
4. **Train** the model on the training data.
5. **Predict** the classes for the test data.
6. **Calculate the accuracy score**, **precision**, **recall**, and **F1 score** for the test data.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Separate features and target variable
X = data.drop('Cul_rating', axis=1)
y = data['Cul_rating']

# Split the data
X_train, X_test, y_train, y_test = # YOUR CODE HERE

# Standardize the feature data
scaler = # YOUR CODE HERE
X_train_scaled = # YOUR CODE HERE
X_test_scaled = # YOUR CODE HERE

# Initialize the SVM classifier
svm_model = # YOUR CODE HERE

# Train the model
# YOUR CODE HERE

# Predict the labels for the test set
y_pred = # YOUR CODE HERE

# Calculate evaluation metrics
accuracy = # YOUR CODE HERE
precision = # YOUR CODE HERE
recall = # YOUR CODE HERE
f1 = # YOUR CODE HERE

# Display the results
print(f"Accuracy Score: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")

### Question 4: Cross-Validation

To get a better estimate of the model's performance, perform cross-validation.

1. **Perform 5-fold cross-validation** on the training data and compute the average accuracy.

**Hint**: Use `cross_val_score` from `sklearn.model_selection`. Set the `scoring= 'accuracy'`.

In [None]:
from sklearn.model_selection import cross_val_score

# Perform cross-validation
cv_scores = # YOUR CODE HERE

# Compute average accuracy
avg_accuracy = # YOUR CODE HERE

print(f"Cross-Validation Accuracy Scores: {cv_scores}")
print(f"Average Cross-Validation Accuracy: {avg_accuracy:.2f}")


### Question 5: Hyperparameter Tuning (Advanced)

Optimize the SVM classifier's hyperparameters to improve performance.

1. **Perform a grid search** to find the best combination of `C` and `gamma` for the RBF kernel. Use `GridSearchCV` with `cv=5` and `scoring='accuracy'` to find the best combination of hyperparameters.
2. **Train the SVM classifier** with the best parameters found. Use `random_state=42`.
3. **Evaluate the model** using the test data.

**Hint**: Use `GridSearchCV` from `sklearn.model_selection`. To understand and implement hyperparameter tuning using `GridSearchCV` for optimizing an SVM classifier, refer to the following resource: 
[Scikit-learn Grid Search Documentation](https://scikit-learn.org/stable/modules/grid_search.html?utm_source=chatgpt.com).

In [None]:
from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'C': [0.1, 1, 10],
    'gamma': [0.01, 0.1, 1]
}

# Initialize GridSearchCV
grid_search = # YOUR CODE HERE

# Perform grid search on training data
# YOUR CODE HERE

# Get the best parameters
best_params = # YOUR CODE HERE
print("Best Parameters:", best_params)

# Train the model with best parameters
best_svm_model = # YOUR CODE HERE

# Predict on test data
y_pred_best = # YOUR CODE HERE

# Evaluate the model
accuracy_best = # YOUR CODE HERE
precision_best = # YOUR CODE HERE
recall_best = # YOUR CODE HERE
f1_best = # YOUR CODE HERE

# Display the results
print(f"Accuracy Score after Hyperparameter Tuning: {accuracy_best:.2f}")
print(f"Precision: {precision_best:.2f}")
print(f"Recall: {recall_best:.2f}")
print(f"F1 Score: {f1_best:.2f}")

After hyperparameter tuning, the SVM model's performance improved significantly, increasing accuracy from **87% to 98%**. The optimized model, with **C=10** and **gamma=1**, achieved perfect **precision (1.00)** and a high **recall (0.96)**, indicating that it correctly identifies almost all positive cases while minimizing false positives. This demonstrates the importance of tuning hyperparameters to achieve optimal classification performance.