<p style="font-family: 'Arial', sans-serif; font-size: 3rem; color: #6a1b9a; text-align: center; margin: 0; 
           text-shadow: 2px 2px 4px rgba(0, 0, 0, 0.1); background-color: #f5f5f5; padding: 10px; 
           border-radius: 10px; border: 4px solid #6a5acd; box-shadow: 2px 2px 12px rgba(0, 0, 0, 0.1); width: 97%;">
    <span style="font-weight: bold; color: #6a1b9a; animation: pulse 2s infinite;"></span>COMPX310-2025 Lab 7 <br> Support Vector Machines
<br>Banking Marketing Campaign Analysis</p>

<div align="center">
  <img src="https://www.driveresearch.com/wp-content/uploads/2024/10/bank-marketing-strategies.jpg" width="800" height="500">
</div>

### Lab Overview
**Total Points: 3**

In this lab, you will apply **Support Vector Machines (SVM)** for classification and **Support Vector Regression (SVR)** to analyze banking data.

### Learning Objectives
- Understand and apply Support Vector Machines for classification
- Implement Support Vector Regression for numerical prediction
- Evaluate model performance using appropriate metrics

---

<div align="center">
  <img src="https://ts2.tc.mm.bing.net/th/id/OIP-C.JFff7dHIUtNoMov7paoXRQHaEq?rs=1&pid=ImgDetMain&o=7&rm=3" width="800" height="500">
</div>

## 1. Business Understanding

### Background
StirBank has been running marketing campaigns to encourage customers to set up regular savings deposits. However, calling many customers is costly and risks annoying people. The bank needs a data-driven approach to identify which customers are most likely to respond positively.

### Problem Statement
**Part 1 - Classification Task:** Predict whether a customer will make a deposit (`made_deposit`) based on their characteristics and previous interactions.

**Part 2 - Regression Task:** Predict a customer's current balance based on their demographic and banking characteristics.

### Why Data Mining?
Data mining is suitable because:
- We have historical data with known outcomes
- The relationships between variables are complex and non-linear
- Manual rules would be difficult to create and maintain
- Automated prediction can save costs and improve targeting

### Key Terminology
- **Model**: A mathematical representation learned from data
- **Features/Variables**: Input attributes used for prediction
- **Target**: The variable we want to predict
- **SVM**: Support Vector Machine - finds optimal decision boundaries
- **Kernel**: Function that transforms data into higher dimensions
- **Hyperparameters**: Parameters that control the learning process (C, gamma, kernel type)

---

## 2. Import Required Libraries

Import all necessary Python libraries for this lab. You will need:
- Data manipulation libraries (pandas, numpy)
- Visualization libraries (matplotlib, seaborn)
- Preprocessing tools from scikit-learn (StandardScaler, LabelEncoder, train_test_split)
- SVM models from scikit-learn (SVC for classification, SVR for regression)
- Model selection tools (GridSearchCV for hyperparameter tuning)
- Evaluation metrics for both classification and regression tasks
- Configure any necessary settings (warnings, plot styles)

---
## 3. Data Understanding 

### Load and Explore the Dataset

Load the 'bank-tr.csv' file and perform initial exploration:
- Read the CSV file into a pandas DataFrame
- Display the shape of the dataset (should be 8000 rows × 20 columns)
- Show the first few rows to understand the data structure
- Display information about column data types and any missing values

### Data Dictionary

Review the variables in the dataset and fill type and Usefulness.
- Go through each column and classify it as **categorical** or **numerical**
- Mark it as **nominal, ordinal, continuous, or discrete**
- Decide whether it is useful for prediction or should be dropped

| Variable | Description | Type | Usefulness |
|----------|-------------|------|------------|
| accountID | Unique customer identifier |  |  |
| town | Customer's town |  |  |
| country | Customer's country |  |  |
| age | Customer's age |  |  |
| job | Type of job |  |  |
| marital | Marital status |  |  |
| education | Education level |  |  |
| defaulted? | Has credit in default |  |  |
| current_balance | Current account balance |  |  |
| housing | Has housing loan |  |  |
| has_loan | Has personal loan |  |  |
| last_contact | Contact communication type |  |  |
| cc_tr | Number of contacts in campaign |  |  |
| last_contact_day | Day of last contact |  |  |
| last_contact_month | Month of last contact |  |  |
| campaign | Number of contacts in this campaign |  |  |
| days_since_last_contact | Days since previous campaign |  |  |
| previous | Number of previous contacts |  |  |
| poutcome | Outcome of previous campaign |  |  |
| made_deposit | Did customer make deposit? (**TARGET VARIABLE**) |  |  |

---


### Summary Statistics

Generate and display summary statistics:
- Use appropriate pandas methods to display summary statistics for numerical variables
- For categorical variables, display the unique values and their frequencies
- Check for any missing values in the dataset

### Target Variable Analysis

Analyze the distribution of the target variable 'made_deposit':
- Display value counts and percentages
- Create visualizations to show the distribution (bar chart and/or pie chart)
- Assess whether the classes are balanced or imbalanced
- Comment on what you observe about the class distribution

---
## 4. Data Preparation 

### Step 4.1: Data Quality Assessment

Examine the data for quality issues:
- Check for any inconsistent values in categorical columns (e.g., different spellings or abbreviations)
- Look specifically at binary columns like 'housing', 'has_loan', and 'defaulted?' for any unexpected values
- Identify any typos or data entry errors that need correction
- Document any issues you find

### Step 4.2: Data Cleaning

Clean any data quality issues you identified:
- Create a copy of the dataframe for processing
- Fix any inconsistent values (e.g., standardize abbreviations)
- Document what changes you made and why
- Verify that the cleaning was successful by checking the unique values again

### Step 4.3: Visualize Distributions (Before Preprocessing)

Create histograms to visualize the distribution of key numerical variables:
- Select important numerical features (age, current_balance, campaign, days_since_last_contact, previous)
- Create histogram plots to understand their distributions
- Look for skewness, outliers, or unusual patterns
- These visualizations will help you understand if any transformations are needed

### Step 4.4: Remove Irrelevant Features

Remove features that won't contribute to prediction:
- Drop the 'accountID' column (it's just an identifier with no predictive value)
- Drop the 'country' column (all values are the same - UK, so it provides no information)
- Consider whether to keep or drop 'town' (too many unique values may not be useful)
- Print the remaining columns to verify the changes

### Step 4.5: Feature Engineering (Optional)

Create new features that might improve model performance:
- Consider creating a binary feature indicating whether the customer was contacted before
- Consider creating age groups from the continuous age variable
- Think about other features that might be useful based on domain knowledge
- Document any new features you create and your reasoning

---
## 5. Modeling - Part 1: SVM Classification

### Objective
Build a Support Vector Machine classifier to predict whether a customer will make a deposit.

### Step 5.1: Prepare Data for Classification

Separate features and target variable:
- Create X (features) by dropping the target column 'made_deposit' and any variables you won't use
- Create y (target) from the 'made_deposit' column
- Convert the target to binary format (1 for 'yes', 0 for 'no')
- Verify the shapes and distribution of your features and target

### Step 5.2: Encode Categorical Variables

Convert categorical variables to numerical format:
- Identify which columns are categorical and which are numerical
- Use LabelEncoder to encode categorical variables into numerical values
- Store the encoders in a dictionary in case you need them later
- Verify that all columns are now numerical

### Step 5.3: Split Data into Train, Validation, and Test Sets

Create three datasets for proper model evaluation:
- Split the data into 60% training, 20% validation, and 20% test
- First split 80-20 for (train+validation) and test
- Then split the 80% into 75-25 to get 60% train and 20% validation
- Use stratification to maintain class balance across splits
- Set random_state=42 for reproducibility
- Print the sizes of each set to verify the split

### Step 5.4: Feature Scaling

**⚠️ CRITICAL:** SVMs are very sensitive to feature scales. You must standardize features.

Apply StandardScaler to normalize features:
- Initialize a StandardScaler
- Fit the scaler on the training data ONLY
- Transform all three sets (train, validation, test) using the fitted scaler
- Never fit the scaler on validation or test data to avoid data leakage
- Verify that the scaled features have mean≈0 and std≈1

### Step 5.5: Build Baseline SVM Model

Create a simple baseline SVM model:
- Initialize an SVC (Support Vector Classifier) with default parameters
- Use the RBF kernel (this is the default)
- Train the model on the scaled training data
- Make predictions on the validation set
- Calculate and display the accuracy score
- Print a classification report showing precision, recall, and F1-score

### Step 5.6: Hyperparameter Tuning

Use GridSearchCV to find the best hyperparameters:
- Define a parameter grid with different values for:
  - **C**: Regularization parameter (try 0.1, 1, 10, 100)
  - **gamma**: Kernel coefficient (try 'scale', 'auto', 0.001, 0.01, 0.1)
  - **kernel**: Kernel type (try 'rbf' and 'linear')
- Create a GridSearchCV object with 5-fold cross-validation
- Fit the grid search on the training data
- Display the best parameters found
- Display the best cross-validation score
- This process may take several minutes to complete

### Step 5.7: Analyze Grid Search Results

Examine the grid search results:
- Extract the cv_results_ from the grid search
- Create a DataFrame to view the results in a structured format
- Display the top 10 parameter combinations
- Note which parameters seem to work best
- Compare different kernels and their performance

### Step 5.8: Train Final SVM Model

Use the best model from grid search:
- Extract the best estimator from the grid search results
- Make predictions on the validation set
- Calculate accuracy and compare it to the baseline model
- Print a detailed classification report
- Note the improvement (if any) over the baseline

### Step 5.9: Final Evaluation on Test Set

Evaluate the final model on unseen test data:
- Use the best model to make predictions on the test set
- Calculate and display the test accuracy
- Generate a comprehensive classification report
- This is your final model performance - report these numbers

---
## 6. Evaluation - Classification 

### Step 6.1: Confusion Matrix

Create and visualize a confusion matrix:
- Generate a confusion matrix comparing actual vs predicted values
- Visualize it using a heatmap with proper labels
- Extract the four values: True Negatives, False Positives, False Negatives, True Positives
- Interpret what each quadrant means:
  - True Negatives: Correctly predicted "No deposit"
  - False Positives: Predicted "Deposit" but actually "No" (Type I error)
  - False Negatives: Predicted "No" but actually "Deposit" (Type II error)
  - True Positives: Correctly predicted "Deposit"
- Discuss the business implications of each type of error

### Step 6.2: ROC Curve and AUC Score

Generate and plot the ROC (Receiver Operating Characteristic) curve:
- Train a new SVM with probability=True to get probability predictions
- Use the same best parameters from grid search
- Get probability predictions for the positive class on the test set
- Calculate the false positive rate, true positive rate, and thresholds
- Calculate the AUC (Area Under Curve) score
- Plot the ROC curve with a diagonal reference line
- Interpret the AUC score:
  - 0.9-1.0: Excellent
  - 0.8-0.9: Good
  - 0.7-0.8: Fair
  - Below 0.7: Poor

### Step 6.3: Error Analysis by Segments

Analyze where the model performs well and where it struggles:
- Create a dataframe combining test features, actual values, and predictions
- Add a column indicating whether the prediction was correct
- Calculate accuracy for different customer segments (e.g., by job type, age group, marital status)
- Identify segments where the model performs poorly
- Discuss potential reasons for lower performance in certain segments
- Suggest improvements based on this analysis

---
## 7. Modeling - Part 2: Support Vector Regression

### Objective
Build a Support Vector Regression model to predict a customer's current_balance.

### Step 7.1: Prepare Data for Regression

Set up the regression task:
- Create X (features) by dropping 'current_balance' (target), 'made_deposit' (classification target), and any engineered features you won't use
- Create y (target) as the 'current_balance' column
- Display summary statistics of the target variable
- Create visualizations (histogram and boxplot) to understand the distribution of current_balance
- Note any skewness or outliers in the target variable

### Step 7.2: Encode Categorical Variables

Encode categorical features for regression:
- Identify categorical and numerical columns
- Apply LabelEncoder to each categorical column
- Store encoders for potential future use
- Verify all columns are now numerical

### Step 7.3: Split Data for Regression

Create train, validation, and test sets:
- Use the same 60-20-20 split strategy as in classification
- First split: 80% (train+val) and 20% test
- Second split: 75% train and 25% validation from the 80%
- Use random_state=42 for reproducibility
- Print the sizes of each set

### Step 7.4: Feature Scaling for Regression

Scale both features and target variable:
- Initialize two StandardScalers: one for features (X) and one for target (y)
- Fit the feature scaler on training features and transform all three sets
- Fit the target scaler on training target and transform all three target sets
- Note: Scaling the target helps SVR performance, but remember to inverse transform predictions
- Verify the scaling was applied correctly

### Step 7.5: Build Baseline SVR Model

Create a baseline Support Vector Regression model:
- Initialize an SVR with default parameters and RBF kernel
- Train on the scaled training data
- Make predictions on the validation set (remember these are scaled)
- Inverse transform the predictions to get them back to original scale
- Calculate and display:
  - RMSE (Root Mean Squared Error)
  - MAE (Mean Absolute Error)
  - R² Score (coefficient of determination)
- Interpret these metrics in the context of the problem

### Step 7.6: Hyperparameter Tuning for SVR

Find optimal hyperparameters using GridSearchCV:
- Define a parameter grid with:
  - **C**: Regularization (try 0.1, 1, 10, 100)
  - **gamma**: Kernel coefficient (try 'scale', 'auto', 0.001, 0.01)
  - **epsilon**: Epsilon-tube (try 0.01, 0.1, 0.2)
  - **kernel**: Kernel type (try 'rbf', 'linear')
- Use GridSearchCV with 5-fold cross-validation
- Use 'neg_mean_squared_error' as the scoring metric
- Fit on the scaled training data
- Display the best parameters and best score
- Note: This may take several minutes

### Step 7.7: Train Final SVR Model

Evaluate the tuned SVR model:
- Extract the best estimator from grid search
- Make predictions on the validation set
- Inverse transform predictions to original scale
- Calculate RMSE, MAE, and R² score
- Compare with the baseline model
- Report the improvement achieved through tuning

### Step 7.8: Final SVR Evaluation on Test Set

Evaluate the final SVR model on unseen test data:
- Use the best model to predict on the test set
- Inverse transform predictions
- Calculate and display final test metrics (RMSE, MAE, R²)
- Compare test performance with validation performance
- Report the mean and standard deviation of the target variable for context

---
## 8. Evaluation - Regression

### Step 8.1: Predicted vs Actual Values

Visualize model predictions:
- Create a scatter plot with actual values on x-axis and predicted values on y-axis
- Add a diagonal reference line representing perfect predictions
- Format the plot with proper labels and title
- Interpret the plot:
  - Points close to the line indicate good predictions
  - Systematic deviations suggest bias
  - Spread around the line indicates variance in predictions

### Step 8.2: Residual Analysis

Analyze prediction errors:
- Calculate residuals (actual - predicted)
- Create two plots:
  1. Residual plot: residuals vs predicted values
  2. Histogram of residuals
- Check for patterns in the residual plot:
  - Random scatter is good (homoscedasticity)
  - Patterns suggest model is missing something
  - Funnel shape suggests heteroscedasticity
- Check if residuals are normally distributed (from histogram)
- Calculate and display residual statistics (mean, std, min, max)
- Ideally, mean should be close to 0

---
## 9. Summary and Conclusions

### Step 9.1: Comprehensive Performance Summary

Create a comprehensive summary of both models:
- Summarize classification results:
  - Best hyperparameters found
  - Test accuracy
  - AUC score
  - Confusion matrix breakdown
- Summarize regression results:
  - Best hyperparameters found
  - Test RMSE, MAE, R²
  - Typical prediction error
- Compare both models' performance relative to baselines

### Step 9.2: Business Insights and Recommendations

Write a text analysis covering:

**Key Findings:**
- What did you learn about customer behavior from the classification model?
- What factors seem most important for predicting deposits?
- How accurate is the balance prediction from regression?

**Business Recommendations:**
- How should StirBank use the classification model?
- Which customers should they prioritize for marketing calls?
- What is the cost-benefit trade-off between false positives and false negatives?
- Should they consider different strategies for different customer segments?

**Model Limitations:**
- Where do the models perform poorly?
- What assumptions might not hold in practice?
- How might model performance degrade over time?

---
## Reflection Questions

Answer the following questions based on your work:

1. **Kernel Selection:** How does the RBF kernel differ from the linear kernel? When might you prefer one over the other?

2. **Hyperparameter C:** What is the role of the C parameter in SVM? What happens when C is very large? Very small?

3. **Feature Scaling:** Why is feature scaling critical for SVM? What would happen if you forgot to scale?

4. **Error Trade-offs:** In the classification task, which is worse for the bank: false positives or false negatives? Why?

5. **Model Improvements:** Based on your analysis, what are three specific ways you could improve these models?

6. **CRISP-DM:** How did following the CRISP-DM framework help structure your analysis? What stages were most important?

7. **Practical Deployment:** What challenges might arise when deploying these models in production at StirBank?

---
**End of Lab 07**

### Submission Checklist:
- ✅ All code cells executed without errors
- ✅ All visualizations displayed correctly
- ✅ Results interpreted and explained
- ✅ Reflection questions answered
- ✅ Business recommendations provided
- ✅ Code is well-commented and readable