# Activity 6
For this activity, let's consider the previous classification problem:

## Predicting Customer Churn in a Telecom Company
### Problem Overview
In this classification problem, the goal is to predict whether a customer will churn (leave) or stay with a telecom company based on several features such as customer demographics, service usage, and account information.

- Class label: Churn (1 = Yes, 0 = No)

    - 1: Customer has churned.
    - 0: Customer has stayed.
    
- Features
    - Customer ID: Unique identifier for each customer.
    - Gender: Whether the customer is male or female.
    - Age: Age of the customer.
    - Tenure: Number of months the customer has been with the company.
    - Service Plan: Type of service plan (e.g., Basic, Premium).
    - Monthly Charges: Monthly bill of the customer.
    - Total Charges: Total amount billed to the customer.
    - Internet Service: Whether the customer has internet service (Yes/No).
    - Tech Support: Whether the customer has tech support (Yes/No).
    - Paperless Billing: Whether the customer opts for paperless billing (Yes/No).
    - Payment Method: Payment method (e.g., Bank Transfer, Credit Card).
    - Contract Type: Contract type (e.g., Month-to-month, One year, Two year).
    - Phone Service: Whether the customer has phone service (Yes/No).
    - Multiple Lines: Whether the customer has multiple lines (Yes/No).
    
### Classification Models:
- K-Nearest Neighbors (KNN) density estimation
    - For density estimation, KNN can be used to estimate the probability density of a data point by looking at the nearest neighbors of that point.
- Support Vector Machine (SVM) with linear kernel for hard and soft margin
    - Linear SVM is the simplest form of SVM where the data is assumed to be linearly separable. It finds the hyperplane that best separates the data into two classes.
    - Hard Margin SVM aims to find the hyperplane that perfectly separates the classes with no margin violations (i.e., no points on the wrong side of the hyperplane). This is only possible when the data is perfectly linearly separable.
    - Soft margin SVM allows some margin violations (misclassifications) but penalizes them through a cost parameter C. The parameter C controls the trade-off between maximizing the margin and minimizing classification errors. A large C value makes the model more sensitive to misclassifications, while a smaller C allows more misclassifications but results in a wider margin.
- Non-Linear SVM
    - When the data is not linearly separable, we use non-linear kernels such as RBF (Radial Basis Function), Polynomial, and Sigmoid. These kernels transform the data into a higher-dimensional space where a linear separation is possible.
    - RBF is a popular non-linear kernel that maps the data into a higher-dimensional space using a Gaussian function.
    - A polynomial kernel uses polynomial functions to map the data into a higher-dimensional space. This can be useful when the data has non-linear relationships.
    - The sigmoid kernel uses a sigmoid function to map the data to a higher-dimensional space. It's less commonly used, but still an option.


### Dataset
We will use the same dataset we used in Activity 4. This dataset contains information for over 7000 customers.

## Import all necessary libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import LabelEncoder
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt
import seaborn as sns

## Load data from the file

In [2]:
df = pd.read_csv("WA_Fn-UseC_-Telco-Customer-Churn.csv")
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


## Preliminary EDA

In [3]:
# Exclude non-numeric columns (e.g., CustomerID, Churn, etc.) from numerical operations
df_cleaned = df.drop(columns=['customerID'])

# Identify numerical columns (you can also use df.select_dtypes() if needed)
numerical_columns = df_cleaned.select_dtypes(include=['int64', 'float64']).columns

# Identify categorical columns (object, category)
categorical_columns = df_cleaned.select_dtypes(include=['object', 'category']).columns

# Impute missing values with the mean for numerical columns
df_cleaned[numerical_columns] = df_cleaned[numerical_columns].fillna(df_cleaned[numerical_columns].mean())
# Impute missing values for categorical columns with the mode (most frequent value)
for column in categorical_columns:
    mode_value = df_cleaned[column].mode()[0]  # Get the most frequent value
    df_cleaned[column] = df_cleaned[column].fillna(mode_value)
    
# Convert 'TotalCharges' to numeric (force errors to NaN)
df_cleaned['TotalCharges'] = pd.to_numeric(df_cleaned['TotalCharges'], errors='coerce')

# Refill any NaN values in 'TotalCharges' (if any)
df_cleaned['TotalCharges'] = df_cleaned['TotalCharges'].fillna(df_cleaned['TotalCharges'].mean())

# encoding for categorical variables
le = LabelEncoder()
for col in categorical_columns:
    df_cleaned[col] = le.fit_transform(df_cleaned[col])

# Split the data into features and target
X = df_cleaned.drop('Churn', axis=1)
y = df_cleaned['Churn']

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the data (important for KNN and Naive Bayes)
scaler = StandardScaler()
X_train[numerical_columns] = scaler.fit_transform(X_train[numerical_columns])
X_test[numerical_columns] = scaler.transform(X_test[numerical_columns])

## Train Different Models

### Visualizing the difference between hard and soft margin SVM

### Linear SVM

### Non-linear SVMs

### Hyperparameter tuning in RBF SVM

### KNN Density Estimation for Classification
KNN is generally used as a classifier by considering the majority vote of the k nearest neighbors. Howeever, for density estimation, we look at how the KNN algorithm can be used to estimate the likelihood of the target variable for a given data point.
In a KNN density estimation setup, we would typically be interested in:
- **Probability Estimation**: The probability of a class for a given point can be approximated by the proportion of neighbors that belong to that class. For example, if 3 out of the 5 nearest neighbors of a data point belong to class 1 (churn), the estimated probability of that point belonging to the class 1 is 0.6.

However, for simplicity, we will use KNN as a classifier directly since it fits with the task more naturally and use it predict via probability estimation.

### Hyper parameter tuning

## Model Evaluation

#### Predictions

#### Performance Measurement

## Performance Comparison

### Conclusion?