## 1. Importing Relevant Libraries

In [None]:
import warnings
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder, StandardScaler, FunctionTransformer, PolynomialFeatures
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV, cross_val_score
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.feature_selection import RFECV, RFE, SelectPercentile, chi2, f_classif
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline
from xgboost import XGBClassifier

## 2. Loading the data

In [None]:
df = pd.read_csv("train.csv")
df.head()

## 3. EDA

creating a copy of the database for EDA

In [None]:
df_eda = df.copy()  # creating a copy of the database

In [None]:
df_eda.shape

In [None]:
df_eda.columns

In [None]:
df_eda.head()

In [None]:
df_eda.info()

In [None]:
df_eda.isnull().sum()

### 3.2. Handling missing values


**Note:**  
We are modifying a **copy** of the DataFrame (`df_eda`) for **Exploratory Data Analysis (EDA)**. The actual modifications will only be applied **after the train-test split** to avoid data leakage.

---

### Missing Value Handling Strategy

We will handle missing values in the following columns as described below:

#### **1. `job`**
- **Missing Values:** 229 null values (0.5% of the data).
- **Imputation Strategy:** Replace null values with the **mode** of the column (most frequent value).

#### **2. `education`, `contact`, and `poutcome`**
- **Reasoning:** The metadata indicates that these variables have "unknown" as a valid category. Therefore, we will replace null values with **"unknown"**.
  
  - **`poutcome`**: 29,451 null values (75% of the data).
  - **`contact`**: 10,336 null values (26% of the data).

---

### Important Note:
Imputing with the **mode** will not produce meaningful results for `poutcome` and `contact` due to the **large proportion of missing data** in these columns. Instead, imputing with `"unknown"` is a better approach.

In [None]:
# Imputation for job column
row_count = df_eda.shape[0]
imputer_eda = SimpleImputer(strategy='most_frequent')
df_eda['job'] = imputer_eda.fit_transform(df_eda[['job']]).reshape(row_count,)

In [None]:
# converting null values to "unknown" for education, contact, poutcome coulmns
df_eda['education'] = df_eda['education'].fillna('unknown')
df_eda['contact'] = df_eda['contact'].fillna('unknown')
df_eda['poutcome'] = df_eda['poutcome'].fillna('unknown')

In [None]:
# all null values are handled now
df_eda.isnull().sum()

### 3.3. Descriptive Statistics of Numerical Features


In [None]:
numerical_features = ['age', 'balance', 'duration', 'campaign', 'pdays', 'previous']
categorical_features = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact','poutcome']
cat_out = ["target"]

In [None]:
df_eda.describe()

#### **1. `balance`**
- **Observation:** 2,800 people have a balance of **`0`** in their accounts.

#### **2. `pdays`**
- **Observation:** 29,446 values are **`-1`**, which indicates that the **client was not previously contacted**.

#### **3. `previous`**
- **Observation:** 29,456 values are **`-1`**, which also indicates that the **client was not previously contacted**.


In [None]:
fig, axes = plt.subplots(2, 3, figsize=(20, 10))
axes = axes.flatten()
bins = [20, 10, 10, 10, 10, 10, 10, 10]

for i in range(len(numerical_features)):
  feature = numerical_features[i]

  df_eda[feature].plot(kind='hist', bins=bins[i], ax=axes[i])
  axes[i].set_title(feature)
  axes[i].set_xlabel(feature)
  axes[i].set_ylabel('Frequency')

### 3.4. Descriptive statistics of Categorical features

In [None]:
for feature in categorical_features:
  df_temp = pd.DataFrame()
  df_temp['values'] = df_eda[feature].unique()
  df_temp["value_counts"] = list(df_eda[feature].value_counts())
  df_temp["proportions"] = list(df_eda[feature].value_counts(normalize=True))
  print(feature)
  print(df_temp)
  print("\n\n")


The following plots provide insights into the distributions of key categorical features in the dataset:

#### **1. `job`**
- The most common occupations are **blue-collar**, **management**, and **technician**.
- Other categories, such as **student**, **housemaid**, and **unemployed**, have significantly lower representation.

#### **2. `marital`**
- The majority of individuals are **married**, followed by **single** individuals.
- A smaller proportion of individuals are **divorced**.

#### **3. `education`**
- Most individuals have completed **secondary education**, followed by **tertiary education**.
- A small proportion has **primary education** or an **unknown** level of education.

#### **4. `default`**
- The vast majority of individuals do not have credit in default (**no**).
- Only a very small fraction of individuals are in default (**yes**).

#### **5. `housing`**
- More than half of the individuals have a housing loan (**yes**), while a significant proportion do not (**no**).

#### **6. `loan`**
- The majority of individuals do not have personal loans (**no**), with a smaller percentage having loans (**yes**).

#### **7. `contact`**
- Most individuals were contacted via **cellular phones**, with fewer being contacted via **telephone**.
- A large number of contact methods are categorized as **unknown**.

#### **8. `poutcome`**
- The **previous campaign outcome** is predominantly labeled as **unknown**, which aligns with a large proportion of clients not being previously contacted.
- Other categories, such as **failure**, **other**, and **success**, have much smaller proportions.


In [None]:
fig, axes = plt.subplots(2, 4, figsize=(20, 10))
axes = axes.flatten()

for i in range(len(categorical_features)):
  feature = categorical_features[i]

  df_eda[feature].value_counts().plot(kind='bar', ax=axes[i])
  axes[i].set_title(feature)
  axes[i].set_xlabel(feature)
  axes[i].set_ylabel('Count')

plt.subplots_adjust(hspace=0.6)
plt.subplots_adjust(wspace=0.25)


### 3.5. Outlier Detection

The box plots provide insights into the distribution and potential outliers for key numerical features in the dataset:

#### **1. `age`**
- The majority of individuals are aged between **30 and 50 years**, as shown by the interquartile range (IQR).
- A few outliers are observed for individuals older than **60 years**, but the range remains reasonable.

#### **2. `balance`**
- Account balances are heavily skewed, with most individuals having balances close to **0**.
- There are significant outliers where balances exceed **1,000,000**, indicating a few individuals with exceptionally high balances.

#### **3. `duration`**
- The duration of calls varies greatly, with most calls lasting less than **500 seconds**.
- Several extreme outliers exist for calls exceeding **4,000 seconds**, indicating very long interactions for some clients.

#### **4. `campaign`**
- Most clients were contacted fewer than **10 times**, with the IQR showing most data points between **1 and 5** contacts.
- Extreme outliers exist with clients contacted over **50 times**, which could indicate aggressive marketing strategies.

#### **5. `pdays`**
- A large proportion of values for `pdays` are concentrated at **-1**, indicating clients not previously contacted.
- For the rest of the data, there are extreme values, with some clients having a contact gap exceeding **800 days**.

#### **6. `previous`**
- Most clients had **0 to 1 previous contacts**, with the IQR concentrated at **0**.
- Significant outliers exist where some clients had over **200 previous contacts**, suggesting repeated interactions for a small subset of clients.

---

### Key Observations:
- **Outliers:** Most features, especially `balance`, `duration`, `campaign`, `pdays`, and `previous`, contain significant outliers

In [None]:
fig, axes = plt.subplots(2, 3, figsize=(10, 5))
axes = axes.flatten()

for i in range(len(numerical_features)):
  plt.subplot(2, 3, i+1)
  feature = numerical_features[i]

  sns.boxplot(x=df_eda[feature])
  plt.title(f"Box Plot of {feature}")

plt.subplots_adjust(hspace=0.6)

### 3.6 Class Frequency

In [None]:
df_eda["target"].value_counts()

In [None]:
sns.countplot(x="target", data=df_eda)
plt.show()

### 3.7 Encoding Categoical Data


**Note:**  
We are modifying a **copy** of the DataFrame (`df_eda`) for **Exploratory Data Analysis (EDA)**. The actual modifications will only be applied **after the train-test split** to avoid data leakage.

---

### Encoding Approach

The categorical features in the dataset will be encoded as follows:

#### **1. Binary Features**
- The following features are binary and will be directly encoded using **binary encoding**:
  - **`default`**
  - **`housing`**
  - **`loan`**

#### **2. Non-Ordinal Categorical Features**
- The following features will be encoded using **OneHotEncoding**:
  - **`job`**
  - **`marital`**
  - **`education`**
  - **`contact`**
  - **`poutcome`**

**Reasoning:**  
- These features lack an **ordinal relationship** between categories.  
- Even if some categories exhibit ordinal relationships, these relationships may not be linear, making **OneHotEncoding** the preferred choice.

#### **3. Target Variable**
- The target variable (`target`) will also be encoded using **binary encoding**.


In [None]:
ordinal_encoder_eda = OrdinalEncoder()
ordinal_encoder_eda.fit(df_eda[['default', 'housing', 'loan']])
df_eda[['default', 'housing', 'loan']] = ordinal_encoder_eda.transform(df_eda[['default', 'housing', 'loan']])

In [None]:
one_hot_encoder_eda = OneHotEncoder(drop="first")
one_hot_encoded_eda = one_hot_encoder_eda.fit_transform(df_eda[['job', 'marital', 'education', 'contact', 'poutcome']])
df_encoded_eda = pd.DataFrame(one_hot_encoded_eda.toarray(), columns=one_hot_encoder_eda.get_feature_names_out())
df_eda = pd.concat([df_eda, df_encoded_eda], axis=1)
df_eda.drop(columns=['job', 'marital', 'education', 'contact', 'poutcome'], inplace=True, axis=1)

In [None]:
df_eda['target'] = df_eda['target'].map({"yes": 1, "no": 0})

In [None]:
df_eda.info()

**Handling Date column `last contact date`**

In [None]:
df_eda['last_contact_year'] = pd.to_datetime(df_eda['last contact date']).dt.year
df_eda['last_contact_month'] = pd.to_datetime(df_eda['last contact date']).dt.month
df_eda['last_contact_day'] = pd.to_datetime(df_eda['last contact date']).dt.day
df_eda.drop(columns=['last contact date'], inplace=True)

### 3.8 Correlation

In [None]:
correlation_matrix = df_eda.corr()
correlation_matrix

In [None]:
correlation_matrix['target'].sort_values(ascending=False)

In [None]:
# 10 higest correlated features to the label
correlation_matrix['target'].abs().sort_values(ascending=False).head(11)

In [None]:
best_features = correlation_matrix['target'].abs().sort_values(ascending=False).head(11).index
best_features

In [None]:
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

In [None]:
correlation_matrix_best = df_eda[best_features].corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix_best, cmap='coolwarm')
plt.title('Correlation Matrix of 10 higest correlated features')
plt.show()

## 4. Train Test Split

In [None]:
X = df.drop(columns=['target'])
y = df['target']

X_train, X_valid, y_train, y_valid = train_test_split(X, y, stratify=y ,test_size=0.2, random_state=42)

## 5. Data Preprocessing

### 5.1. Handling Missing Values

The missing values in the dataset will be addressed using the following strategies:

---

#### **1. `job`**
- **Missing Values:** 229 null values (0.5% of the data).
- **Imputation Strategy:** Replace null values with the **mode** of the column.

---

#### **2. `education`, `contact`**
- These input variables have **"unknown"** as a valid category based on the metadata.
- **Imputation Strategy:** Replace null values with **"unknown"**.

---

### Important Note:
- **Imputation with Mode Limitation:**  
  Imputation with the **mode** will not be effective for columns like `poutcome` and `contact` due to their large proportion of missing data:
  - **`poutcome`:** 23,578 null values (75% of the data).
  - **`contact`:** 8,267 null values (26% of the data).

Replacing these with **"unknown"** helps preserve the data.


### 5.2. Encoding Categorical Features

The categorical features in the dataset will be encoded as follows:

---

#### **1. Binary Features**
- The following features are binary and will be encoded directly using **binary encoding**:
  - **`default`**
  - **`housing`**
  - **`loan`**

---

#### **2. Non-Ordinal Categorical Features**
- The following features will be encoded using **OneHotEncoding**:
  - **`job`**
  - **`marital`**
  - **`education`**
 these variables.
---

### 5.3. Feature Scaling

We will use **Standard Scaler** for scaling the dataset due to the following reasons:

- **Robustness:** Standard Scaler is very robust and less affected by outliers compared to other scaling methodss.

---

#### Key Point:
- **Scaling Scope:**  
  Scaling will be applied **only to numerical features** in the dataset.


### 5.4. Data Preprocessing Pipeline**

In [None]:
date_features = ['last_contact_year', 'last_contact_month', 'last_contact_day', 'last_contact_weekday']
numeric_features = ['age', 'balance', 'duration', 'campaign']
special_num = ["previous"]
categorical_features = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact']
special_cat = ['poutcome']

### Special Transformations: `previous`

The column **`previous`**, which represents the number of contacts performed before this campaign, is being transformed as follows:

| Original Value (`previous`) | Transformed Value (`new_val`) | Description                               |
|-----------------------------|-------------------------------|-------------------------------------------|
| **0**                       | **-1**                       | Indicates no prior contacts               |
| **> 0**                     | **+1**                       | Indicates one or more prior contacts      |

**Purpose:**  
This transformation simplifies the `previous` column into a binary representation for easier modeling.= +1

In [None]:
def transform_previous(values):
    return np.where(values > 0, 1, -1).reshape(-1, 1)  # Converts values based on your criteria
previous_transformer = FunctionTransformer(transform_previous)


We will extract the following components from the `last contact date`:
- **last_contact_year**
- **last_contact_month**
- **last_contact_day**
- **last_contact_weekday**y**


In [None]:
def extract_date(X):
  X = X.copy()
  X['last_contact_year'] = pd.to_datetime(X['last contact date']).dt.year
  X['last_contact_month'] = pd.to_datetime(X['last contact date']).dt.month
  X['last_contact_day'] = pd.to_datetime(X['last contact date']).dt.day
  X['last_contact_weekday'] = pd.to_datetime(X['last contact date']).dt.weekday
  X.drop(columns=['last contact date'], inplace=True)
  return X
date_transformer = FunctionTransformer(extract_date)

#### Special Transformations: `poutcome` and `pdays`

#### Transformation for `poutcome`
The column **`poutcome`**, which represents the outcome of the previous marketing campaign, is being transformed as follows:

| Original Value (`poutcome`)       | Transformed Value (`new_val`) | Description                                   |
|-----------------------------------|-------------------------------|-----------------------------------------------|
| **failure**                       | **-1**                        | Indicates a failure in the previous campaign  |
| **other**, **unknown** *(null)*   | **0**                         | Indicates no clear outcome or unknown result  |
| **success**                       | **+1**                        | Indicates a success in the previous campaign  |

**Purpose:**  
This transformation simplifies `poutcome` into a numerical representation for modeling.

---

#### Action for `pdays`
The column **`pdays`**, which represents the number of days since the last contact, will be **dropped** from the dataset.

**Reason:**  
The transformed `poutcome` column provides equivalent information, making `pdays` redundant.
 p_out_come

In [None]:
def transform_poutcome(values):
    # Map "failure" to -1, "other" and None (null) to 0, and "success" to 1
    mapping = {'failure': -1, 'other': 0, np.nan: 0, 'success': 1}
    return np.vectorize(mapping.get)(values).reshape(-1, 1)  # Vectorize for efficient mapping

poutcome_transformer = FunctionTransformer(transform_poutcome)

In [None]:
preprocess_pipeline = ColumnTransformer([
    ('date', Pipeline([
        ("extract", date_transformer),
        ("scale", StandardScaler())
    ]), ['last contact date']),
    
    # Numeric features: only scaling
    ('num', StandardScaler(), numeric_features),

    # Job: mode imputation and one-hot encoding
    ('job', Pipeline([
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('onehot', OneHotEncoder(drop='first'))
    ]), ['job']),

    # Education, contact, poutcome: constant imputation and one-hot encoding
    ('edu_con', Pipeline([
        ('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
        ('onehot', OneHotEncoder(drop='first'))
    ]), ['education', 'contact']),

    # Default, housing, loan: constant imputation and ordinal encoding
    ('def_hou_loan', Pipeline([
        ('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
        ('ordinal', OrdinalEncoder())
    ]), ['default', 'housing', 'loan']),

    # Marital: one-hot encoding (no imputation needed)
    ('marital', OneHotEncoder(drop='first'), ['marital']),

    ("previous", previous_transformer, ["previous"]),

    ("poutcome", poutcome_transformer, ["poutcome"])
])

In [None]:
# Fit and transform the data
X_transformed = preprocess_pipeline.fit_transform(X_train)

# Get feature names
feature_names = (
    date_features +
    numeric_features +
    preprocess_pipeline.named_transformers_['job'].named_steps['onehot'].get_feature_names_out(['job']).tolist() +
    preprocess_pipeline.named_transformers_['edu_con'].named_steps['onehot'].get_feature_names_out(['education', 'contact']).tolist() +
    ['default', 'housing', 'loan'] +
    preprocess_pipeline.named_transformers_['marital'].get_feature_names_out(['marital']).tolist() + ["previous", "p_outcome"]
)

# Create DataFrame with appropriate column names
X_transformed_df = pd.DataFrame(X_transformed, columns=feature_names, index=X_train.index)


In [None]:
X_transformed_df.head(10)

In [None]:
X_train.head(10)

We will use **SMOTE (Synthetic Minority Oversampling Technique)** to address class imbalance by generating synthetic samples for the minority class. This ensures a balanced dataset for better model performance.

We will use **SMOTE (Synthetic Minority Oversampling Technique)** to handle class imbalance for the following reasons:

- **Our dataset is small**: SMOTE generates synthetic samples, which helps augment the dataset without duplicating data, making it suitable for smaller datasets.
- **Prevents overfitting**: Unlike random oversampling, SMOTE reduces the risk of overfitting by creating new, synthetic samples rather than duplicating existing ones.
- **Preserves important patterns**: By interpolating between existing samples, SMOTE ensures that the minority class samples are more diverse while maintaining the overall structure of the data.

This approach ensures a more balanced dataset and improves the model's ability to learn effectively without biasing predictions toward the majority class.

In [None]:
# Preprocessing Pipeline with SMOTE
smote = SMOTE(sampling_strategy="auto", random_state=42)
imb_pipe = ImbPipeline([
    ('preprocess', preprocess_pipeline),
    ('smote', smote)
])

In [None]:
X_resampled, y_resampled = imb_pipe.fit_resample(X_train, y_train)

In [None]:
from collections import Counter
print("Class distribution after resampling:", Counter(y_resampled))

## 6. Baseline Model (logistic Regression)

In [None]:
log_reg_pipe = Pipeline([
    ('preprocess', preprocess_pipeline),
    ('model', LogisticRegression())
])

In [None]:
log_reg_pipe.fit(X_train, y_train)

In [None]:
y_pred = log_reg_pipe.predict(X_valid)


In [None]:
accuracy = accuracy_score(y_valid, y_pred)
print(f'Accuracy: {accuracy:.2f}')

## 7. Linear Model

### 7.1. Logistic Regression

In [None]:
model = LogisticRegression()

In [None]:
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2', 'elasticnet', 'none'],
    'solver': ['liblinear', 'saga'],
    'max_iter': [100, 200, 300],
}

In [None]:
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='f1')

In [None]:
X_train_trans = preprocess_pipeline.fit_transform(X_train)

In [None]:
grid_search.fit(X_train_trans, y_train)

In [None]:
grid_search.best_params_

In [None]:
log_reg_grid_pipe = Pipeline([
    ('preprocess', preprocess_pipeline),
    ('model', LogisticRegression(C=0.01, max_iter=100, penalty='l1', solver='liblinear'))
])

In [None]:
log_reg_grid_pipe.fit(X_train, y_train)

In [None]:
log_reg_grid_pipe.score(X_valid, y_valid)

In [None]:
print(classification_report(y_valid, log_reg_pipe.predict(X_valid)))

## 8. Stochastic Gradient Descent

### 8.1. SGD Classifier

In [None]:
model = SGDClassifier()

In [None]:
X_train_trans = preprocess_pipeline.transform(X_train)

In [None]:
param_grid = {
    'loss': ['hinge', 'modified_huber'],  # Different loss functions for classification
    'penalty': ['l2', 'l1'],  # Regularization types
    'alpha': [0.0001, 0.001, 0.01],  # Regularization strength
    'eta0': [0.001, 0.01, 0.1],  # Initial learning rate
    'max_iter': [1000, 3000],  # Maximum iterations
    'tol': [1e-3, 1e-5]  # Tolerance for stopping criteria
}

In [None]:
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='f1', n_jobs=-1, verbose=1)

In [None]:
grid_search.fit(X_train_trans, y_train)

In [None]:
grid_search.best_params_

In [None]:
sgd_pipe = Pipeline([
    ('preprocess', preprocess_pipeline),
    ('model', SGDClassifier(alpha=0.0001, eta0=0.001, loss="hinge", max_iter=1000, penalty="l2", tol=0.001))
])

In [None]:
sgd_pipe.fit(X_train, y_train)

In [None]:
sgd_pipe.score(X_train, y_train)

In [None]:
sgd_pipe.score(X_valid, y_valid)

## 9. Support Vector Machines (SVM)

### 9.1 Support Vector Classifier (SVC)

In [None]:
svc_model = SVC(kernel='linear', random_state=42)

In [None]:
svc_pipe = Pipeline([
    ('preprocess', preprocess_pipeline),
    ('model', svc_model)
])

In [None]:
svc_pipe.fit(X_train, y_train)

In [None]:
svc_pipe.score(X_train, y_train)

In [None]:
preds = svc_pipe.predict(X_valid)

In [None]:
accuracy = accuracy_score(y_valid, preds)
accuracy

In [None]:
print(classification_report(y_valid, preds))

## 10. Decision Trees

### 10.1. Decision Tree Classifier

In [None]:
decision_tree_model = DecisionTreeClassifier(max_depth=8, random_state=42)

In [None]:
decision_tree_pipe = Pipeline([
    ('preprocess', preprocess_pipeline),
    ('model', decision_tree_model),
])

In [None]:
decision_tree_pipe.fit(X_train, y_train)

In [None]:
decision_tree_pipe.score(X_train, y_train)

In [None]:
print(classification_report(y_valid, decision_tree_pipe.predict(X_valid)))

## 11. Random Forest

### Random Forest Classifier

In [None]:
rand_frst_clf = RandomForestClassifier()

In [None]:
rand_frst_pipe = Pipeline([
    ('preprocess', preprocess_pipeline),
    ('model', rand_frst_clf)
])

In [None]:
rand_frst_pipe.fit(X_train, y_train)

In [None]:
rand_frst_pipe.score(X_train, y_train)

In [None]:
rand_frst_pipe.score(X_valid, y_valid)

## 12. XGBoost

In [None]:
xg_boost = XGBClassifier()

In [None]:
xg_pipe = ImbPipeline([
    ("preprocess", preprocess_pipeline),
    ('smote', SMOTE(sampling_strategy='auto', random_state=42)),
    ('model', xg_boost)
])

In [None]:
y_train_trans = y_train.map({"yes": 1, "no": 0})
xg_pipe.fit(X_train, y_train_trans)

In [None]:
xg_pipe.score(X_train, y_train_trans)

In [None]:
y_valid_trans = y_valid.map({"yes": 1, "no": 0})
xg_pipe.score(X_valid, y_valid_trans)

In [None]:
print(classification_report(y_valid_trans, xg_pipe.predict(X_valid)))

## 12. Fine Tuning the best models

### SVC Support Vector Classifier

In [None]:
svc_model = SVC(random_state=42)

In [None]:
sp = SelectPercentile(f_classif)

In [None]:
svc_pipe = ImbPipeline([
    ("poly_features", PolynomialFeatures(include_bias=False)),
    ('smote', SMOTE(sampling_strategy='auto', random_state=42)),
    ('sp', sp),
    ('model', svc_model)
])


In [None]:
param_grid = {
    # PolynomialFeatures: Experiment with degrees
    'poly_features__degree': [1, 2],  # Linear, quadratic

    # SelectPercentile: Tune the percentage of features to select
    'sp__percentile': [50, 75, 100],  # Select 50%, 75%, or all features

    # SVC Hyperparameters
    'model__C': [0.01, 0.1, 1],  # Regularization parameter
}

In [None]:
grid_search_svc = GridSearchCV(
    estimator=svc_pipe,
    param_grid=param_grid,
    scoring='f1',  
    cv=2, 
    verbose=10,
    n_jobs=-1  # Use all available cores for faster computation
)

In [None]:
X_train_trans = preprocess_pipeline.fit_transform(X_train)
grid_search_svc.fit(X_train_trans, y_train)

In [None]:
grid_search_svc.best_estimator_

In [None]:
best_svc_pipe = Pipeline([
    ("preprocess", preprocess_pipeline),
    ("poly_features", PolynomialFeatures(degree=1, include_bias=False)),
    ('sp', SelectPercentile(f_classif, percentile=75)),
    ('model', SVC(C=0.1, random_state=42))
])

In [None]:
best_svc_pipe.fit(X_train, y_train)

In [None]:
best_svc_pipe.score(X_train, y_train)

In [None]:
best_svc_pipe.score(X_valid, y_valid)

**How SVC Works:**  
SVC constructs a hyperplane in a high-dimensional space to separate classes, aiming to maximize the margin between them. It’s effective for datasets with clear class separation and can use kernels for non-linear relationships.

---

**Performance Analysis**
- **Majority Class (`no`)**: High precision (0.89) and recall (0.95), leading to an F1-score of 0.92. SVC performs well on this dominant class.
- **Minority Class (`yes`)**: Precision (0.55) and recall (0.37) are low, resulting in an F1-score of 0.44. The model struggles with the minority class due to imbalanced data.
- **Overall**: Accuracy is 86%, but the macro average F1-score (0.68) highlights poor balance between the two classes.

---

**Why It Performed This Way**
- **Class Imbalance**: SVC does not handle imbalanced data well and tends to prioritize the majority class.
- **Margin Maximization**: It relies on maximizing the margin, which can lead to ignoring minority class samples if they don’t significantly affect the margin.







### Random Forest

In [None]:
rand_frst_clf = RandomForestClassifier()

In [None]:
sp = SelectPercentile(f_classif)

In [None]:
rand_frst_pipe = ImbPipeline([
    ('poly', PolynomialFeatures(include_bias=False)),
    ('smote', SMOTE(sampling_strategy='auto', random_state=42)),
    ('sp', sp),
    ('model', rand_frst_clf)
])

In [None]:
param_grid = {
    'max_depth': [5, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 5, 10],
}

In [None]:
grid_search_rf = GridSearchCV(
    estimator=rand_frst_pipe,
    param_grid=param_grid,
    cv=3,  # Cross-validation
    verbose=2,
    n_jobs=-1,
)

In [None]:
X_train_trans = preprocess_pipeline.fit_transform(X_train)
grid_search_rf.fit(X_train_trans, y_train)

In [None]:
grid_search_rf.best_params_

In [None]:
best_rf_pipe = ImbPipeline([
    ("preprocess", preprocess_pipeline),
    ("poly", PolynomialFeatures(degree=2, include_bias=False)),
    ('smote', SMOTE(sampling_strategy='auto', random_state=4)),
    ('sp', SelectPercentile(f_classif, percentile=75)),
    ('model', RandomForestClassifier(max_depth=None, min_samples_leaf=1, min_samples_split=2, random_state=42))
  ])

In [None]:
best_rf_pipe.fit(X_train, y_train)

In [None]:
best_rf_pipe.score(X_train, y_train)

In [None]:
best_rf_pipe.score(X_valid, y_valid)

#### Handling Overfitting

In [None]:
data = [] # [ [max_depth, train_score, test_score] ]
for i in [1, 5, 8, 10, 20]:
  rf_pipe = ImbPipeline([
    ("preprocess", preprocess_pipeline),
    ("poly", PolynomialFeatures(degree=2, include_bias=False)),
    ('smote', SMOTE(sampling_strategy='auto', random_state=4)),
    ('sp', SelectPercentile(f_classif, percentile=75)),
    ('model', RandomForestClassifier(max_depth=i, min_samples_leaf=1, min_samples_split=2, random_state=42))
  ])

  rf_pipe.fit(X_train, y_train)
  train_score = rf_pipe.score(X_train, y_train)
  valid_score = rf_pipe.score(X_valid, y_valid)
  data.append([i, train_score, valid_score])

In [None]:
sns.lineplot(x='max_depth', y='train_score', data=pd.DataFrame(data, columns=['max_depth', 'train_score', 'test_score']))
sns.lineplot(x='max_depth', y='test_score', data=pd.DataFrame(data, columns=['max_depth', 'train_score', 'test_score']))
plt.ylabel("f1 score")
plt.xticks([1,5,10, 15, 20])

In [None]:
ideal_rf_pipe = ImbPipeline([
    ("preprocess", preprocess_pipeline),
    ("poly", PolynomialFeatures(degree=1, include_bias=False)),
    ('smote', SMOTE(sampling_strategy='auto', random_state=4)),
    ('sp', SelectPercentile(f_classif, percentile=75)),
    ('model', RandomForestClassifier(max_depth=10, min_samples_leaf=1, min_samples_split=2, random_state=42))
  ])

In [None]:
ideal_rf_pipe.fit(X_train, y_train)

In [None]:
ideal_rf_pipe.score(X_train, y_train)

In [None]:
ideal_rf_pipe.score(X_valid, y_valid)

In [None]:
print(classification_report(y_valid, ideal_rf_pipe.predict(X_valid)))


**How It Works:**  
Random Forest is an ensemble model that builds multiple decision trees during training and outputs the class that is the mode of their predictions. It handles non-linear relationships and is robust to overfitting.

---

### **Performance Analysis**
- **Majority Class (`no`)**: High precision (0.96) but slightly lower recall (0.84), resulting in an F1-score of 0.89.
- **Minority Class (`yes`)**: Improved recall (0.80) compared to SVC, with a precision of 0.47 and an F1-score of 0.59.
- **Overall**: Accuracy is 83%, and the macro average F1-score (0.74) indicates better balance between the two classes than SVC.

---

### **Why It Performed This Way**
- **Class Imbalance Handling**: Random Forest handles class imbalance better than SVC by considering multiple trees, which increases recall for the minority class.
- **Non-Linear Modeling**: Its ability to model non-linear relationships improves overall performance, though precision for the minority class remains moderate.


### XGBoost

In [None]:
xg_boost = XGBClassifier()

In [None]:
xg_pipe = ImbPipeline([
    # ('smote', SMOTE(sampling_strategy='auto', random_state=42)),
    ('model', xg_boost)
])

In [None]:
param_grid = {
    'model__learning_rate': [0.01, 0.2],  # Step size shrinkage
    'model__max_depth': [3, 10],  # Tree depth
    'model__min_child_weight': [1, 10],  # Minimum child weight
    'model__subsample': [0.6, 1.0],  # Fraction of samples used per tree
    'model__colsample_bytree': [0.6, 1.0],  # Fraction of features used per tree
    'model__n_estimators': [100, 300],  # Number of boosting rounds
    'model__gamma': [0, 5],  # Minimum loss reduction for splits
    'model__reg_alpha': [0, 0.1, 1],  # L1 regularization
}


In [None]:
grid_search_xg = GridSearchCV(
    estimator=xg_pipe,
    param_grid=param_grid,
    scoring='f1',
    cv=2,
    verbose=2,
    n_jobs=1
)

In [None]:
X_train_trans = preprocess_pipeline.fit_transform(X_train)
y_train_trans = y_train.map({"yes": 1, "no": 0})
grid_search_xg.fit(X_train_trans, y_train_trans)

In [None]:
grid_search_xg.best_params_

In [None]:
best_xg_pipe = ImbPipeline([
    ("preprocess", preprocess_pipeline),
    ('smote', SMOTE(sampling_strategy='auto', random_state=42)),
    ('model', XGBClassifier(colsample_bytree=0.6, gamma=0, learning_rate=0.001, max_depth=12, min_child_weight=10, n_estimators=110, reg_alpha=1, subsample=1))
])

In [None]:
best_xg_pipe = ImbPipeline([
    ("preprocess", preprocess_pipeline),
    ('smote', SMOTE(sampling_strategy='auto', random_state=42)),
    ('model', XGBClassifier(colsample_bytree=0.6, gamma=5, learning_rate=0.01, max_depth=10, min_child_weight=10, n_estimators=100, reg_alpha=0.1, subsample=0.6))
])

In [None]:
y_train_trans = y_train.map({"yes": 1, "no": 0})
best_xg_pipe.fit(X_train, y_train_trans)

In [None]:
best_xg_pipe.score(X_train, y_train_trans)

In [None]:
y_valid_trans = y_valid.map({"yes": 1, "no": 0})
best_xg_pipe.score(X_valid, y_valid_trans)

In [None]:
y_valid_trans = y_valid.map({"yes": 1, "no": 0})
preds = best_xg_pipe.predict(X_valid)
print(classification_report(y_valid_trans, preds))


**How It Works:**  
XGBoost is a gradient-boosting framework that builds trees sequentially, focusing on correcting errors made by previous trees. It uses regularization to prevent overfitting and works well with imbalanced and non-linear data.



### **Performance Analysis**
- **Majority Class (`0`)**: Excellent precision (0.96) and recall (0.88), leading to an F1-score of 0.91.
- **Minority Class (`1`)**: Balanced precision (0.53) and recall (0.77), resulting in the highest F1-score for the minority class (0.63) among all models.
- **Overall**: Accuracy is 86%, with a macro average F1-score of 0.77, indicating the best balance between the two class

---

### **Why It Performed This Way**
- **Boosting Mechanism**: XGBoost’s sequential tree-building approach effectively captures minority class patterns, improving both precision and recall.
- **Robustness**: It handles non-linear relationships and class imbalance well, outperforming both SVC and Random Forest on the minority class.


## 13. Model Comparision

In [None]:
models = [best_svc_pipe, ideal_rf_pipe, best_xg_pipe]
preds = [best_svc_pipe.predict(X_valid), ideal_rf_pipe.predict(X_valid), np.array(["yes" if x==1 else "no" for x in best_xg_pipe.predict(X_valid)])]

In [None]:
eval_data = []
for pred in preds:
  report = classification_report(y_valid, pred, output_dict=True)
  
  precision_no_class = report['no']['precision']
  recall_no_class = report['no']['recall']
  f1_no_class = report['no']['f1-score']

  precision_yes_class = report['yes']['precision']
  recall_yes_class = report['yes']['recall']
  f1_yes_class = report['yes']['f1-score']

  f1 = report["accuracy"]
  # create into a dataframe
  

  eval_data.append([precision_no_class, recall_no_class, f1_no_class, precision_yes_class, recall_yes_class, f1_yes_class, f1])



In [None]:
# eval data to dataframe
eval_df = pd.DataFrame(eval_data, columns=["precision_no", "recall_no", "f1_no", "precision_yes", "recall_yes", "f1_yes", "f1"])
eval_df["model"] = ["SVC", "Random Forest", "XGBoost"]
eval_df

In [None]:
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(eval_df['model'], eval_df['f1_no'], marker='o', label='F1-Score (no)', color='blue')
ax.plot(eval_df['model'], eval_df['f1_yes'], marker='o', label='F1-Score (yes)', color='orange')
ax.plot(eval_df['model'], eval_df['f1'], marker='o', label='Overall F1-Score', color='green')

ax.set_title('Trends in F1-Scores for Models')
ax.set_ylabel('F1-Score')
ax.set_xlabel('Model')
ax.legend()
plt.show()




| Model          | Precision (no) | Recall (no) | F1-Score (no) | Precision (yes) | Recall (yes) | F1-Score (yes) | Overall F1-Score |
|-----------------|----------------|-------------|---------------|------------------|--------------|----------------|------------------|
| SVC            | 0.893807       | 0.951242    | 0.921630      | 0.579767         | 0.373122     | 0.454038       | 0.862935         |
| Random Forest  | 0.966479       | 0.850414    | 0.904739      | 0.502004         | 0.836394     | 0.627426       | 0.848272         |
| XGBoost        | 0.959462       | 0.890444    | 0.923665      | 0.565632         | 0.791319     | 0.659708       | 0.875303    


   |
### Key Observations:

1. **Overall F1-Score**:
   - **XGBoost** achieves the highest overall F1-score (0.8753), followed by **SVC** (0.8629) and **Random Forest** (0.8483).

2. **Majority Class Performance**:
   - **SVC** and **XGBoost** both demonstrate strong performance on the `no` class, with high F1-scores (0.9216 and 0.9237, respectively). **Random Forest** is slightly behind with 0.9047.

3. **Minority Class Performance**:
   - **XGBoost** outperforms both **SVC** and **Random Forest** for the `yes` class, achieving the highest F1-score (0.6597) and maintaining a good balance between precision (0.5656) and recall (0.7913).
   - **Random Forest** shows decent recall for the minority class (0.8364) but lower precision (0.5020), leading to a lower F1-score (0.6274).
   - **SVC** struggles with the `yes` class, with a low F1-score (0.4540) due to its low recall (0.3731).

4. **Class Imbalance Handling**:
   - **SVC** struggles significantly with class imbalance, failing to perform well on the minority class due to its reliance on margin maximization.
   - **Random Forest** improves recall for the minority class but sacrifices precision, leading to moderate performance.
   - **XGBoost** effectively balances precision and recall, making it the most reliable for imbalanced data.

5. **Model Complexity**:
   - **SVC** is a simpler model but limited in handling imbalanced and non-linear data.
   - **Random Forest** handles non-linearity well but still requires careful tuning to address imbalance.
   - **XGBoost** is the most robust, leveraging boosting and regularization to achieve a balanced performance across all metrics.

---

### Conclusion:
**XGBoost** is the best-performing model based on the overall F1-score and its ability to balance precision and recall for both the majority and minority classes. For datasets with imbalanced classes and non-linear relationships, **XGBoost** stands out as the most suitable choice.
 XGBoost is the most suitable choice.


## 14. Training the model on entire dataset

In [None]:
y_trans = y.map({"yes": 1, "no": 0})
best_xg_pipe.fit(X, y_trans)

In [None]:
best_xg_pipe.fit(X, y)

## 15. Predicting on Test Set and Saving results to output.csv

In [None]:
df_test = pd.read_csv('test.csv')

In [None]:
preds = best_xg_pipe.predict(df_test)
preds = ["yes" if i == 1 else "no" for i in preds]

In [None]:
out_df = pd.DataFrame()
out_df['id'] = df_test.index
out_df['target'] = preds
out_df

In [None]:
out_df.to_csv('prediction.csv', index=False)