# US Accidents

A Countrywide Traffic Accident Dataset (2016 - 2023)



This is a countrywide car accident dataset that covers 49 states of the USA. The accident data were collected from February 2016 to March 2023, using multiple APIs that provide streaming traffic incident (or event) data. These APIs broadcast traffic data captured by various entities, including the US and state departments of transportation, law enforcement agencies, traffic cameras, and traffic sensors within the road networks. The dataset currently contains approximately 7.7 million accident records.

#### About the dataset



**ID**: This is a unique identifier of the accident record.



**Source**: Source of raw accident data



**Severity**: Shows the severity of the accident, a number between 1 and 4. 1 indicates the least impact on traffic (short delay), and 4 indicates a significant impact (long delay).



**Start_Time**: Shows the start time of the accident in local time zone.



**End_Time**: Shows the end time of the accident impact on traffic flow in local time zone.



**Start_Lat**: Shows latitude in GPS coordinates of the start point.



**Start_Lng**: Shows longitude in GPS coordinates of the start point.



**End_Lat**: Shows latitude in GPS coordinates of the end point, if applicable.



**End_Lng**: Shows longitude in GPS coordinates of the end point, if applicable.



**Distance(mi)**: The length of the road affected by the accident, in miles.



**Description**: Natural language description of the accident.



**Number**: Street number in the address field, if available.



**Street**: Street name in the address field, if available.



**Side**: Relative side of the street (Right/Left) in the address field, if available.



**City**: City in the address field, if available.



**County**: County in the address field, if available.



**State**: State in the address field, if available.



**Zipcode**: Zipcode in the address field, if available.



**Country**: Country in the address field, if available.



**Timezone**: Timezone based on the location of the accident (e.g., Eastern, Central).



**Airport_Code**: Airport-based weather station closest to the accident location.



**Weather_Timestamp**: Timestamp of weather observation record in local time.



**Temperature(F)**: Temperature at the time of the accident, in Fahrenheit.



**Wind_Chill(F)**: Wind chill at the time of the accident, in Fahrenheit.



**Humidity(%)**: Humidity at the time of the accident, as a percentage.



**Pressure(in)**: Air pressure at the time of the accident, in inches.



**Visibility(mi)**: Visibility at the time of the accident, in miles.



**Wind_Direction**: Wind direction at the time of the accident.



**Wind_Speed(mph)**: Wind speed at the time of the accident, in miles per hour.



**Precipitation(in)**: Precipitation amount at the time of the accident, in inches, if any.



**Weather_Condition**: Weather condition at the time of the accident (e.g., rain, snow, fog).



**Amenity**: Presence of amenity in a nearby location, annotated as Point of Interest (POI).



**Bump**: Presence of speed bump or hump in a nearby location, annotated as POI.



**Crossing**: Presence of crossing in a nearby location, annotated as POI.



**Give_Way**: Presence of give_way in a nearby location, annotated as POI.



**Junction**: Presence of junction in a nearby location, annotated as POI.



**No_Exit**: Presence of no_exit in a nearby location, annotated as POI.



**Railway**: Presence of railway in a nearby location, annotated as POI.



**Roundabout**: Presence of roundabout in a nearby location, annotated as POI.



**Station**: Presence of station in a nearby location, annotated as POI.



**Stop**: Presence of stop in a nearby location, annotated as POI.



**Traffic_Calming**: Presence of traffic_calming in a nearby location, annotated as POI.



**Traffic_Signal**: Presence of traffic_signal in a nearby location, annotated as POI.



**Turning_Loop**: Presence of turning_loop in a nearby location, annotated as POI.



**Sunrise_Sunset**: Period of day (day or night) based on sunrise/sunset.



**Civil_Twilight**: Period of day (day or night) based on civil twilight.



**Nautical_Twilight**: Period of day (day or night) based on nautical twilight.



**Astronomical_Twilight**: Period of day (day or night) based on astronomical twilight.




#### Import Libraries

In [None]:
# Importing necessary libraries for data manipulation, visualization, and modeling

# Data manipulation and display settings
import pandas as pd
pd.set_option('display.max_colwidth', None)  # Show full content of columns
pd.set_option('display.max_rows', None)     # Display all rows
pd.set_option('display.max_columns', None)  # Display all columns
pd.options.display.float_format = '{:.6f}'.format  # Display floats up to 6 decimal places

# Libraries for numerical operations and plotting
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Suppress warnings for cleaner output
from warnings import filterwarnings
filterwarnings('ignore')

# Libraries for data splitting and statistical analysis
from sklearn.model_selection import train_test_split
import scipy.stats as stats
from scipy.stats import f_oneway, kruskal, chi2_contingency

# For timezone operations
import pytz

# Importing statsmodels for statistical modeling
import statsmodels.api as sm

# Preprocessing libraries for scaling and imputing missing values
from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler
from sklearn.impute import KNNImputer

# Libraries for evaluating model performance
from sklearn import metrics
from sklearn.metrics import (
    classification_report, cohen_kappa_score, confusion_matrix, roc_curve, 
    accuracy_score, roc_auc_score, recall_score
)

# Feature selection techniques
from sklearn.feature_selection import RFE

# Libraries for handling imbalanced datasets
from imblearn.over_sampling import ADASYN
from imblearn.pipeline import Pipeline
from imblearn.under_sampling import RandomUnderSampler

# Machine learning models
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
    RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, StackingClassifier
)
from xgboost import XGBClassifier

# Hyperparameter tuning and cross-validation
from sklearn.model_selection import GridSearchCV
from hyperopt import fmin, tpe, hp, Trials
from sklearn.model_selection import cross_val_score

#### Data Preparation

##### Read the Data

**Due to the large size of the initial dataset (7.7 million records), we used stratified sampling based on the severity levels (1 to 4) of the accidents. This method allowed us to efficiently reduce the dataset to 3% for processing, thereby optimizing computational resources.**

**Read the dataset and print the first five observations.**

In [None]:
df_US_Accidents = pd.read_csv("/kaggle/input/us-accidents-stratified-sample/US_Accidents_March23_Stratified.csv")
df_US_Accidents.head()

**Let us now see the number of variables and observations in the data.**

In [None]:
df_US_Accidents.shape

**Interpretation:** The data has 231851 observations and 46 variables.

##### Discover Unique Identfier

**Need to find the unique Identifier in the dataset and check for duplicates.**

In [None]:
df_US_Accidents['ID'].nunique()

**Interpretation:** As we can see, the number of unique values in the ID column matches the total number of rows. Therefore, ID is a unique identifier for the dataset.

##### Check the Data Type

**Check the data type of each variable. If the data type is not as per the data definition, change the data type.**

In [None]:
df_US_Accidents.info()

**As the severity column is our target variable so changing its datatype from int to object**

In [None]:
# df_US_Accidents['Severity'] = df_US_Accidents['Severity'].astype(object)

**Start_Time and End_Time columns have milli second values, we can remove them as they don't provide additional information**



As the datetime strings follow ISO 8601 format (e.g., "2021-12-07T15:44:49.123456"), we are using format='ISO8601'.

In [None]:
# Convert to datetime with specific format
df_US_Accidents['Start_Time'] = pd.to_datetime(df_US_Accidents['Start_Time'], format='ISO8601')
df_US_Accidents['End_Time'] = pd.to_datetime(df_US_Accidents['End_Time'], format='ISO8601')
df_US_Accidents['Weather_Timestamp'] = pd.to_datetime(df_US_Accidents['Weather_Timestamp'], format='ISO8601')

**Converting all the boolean columns into integer for simplicity**

In [None]:
for col in df_US_Accidents.select_dtypes(include='bool').columns:
  df_US_Accidents[col] = df_US_Accidents[col].astype(int)

**Recheck the data type after the conversion**

In [None]:
df_US_Accidents.info()

##### Feature Correction

**Binary reduction of the target column Severity**

In [None]:
df_US_Accidents['Severity'].value_counts()

In [None]:
# Define the mapping
severity_map = {1: 0, 2: 0, 3: 1, 4: 1}

# Transform the 'Severity' column
df_US_Accidents['Severity'] = df_US_Accidents['Severity'].map(severity_map)
df_US_Accidents['Severity'] = df_US_Accidents['Severity'].astype(object)
df_US_Accidents['Severity'].value_counts()

0 - Low Severity



1 - High Severity

**Counting unique values in Each Columns**

In [None]:
for col in df_US_Accidents.select_dtypes(include = 'object').columns[1:]:
    print(f"{col} having {df_US_Accidents[col].nunique()} unique values")

**Remove Insignificant Variables**

In [None]:
df_US_Accidents.drop(["Description","City","County","Zipcode","Country","Airport_Code","Turning_Loop"],axis = 1,inplace = True)

**Interpretation:**

1. Description column has too many characters,as we are not performing NLP,we have droped the colums.

2. City,County,Zipcode and Airport code have too many unique values which are not usefull in froming classification model as it increase feature count that might lead to a less significant model.

3. Country columns have only one value which is redendent column.

4. turning loop column has only one value that is false which does not help the model in any way.

**Cardinality correction**

In [None]:
for col in df_US_Accidents.select_dtypes(include = 'object').columns[1:]:
  print(df_US_Accidents[col].value_counts(normalize=True))
  print("\n")

**Checking if the street column has any unique values to for with if not we can drop the column**

In [None]:
x = df_US_Accidents["Street"]
first = x.apply(lambda x:str(x).split()[0])
second = x.apply(lambda x:str(x).split()[1] if len(str(x).split()) > 1 else None)
print("Unique values of only the first charecter in the street column",first.nunique())
print("Unique values of only the second charecter in the street column",second.nunique())

In [None]:
first.value_counts(normalize=True).head(10)

In [None]:
second.value_counts(normalize=True).head(10)

In [None]:
df_US_Accidents.drop(["Street"],axis = 1,inplace = True)

**Interpretation:** As you can see even if the unique values reduced to a large extent the first 10 columns does not sum up to more then 70% column values so the street column is deleted

**Reducing the State,wind direction and weather condion column's cardinality to make the data more workable**

**State**

In [None]:
Normalized_State = df_US_Accidents["State"].value_counts(normalize=True)
cumulative_sum = Normalized_State.cumsum()
Effective_State = list(cumulative_sum[cumulative_sum <= 0.8].index)
print(Effective_State)

In [None]:
df_US_Accidents["State"] = df_US_Accidents["State"].apply(lambda x: x if x in Effective_State else "Other")

In [None]:
df_US_Accidents["State"].value_counts(normalize=True)

**Interpretation:** The cardinality of the state column is reduced to 16 from 49.

**Wind_Direction**

In [None]:
df_US_Accidents["Wind_Direction"].value_counts(normalize=True)

**Going to perform Label correction has this feature have too many redundent columns**

In [None]:
df_US_Accidents['Wind_Direction'].replace({
    "South": "S",
    "West": "W",
    "North": "N",
    "East": "E",
    "Variable": "VAR",
    "Calm": "CALM"
}, inplace=True)

In [None]:
df_US_Accidents["Wind_Direction"].value_counts(normalize=True)

**Interpretation:** After perfroming label correction the cardinality of the feature reduced to 18 from 24.

**Weather_Condition**

In [None]:
Normalized_Weather_Condition = df_US_Accidents["Weather_Condition"].value_counts(normalize=True)
cumulative_sum = Normalized_Weather_Condition.cumsum()
Effective_Weather_Condition = list(cumulative_sum[cumulative_sum <= 0.8].index)
print(Effective_Weather_Condition)

In [None]:
df_US_Accidents["Weather_Condition"] = df_US_Accidents["Weather_Condition"].apply(lambda x: x if x in Effective_Weather_Condition else "Other")

In [None]:
df_US_Accidents["Weather_Condition"].value_counts(normalize=True)

**Interpretation:** The cardinality of the Weather Condition column is reduced to 6 from 109.

****

##### Feature Creation

**Creating new feature from the time stamps as there individually doesnot represent anything so finding**

1.  time differnce of hours in decimal and getting the year of accident.

2. day in the week where accidents happened.

3. year and month of the accident

4. day in the month and hour of the day the  accident took place

In [None]:
timediff = df_US_Accidents["End_Time"]-df_US_Accidents["Start_Time"]
df_US_Accidents["Traffic_Delay_Hours"] = round(timediff.dt.total_seconds() / (60*60),3)

In [None]:
df_US_Accidents["Accident_Year"]=df_US_Accidents["Start_Time"].dt.year
# Monday=0, Sunday=6
df_US_Accidents["Dayofweek"]= df_US_Accidents["Start_Time"].dt.dayofweek
df_US_Accidents['Accident_Month'] = df_US_Accidents['Start_Time'].dt.month
df_US_Accidents['Dayofmonth'] = df_US_Accidents['Start_Time'].dt.day
df_US_Accidents['Hourofday'] = df_US_Accidents['Start_Time'].dt.hour

In [None]:
df_US_Accidents.drop(["Start_Time","End_Time","Weather_Timestamp"],axis=1,inplace = True)

**Checking of the new columns are created and old time stamp columns are deleted**

In [None]:
df_US_Accidents.head()

In [None]:
df_US_Accidents.columns

#### Data Visualization

##### Correlation matrix

In [None]:
df_numeric = df_US_Accidents.select_dtypes(include = 'number')
correlation_matrix = df_numeric.corr()
correlation_matrix

In [None]:
plt.figure(figsize=(10,10))
sns.heatmap(correlation_matrix)
plt.show()

**Interpretation:**

1. From the above heatmap we can see that the temprature and wind chill are highly correlated so droping wind chill feature.
2. Other variables that are highly correlated are "Bump & Traffic_calming" and "Crossing & Traffic_signal"

In [None]:
df_US_Accidents.drop(["Wind_Chill(F)"],axis=1,inplace=True)

##### Discriptive Statistics

**Catagorical features**

In [None]:
df_US_Accidents.describe(include=[object]).T

**Numerical features**

In [None]:
df_US_Accidents.describe(include=[np.number]).T

8) how many categorical columns are there and find whether they are ordinal or nominal.

Severity, Weather_Condition, Wind_Direction these 3 are the only columns to have a ordinal categorical column all other are just nominal columns

##### Scatter plot

In [None]:
sns.scatterplot(data=df_US_Accidents, x='Start_Lng', y='Start_Lat',hue = 'Severity', alpha=0.7)
plt.show()

##### Univariate Analysis

**Continues Numeric Variables**

In [None]:
df_numeric = df_US_Accidents.select_dtypes(include=['float'])

# Calculate number of rows needed based on number of columns
rows = (len(df_numeric.columns) - 1) // 3 + 1

# Create the figure and axes
fig, ax = plt.subplots(nrows=rows, ncols=3, figsize=(15, rows * 5))

# Loop through each column in df_numeric and create a boxplot
for i, col in enumerate(df_numeric.columns):
    row, col_index = divmod(i, 3)  # Get row and column index for the subplot
    df_numeric[col] = df_numeric[col][df_numeric[col] > df_numeric[col].quantile(0.05)]
    df_numeric[col] = df_numeric[col][df_numeric[col] < df_numeric[col].quantile(0.95)]
    sns.boxplot(data=df_numeric, x=col, ax=ax[row, col_index])

# Remove any unused subplots
for j in range(len(df_numeric.columns), rows * 3):
    fig.delaxes(ax.flatten()[j])
    
plt.tight_layout()
plt.show()

**Interpretation:**

1. **Visibility:**
   - Most accidents occur in clear visibility conditions. This suggests that other factors such as speeding or obstacles might have contributed to the accidents.
2. **Wind Speed:**
   - **Light Winds (0-15 mph):** These are prevalent during accidents, indicating minimal impact on driving.
   - **Moderate Winds (15-30 mph):** These winds might start to affect high-profile vehicles and cause minor steering issues.
   - **Strong Winds (30-45 mph):** These can significantly affect vehicle stability, especially high-profile vehicles, and may contribute to accidents.
   - **Very Strong Winds (45+ mph):** These conditions can create severe driving difficulties and are less common during accidents in your dataset.
   - **Finding:** Wind speed is Light winds during accidents most of the time
3. **Precipitation:**
   - **0 inches to 0.1 inches (Drizzle to Light Rain):** Most accidents occur during light precipitation.
   - **0.1 to 0.5 inches (Moderate Precipitation):** Some accidents occur under moderate precipitation conditions.
   - **0.5 inches to 2.0 inches (Significant to Heavy Precipitation):** Heavy precipitation is less common but still a factor.
   - **Finding:** From the average precipitaion me see most of the precipitation are in a drizzle to significant
4. **Pressure:**
   - **High Pressure (Above 29.92 inHg):** Most accidents occur during high-pressure conditions, which are generally associated with clearer skies.
   - **Low Pressure (Below 29.92 inHg):** Low-pressure systems, often linked with stormy weather, are less common during accidents in your dataset.
   - **Finding:** Most of the Air pressure are beyond 29.5 which tells us most of the accident happen in clear sky
5. **Temperature:**
   - **Cold Temperatures (Below 32°F):** These can lead to icy conditions and higher accident risks, but such conditions are less common.
   - **Moderate Temperatures (32°F to 50°F):** Still somewhat risky, especially with moisture present, but more favorable than very cold conditions.
   - **Warm Temperatures (Above 50°F):** Most accidents occur in these conditions, which generally provide better driving conditions.
   - **Finding:** Most of the accident happens only on better driving conditions
6. **Humidity:**
   - **High Humidity (Above 70-80%):** This can lead to fog or dew, reducing visibility and creating slippery surfaces, which might contribute to accidents.
   - **Low Humidity (Below 30%):** Dry air might lead to other issues like dust, but it's less impactful compared to high humidity.
   - **Finding:** most of the accidents happened during relative and high

**Discrete Numeric Variables**

In [None]:
df_numeric = df_US_Accidents.select_dtypes(include=['int'])

# Calculate number of rows needed based on number of columns
rows = (len(df_numeric.columns) - 1) // 3 + 1

# Create the figure and axes
fig, ax = plt.subplots(nrows=rows, ncols=3, figsize=(15, rows * 5))

# Loop through each column in df_numeric and create a boxplot
for i, col in enumerate(df_numeric.columns):
    row, col_index = divmod(i, 3)  # Get row and column index for the subplot
    sns.countplot(data=df_numeric, x=col, ax=ax[row, col_index])
    
# Remove any unused subplots
for j in range(len(df_numeric.columns), rows * 3):
    fig.delaxes(ax.flatten()[j])

plt.tight_layout()
plt.show()

**Interpretation:**

1.	From the initial analysis, we can see that almost all binary columns are skewed towards false.
2.	Accident Year: 2022 is the year when the maximum number of accidents occurred.
3.	Day of the Week: The most accidents happen on Fridays, but the difference compared to Mondays is not significant. Interestingly, Saturdays and Sundays have the least number of accidents, possibly because fewer people go out or there is less pressure to be on time.
4.	Accident Month: December has the highest number of accidents, possibly due to low visibility or slippery roads, while July has the least.
5.	Day of the Month: The 31st day of the month has the least number of accidents, which is surprising because many businesses pay salaries on that day, and one might expect higher excitement levels that could lead to more accidents.
6.	Hours of the Day: It is very clear that the hours corresponding to office commutes (in and out) mark the most accidents.

**Categorical variable**

In [None]:
df_object = df_US_Accidents.select_dtypes(include=['object']).iloc[:,1:]

# Calculate number of rows needed based on number of columns
rows = (len(df_object.columns) - 1) // 3 + 1

# Create the figure and axes
fig, ax = plt.subplots(nrows=rows, ncols=3, figsize=(15, rows * 5))

# Loop through each column in df_numeric and create a boxplot
for i, col in enumerate(df_object.columns):
    row, col_index = divmod(i, 3)  # Get row and column index for the subplot
    sns.countplot(data=df_object, x=col, ax=ax[row, col_index])

# Remove any unused subplots
for j in range(len(df_object.columns), rows * 3):
    fig.delaxes(ax.flatten()[j])

plt.tight_layout()
plt.show()

**Interpretation:**

1.	The US/Eastern region accounts for the maximum number of accidents.
2.	Surprisingly, the maximum number of accidents happened when the wind direction was calm.
3.	Most accidents occurred in fair weather conditions.
4.	We have more data for severity 2, which needs to be checked for potential bias.
5.	The maximum number of accidents happened during daytime.

##### Bivariate Analysis

###### Target vs Continuous Numerical

In [None]:
df_numeric = df_US_Accidents.select_dtypes(include=['float'])
df = df_US_Accidents.copy()

# Calculate number of rows needed based on number of columns
rows = (len(df_numeric.columns) - 1) // 3 + 1

# Create the figure and axes
fig, ax = plt.subplots(nrows=rows, ncols=3, figsize=(20, rows * 5))

# Loop through each column in df_numeric and create a boxplot
for i, col in enumerate(df_numeric.columns):
    row, col_index = divmod(i, 3)  # Get row and column index for the subplot
    df[col] = df[col][df[col] > df[col].quantile(0.05)]
    df[col] = df[col][df[col] < df[col].quantile(0.95)]
    sns.boxplot(data=df,hue='Severity',y=col, ax=ax[row, col_index])
    ax[row, col_index].set_xlabel('Severity')  # Set the x-axis label
    ax[row, col_index].set_ylabel(col)  # Set the y-axis label
    ax[row, col_index].set_title(f'Boxplot of {col} by Severity')  # Set the title

# Remove any unused subplots
for j in range(len(df_numeric.columns), rows * 3):
    fig.delaxes(ax.flatten()[j])

plt.tight_layout()
plt.show()

Interpretation:

1. Most variables do not show significant differences across severity levels, indicating that these factors might not be strong predictors of severity.
2. There are some outliers in each variable that might be worth investigating further to understand their impact on severity.
3. Variables like distance and traffic delay hours show more variation with severity, suggesting they might have a stronger relationship with incident severity compared to other factors.

**Hypothesis testing Target vs Continuous Numerical**

In [None]:
columns_to_test = ['Distance(mi)','Temperature(F)', 'Humidity(%)', 'Pressure(in)', 'Visibility(mi)','Wind_Speed(mph)', 'Precipitation(in)', 'Traffic_Delay_Hours']

# Initialize a list to store results
results = []

# Loop through each column and perform the Kruskal-Wallis test
for column in columns_to_test:
  # Drop rows with NaN values in the current column
    df_clean = df_US_Accidents.dropna(subset=[column, 'Severity'])
    # Ensure there is sufficient data for each severity level
    valid_categories = [df_clean[df_clean['Severity'] == cat][column] for cat in df_clean['Severity'].unique() if len(df_clean[df_clean['Severity'] == cat]) > 0]
    # categories = [df_US_Accidents[df_US_Accidents['Severity'] == cat][column] for cat in df_US_Accidents['Severity'].unique()]
    h_stat, p_val = kruskal(*valid_categories)

    # Interpret the results
    if p_val < 0.05:
        result = f"Reject the null hypothesis: There is a significant difference in the median {column} across different severity levels."
    else:
        result = f"Fail to reject the null hypothesis: There is no significant difference in the median {column} across different severity levels."

    # Append the results to the list
    results.append({
        'Title': f"Severity vs {column}",
        'P-value': p_val,
        'H-statistic': h_stat,
        'Result': result
    })

# Convert the results list to a DataFrame
results_df = pd.DataFrame(results)

In [None]:
results_df

###### Target vs Discrete numerical

In [None]:
# Select only integer columns
df_numeric = df_US_Accidents.select_dtypes(include=['int'])

# Calculate number of rows needed based on number of columns, assuming 2 columns per row
rows = (len(df_numeric.columns) - 1) // 2 + 1

# Create the figure and axes
fig, ax = plt.subplots(nrows=rows, ncols=2, figsize=(15, rows * 5))

# Flatten the axes array to make indexing easier
ax = ax.flatten()

# Loop through each column in df_numeric and create a stacked bar plot
for i, col in enumerate(df_numeric.columns):
    # Create the crosstab plot with normalized counts
    crosstab_data = pd.crosstab(df_US_Accidents[col], df_US_Accidents['Severity'], normalize='index')
    crosstab_data.plot(kind='bar', stacked=True, ax=ax[i])
    ax[i].set_xlabel('Severity')  # Set the x-axis label
    ax[i].set_ylabel(col)  # Set the y-axis label
    ax[i].set_title(f'Stacked Bar Plot of {col} by Severity')  # Set the title

# Remove any unused subplots
for j in range(i + 1, len(ax)):
    fig.delaxes(ax[j])

plt.tight_layout()
plt.show()

1. Across all features, Severity 2 (orange) is the most common, indicating that the majority of accidents have a medium level of severity.
2. Severity 1 (blue) and Severity 4 (red) are relatively rare, suggesting that very mild and very severe accidents are less common.
3. Severity 3 (green) is the second most common severity level in most features.
4. The pattern is consistent across different features, implying that the severity distribution is fairly uniform regardless of the specific feature.

In [None]:
# Define numerical columns
numerical_columns = df_US_Accidents.select_dtypes(include=['int']).columns

# Results storage
results = []

# Loop through numerical columns, bin the data, and perform Chi-Square Test
for col in numerical_columns:
    if col == 'Severity':  # Skip the Severity column
        continue
    # Create a contingency table
    contingency_table = pd.crosstab(df_US_Accidents['Severity'], df_US_Accidents[col])
    # Perform Chi-Square Test
    chi2_stat, p_val, dof, expected = chi2_contingency(contingency_table)
    # Store results
    result = {
        'Title': f'Severity vs {col}',
        'Chi-Square Statistic': chi2_stat,
        'P-Value': p_val,
        'Degrees of Freedom': dof,
        'Result': f'Reject the null hypothesis: There is a significant association between severity and {col}.' if p_val < 0.05 else f'Fail to reject the null hypothesis: There is no significant association between severity and {col}.'
    }
    results.append(result)

# Convert results to DataFrame
results_df = pd.DataFrame(results)

# Display results
print("Chi-Square Test Results:")
results_df

###### Target vs categorical feature

In [None]:
# Select only integer columns
df_object = df_US_Accidents.select_dtypes(include=['object']).drop(["Severity","ID"],axis=1)

# Calculate number of rows needed based on number of columns, assuming 2 columns per row
rows = (len(df_object.columns) - 1) // 2 + 1

# Create the figure and axes
fig, ax = plt.subplots(nrows=rows, ncols=2, figsize=(15, rows * 5))

# Flatten the axes array to make indexing easier
ax = ax.flatten()

# Loop through each column in df_numeric and create a stacked bar plot
for i, col in enumerate(df_object.columns):
    # Create the crosstab plot with normalized counts
    crosstab_data = pd.crosstab(df_US_Accidents[col], df_US_Accidents['Severity'], normalize='index')
    crosstab_data.plot(kind='bar', stacked=True, ax=ax[i])
    ax[i].set_xlabel('Severity')  # Set the x-axis label
    ax[i].set_ylabel(col)  # Set the y-axis label
    ax[i].set_title(f'Stacked Bar Plot of {col} by Severity')  # Set the title

# Remove any unused subplots
for j in range(i + 1, len(ax)):
    fig.delaxes(ax[j])

plt.tight_layout()
plt.show()

**Interpretation:**

1. Stacked Bar Plot of Source by Severity

Interpretation: The majority of accident data across all sources (Source1, Source2, and Source3) are classified under severity level 2, indicating moderate impact on traffic flow. Severity levels 3 and 4, which signify higher traffic impacts, are less frequent but present in all sources.

2. Stacked Bar Plot of State by Severity
   
Interpretation: Most states show a similar pattern where severity level 2 is the most common. Severity levels 3 and 4 have a noticeable presence in states like FL, GA and TX, indicating that these states experience a higher proportion of more severe accidents. Some states like AZ and NC show a higher proportion of severity level 1, indicating minor impacts on traffic.

3. Stacked Bar Plot of Timezone by Severity

Interpretation: Across different time zones (US/Eastern, US/Central, US/Mountain, US/Pacific), the majority of accidents fall under severity level 2. This indicates a consistent pattern of accident severity across the country, irrespective of the timezone.

4. Stacked Bar Plot of Wind Direction by Severity

Interpretation: Accidents occur under a wide range of wind directions, with severity level 2 being the most common across all directions. There is no significant variation in accident severity based on wind direction, suggesting that wind direction might not be a strong factor in determining accident severity.

5. Stacked Bar Plot of Weather Condition by Severity

Interpretation: Clear weather conditions see the highest proportion of severity level 2 accidents, followed by Cloudy, Fair, and Partly Cloudy conditions. Severity level 3 and 4 accidents are more frequent under adverse weather conditions (like Rain), highlighting the impact of weather on accident severity.

6. Stacked Bar Plot of Sunrise_Sunset by Severity
 
Interpretation: Both day and night periods show that severity level 2 is the most common. However, there is a slightly higher proportion of severity level 3 and 4 accidents during night time, indicating that night conditions may contribute to more severe accidents.

7. Stacked Bar Plot of Civil_Twilight by Severity
 
Interpretation: Similar to Sunrise_Sunset, both day and night periods under civil twilight show that severity level 2 is the most prevalent. There is no significant difference in the distribution of severity levels between day and night periods under civil twilight.

8. Stacked Bar Plot of Nautical_Twilight by Severity
    
Interpretation: The pattern remains consistent with the previous plots, where severity level 2 is the most frequent during both day and night periods under nautical twilight. Night time has a slightly higher proportion of severity level 3 and 4 accidents.

9. Stacked Bar Plot of Astronomical_Twilight by Severity
    
Interpretation: Both day and night periods under astronomical twilight show a higher prevalence of severity level 2 accidents. As with other twilight conditions, there is a marginal increase in the proportion of severity level 3 and 4 accidents during night time.

**Overall Insights**

Consistency in Severity Level 2: Across various factors such as source, state, timezone, wind direction, weather conditions, and different times of the day, severity level 2 remains the most common. This indicates that most accidents have a moderate impact on traffic flow.

Higher Severity at Night: There is a consistent observation that night time conditions (irrespective of the type of twilight) have a slightly higher proportion of more severe accidents (levels 3 and 4). This suggests that driving conditions at night may contribute to more severe accidents.

Impact of Adverse Weather: Adverse weather conditions, such as rain, contribute to a higher proportion of severe accidents, highlighting the importance of weather conditions in traffic safety.

These insights can help in formulating better traffic management strategies and safety measures, especially focusing on night time driving and adverse weather conditions.

In [None]:
# Define categorical columns
categorical_columns = df_US_Accidents.select_dtypes(include=['object']).drop(["Severity","ID"],axis=1).columns

# Results storage
results = []

# Loop through categorical columns and perform Chi-Square Test with 'Severity'
for col in categorical_columns:
    # Create a contingency table
    contingency_table = pd.crosstab(df_US_Accidents['Severity'], df_US_Accidents[col])

    # Perform Chi-Square Test
    chi2_stat, p_val, dof, expected = chi2_contingency(contingency_table)

    # Store results
    result = {
        'Title': f'Severity vs {col}',
        'Chi-Square Statistic': chi2_stat,
        'P-Value': p_val,
        'Degrees of Freedom': dof,
        'Result': f'Reject the null hypothesis: There is a significant association between Severity and {col}.' if p_val < 0.05 else f'Fail to reject the null hypothesis: There is no significant association between Severity and {col}.'
    }
    results.append(result)

# Convert results to DataFrame
results_df = pd.DataFrame(results)

# Display results
print("Chi-Square Test Results:")
results_df

**Removing Negative Values**

In [None]:
df_US_Accidents = df_US_Accidents[df_US_Accidents["Temperature(F)"] > 0]

In [None]:
df_US_Accidents.shape

**Removing Garbage values from distance column**

In [None]:
df_US_Accidents.loc[df_US_Accidents["End_Lat"].isna(), "Distance(mi)"] = df_US_Accidents.loc[df_US_Accidents["End_Lat"].isna(), "Distance(mi)"].replace(0.0, np.nan)

In [None]:
df_US_Accidents.drop(["Start_Lat","End_Lat","Start_Lng","End_Lng"],axis=1,inplace=True)

In [None]:
df_US_Accidents.shape

#### Train Test Split

In [None]:
X = df_US_Accidents.drop(["Severity"],axis=1)
y = df_US_Accidents["Severity"]
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=100)

In [None]:
df_US_Accidents_Train = pd.concat([X_train,y_train],axis=1)
df_US_Accidents_Test = pd.concat([X_test,y_test],axis=1)

#### Outliers Treatment

**Interpretation:** Wherever the End_Lat have NAN the distance column value is zero, it states that it is a garbage value.

**Checking presence of outlier for numeric variables keeping 95% as upper whisker and 5% as lower whisker.**

In [None]:
# Select only numerical columns
num_columns = df_US_Accidents_Train.select_dtypes(include='number').columns

# Determine the number of plots
num_plots = len(num_columns)
num_rows = (num_plots + 2) // 3  # Calculate the number of rows needed

# Create subplots
fig, axes = plt.subplots(num_rows, 3, figsize=(15, 5 * num_rows))
axes = axes.flatten()  # Flatten the 2D array of axes

# Plot histograms using Seaborn
for i, col in enumerate(num_columns):
    sns.histplot(df_US_Accidents_Train[col], ax=axes[i], bins=10, kde=True)
    axes[i].set_title(col)

# Remove any unused subplots
for j in range(i + 1, len(axes)):
    fig.delaxes(axes[j])

# Adjust layout
plt.tight_layout()
plt.show()

**Interpretation:** From above histograms we can say that we need to perform outlier analsys to only few records like 'Distance(mi)', 'Temperature(F)', 'Wind_Chill(F)', 'Humidity(%)','Pressure(in)', 'Visibility(mi)', 'Wind_Speed(mph)','Precipitation(in)','Traffic_Delay_Hours' because other represent a boolean column or date,time columns which defined between certian values.

In [None]:
for col in df_US_Accidents_Train[['Distance(mi)', 'Temperature(F)', 'Humidity(%)','Pressure(in)', 'Visibility(mi)', 'Wind_Speed(mph)','Precipitation(in)','Traffic_Delay_Hours']].columns:
    lower_whisker = df_US_Accidents_Train[col].quantile(.05)
    upper_whisker = df_US_Accidents_Train[col].quantile(.95)
    count = df_US_Accidents_Train[col][(df_US_Accidents_Train[col] < lower_whisker) | (df_US_Accidents_Train[col] > upper_whisker)].count()
    print(f"{col} has {count} outliner")

**Interpretation:** 5-10% outliers are present in a column, to scale it we are using RobustScaler.

RobustScaler is used for scaling because it utilizes percentiles, which makes it effective for handling numerical input variables that contain outliers, reducing their impact.

In [None]:
# Select the desired columns
df = df_US_Accidents_Train[['Distance(mi)', 'Temperature(F)', 'Humidity(%)', 'Pressure(in)', 'Visibility(mi)', 'Wind_Speed(mph)', 'Precipitation(in)', 'Traffic_Delay_Hours']]
num_columns = df.columns

# Determine the number of plots
num_plots = len(num_columns)
num_rows = (num_plots + 2) // 3  # Calculate the number of rows needed

# Create subplots
fig, axes = plt.subplots(num_rows, 3, figsize=(15, 5 * num_rows))
axes = axes.flatten()  # Flatten the 2D array of axes

# Plot boxplots using Seaborn
for i, col in enumerate(num_columns):
    sns.boxplot(y=df[col], ax=axes[i])
    axes[i].set_title(col)

# Remove any unused subplots
for j in range(i + 1, len(axes)):
    fig.delaxes(axes[j])

# Adjust layout
plt.tight_layout()
plt.show()

In [None]:
# Create copies of the DataFrames
df_train = df_US_Accidents_Train.copy()
df_test = df_US_Accidents_Test.copy()

# Select only numerical columns
num_columns_train = df_train.select_dtypes(include='float').columns
num_columns_test = df_test.select_dtypes(include='float').columns

# Create an instance of RobustScaler
scaler = RobustScaler(quantile_range=(5, 95))

# Fit the scaler and transform the training data
for col in num_columns_train:
    # Isolate the column
    col_data_train = df_train[col].values.reshape(-1, 1)

    # Identify the non-NaN indices
    non_nan_indices_train = ~np.isnan(col_data_train).flatten()

    # Fit the scaler on non-NaN values of the training data
    scaler.fit(col_data_train[non_nan_indices_train])

    # Transform the non-NaN values of the training data
    col_data_train[non_nan_indices_train] = scaler.transform(col_data_train[non_nan_indices_train])

    # Update the DataFrame with the scaled values
    df_train[col] = col_data_train

# Transform the test data using the scaler fitted on the training data
for col in num_columns_test:
    # Isolate the column
    col_data_test = df_test[col].values.reshape(-1, 1)

    # Identify the non-NaN indices
    non_nan_indices_test = ~np.isnan(col_data_test).flatten()
    
    # Transform the non-NaN values of the test data
    col_data_test[non_nan_indices_test] = scaler.transform(col_data_test[non_nan_indices_test])

    # Update the DataFrame with the scaled values
    df_test[col] = col_data_test

# Assign the scaled DataFrames back to the original DataFrames
df_US_Accidents_Train.update(df_train)
df_US_Accidents_Test.update(df_test)

In [None]:
# Select the desired columns
df = df_US_Accidents_Train[['Distance(mi)', 'Temperature(F)', 'Humidity(%)', 'Pressure(in)', 'Visibility(mi)', 'Wind_Speed(mph)', 'Precipitation(in)', 'Traffic_Delay_Hours']]
num_columns = df.columns

# Determine the number of plots
num_plots = len(num_columns)
num_rows = (num_plots + 2) // 3  # Calculate the number of rows needed

# Create subplots
fig, axes = plt.subplots(num_rows, 3, figsize=(15, 5 * num_rows))
axes = axes.flatten()  # Flatten the 2D array of axes

# Plot boxplots using Seaborn
for i, col in enumerate(num_columns):
    sns.boxplot(y=df[col], ax=axes[i])
    axes[i].set_title(col)

# Remove any unused subplots
for j in range(i + 1, len(axes)):
    fig.delaxes(axes[j])

# Adjust layout
plt.tight_layout()
plt.show()

**Interpretation:** From the 2 box plot ploted we can see the there has been a robusr scalar has scaled the data to a significant level

#### Missing Value Imputation

2. Check the presence of missing values and garbage values.

In [None]:
df_US_Accidents_Train.isna().mean() * 100

**For categorical columns, replacing mode for missing values**

In [None]:
missing_categorical = ["Sunrise_Sunset","Civil_Twilight","Nautical_Twilight","Astronomical_Twilight","Wind_Direction","Timezone"]
for col in missing_categorical:
    df_US_Accidents_Train[col].fillna(df_US_Accidents_Train[col].mode()[0], inplace=True)

In [None]:
missing_categorical = ["Sunrise_Sunset","Civil_Twilight","Nautical_Twilight","Astronomical_Twilight","Wind_Direction","Timezone"]
for col in missing_categorical:
    df_US_Accidents_Test[col].fillna(df_US_Accidents_Test[col].mode()[0], inplace=True)

In [None]:
df_US_Accidents_Train.isna().mean() * 100

**Checking for outliers using box plot for missing value columns**

In [None]:
# List of numerical columns with missing values
missing_numerical = ['Temperature(F)', 'Humidity(%)', 'Pressure(in)',
                     'Visibility(mi)', 'Wind_Speed(mph)', 'Precipitation(in)']

# Calculate number of rows needed for subplots
rows = (len(missing_numerical) + 2) // 3  # Equivalent to math.ceil(len(missing_numerical) / 3)

# Create subplots
fig, ax = plt.subplots(nrows=rows, ncols=3, figsize=(15, rows * 5))

# Iterate through columns and plot box plots
for i, col in enumerate(missing_numerical):
    row, col_index = divmod(i, 3)  # Get row and column index for the subplot
    sns.boxplot(data=df_US_Accidents_Train, x=col, ax=ax[row, col_index])
    ax[row, col_index].set_title(col)  # Set title for each subplot

# Hide empty subplots if any
for i in range(len(missing_numerical), rows * 3):
    row, col_index = divmod(i, 3)
    ax[row, col_index].axis('off')

plt.tight_layout()
plt.show()

**Interpretation:** From the above box plot we can see that there are outliers present in almost all feature so to replacing the nan value with median.

In [None]:
# Create copies of the DataFrames
df_train = df_US_Accidents_Train.copy()
df_test = df_US_Accidents_Test.copy()

# Fill missing values with median for both training and test data
for col in df_train.select_dtypes(include="number").columns:
    df_train[col].fillna(df_train[col].median(), inplace=True)
    
for col in df_test.select_dtypes(include="number").columns:
    df_test[col].fillna(df_test[col].median(), inplace=True)

# Create an instance of KNNImputer
knn_imputer = KNNImputer(n_neighbors=5)

# Fit the imputer on the training data
knn_imputer.fit(df_train.select_dtypes(include="number"))

# Transform the training data
df_train_imputed = knn_imputer.transform(df_train.select_dtypes(include="number"))

# Convert the result back to a DataFrame
df_train_imputed = pd.DataFrame(df_train_imputed,
                                columns=df_train.select_dtypes(include="number").columns,
                                index=df_train.index)

# Assign the imputed DataFrame back to the original DataFrame
for col in df_train_imputed.columns:
    df_train[col] = df_train_imputed[col]

# Transform the test data
df_test_imputed = knn_imputer.transform(df_test.select_dtypes(include="number"))

# Convert the result back to a DataFrame
df_test_imputed = pd.DataFrame(df_test_imputed,
                               columns=df_test.select_dtypes(include="number").columns,
                               index=df_test.index)

# Assign the imputed DataFrame back to the original DataFrame
for col in df_test_imputed.columns:
    df_test[col] = df_test_imputed[col]

# If you need to update the original DataFrames
df_US_Accidents_Train.update(df_train)
df_US_Accidents_Test.update(df_test)

In [None]:
df_US_Accidents_Train.head()

**Droping Start and end lat,long as they don't provide meaningfull information to model**

In [None]:
df_US_Accidents_Train.drop(["ID","Source","Roundabout","Give_Way","Timezone","Sunrise_Sunset","Civil_Twilight","Nautical_Twilight","Astronomical_Twilight"],axis=1,inplace=True)
df_US_Accidents_Test.drop(["ID","Source","Roundabout","Give_Way","Timezone","Sunrise_Sunset","Civil_Twilight","Nautical_Twilight","Astronomical_Twilight"],axis=1,inplace=True)

**Interpretation:** All the missing values are handled

In [None]:
X_train = df_US_Accidents_Train.drop(["Severity"],axis=1)
y_train = df_US_Accidents_Train["Severity"]
X_test = df_US_Accidents_Test.drop(["Severity"],axis=1)
y_test = df_US_Accidents_Test["Severity"]

In [None]:
df_cat = X_train.select_dtypes(include="object")
df_num = X_train.select_dtypes(include="number")
df_dummies = pd.get_dummies(df_cat,drop_first=True,dtype=int)
X_train = pd.concat([df_num,df_dummies],axis=1)

In [None]:
df_cat = X_test.select_dtypes(include="object")
df_num = X_test.select_dtypes(include="number")
df_dummies = pd.get_dummies(df_cat,drop_first=True,dtype=int)
X_test = pd.concat([df_num,df_dummies],axis=1)

In [None]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

In [None]:
y_train = y_train.astype(int)
y_test = y_test.astype(int)

#### ADASAN

In [None]:
# Step 3: Initialize ADASYN and optionally RandomUnderSampler for better balance
adasyn = ADASYN(sampling_strategy='minority', random_state=100)

# Optional: Combine with RandomUnderSampler to achieve a more balanced dataset
# This step helps if the dataset is very imbalanced after ADASYN.
undersample = RandomUnderSampler(sampling_strategy='majority', random_state=100)
pipeline = Pipeline(steps=[('adasyn', adasyn), ('under', undersample)])

# Step 4: Fit ADASYN on the training data
X_resampled, y_resampled = pipeline.fit_resample(X_train, y_train)

In [None]:
print(X_resampled.shape)
print(y_resampled.shape)

In [None]:
X_train = X_resampled
y_train = y_resampled

#### Modeling Functions

In [None]:
df_result_Test = pd.DataFrame(columns=['Model Name','Recall 0','Recall 1','Precision 0','Precision 1','F1-Score 0','F1-Score 1'])
df_result_Train = pd.DataFrame(columns=['Model Name','Recall 0','Recall 1','Precision 0','Precision 1','F1-Score 0','F1-Score 1'])

In [None]:
df_result_Test

In [None]:
df_result_Train

**Classification Report**

In [None]:
import numpy as np
from sklearn.metrics import classification_report, roc_curve, roc_auc_score

def get_test_report(model, test_data, target_data):
    # Predict the probabilities for the positive class
    y_proba = model.predict_proba(test_data)[:, 1]
    # Compute ROC curve
    fpr, tpr, thresholds = roc_curve(target_data, y_proba)
    # Compute Youden's J statistic
    j_scores = tpr - fpr
    best_index = np.argmax(j_scores)
    best_threshold = thresholds[best_index]
    print(f'Best threshold according to Youden\'s J statistic: {best_threshold}')
    print("\n")
    # Predict classes based on a specific threshold
    y_pred = (y_proba >= best_threshold).astype(int)  # Using best_threshold for prediction
    # Generate classification report
    report = classification_report(target_data, y_pred)
    print(report)
    # Return the classification report for test data
    return classification_report(target_data, y_pred, output_dict=True)

**Confusion Matrix**

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, roc_curve
from matplotlib.colors import ListedColormap
def plot_confusion_matrix(model, test_data, target_data):
    # Predict the probabilities for the positive class
    y_proba = model.predict_proba(test_data)[:, 1]
    # Compute ROC curve
    fpr, tpr, thresholds = roc_curve(target_data, y_proba)
    # Compute Youden's J statistic
    j_scores = tpr - fpr
    best_index = np.argmax(j_scores)
    best_threshold = thresholds[best_index]
    # Predict classes based on the best threshold
    y_pred = (y_proba >= best_threshold).astype(int)
    # Create a confusion matrix
    cm = confusion_matrix(target_data, y_pred)
    # Label the confusion matrix
    conf_matrix = pd.DataFrame(data=cm, columns=['Predicted:0', 'Predicted:1'], index=['Actual:0', 'Actual:1'])
    # Plot a heatmap to visualize the confusion matrix
    sns.heatmap(conf_matrix, annot=True, fmt='d', cmap=ListedColormap(['lightskyblue']), cbar=False,
                linewidths=0.1, annot_kws={'size': 12})
    # Set the font size of x-axis and y-axis ticks
    plt.xticks(fontsize=12)
    plt.yticks(fontsize=12)
    # Display the plot
    plt.show()

**ROC AUC Curve**

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, roc_auc_score
import numpy as np

def plot_roc(model, X_test, y_test):
    """
    Plots the ROC curve for a binary classification model and calculates the ROC-AUC score.
    Parameters:
    - model: Trained machine learning model with `predict_proba` method
    - X_test: Features of the test set
    - y_test: True target values of the test set
    """
    # Predict the probability of the positive class (class 1)
    y_pred_prob = model.predict_proba(X_test)[:, 1]

    # Compute ROC curve
    fpr, tpr, _ = roc_curve(y_test, y_pred_prob)

    # Compute AUC score
    auc_score = roc_auc_score(y_test, y_pred_prob)

    # Plot the ROC curve
    plt.figure(figsize=(8, 6))
    plt.plot(fpr, tpr, color='blue', label=f'ROC curve (AUC = {auc_score:.4f})')

    # Plot the diagonal line representing a random classifier
    plt.plot([0, 1], [0, 1], 'r--')

    # Set the limits for the x and y axes
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.0])

    # Add plot and axes labels
    plt.title('ROC Curve for Binary Classification', fontsize=15)
    plt.xlabel('False Positive Rate (1 - Specificity)', fontsize=15)
    plt.ylabel('True Positive Rate (Sensitivity)', fontsize=15)

    # Add a legend to the plot
    plt.legend(loc='lower right')

    # Add grid lines for better readability
    plt.grid(True)

    # Display the plot
    plt.show()

#### K nearest Neighbor

**Initial complete modeling**

In [None]:
# instantiate the 'KNeighborsClassifier'
# n_neighnors: number of neighbors to consider
# default metric is minkowski, and with p=2 it is equivalent to the euclidean metric
knn_classification = KNeighborsClassifier()

# fit the model using fit() on train data
knn_model = knn_classification.fit(X_train, y_train)

###### Base Model

In [None]:
plot_confusion_matrix(knn_model,X_test,y_test)

In [None]:
report = get_test_report(knn_model,X_test,y_test)

precision_0 = round(report['0']['precision'],4)
recall_0 = round(report['0']['recall'],4)
f1score_0 = round(report['0']['f1-score'],4)

precision_1 = round(report['1']['precision'],4)
recall_1 = round(report['1']['recall'],4)
f1score_1 = round(report['1']['f1-score'],4)

In [None]:
plot_roc(knn_model,X_test,y_test)

In [None]:
df_result_Test.loc[0] = ['KNN Classification Base Model',recall_0, recall_1,precision_0,precision_1,f1score_0,f1score_1]
df_result_Test

In [None]:
plot_confusion_matrix(knn_model,X_train,y_train)

In [None]:
report = get_test_report(knn_model,X_train,y_train)

precision_0 = round(report['0']['precision'],4)
recall_0 = round(report['0']['recall'],4)
f1score_0 = round(report['0']['f1-score'],4)

precision_1 = round(report['1']['precision'],4)
recall_1 = round(report['1']['recall'],4)
f1score_1 = round(report['1']['f1-score'],4)

In [None]:
plot_roc(knn_model,X_train,y_train)

In [None]:
df_result_Train.loc[0] = ['KNN Classification Base Model',recall_0, recall_1,precision_0,precision_1,f1score_0,f1score_1]
df_result_Train

###### HyperOpt Model

In [None]:
X = X_train
y = y_train

# Split data into train and test sets
X_train_hype, X_test_hype, y_train_hype, y_test_hype = train_test_split(X, y, test_size=0.2, random_state=42)

def objective(params):
    n_neighbors = int(params['n_neighbors'])
    metric = params['metric']
    knn = KNeighborsClassifier(n_neighbors=n_neighbors, metric=metric, n_jobs=-1)
    knn.fit(X_train_hype, y_train_hype)
    y_pred_hype = knn.predict(X_test_hype)
    recall = recall_score(y_test_hype, y_pred_hype)
    return -recall  # Hyperopt minimizes the objective, so negate the recall to maximize it

space = {
    'n_neighbors': hp.quniform('n_neighbors', 1, 20, 1),
    'metric': hp.choice('metric', ['euclidean', 'manhattan'])
}

trials = Trials()
best = fmin(fn=objective, space=space, algo=tpe.suggest, max_evals=10, trials=trials)

print('Best parameters: ', best)

In [None]:
knn_classification = KNeighborsClassifier(n_neighbors = int(best['n_neighbors']),metric = ['euclidean', 'manhattan'][best['metric']])
knn_model = knn_classification.fit(X_train, y_train)

In [None]:
plot_confusion_matrix(knn_model,X_test,y_test)

In [None]:
report = get_test_report(knn_model,X_test,y_test)

precision_0 = round(report['0']['precision'],4)
recall_0 = round(report['0']['recall'],4)
f1score_0 = round(report['0']['f1-score'],4)

precision_1 = round(report['1']['precision'],4)
recall_1 = round(report['1']['recall'],4)
f1score_1 = round(report['1']['f1-score'],4)

In [None]:
plot_roc(knn_model,X_test,y_test)

In [None]:
df_result_Test.loc[1] = ['KNN Classification Hypertuned Model',recall_0, recall_1,precision_0,precision_1,f1score_0,f1score_1]
df_result_Test

#### Naive Bayes Algorithm

###### Base Model

In [None]:
# instantiate the 'GaussianNB'
gnb = GaussianNB()

# fit the model using fit() on train data
gnb_model = gnb.fit(X_train, y_train)

In [None]:
plot_confusion_matrix(gnb_model,X_test,y_test)

In [None]:
report = get_test_report(gnb_model,X_test,y_test)

precision_0 = round(report['0']['precision'],4)
recall_0 = round(report['0']['recall'],4)
f1score_0 = round(report['0']['f1-score'],4)

precision_1 = round(report['1']['precision'],4)
recall_1 = round(report['1']['recall'],4)
f1score_1 = round(report['1']['f1-score'],4)

In [None]:
plot_roc(gnb_model,X_test,y_test)

In [None]:
df_result_Test.loc[2] = ['Naive Bayes Algorithm Base Model',recall_0, recall_1,precision_0,precision_1,f1score_0,f1score_1]
df_result_Test

In [None]:
plot_confusion_matrix(gnb_model,X_train,y_train)

In [None]:
report = get_test_report(gnb_model,X_train,y_train)

precision_0 = round(report['0']['precision'],4)
recall_0 = round(report['0']['recall'],4)
f1score_0 = round(report['0']['f1-score'],4)

precision_1 = round(report['1']['precision'],4)
recall_1 = round(report['1']['recall'],4)
f1score_1 = round(report['1']['f1-score'],4)

In [None]:
plot_roc(gnb_model,X_train,y_train)

In [None]:
df_result_Train.loc[1] = ['Naive Bayes Algorithm Base Model',recall_0, recall_1,precision_0,precision_1,f1score_0,f1score_1]
df_result_Train

###### HyperOpt Model

In [None]:
# Sample data
X = X_train  # Your feature matrix
y = y_train  # Your target variable

# Split data into train and test sets
X_train_hype, X_test_hype, y_train_hype, y_test_hype = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the objective function
def objective(params):
    # Create the Gaussian Naive Bayes classifier
    gnb = GaussianNB(var_smoothing=params['var_smoothing'])

    # Fit the model
    gnb.fit(X_train_hype, y_train_hype)

    # Make predictions
    y_pred_hype = gnb.predict(X_test_hype)

    # Calculate recall score
    recall = recall_score(y_test_hype, y_pred_hype, average='binary')  # Change 'average' if multi-class
    return -recall  # Hyperopt minimizes the objective, so negate the recall to maximize it

# Define the search space
space = {
    'var_smoothing': hp.loguniform('var_smoothing', np.log(1e-9), np.log(1e+0))  # Positive values only
}

# Set up Trials object
trials = Trials()

# Run hyperparameter optimization
best = fmin(fn=objective, space=space, algo=tpe.suggest, max_evals=10, trials=trials)

print('Best parameters: ', best)

In [None]:
# instantiate the 'GaussianNB'
gnb = GaussianNB(var_smoothing=float(best['var_smoothing']))

# fit the model using fit() on train data
gnb_model = gnb.fit(X_train, y_train)

In [None]:
plot_confusion_matrix(gnb_model,X_test,y_test)

In [None]:
report = get_test_report(gnb_model,X_test,y_test)

precision_0 = round(report['0']['precision'],4)
recall_0 = round(report['0']['recall'],4)
f1score_0 = round(report['0']['f1-score'],4)

precision_1 = round(report['1']['precision'],4)
recall_1 = round(report['1']['recall'],4)
f1score_1 = round(report['1']['f1-score'],4)

In [None]:
plot_roc(gnb_model,X_test,y_test)

In [None]:
df_result_Test.loc[3] = ['Naive Bayes Algorithm Hypertuned Model',recall_0, recall_1,precision_0,precision_1,f1score_0,f1score_1]
df_result_Test

#### Multinomial Naive Bayes

###### Base model

In [None]:
MultiNB = MultinomialNB()
X_train_sc = MinMaxScaler().fit_transform(X_train)
X_test_sc = MinMaxScaler().fit_transform(X_test)
MultiNB.fit(X_train_sc, y_train)

In [None]:
y_pred = MultiNB.predict(X_test_sc)

In [None]:
# Create a confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Create a DataFrame for the confusion matrix
conf_matrix = pd.DataFrame(data=cm, columns=['Predicted:0', 'Predicted:1'],
                           index=['Actual:0', 'Actual:1'])

# Plot the heatmap to visualize the confusion matrix
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap=ListedColormap(['lightskyblue']), cbar=False,
            linewidths=0.1, annot_kws={'size':12})

# Set the font size of x-axis and y-axis ticks
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

# Display the plot
plt.show()

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
report = classification_report(y_test, y_pred,output_dict=True)

precision_0 = round(report['0']['precision'],4)
recall_0 = round(report['0']['recall'],4)
f1score_0 = round(report['0']['f1-score'],4)

precision_1 = round(report['1']['precision'],4)
recall_1 = round(report['1']['recall'],4)
f1score_1 = round(report['1']['f1-score'],4)

In [None]:
df_result_Test.loc[4] = ['Multinomial Naive Bayes Classifier Base Model',recall_0, recall_1,precision_0,precision_1,f1score_0,f1score_1]
df_result_Test

###### HyperOpt Model

In [None]:
X = X_train_sc
y = y_train

# Split data into train and test sets
X_train_hype, X_test_hype, y_train_hype, y_test_hype = train_test_split(X, y, test_size=0.2, random_state=42)

def objective(params):
    alpha = params['alpha']

    # Initialize and fit the model
    model = MultinomialNB(alpha=alpha)
    model.fit(X_train_hype, y_train_hype)
    y_pred_hype = model.predict(X_test_hype)

    # Compute recall score
    recall = recall_score(y_test_hype, y_pred_hype)

    # Return the negative recall score since Hyperopt minimizes the objective
    return -recall

# Define the parameter space for `alpha`
space = {
    'alpha': hp.uniform('alpha', 0.01, 2.0)  # Uniform distribution for alpha
}

# Initialize Trials object
trials = Trials()

# Perform hyperparameter optimization
best = fmin(fn=objective, space=space, algo=tpe.suggest, max_evals=10, trials=trials)

print('Best parameters: ', best)

In [None]:
MultiNB = MultinomialNB(alpha=float(best['alpha']))
MultiNB.fit(X_train_sc, y_train)

In [None]:
plot_confusion_matrix(MultiNB,X_test_sc,y_test)

In [None]:
report = get_test_report(MultiNB,X_test_sc,y_test)

precision_0 = round(report['0']['precision'],4)
recall_0 = round(report['0']['recall'],4)
f1score_0 = round(report['0']['f1-score'],4)

precision_1 = round(report['1']['precision'],4)
recall_1 = round(report['1']['recall'],4)
f1score_1 = round(report['1']['f1-score'],4)

In [None]:
plot_roc(MultiNB,X_test_sc,y_test)

In [None]:
df_result_Test.loc[5] = ['Multinomial Naive Bayes Classifier Hypertuned Model',recall_0, recall_1,precision_0,precision_1,f1score_0,f1score_1]
df_result_Test

#### Decision tree

###### Base Model

In [None]:
# instantiate the 'DecisionTreeClassifier' object using 'entropy' criterion
# pass the 'random_state' to obtain the same samples for each time you run the code
decision_tree_classification = DecisionTreeClassifier()

# fit the model using fit() on train data
decision_tree = decision_tree_classification.fit(X_train, y_train)

In [None]:
plot_confusion_matrix(decision_tree,X_test,y_test)

In [None]:
report = get_test_report(decision_tree,X_test,y_test)

precision_0 = round(report['0']['precision'],4)
recall_0 = round(report['0']['recall'],4)
f1score_0 = round(report['0']['f1-score'],4)

precision_1 = round(report['1']['precision'],4)
recall_1 = round(report['1']['recall'],4)
f1score_1 = round(report['1']['f1-score'],4)

In [None]:
plot_roc(decision_tree,X_test,y_test)

In [None]:
df_result_Test.loc[6] = ['Decision Tree Base Model',recall_0, recall_1,precision_0,precision_1,f1score_0,f1score_1]
df_result_Test

In [None]:
# create a dataframe that stores the feature names and their importance
# 'feature_importances_' returns the features based on the gini importance
important_features = pd.DataFrame({'Features': X_train.columns,
                                   'Importance': decision_tree.feature_importances_})

# sort the dataframe in the descending order according to the feature importance
important_features = important_features.sort_values('Importance', ascending = False)

# create a barplot to visualize the features based on their importance
sns.barplot(x = 'Importance', y = 'Features', data = important_features[:10])

# add plot and axes labels
# set text size using 'fontsize'
plt.title('Feature Importance', fontsize = 10)
plt.xlabel('Importance', fontsize = 10)
plt.ylabel('Features', fontsize = 10)
# display the plot
plt.show()

In [None]:
plot_confusion_matrix(decision_tree,X_train,y_train)

In [None]:
report = get_test_report(decision_tree,X_train,y_train)

precision_0 = round(report['0']['precision'],4)
recall_0 = round(report['0']['recall'],4)
f1score_0 = round(report['0']['f1-score'],4)

precision_1 = round(report['1']['precision'],4)
recall_1 = round(report['1']['recall'],4)
f1score_1 = round(report['1']['f1-score'],4)

In [None]:
plot_roc(decision_tree,X_train,y_train)

In [None]:
df_result_Train.loc[2] = ['Decision Tree Base Model',recall_0, recall_1,precision_0,precision_1,f1score_0,f1score_1]
df_result_Train

###### HyperOpt Model

In [None]:
X = X_train
y = y_train

# Split data into train and test sets
X_train_hype, X_test_hype, y_train_hype, y_test_hype = train_test_split(X, y, test_size=0.2, random_state=42)

def objective(params):
    criterion = params['criterion']
    max_depth = int(params['max_depth'])
    min_samples_split = int(params['min_samples_split'])
    min_samples_leaf = int(params['min_samples_leaf'])
    max_features = params['max_features']

    # Initialize and fit the model
    model = DecisionTreeClassifier(
        criterion=criterion,
        max_depth=max_depth,
        min_samples_split=min_samples_split,
        min_samples_leaf=min_samples_leaf,
        max_features=max_features,
        random_state=42
    )
    model.fit(X_train_hype, y_train_hype)
    y_pred_hype = model.predict(X_test_hype)

    # Compute recall score
    recall = recall_score(y_test_hype, y_pred_hype)

    # Return the negative recall score since Hyperopt minimizes the objective
    return -recall

# Define the parameter space
space = {
    'criterion': hp.choice('criterion', ['gini', 'entropy']),
    'max_depth': hp.quniform('max_depth', 3, 10, 1),  # Integer values between 3 and 10
    'min_samples_split': hp.quniform('min_samples_split', 2, 10, 1),  # Integer values between 2 and 10
    'min_samples_leaf': hp.quniform('min_samples_leaf', 1, 5, 1),  # Integer values between 1 and 5
    'max_features': hp.choice('max_features', ['sqrt', 'log2', None])
}

# Initialize Trials object
trials = Trials()

# Perform hyperparameter optimization
best = fmin(fn=objective, space=space, algo=tpe.suggest, max_evals=10, trials=trials)
print('Best parameters: ', best)

In [None]:
decision_tree_classification = DecisionTreeClassifier(criterion = ['gini','entropy'][best['criterion']], #Use the integer representation from hyperopt to select the correct string.
                                                       max_depth = int(best['max_depth']),
                                                       max_features = ['sqrt', 'log2', None][best['max_features']], # Same issue as criterion
                                                       min_samples_leaf = int(best['min_samples_leaf']),
                                                       min_samples_split = int(best['min_samples_split']))

# fit the model using fit() on train data
decision_tree = decision_tree_classification.fit(X_train, y_train)

In [None]:
plot_confusion_matrix(decision_tree,X_test,y_test)

In [None]:
report = get_test_report(decision_tree,X_test,y_test)

precision_0 = round(report['0']['precision'],4)
recall_0 = round(report['0']['recall'],4)
f1score_0 = round(report['0']['f1-score'],4)

precision_1 = round(report['1']['precision'],4)
recall_1 = round(report['1']['recall'],4)
f1score_1 = round(report['1']['f1-score'],4)

In [None]:
plot_roc(decision_tree,X_test,y_test)

In [None]:
df_result_Test.loc[7] = ['Decision Tree Hypertuned Model',recall_0, recall_1,precision_0,precision_1,f1score_0,f1score_1]
df_result_Test

#### Random Forest

###### Base Model

In [None]:
# instantiate the 'RandomForestClassifier'
# pass the required number of trees in the random forest to the parameter, 'n_estimators'
# pass the 'random_state' to obtain the same samples for each time you run the code
rf_classification = RandomForestClassifier()

# use fit() to fit the model on the train set
rf_model = rf_classification.fit(X_train, y_train)

In [None]:
plot_confusion_matrix(rf_model,X_test,y_test)

In [None]:
report = get_test_report(rf_model,X_test,y_test)

precision_0 = round(report['0']['precision'],4)
recall_0 = round(report['0']['recall'],4)
f1score_0 = round(report['0']['f1-score'],4)

precision_1 = round(report['1']['precision'],4)
recall_1 = round(report['1']['recall'],4)
f1score_1 = round(report['1']['f1-score'],4)

In [None]:
plot_roc(rf_model,X_test,y_test)

In [None]:
df_result_Test.loc[8] = ['Random Forest Base Model',recall_0, recall_1,precision_0,precision_1,f1score_0,f1score_1]
df_result_Test

In [None]:
plot_confusion_matrix(rf_model,X_train,y_train)

In [None]:
report = get_test_report(rf_model,X_train,y_train)

precision_0 = round(report['0']['precision'],4)
recall_0 = round(report['0']['recall'],4)
f1score_0 = round(report['0']['f1-score'],4)

precision_1 = round(report['1']['precision'],4)
recall_1 = round(report['1']['recall'],4)
f1score_1 = round(report['1']['f1-score'],4)

In [None]:
plot_roc(rf_model,X_train,y_train)

In [None]:
df_result_Train.loc[3] = ['Random Forest Base Model',recall_0, recall_1,precision_0,precision_1,f1score_0,f1score_1]
df_result_Train

###### HyperOpt Model

In [None]:
X = X_train  # Your feature matrix
y = y_train  # Your target variable

# Split data into train and test sets
X_train_hype, X_test_hype, y_train_hype, y_test_hype = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the objective function
def objective(params):
    rf = RandomForestClassifier(
        criterion=params['criterion'],
        n_estimators=int(params['n_estimators']),
        max_depth=int(params['max_depth']),
        max_features=params['max_features'],
        min_samples_split=int(params['min_samples_split']),
        min_samples_leaf=int(params['min_samples_leaf']),
        max_leaf_nodes=int(params['max_leaf_nodes']),
        random_state=42
    )
    rf.fit(X_train_hype, y_train_hype)
    y_pred_hype = rf.predict(X_test_hype)
    recall = recall_score(y_test_hype, y_pred_hype, average='binary')  # Change 'average' if multi-class
    return -recall  # Hyperopt minimizes the objective, so negate the recall to maximize it

# Define the search space
space = {
    'criterion': hp.choice('criterion', ['gini', 'entropy']),
    'n_estimators': hp.randint('n_estimators', 90) + 10,  # Random integers between 10 and 100
    'max_depth': hp.randint('max_depth', 15) + 5,          # Random integers between 5 and 20
    'max_features': hp.choice('max_features', ['sqrt', 'log2', None]),
    'min_samples_split': hp.randint('min_samples_split', 8) + 2,  # Random integers between 2 and 10
    'min_samples_leaf': hp.randint('min_samples_leaf', 5) + 1,    # Random integers between 1 and 5
    'max_leaf_nodes': hp.randint('max_leaf_nodes', 25) + 5         # Random integers between 5 and 30
}

# Set up Trials object
trials = Trials()

# Run hyperparameter optimization
best = fmin(fn=objective, space=space, algo=tpe.suggest, max_evals=10, trials=trials)
print('Best parameters for Random Forest Classifier:', best)

In [None]:
rf_classification = RandomForestClassifier(criterion = ['gini', 'entropy'][best['criterion']], # Access the correct string value based on the index
                                           n_estimators = best['n_estimators'],
                                           max_depth = best['max_depth'],
                                           max_features = ['sqrt', 'log2', None][best['max_features']],
                                           max_leaf_nodes = best['max_leaf_nodes'],
                                           min_samples_leaf = max(1, best['min_samples_leaf']), # Ensure min_samples_leaf is at least 1
                                           min_samples_split= max(2, best['min_samples_split'])) # Ensure min_samples_split is at least 2
rf_model = rf_classification.fit(X_train, y_train)

In [None]:
plot_confusion_matrix(rf_model,X_test,y_test)

In [None]:
report = get_test_report(rf_model,X_test,y_test)

precision_0 = round(report['0']['precision'],4)
recall_0 = round(report['0']['recall'],4)
f1score_0 = round(report['0']['f1-score'],4)

precision_1 = round(report['1']['precision'],4)
recall_1 = round(report['1']['recall'],4)
f1score_1 = round(report['1']['f1-score'],4)

In [None]:
plot_roc(rf_model,X_test,y_test)

In [None]:
df_result_Test.loc[9] = ['Random Forest Hypertuned Model',recall_0, recall_1,precision_0,precision_1,f1score_0,f1score_1]
df_result_Test

#### Ensemble Techniques

##### AdaBoosting Classifier

###### Base Model

In [None]:
# instantiate the 'AdaBoostClassifier'
# n_estimators: number of estimators at which boosting is terminated
# pass the 'random_state' to obtain the same results for each code implementation
ada_model = AdaBoostClassifier()

# fit the model using fit() on train data
ada_model.fit(X_train, y_train)

In [None]:
plot_confusion_matrix(ada_model,X_test,y_test)

In [None]:
report = get_test_report(ada_model,X_test,y_test)

precision_0 = round(report['0']['precision'],4)
recall_0 = round(report['0']['recall'],4)
f1score_0 = round(report['0']['f1-score'],4)

precision_1 = round(report['1']['precision'],4)
recall_1 = round(report['1']['recall'],4)
f1score_1 = round(report['1']['f1-score'],4)

In [None]:
plot_roc(ada_model,X_test,y_test)

In [None]:
df_result_Test.loc[10] = ['AdaBoosting Classifier Base Model',recall_0, recall_1,precision_0,precision_1,f1score_0,f1score_1]
df_result_Test

In [None]:
plot_confusion_matrix(ada_model,X_train,y_train)

In [None]:
report = get_test_report(ada_model,X_train,y_train)

precision_0 = round(report['0']['precision'],4)
recall_0 = round(report['0']['recall'],4)
f1score_0 = round(report['0']['f1-score'],4)

precision_1 = round(report['1']['precision'],4)
recall_1 = round(report['1']['recall'],4)
f1score_1 = round(report['1']['f1-score'],4)

In [None]:
plot_roc(ada_model,X_train,y_train)

In [None]:
df_result_Train.loc[4] = ['AdaBoosting Classifier Base Model',recall_0, recall_1,precision_0,precision_1,f1score_0,f1score_1]
df_result_Train

###### HyperOpt Model

In [None]:
X = X_train  # Your feature matrix
y = y_train  # Your target variable

# Split data into train and test sets
X_train_hype, X_test_hype, y_train_hype, y_test_hype = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the objective function
def objective(params):
    # Create the base estimator
    # Assign base_estimator to None when params['base_estimator'] == 0
    ada = AdaBoostClassifier(
        learning_rate=params['learning_rate'],
        n_estimators=int(params['n_estimators']),
        algorithm=params['algorithm'],
        random_state=42
    )
    ada.fit(X_train_hype, y_train_hype)
    y_pred_hype = ada.predict(X_test_hype)
    recall = recall_score(y_test_hype, y_pred_hype, average='binary')  # Change 'average' if multi-class
    return -recall  # Hyperopt minimizes the objective, so negate the recall to maximize it

space = {
    'learning_rate': hp.uniform('learning_rate', 0.01, 2.0),  # Uniform distribution between 0.01 and 2.0
    'n_estimators': hp.randint('n_estimators', 41) + 10,  # Random integers between 10 and 50
    'algorithm': hp.choice('algorithm', ['SAMME', 'SAMME.R'])  # Algorithm choices
}

# Set up Trials object
trials = Trials()

# Run hyperparameter optimization
best = fmin(fn=objective, space=space, algo=tpe.suggest, max_evals=10, trials=trials)
print('Best parameters for AdaBoost classifier:', best)

In [None]:
ada_model = AdaBoostClassifier(algorithm = 'SAMME' if best['algorithm'] == 0 else 'SAMME.R', # Accessing the algorithm parameter from the best dictionary and converting numerical values into 'SAMME' or 'SAMME.R'
                               learning_rate = best['learning_rate'],
                               n_estimators = int(best['n_estimators'])+1) # Converting n_estimators to integer
ada_model.fit(X_train, y_train)

In [None]:
plot_confusion_matrix(ada_model,X_test,y_test)

In [None]:
report = get_test_report(ada_model,X_test,y_test)

precision_0 = round(report['0']['precision'],4)
recall_0 = round(report['0']['recall'],4)
f1score_0 = round(report['0']['f1-score'],4)

precision_1 = round(report['1']['precision'],4)
recall_1 = round(report['1']['recall'],4)
f1score_1 = round(report['1']['f1-score'],4)

In [None]:
plot_roc(ada_model,X_test,y_test)

In [None]:
df_result_Test.loc[11] = ['AdaBoosting Classifier Hypertuned Model',recall_0, recall_1,precision_0,precision_1,f1score_0,f1score_1]
df_result_Test

##### Gradient Boosting

###### Base Model

In [None]:
# instantiate the 'GradientBoostingClassifier'
# n_estimators: number of estimators to consider
# 'max_depth': assigns maximum depth of the tree
# pass the 'random_state' to obtain the same results for each code implementation
gboost_model = GradientBoostingClassifier()

# fit the model using fit() on train data
gboost_model.fit(X_train, y_train)

In [None]:
plot_confusion_matrix(gboost_model,X_test,y_test)

In [None]:
report = get_test_report(gboost_model,X_test,y_test)

precision_0 = round(report['0']['precision'],4)
recall_0 = round(report['0']['recall'],4)
f1score_0 = round(report['0']['f1-score'],4)

precision_1 = round(report['1']['precision'],4)
recall_1 = round(report['1']['recall'],4)
f1score_1 = round(report['1']['f1-score'],4)

In [None]:
plot_roc(gboost_model,X_test,y_test)

In [None]:
df_result_Test.loc[12] = ['Gradient Boosting Classifier Base Model',recall_0, recall_1,precision_0,precision_1,f1score_0,f1score_1]
df_result_Test

In [None]:
plot_confusion_matrix(gboost_model,X_train,y_train)

In [None]:
report = get_test_report(gboost_model,X_train,y_train)

precision_0 = round(report['0']['precision'],4)
recall_0 = round(report['0']['recall'],4)
f1score_0 = round(report['0']['f1-score'],4)

precision_1 = round(report['1']['precision'],4)
recall_1 = round(report['1']['recall'],4)
f1score_1 = round(report['1']['f1-score'],4)

In [None]:
plot_roc(gboost_model,X_train,y_train)

In [None]:
df_result_Train.loc[5] = ['Gradient Boosting Classifier Base Model',recall_0, recall_1,precision_0,precision_1,f1score_0,f1score_1]
df_result_Train

###### HyperOpt Model

In [None]:
X = X_train  # Your feature matrix
y = y_train  # Your target variable

# Split data into train and test sets
X_train_hype, X_test_hype, y_train_hype, y_test_hype = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the objective function
def objective(params):
    gboost = GradientBoostingClassifier(
        n_estimators=int(params['n_estimators']),
        learning_rate=params['learning_rate'],
        max_depth=int(params['max_depth']),
        min_samples_split=int(params['min_samples_split']),
        subsample=params['subsample'],
        random_state=42
    )
    gboost.fit(X_train_hype, y_train_hype)
    y_pred_hype = gboost.predict(X_test_hype)
    recall = recall_score(y_test_hype, y_pred_hype, average='binary')  # Change 'average' if multi-class
    return -recall  # Hyperopt minimizes the objective, so negate the recall to maximize it

# Define the search space
space = {
    'n_estimators': hp.randint('n_estimators', 51) + 50,  # Random integers between 50 and 100
    'learning_rate': hp.uniform('learning_rate', 0.1, 1.0),  # Uniform distribution between 0.1 and 1.0
    'max_depth': hp.randint('max_depth', 3) + 3,            # Random integers between 3 and 5
    'min_samples_split': hp.randint('min_samples_split', 6) + 5,  # Random integers between 5 and 10
    'subsample': hp.uniform('subsample', 0.8, 1.0)           # Uniform distribution between 0.8 and 1.0
}

# Set up Trials object
trials = Trials()

# Run hyperparameter optimization
best = fmin(fn=objective, space=space, algo=tpe.suggest, max_evals=10, trials=trials)
print('Best parameters for Gradient Boosting Classifier:', best)

In [None]:
gboost_model = GradientBoostingClassifier(learning_rate = best['learning_rate'],
                                          max_depth = int(best['max_depth']), # Ensure max_depth is an integer
                                          min_samples_split = max(2, int(best['min_samples_split'])), # Ensure min_samples_split is an integer and >=2
                                          n_estimators = int(best['n_estimators']), # Ensure n_estimators is an integer
                                          subsample = best['subsample'])
gboost_model.fit(X_train, y_train)

In [None]:
plot_confusion_matrix(gboost_model,X_test,y_test)

In [None]:
report = get_test_report(gboost_model,X_test,y_test)

precision_0 = round(report['0']['precision'],4)
recall_0 = round(report['0']['recall'],4)
f1score_0 = round(report['0']['f1-score'],4)

precision_1 = round(report['1']['precision'],4)
recall_1 = round(report['1']['recall'],4)
f1score_1 = round(report['1']['f1-score'],4)

In [None]:
plot_roc(gboost_model,X_test,y_test)

In [None]:
df_result_Test.loc[13] = ['Gradient Boosting Classifier Hypertuned Model',recall_0, recall_1,precision_0,precision_1,f1score_0,f1score_1]
df_result_Test

##### XGBoost Classification

###### Base Model

In [None]:
# instantiate the 'XGBClassifier'
# set the maximum depth of the tree using the parameter, 'max_depth'
# pass the value of minimum loss reduction required for partition of the leaf node to the parameter, 'gamma'
xgb_model = XGBClassifier()

# fit the model using fit() on train data
xgb_model.fit(X_train, y_train)

In [None]:
plot_confusion_matrix(xgb_model,X_test,y_test)

In [None]:
report = get_test_report(xgb_model,X_test,y_test)

precision_0 = round(report['0']['precision'],4)
recall_0 = round(report['0']['recall'],4)
f1score_0 = round(report['0']['f1-score'],4)

precision_1 = round(report['1']['precision'],4)
recall_1 = round(report['1']['recall'],4)
f1score_1 = round(report['1']['f1-score'],4)

In [None]:
plot_roc(xgb_model,X_test,y_test)

In [None]:
df_result_Test.loc[14] = ['XGBoost Classifier Base Model',recall_0, recall_1,precision_0,precision_1,f1score_0,f1score_1]
df_result_Test

###### HyperOpt Model

In [None]:
X = X_train  # Your feature matrix
y = y_train  # Your target variable

# Split data into train and test sets
X_train_hype, X_test_hype, y_train_hype, y_test_hype = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the objective function
def objective(params):
    xgb_model = XGBClassifier(
        learning_rate=params['learning_rate'],
        max_depth=int(params['max_depth']),
        gamma=int(params['gamma']),
        use_label_encoder=False,
        eval_metric='logloss'
    )
    xgb_model.fit(X_train_hype, y_train_hype)
    y_pred_proba = xgb_model.predict_proba(X_test_hype)[:, 1]  # Get probabilities for the positive class
    roc_auc = roc_auc_score(y_test_hype, y_pred_proba)
    return -roc_auc  # Hyperopt minimizes the objective, so negate the ROC AUC to maximize it

# Define the search space
space = {
    'learning_rate': hp.uniform('learning_rate', 0.1, 0.6),  # Uniform distribution between 0.1 and 0.6
    'max_depth': hp.randint('max_depth', 7) + 3,            # Random integers between 3 and 9
    'gamma': hp.randint('gamma', 5)                          # Random integers between 0 and 4
}

# Set up Trials object
trials = Trials()

# Run hyperparameter optimization
best = fmin(fn=objective, space=space, algo=tpe.suggest, max_evals=10, trials=trials)
print('Best parameters for XGBoost classifier:', best)

In [None]:
xgb_model = XGBClassifier(gamma = best['gamma'],
                          learning_rate = best['learning_rate'],
                          max_depth = best['max_depth'])
xgb_model.fit(X_train, y_train)

In [None]:
plot_confusion_matrix(xgb_model,X_test,y_test)

In [None]:
report = get_test_report(xgb_model,X_test,y_test)

precision_0 = round(report['0']['precision'],4)
recall_0 = round(report['0']['recall'],4)
f1score_0 = round(report['0']['f1-score'],4)

precision_1 = round(report['1']['precision'],4)
recall_1 = round(report['1']['recall'],4)
f1score_1 = round(report['1']['f1-score'],4)

In [None]:
plot_roc(xgb_model,X_test,y_test)

In [None]:
df_result_Test.loc[15] = ['XGBoost Classifier Hypertuned Model',recall_0, recall_1,precision_0,precision_1,f1score_0,f1score_1]
df_result_Test

### Result

In [None]:
df_result_Test.sort_values(by=['Recall 0','Recall 1'],ascending=False)

In [None]:
df_result_Train.sort_values(by=['Recall 0'],ascending=False)