<h2 style="font-family:camtasia;
          font-size:320%;
          font-weight: bold;
          color: #990033;
          text-align:center;
          margin: 0 auto;
          padding:10px; 
          border-radius:20px 20px;
          background-color: white;">
Project 1: Road Accidents Analysis
</h2>


<a id="import"></a>
# <p style="background-color: #990033; font-family:calibri; color:white; font-size:150%; text-align:center; border-radius:20px 20px;">Dataset Descriptions and Data Understanding:</p>




1. Status: The status of the accident (e.g., reported, under investigation).

2. Accident_Index: A unique identifier for each reported accident.
Unique value for each accident. The accident_index combines the accident_year and accident_ref_no to form a unique ID. It can be used to join to Vehicle and Casualty

3. Accident_Year: The year in which the accident occurred.

4. Accident_Reference: A reference number associated with the accident.

5. Vehicle_Reference: A reference number for the involved vehicle in the accident.

6. Casualty_Reference: A reference number for the casualty involved in the accident.

7. Casualty_Class: Indicates the class of the casualty


8. Sex_of_Casualty: The gender of the casualty (male or female).

    
9. Age_of_Casualty: The age of the casualty. 

10. Age_Band_of_Casualty: Age group to which the casualty belongs (e.g., 0-5, 6-10, 11-15).

11. Casualty_Severity: The severity of the casualty's injuries

12. Pedestrian_Location: The location of the pedestrian at the time of the accident.
13. Pedestrian_Movement: The movement of the pedestrian during the accident

14. Car_Passenger: Indicates whether the casualty was a car passenger at the time of the accident (yes or no)


15. Bus_or_Coach_Passenger: Indicates whether the casualty was a bus or coach passenger (yes or no)


16. Pedestrian_Road_Maintenance_Worker: Indicates whether the casualty was a road maintenance worker (yes or no)


17. Casualty_Type: The type of casualty (e.g., driver/rider, passenger, pedestrian)

18. Casualty_Home_Area_Type: The type of area in which the casualty resides (e.g., urban, rural)


19. Casualty_IMD_Decile: The IMD decile of the area where the casualty resides (a measure of deprivation)


20. LSOA_of_Casualty: The Lower Layer Super Output Area (LSOA) associated with the casualty's location.

<a id="contents_tabel"></a>    
<div style="border-radius: 10px; padding: 15px; background-color: #ffffff; font-size: 115%; text-align: left;">

<h3 align="left" style="color: #990033;">Table of Contents:</h3>
    
* [Step 1 | Import Libraries](#import)
* [Step 2 | Read Dataset](#read)
* [Step 3 | Dataset Overview](#overview)
    - [Step 3.1 | Dataset Basic Information](#basic)
    - [Step 3.2 | Summary Statistics for Numerical Variables](#num_statistics)
    - [Step 3.3 | Summary Statistics for Categorical Variables](#cat_statistics)
* [Step 4 | EDA](#eda)
    - [Step 4.1 | Numerical Features vs Target](#num_target)
    - [Step 4.2 | Categorical Features vs Target](#cat_target)
* [Step 5 | Data Preprocessing](#preprocessing)
    - [Step 5.1 | Check Missing Values](#missing)
    - [Step 5.2 | Encode Categorical Features](#encode)
    - [Step 5.3 | Split the Dataset](#split)
* [Step 6 | Build XGBoost Regressor](#xgb)
    - [Step 6.1 | XGBoost Base Model Definition](#xgb_base)
    - [Step 6.2 | XGBoost Hyperparameter Tuning](#xgb_hp)
    - [Step 6.3 | XGBoost Regressor Evaluation](#xgb_eval)
* [Step 7 | Build CatBoost Regressor](#ctb)
    - [Step 7.1 | CatBoost Base Model Definition](#ctb_base)
    - [Step 7.2 | CatBoost Hyperparameter Tuning](#ctb_hp)
    - [Step 7.3 | CatBoost Regressor Evaluation](#ctb_eval)
* [Step 8 | Conclusion](#conclusion)


<a id="overview"></a>
# <p style="background-color:#990033; font-family:calibri; color:white; font-size:150%; text-align:center; border-radius:20px 20px;">Step 1: Import Libraries</p>

⬆️ [Tabel of Contents](#contents_tabel)

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.impute import KNNImputer

In [None]:
import warnings
warnings.filterwarnings("ignore", "is_categorical_dtype")
warnings.filterwarnings("ignore", "use_inf_as_na")
warnings.simplefilter(action='ignore', category=FutureWarning)


<a id="overview"></a>
# <p style="background-color:#990033; font-family:calibri; color:white; font-size:150%; text-align:center; border-radius:20px 20px;">Step 2: Read Dataset </p>

⬆️ [Tabel of Contents](#contents_tabel)

In [None]:
df = pd.read_csv('dft-road-casualty-statistics-casualty-provisional-mid-year-unvalidated-2022.csv').drop_duplicates()

print(df.shape)
df.sample(5)

<a id="overview"></a>
# <p style="background-color:#990033; font-family:calibri; color:white; font-size:150%; text-align:center; border-radius:20px 20px;">Step 3: Dataset Overview</p>

⬆️ [Tabel of Contents](#contents_tabel)


1. Status: The status of the accident (e.g., reported, under investigation).

2. Accident_Index: A unique identifier for each reported accident.
Unique value for each accident. The accident_index combines the accident_year and accident_ref_no to form a unique ID. It can be used to join to Vehicle and Casualty

3. Accident_Year: The year in which the accident occurred.

4. Accident_Reference: A reference number associated with the accident.

5. Vehicle_Reference: A reference number for the involved vehicle in the accident.

6. Casualty_Reference: A reference number for the casualty involved in the accident.

7. Casualty_Class: Indicates the class of the casualty:<pre>
1	Driver or rider
2	Passenger
3	Pedestrian

8. Sex_of_Casualty: The gender of the casualty (male or female).<pre>
    1	Male, 2	Female
    9	unknown (self reported)
    -1	Data missing or out of range
    
9. Age_of_Casualty: The age of the casualty. (-1	Data missing or out of range)

10. Age_Band_of_Casualty: Age group to which the casualty belongs (e.g., 0-5, 6-10, 11-15).<pre>
1	0 - 5
2	6 - 10
3	11 - 15
4	16 - 20
5	21 - 25
6	26 - 35
7	36 - 45
8	46 - 55
9	56 - 65
10	66 - 75
11	Over 75
-1	Data missing or out of range

11. Casualty_Severity: The severity of the casualty's injuries:<pre>
1	Fatal
2	Serious
3	Slight

12. Pedestrian_Location: The location of the pedestrian at the time of the accident.<pre>
0	Not a Pedestrian
1	Crossing on pedestrian crossing facility
2	Crossing in zig-zag approach lines
3	Crossing in zig-zag exit lines
4	Crossing elsewhere within 50m. of pedestrian crossing
5	In carriageway, crossing elsewhere
6	On footway or verge
7	On refuge, central island or central reservation
8	In centre of carriageway - not on refuge, island or central reservation
9	In carriageway, not crossing
10	Unknown or other
-1	Data missing or out of range

13. Pedestrian_Movement: The movement of the pedestrian during the accident.<pre>
0	Not a Pedestrian
1	Crossing from driver's nearside
2	Crossing from nearside - masked by parked or stationary vehicle
3	Crossing from driver's offside
4	Crossing from offside - masked by  parked or stationary vehicle
5	In carriageway, stationary - not crossing  (standing or playing)
6	In carriageway, stationary - not crossing  (standing or playing) - masked by parked or stationary vehicle
7	Walking along in carriageway, facing traffic
8	Walking along in carriageway, back to traffic
9	Unknown or other
-1	Data missing or out of range

14. Car_Passenger: Indicates whether the casualty was a car passenger at the time of the accident (yes or no).<pre>
0	Not car passenger
1	Front seat passenger
2	Rear seat passenger
9	unknown (self reported)
-1	Data missing or out of range

15. Bus_or_Coach_Passenger: Indicates whether the casualty was a bus or coach passenger (yes or no).<pre>
0	Not a bus or coach passenger
1	Boarding
2	Alighting
3	Standing passenger
4	Seated passenger
9	unknown (self reported)
-1	Data missing or out of range

16. Pedestrian_Road_Maintenance_Worker: Indicates whether the casualty was a road maintenance worker (yes or no).<pre>
0	No / Not applicable
1	Yes
2	Not Known
3	Probable
-1	Data missing or out of range

17. Casualty_Type: The type of casualty (e.g., driver/rider, passenger, pedestrian).<pre>
0	Pedestrian
1	Cyclist
2	Motorcycle 50cc and under rider or passenger
3	Motorcycle 125cc and under rider or passenger
4	Motorcycle over 125cc and up to 500cc rider or  passenger
5	Motorcycle over 500cc rider or passenger
8	Taxi/Private hire car occupant
9	Car occupant
10	Minibus (8 - 16 passenger seats) occupant
11	Bus or coach occupant (17 or more pass seats)
16	Horse rider
17	Agricultural vehicle occupant
18	Tram occupant
19	Van / Goods vehicle (3.5 tonnes mgw or under) occupant
20	Goods vehicle (over 3.5t. and under 7.5t.) occupant
21	Goods vehicle (7.5 tonnes mgw and over) occupant
22	Mobility scooter rider
23	Electric motorcycle rider or passenger
90	Other vehicle occupant
97	Motorcycle - unknown cc rider or passenger
98	Goods vehicle (unknown weight) occupant
99	Unknown vehicle type (self rep only)
-1	Data missing or out of range

18. Casualty_Home_Area_Type: The type of area in which the casualty resides (e.g., urban, rural).<pre>
1	Urban area
2	Small town
3	Rural
-1	Data missing or out of range

19. Casualty_IMD_Decile: The IMD decile of the area where the casualty resides (a measure of deprivation).<pre>
1	Most deprived 10%
2	More deprived 10-20%
3	More deprived 20-30%
4	More deprived 30-40%
5	More deprived 40-50%
6	Less deprived 40-50%
7	Less deprived 30-40%
8	Less deprived 20-30%
9	Less deprived 10-20%
10	Least deprived 10%
-1	Data missing or out of range

20. LSOA_of_Casualty: The Lower Layer Super Output Area (LSOA) associated with the casualty's location.

<a id="basic"></a>
# <b><span style='color:#8fc265'>Step 3.1 |</span><span style='color:#990033'> Dataset Basic Information</span></b>

In [None]:
shape = df.shape
print(f'Dataset has {shape[0]} rows and {shape[1]} columns.')

In [None]:
df.info()

<a id="num_statistics"></a>
# <b><span style='color:#8fc265'>Step 3.2 |</span><span style='color:#990033'> Summary Statistics for Numerical Variables</span></b>

In [None]:
df.describe().T

<a id="cat_statistics"></a>
# <b><span style='color:#8fc265'>Step 3.3 |</span><span style='color:#990033'> Summary Statistics for Categorical  Variables</span></b>

In [None]:
# Get the summary statistics for categorical variables
df.describe(include='object')

In [None]:
# Replace all occurrences of -1 with NaN
# df.replace(-1, np.NaN, inplace=True)

<a id="eda"></a>
# <p style="background-color:#990033; font-family:calibri; color:white; font-size:150%; text-align:center; border-radius:20px 20px;">Step 4: EDA</p>

⬆️ [Tabel of Contents](#contents_tabel)

In [None]:
# Define the columns to plot
columns_to_plot = ['casualty_class', 'sex_of_casualty', 'age_of_casualty', 
                   'age_band_of_casualty', 'casualty_severity', 'pedestrian_location', 'pedestrian_movement', 
                   'car_passenger', 'bus_or_coach_passenger', 'pedestrian_road_maintenance_worker', 
                   'casualty_type', 'casualty_home_area_type', 'casualty_imd_decile']

# Set up the subplot layout
num_rows = 5
num_cols = (len(columns_to_plot) + num_rows - 1) // num_rows
plt.figure(figsize=(20, 20))

# Create subplots
for i, col in enumerate(columns_to_plot, start=1):
    plt.subplot(num_rows, num_cols, i)
    sns.countplot(x=col, data=df)
    plt.title(col)

plt.tight_layout()
plt.show()


In [None]:
cmap = sns.color_palette("Blues", as_cmap=True)

In [None]:
# Compute the correlation matrix
corr_matrix = df.corr(numeric_only=True)

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))

# Set up the matplotlib figure
plt.figure(figsize=(18, 8))

# Plot the heatmap with annotations in the lower triangular part
sns.heatmap(corr_matrix, mask=mask, annot=True, fmt=".2f", linewidth=.3, cmap='coolwarm', square=True)

plt.title('Correlation Matrix with Values', fontsize=25)
plt.show()


In [None]:
# Setting Style of Dataframes :
def style(df):
    formatted_df = df.applymap(lambda x: f'{x:.3f}' if isinstance(x, float) else x)
    styled_df = formatted_df.style.set_properties(**{'border': '1.3px solid blue',
                                           'color': 'darkblue',
                                           'background-color': '#C2DFFF'})
    
    return styled_df

In [None]:
corr = pd.DataFrame({'Correlation with casualty_severity by percentage': (df.corr(numeric_only=True)['casualty_severity'].sort_values(ascending=False) * 100).round(3)})
style(corr)

In [None]:
df.head()
df.columns

In [None]:
df.nunique()

In [None]:
# Drop unnecessary columns
# There is only one value for status and accident_year columns.
df = df.drop(['status', 'accident_year'], axis=1)
df = df.drop(['accident_index', 'accident_reference'], axis=1)

df.shape

In [None]:
df.columns

In [None]:
# Mapping of casualty severity codes to descriptions
severity_mapping = {1: 'Fatal', 2: 'Serious', 3: 'Slight'}

# Map the 'casualty_severity' column using the provided descriptions
df['casualty_severity_label'] = df['casualty_severity'].map(severity_mapping)
df.columns

In [None]:

# Plot the distribution of accident severities
plt.figure(figsize=(8, 6))
sns.countplot(x='casualty_severity_label', data=df, order=['Fatal', 'Serious', 'Slight'])  # Specify order of bars
plt.title('Distribution of Accident Severities')
plt.xlabel('Casualty Severity')
plt.ylabel('Count')
plt.show()


In [None]:
# Plot the relationship between casualty severity and age band
plt.figure(figsize=(10, 8))
sns.boxplot(x='casualty_severity_label', y='age_of_casualty', data=df, hue_order=['Fatal', 'Serious', 'Slight'])
plt.title('Casualty Severity by Age of Casualty'
          
          , order=['Fatal', 'Serious', 'Slight']
          
          )
plt.xlabel('Casualty Severity')
plt.ylabel('Age of Casualty')
plt.show()


In [None]:
# Map numerical codes to meaningful labels
sex_mapping = {1: 'Male', 2: 'Female', -1: 'Unknown', 9: 'Not Specified'}

# Replace numerical codes with corresponding labels
df['sex_of_casualty_label'] = df['sex_of_casualty'].map(sex_mapping)


In [None]:
df.columns

In [None]:
# Plot the relationship between casualty severity and gender
plt.figure(figsize=(8, 6))
sns.countplot(x='casualty_severity', hue='sex_of_casualty_label', data=df)
plt.title('Casualty Severity by Gender')
plt.xlabel('Casualty Severity')
plt.ylabel('Count')
plt.legend(title='Sex of Casualty', labels=['Unknown', 'Male', 'Female'])
plt.show()

In [None]:

# Plot the relationship between casualty severity and gender
plt.figure(figsize=(8, 6))
sns.countplot(x='casualty_severity_label',
              hue='sex_of_casualty_label',
              data=df,
              order=['Fatal', 'Serious', 'Slight'],
            )
plt.title('Casualty Severity by Gender')
plt.xlabel('Casualty Severity')
plt.ylabel('Count')
plt.legend(title='Sex of Casualty')
plt.show()


In [None]:
# Plot the relationship between casualty severity and casualty's home area type

# Map numerical codes to meaningful labels
casualty_home_area_type_mapping = {1: 'Urban', 2: 'Rural', -1: 'Semi-Urban', 9: 'Unknown'}

# Replace numerical codes with corresponding labels
df['casualty_home_area_type'] = df['casualty_home_area_type'].map(casualty_home_area_type_mapping)


In [None]:
plt.figure(figsize=(10, 6))
sns.countplot(x='casualty_home_area_type', data=df)
plt.title('Countplot of Casualty Home Area Type')
plt.xlabel('Casualty Home Area Type')
plt.ylabel('Count')
plt.show()

In [None]:


plt.figure(figsize=(8, 6))
sns.countplot(x='casualty_severity_label',
              hue='casualty_home_area_type',
              data=df,
              order=['Fatal', 'Serious', 'Slight'],
)
plt.title('Casualty Severity by Home Area Type')
plt.xlabel('Casualty Severity')
plt.ylabel('Count')
plt.legend(title='Home Area Type')
plt.show()

## 2. Data Cleaning

In [None]:
nulls = df.isnull().sum()
nulls

In [None]:
if sum(nulls) == 0:
    print('There is no Null data!')
else:
    print('There are Null data!')


In [None]:
print(df.columns)
print(df.shape)

In [None]:
# (df.accident_index == '2022' + df.accident_reference).sum()
# # df.accident_index is equal to '2022' + df.accident_reference
# df = df.drop(['accident_index'], axis=1)
# df.shape

In [None]:
def value_counts(df, column):
    # Calculate the frequency of each value in the "sex_of_casualty" column
    frequency = df[column].value_counts()
    
    # Print the frequency of every value
    print(f"Frequency of each value for DataFrame['{column}']: \n{frequency}")

In [None]:

# def plot_countplot(df, column):
#     """
#     Plot a countplot for the specified column in the DataFrame.

#     Parameters:
#         df (DataFrame): The DataFrame containing the data.
#         column (str): The name of the column to plot.

#     Returns:
#         None
#     """
#     plt.figure(figsize=(10, 6))
#     sns.countplot(x=column, data=df)
#     plt.title(f'Countplot of {column}')
#     plt.xlabel(column)
#     plt.ylabel('Count')
#     plt.show()


In [None]:
def remove_outliers(df, column, values_to_remove):
    """
    Remove rows from a DataFrame where the specified column contains specified values.

    Args:
        df (DataFrame): The pandas DataFrame.
        column (str): The column name to check for outliers.
        values_to_remove (list): A list of values to remove from the specified column.

    Returns:
        DataFrame: The DataFrame with outliers removed.
    """
    # Create a boolean mask to identify rows with the specified values to remove
    mask = df[column].isin(values_to_remove)
    
    # Use the mask to filter out rows with the specified values and create a new DataFrame
    filtered_df = df[~mask]
    
    # Optionally, you can modify the original DataFrame in place by uncommenting the following line:
    # df.drop(df[mask].index, inplace=True)
    
    return filtered_df

In [None]:
def count_plot_df(df, column, label=None):
    # Set the style of the seaborn plot
    sns.set_style("whitegrid")
    
    # Create the count plot
    plt.figure(figsize=(10, 6))
    sns.countplot(data=df, x=column, palette="viridis")
    
    # Rotate x-axis labels for better readability (if needed)
    plt.xticks(rotation=45)
    
    # Add labels and title
    if label is not None:
        plt.xlabel(label)
        plt.ylabel("Frequency")
        plt.title(f"Frequency Plot of {label}")
    else:
        plt.ylabel("Frequency")
        plt.title(f"Frequency Plot")
        
    
    # Show the plot
    plt.tight_layout()
    plt.show()


In [None]:
df['vehicle_reference'].unique()

In [None]:
value_counts(df, 'vehicle_reference')

In [None]:
count_plot_df(df, 'vehicle_reference')

In [None]:
vehicle_reference_to_remove = [5, 6, 7, 8, 9, 61, 227]
df = remove_outliers(df, "vehicle_reference", vehicle_reference_to_remove)
df.shape

In [None]:
value_counts(df, 'casualty_reference')
count_plot_df(df, 'casualty_reference', label="Casualty Reference"))

In [None]:
# Values to remove
casualty_reference_to_remove = [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 22, 148]

df = remove_outliers(df, "casualty_reference", casualty_reference_to_remove)
df.shape

In [None]:
# Calculate the frequency of each value in the "casualty_class" column
value_counts(df, 'casualty_class')

In [None]:
df.shape


In [None]:
# Calculate the frequency of each value in the "casualty_type" column
value_counts(df, 'casualty_type')

In [None]:
# Calculate the percentage of each casualty type
casualty_type_counts = df['casualty_type'].value_counts(normalize=True) * 100
casualty_type_counts
# Find casualty types that occur less than 1% of all data
rare_types = casualty_type_counts[casualty_type_counts < 1].index
rare_types
# Filter out rows with rare casualty types
df_filtered = df[~df['casualty_type'].isin(rare_types)]

# Print the shape of the filtered DataFrame to see how many rows were removed
print("Original DataFrame shape:", df.shape)
print("Filtered DataFrame shape:", df_filtered.shape)


In [None]:
df = df_filtered
df.shape

In [None]:
# # Values to remove
# casualty_type_to_remove = [-1]

# df = remove_outliers(df, 'casualty_type', casualty_type_to_remove)
# df.shape

In [None]:
# Calculate the frequency of each value in the "sex_of_casualty" column
value_counts(df, 'sex_of_casualty')

In [None]:
# Values to remove
sex_of_casualty_to_remove = [-1, 9]

df = remove_outliers(df, 'sex_of_casualty', sex_of_casualty_to_remove)
df.shape

In [None]:
# Calculate the frequency of each value in the "age_of_casualty" column
value_counts(df, 'age_of_casualty')

In [None]:
# Calculate the percentage of each casualty type
age_of_casualty_counts = df['age_of_casualty'].value_counts(normalize=True) * 100
age_of_casualty_counts
# Find casualty types that occur less than 1% of all data
rare_types = age_of_casualty_counts[age_of_casualty_counts < 1].index
rare_types
# Filter out rows with rare casualty types
df_filtered = df[~df['age_of_casualty'].isin(rare_types)]

# Print the shape of the filtered DataFrame to see how many rows were removed
print("Original DataFrame shape:", df.shape)
print("Filtered DataFrame shape:", df_filtered.shape)


In [None]:
count_plot_df(df, 'age_of_casualty', "Age of Casualty")

In [None]:
age_of_casualty_to_remove = [-1]

df = remove_outliers(df, 'age_of_casualty', age_of_casualty_to_remove)
df.shape

In [None]:
df = df_filtered
df.shape

In [None]:
# Calculate the frequency of each value in the "age_band_of_casualty" column
value_counts(df, 'age_band_of_casualty')

In [None]:
# Calculate the frequency of each value in the "casualty_severity" column
value_counts(df, 'casualty_severity')

In [None]:
df.nunique()

In [None]:
df.isnull().sum()

In [None]:
# # Replace all occurrences of -1 with NaN
# df.replace(-1, np.NaN, inplace=True)

In [None]:
df.columns


In [None]:
# Calculate the percentage of each casualty severity within each IMD decile category
severity_imd_percentage = df.groupby(['casualty_imd_decile', 'casualty_severity']).size() / df.groupby('casualty_imd_decile').size() * 100
severity_imd_percentage = severity_imd_percentage.reset_index(name='Percentage')
severity_imd_percentage


In [None]:
print(severity_imd_percentage.dtypes)


In [None]:
severity_imd_percentage['casualty_severity'] = severity_imd_percentage['casualty_severity'].astype(str)


In [None]:
# Plotting the relationship between casualty severity and IMD decile
plt.figure(figsize=(10, 6))
sns.barplot(x='casualty_imd_decile', y='Percentage', hue='casualty_severity', data=severity_imd_percentage)
plt.title('Percentage of Casualty Severity by IMD Decile of Casualty Home Postcode')
plt.xlabel('IMD Decile')
plt.ylabel('Percentage')
plt.legend(title='Casualty Severity', loc='upper right')
plt.show()

In [None]:

# Plot the relationship between casualties by road user type and IMD decile
plt.figure(figsize=(12, 8))
sns.countplot(x='casualty_type_categ', hue='casualty_imd_decile_categ', data=casualty_type_tmd_decile_df)
plt.title('Casualties by Road User Type and IMD Decile')
plt.xlabel('Road User Type')
plt.ylabel('Number of Casualties')
plt.legend(title='IMD Decile')
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability
plt.show()


In [None]:
# import seaborn as sns
# import matplotlib.pyplot as plt

# # Calculate the percentage of each road user type within each IMD decile category
# road_user_imd_percentage = df.groupby(['casualty_imd_decile', 'casualty_type']).size() / df.groupby('casualty_imd_decile').size() * 100
# road_user_imd_percentage = road_user_imd_percentage.reset_index(name='Percentage')

# # Plot the relationship between road user type and IMD decile
# plt.figure(figsize=(12, 8))
# sns.barplot(x='casualty_imd_decile', y='Percentage', hue='casualty_type', data=road_user_imd_percentage)
# plt.title('Casualties by Road User Type and IMD Decile')
# plt.xlabel('IMD Decile')
# plt.ylabel('Percentage of Casualties')
# plt.legend(title='Road User Type', loc='upper right')
# plt.show()


In [None]:
# Plot the relationship between casualty severity and the IMD decile
plt.figure(figsize=(10, 8))
sns.boxplot(x='casualty_severity', y='casualty_imd_decile', data=df)
# sns.boxplot(x='casualty_imd_decile', y='casualty_severity',  data=df)
plt.title('Casualty Severity by IMD Decile')
plt.xlabel('Casualty Severity')
plt.ylabel('IMD Decile')
plt.show()

In [None]:
# imputer = KNNImputer(n_neighbors=3)
# df = imputer.fit_transform(df)

In [None]:
# Calculate the frequency of each value in the "pedestrian_location" column
value_counts(df, 'pedestrian_location')

In [None]:
# Calculate the frequency of each value in the "pedestrian_movement" column
value_counts(df, 'pedestrian_movement')

In [None]:
# Calculate the frequency of each value in the "car_passenger" column
value_counts(df, 'car_passenger')

In [None]:
# Calculate the frequency of each value in the "bus_or_coach_passenger" column
value_counts(df, 'bus_or_coach_passenger')

In [None]:
# Calculate the frequency of each value in the "pedestrian_road_maintenance_worker" column
value_counts(df, 'pedestrian_road_maintenance_worker')

In [None]:
# Calculate the frequency of each value in the "casualty_type" column
value_counts(df, 'casualty_type')

In [None]:
# Calculate the frequency of each value in the "casualty_home_area_type" column
value_counts(df, 'casualty_home_area_type')

In [None]:
# Calculate the frequency of each value in the "casualty_imd_decile" column
value_counts(df, 'casualty_imd_decile')

In [None]:
df.casualty_imd_decile.nunique()

In [None]:
# Calculate the frequency of each value in the "lsoa_of_casualty" column
value_counts(df, 'lsoa_of_casualty')

In [None]:
df.lsoa_of_casualty.nunique()

In [None]:
(df['lsoa_of_casualty'].value_counts()>10).sum()

In [None]:
# count_plot_df(df, 'lsoa_of_casualty')

In [None]:
df.head(10)

In [None]:
shape = df.shape
print(f'Dataset has {shape[0]} rows and {shape[1]} columns.')

## 3. Exploratory Data Analysis (EDA)

## 4. Feature Engineering

## 5. Data Visualization

## 6. Model Building

## 7. Interpretation and Insights