## <font color='Blue'> Road Accident.</font>

![](https://cdn-hinbp.nitrocdn.com/bNGDGkJzqqclpSTwXkdOJqLhhVktqhvt/assets/images/optimized/rev-dea4b79/www.kraftlaw.com/wp-content/uploads/2021/10/types-of-car-accidents.jpg)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

#### Importing Libraries.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings
warnings.filterwarnings("ignore")
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from yellowbrick.classifier import ConfusionMatrix
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.preprocessing import LabelEncoder
from matplotlib.ticker import ScalarFormatter

## <font color='blue'> 1. Importing Dataset. </font>

#### Here we are reading the database using the name "df" in the variable name.

In [None]:
df = pd.read_csv("/Users/riteshkumar/Downloads/ML projects/Road Accident/accident.csv")

#### We can see that we have categorical and continuous variables, the dataset contains the following features:
- Age: Person Age
- Gender: Person Gender
- Speed of Impact: Speed of Impact
- Hemlet_Used: Helmet used or not
- Seatbelt_Used: Seatbelt used or not
- Survived: If person survived or no

In [None]:
pd.set_option('display.max_columns', None)
df.head(5)

In [None]:
unique_counts = df.nunique()
print(unique_counts)

In [None]:
df.info()

In [None]:
df.dtypes

#### Here, taking a first look at our continuous data, we can see that we have people ranging from the youngest to the oldest in our dataset, as well as a wide variety in our impact speed variable.

In [None]:
df.describe()

#### As we saw above and can confirm below, we have 3 null values. Since they are very few, I will choose to remove these null values to proceed with the study.

In [None]:
df.isnull().sum()/len(df)

In [None]:
df = df.dropna()
df.isnull().sum()/len(df)

## <font color='blue'> 2. Data Analysis. </font>

#### Categorical Variables:
#### Looking at our categorical variables, we can start to analyze the population of our dataset and how our data is distributed. Here, we can see that we have a predominantly female population. In our dataset, there is a predominance of people using helmets/seat belts (although we still have about 45% who don’t use them, which is quite alarming). When we look at our target variable, we can see that it is well balanced, with almost 50% of the data in each class.

In [None]:
def add_percentage(ax, total):
    for p in ax.patches:
        percentage = '{:.1f}%'.format(100 * p.get_height() / total)
        x = p.get_x() + p.get_width() / 2
        y = p.get_height()
        ax.annotate(percentage, (x, y), ha='center', va='bottom', fontsize=15)

def add_mean_line(ax, total):
    heights = [p.get_height() for p in ax.patches]
    mean_count = sum(heights) / len(heights)  

    ax.axhline(mean_count, color='red', linestyle='--', label=f'Mean: {mean_count:.1f}')
    
    ax.legend()

def add_count(ax):
    for p in ax.patches:
        count = int(p.get_height())
        x = p.get_x() + p.get_width() / 2
        y = p.get_height() / 2  
        ax.annotate(f'{count}', (x, y), ha='center', va='center', fontsize=15)

plt.figure(figsize=(25, 13))

plt.subplot(2, 2, 1)
plt.gca().set_title('Variable Gender')
ax1 = sns.countplot(x='Gender', palette='Set2', data=df)
add_percentage(ax1, len(df['Gender']))
add_mean_line(ax1, len(df['Gender']))
add_count(ax1)

plt.subplot(2, 2, 2)
plt.gca().set_title('Variable Helmet_Used')
ax2 = sns.countplot(x='Helmet_Used', palette='Set2', data=df)
add_percentage(ax2, len(df['Helmet_Used']))
add_mean_line(ax2, len(df['Helmet_Used']))
add_count(ax2)

plt.subplot(2, 2, 3)
plt.gca().set_title('Variable Seatbelt_Used')
ax3 = sns.countplot(x='Seatbelt_Used',  palette='Set2', data=df)
add_percentage(ax3, len(df['Seatbelt_Used']))
add_mean_line(ax3, len(df['Seatbelt_Used']))
add_count(ax3)

plt.subplot(2, 2, 4)
plt.gca().set_title('Variable Survived')
ax4 = sns.countplot(x='Survived', palette='Set2', data=df)
add_percentage(ax4, len(df['Survived']))
add_mean_line(ax4, len(df['Survived']))
add_count(ax4)

plt.tight_layout()
plt.show()

#### Continuous Variables:
#### When we look at our continuous variables, we can confirm that our dataset is well distributed, despite being a small dataset. When we look at the first variable, which is age, we can see that the data is well distributed between 18 and 70 years, with some variations at certain ages and a peak around 45 years, but it is still a very well-distributed variable.

#### Looking at the speed of impact variable, we can see a smaller distribution, but it still exists. We can observe that the peak of accidents is around 120, which is the highest speed, but we can also verify that we have examples of accidents at all speeds.

In [None]:
from matplotlib.ticker import ScalarFormatter

plt.figure(figsize=(25, 13))
sns.set(color_codes=True)


def create_histogram(data, subplot_position):
    ax = plt.subplot(2, 1, subplot_position) 
    sns.histplot(data, kde=False, ax=ax)
    
    # Definir formato do eixo Y
    ax.yaxis.set_major_formatter(ScalarFormatter())
    ax.ticklabel_format(useOffset=False) 

# Criar histogramas
create_histogram(df['Age'], 1)
create_histogram(df['Speed_of_Impact'], 2)

plt.tight_layout()
plt.show()

#### Looking at the boxplots, we can verify that we don't have any outliers (which is great, as this means there is no need for treatment, given that we don’t have a large dataset). Looking at our age variable, we can confirm what we saw in our histogram. Our average age is 43 years, with the youngest person being 18 years old and the oldest being 69. We can also see that 50% of our data is between 18 and 44 years, and the other 50% is between 44 and 69 years.

#### Looking at our other continuous variable, which is the accident speed, we can see that the average is 70 km/h, but we have accidents at all speeds. We can also see that 50% of our accidents are below 71 km/h, which is a number that surprised me a lot.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

fig, axes = plt.subplots(1, 2, figsize=(20, 10))  

data_1 = df["Age"]
quartiles_1 = np.percentile(data_1, [25, 50, 75])
min_1, max_1, mean_1 = data_1.min(), data_1.max(), data_1.mean()

axes[0].set_title("Boxplot Age", fontdict={'fontsize': 20}) 
sns.boxplot(x=data_1, ax=axes[0], boxprops=dict(color='#4C72B0'), medianprops=dict(color='#56B4E9'))

axes[0].text(quartiles_1[0], 0.80, f"Q1: {quartiles_1[0]:.2f}", ha='center', va='bottom', fontsize=12, color='#4C72B0')
axes[0].text(quartiles_1[1], 0.80, f"Median: {quartiles_1[1]:.2f}", ha='center', va='bottom', fontsize=12, color='#56B4E9')
axes[0].text(quartiles_1[2], 0.80, f"Q3: {quartiles_1[2]:.2f}", ha='center', va='bottom', fontsize=12, color='#4C72B0')
axes[0].text(min_1, 1.1, f"Min: {min_1:.2f}", ha='center', va='bottom', fontsize=12, color='green')
axes[0].text(max_1, 1.1, f"Max: {max_1:.2f}", ha='center', va='bottom', fontsize=12, color='red')
axes[0].text(mean_1, 1.1, f"Mean: {mean_1:.2f}", ha='center', va='bottom', fontsize=12, color='purple')

data_2 = df["Speed_of_Impact"]
quartiles_2 = np.percentile(data_2, [25, 50, 75])
min_2, max_2, mean_2 = data_2.min(), data_2.max(), data_2.mean()

axes[1].set_title("Boxplot Speed_of_Impact", fontdict={'fontsize': 20})  
sns.boxplot(x=data_2, ax=axes[1], boxprops=dict(color='#4C72B0'), medianprops=dict(color='#56B4E9'))

# Add statistics annotations for Speed_of_Impact
axes[1].text(quartiles_2[0], 0.80, f"Q1: {quartiles_2[0]:.2f}", ha='center', va='bottom', fontsize=12, color='#4C72B0')
axes[1].text(quartiles_2[1], 0.80, f"Median: {quartiles_2[1]:.2f}", ha='center', va='bottom', fontsize=12, color='#56B4E9')
axes[1].text(quartiles_2[2], 0.80, f"Q3: {quartiles_2[2]:.2f}", ha='center', va='bottom', fontsize=12, color='#4C72B0')
axes[1].text(min_2, 1.1, f"Min: {min_2:.2f}", ha='center', va='bottom', fontsize=12, color='green')
axes[1].text(max_2, 1.1, f"Max: {max_2:.2f}", ha='center', va='bottom', fontsize=12, color='red')
axes[1].text(mean_2, 1.1, f"Mean: {mean_2:.2f}", ha='center', va='bottom', fontsize=12, color='purple')

plt.tight_layout()
plt.show()

#### Bivariate Analysis.

#### Let’s start our bivariate analysis by comparing our categorical variables with our target variable (Survived). When we look at the first variable, Gender, we can observe that there is a higher probability of a female person dying than a male person (which I thought would be the complete opposite). However, we also don't have information on whether the person was driving or just in the car, but what we can confirm is that women are more likely to die in an accident, at least with this dataset.

#### One thing that really surprised me when looking at the continuous variables was the 'Helmet_Used' variable, where people who didn’t use a helmet had a higher chance of surviving than those who did. As I mentioned before, we don’t have much information to complement this, but it is an interesting finding.

#### Looking at the last variable, which is about seat belt usage, we see a behavior we could expect: when the person doesn’t use a seat belt, they have a higher probability of not surviving, which is the opposite when they do use it.

In [None]:
def add_percentage_and_count_by_group(ax, data, x, hue):
    counts = data.groupby([x, hue]).size().unstack(fill_value=0)
    
    percentages = counts.apply(lambda c: c / c.sum() * 100, axis=1)
    
    hue_order = ax.legend_.get_texts()
    hue_labels = [t.get_text() for t in hue_order]
    
    for i, c in enumerate(ax.containers):
        labels = []
        for j, v in enumerate(c):
            height = v.get_height()
            if height > 0:
                try:
                    percentage = percentages.iloc[j, percentages.columns.get_loc(hue_labels[i])]
                except KeyError:
                    percentage = percentages.iloc[j, i]
                
                count = int(v.get_height())  
                labels.append(f'{count}\n({percentage:.1f}%)')
            else:
                labels.append('')
        
        ax.bar_label(c, labels=labels, label_type='center', fontsize=12)

    plt.tight_layout()
    ax.set_ylim(0, ax.get_ylim()[1] * 1.1)

plt.figure(figsize=(20, 15))
plt.suptitle("Analysis Of Variable target (Survived)", fontweight="bold", fontsize=20)

plt.subplot(2, 2, 1)
plt.gca().set_title('Variable Gender')
ax1 = sns.countplot(x='Gender', hue='Survived', palette='Set2', data=df)
add_percentage_and_count_by_group(ax1, df, 'Gender', 'Survived')

plt.subplot(2, 2, 2)
plt.gca().set_title('Variable Helmet_Used')
ax2 = sns.countplot(x='Helmet_Used', hue='Survived', palette='Set2', data=df)
add_percentage_and_count_by_group(ax2, df, 'Helmet_Used', 'Survived')

plt.subplot(2, 2, 3)
plt.gca().set_title('Variable Seatbelt_Used')
ax3 = sns.countplot(x='Seatbelt_Used', hue='Survived', palette='Set2', data=df)
add_percentage_and_count_by_group(ax3, df, 'Seatbelt_Used', 'Survived')

#### Moving on to our continuous variables, we can also draw some insights. Although we have data on people of all ages who did not survive, when we look at our age variable, we can see that our Q1 (25% of our data) behaves differently. The average age for people who did not die is 32, compared to 28, which means that, in this case, there is a lower tendency for fatal accidents above 30 years old, which makes a lot of sense.

#### When we look at the impact speed variable, we don’t have a well-defined pattern, which is quite strange because we usually tend to think that accidents involving higher speeds would have a higher chance of being fatal.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

def add_quartile_labels_by_hue(ax, x, hue, data):
    # Calcular os quartis para cada grupo de 'hue' (Preference)
    quartiles = data.groupby(hue)[x].quantile([0.25, 0.5, 0.75]).unstack()

    # Obter as posições dos boxplots no eixo x
    positions = ax.get_xticks()  # Posições dos boxplots
    hue_values = data[hue].unique()  # Valores únicos de 'hue' (Preference)

    # Inverter a ordem das classes de hue (classe 1 primeiro, classe 0 depois)
    hue_values = hue_values[::-1]

    # Certifique-se de que os valores de 'hue' correspondem corretamente aos boxplots
    for i, pos in enumerate(positions):
        # Pegue a classe de 'hue' correspondente à posição do boxplot
        preference = hue_values[i]
        
        # Obter os quartis (Q1, Q2, Q3) para esse grupo
        Q1 = quartiles.loc[preference, 0.25]
        Q2 = quartiles.loc[preference, 0.5]  # Mediana
        Q3 = quartiles.loc[preference, 0.75]

        # Adicionar as anotações para os quartis (Q1, Mediana e Q3) na posição correta
        ax.text(pos, Q1, f'Q1: {Q1:.2f}', horizontalalignment='center', verticalalignment='center', fontsize=10, color='blue')
        ax.text(pos, Q2, f'Median: {Q2:.2f}', horizontalalignment='center', verticalalignment='center', fontsize=10, color='green')
        ax.text(pos, Q3, f'Q3: {Q3:.2f}', horizontalalignment='center', verticalalignment='center', fontsize=10, color='red')


plt.figure(figsize=(25, 13))
plt.suptitle("Analysis Of Variable target (Survived)", fontweight="bold", fontsize=20)

plt.subplot(2, 3, 1)
ax1 = sns.boxplot(x='Survived', y='Age', data=df, hue="Survived", palette='Set3', hue_order=[1, 0])
plt.title('Boxplot of Age vs Survived')
add_quartile_labels_by_hue(ax1, 'Age', 'Survived', df)

plt.subplot(2, 3, 2)
ax2 = sns.violinplot(x='Survived', y='Age', data=df, hue="Survived", palette='Set3', hue_order=[1, 0])
plt.title('Violin plot of Age vs Survived')

plt.subplot(2, 3, 3)
ax3 = sns.stripplot(x='Survived', y='Age', data=df, hue="Survived", palette='Set3', hue_order=[1, 0])
plt.title('stripplot of Age vs Survived')

plt.subplot(2, 3, 4)
ax4 = sns.boxplot(x='Survived', y='Speed_of_Impact', data=df, hue="Survived", palette='Set3', hue_order=[1, 0])
plt.title('Boxplot of Speed_of_Impact vs Survived')
add_quartile_labels_by_hue(ax4, 'Speed_of_Impact', 'Survived', df)

plt.subplot(2, 3, 5)
ax5 = sns.violinplot(x='Survived', y='Speed_of_Impact', data=df, hue="Survived", palette='Set3', hue_order=[1, 0])
plt.title('Violin plot of Speed_of_Impact vs Survived')

plt.subplot(2, 3, 6)
ax6 = sns.stripplot(x='Survived', y='Speed_of_Impact', data=df, hue="Survived", palette='Set3', hue_order=[1, 0])
plt.title('stripplot of Speed_of_Impact vs Survived')

#### Trying to analyze the age of our users with the impact speed to check if younger people are more likely to drive faster, we couldn’t find any pattern between age and impact speed.

In [None]:
axes = []

axes.append(plt.subplot(1, 1, 1))
sns.scatterplot(x='Age', y='Speed_of_Impact', data=df)
plt.title('Scatterplot of Age vs Speed_of_Impact')

for ax in axes:
    ax.ticklabel_format(style='plain', axis='both')

plt.tight_layout(rect=[0, 0, 1, 0.96])  
plt.show()

#### Comparing our age variable with helmet/seat belt usage, although there is a slight difference, we couldn’t find much difference between people who use them and those who don’t.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

def add_quartile_labels_by_hue(ax, x, hue, data):
    quartiles = data.groupby(hue)[x].quantile([0.25, 0.5, 0.75]).unstack()
    positions = ax.get_xticks()
    hue_values = data[hue].unique()
    # Remover esta linha: hue_values = hue_values[::-1]
    
    for i, pos in enumerate(positions):
        preference = hue_values[i]
        Q1 = quartiles.loc[preference, 0.25]
        Q2 = quartiles.loc[preference, 0.5]
        Q3 = quartiles.loc[preference, 0.75]
        ax.text(pos, Q1, f'Q1: {Q1:.2f}', horizontalalignment='center', verticalalignment='center', fontsize=10, color='blue')
        ax.text(pos, Q2, f'Median: {Q2:.2f}', horizontalalignment='center', verticalalignment='center', fontsize=10, color='green')
        ax.text(pos, Q3, f'Q3: {Q3:.2f}', horizontalalignment='center', verticalalignment='center', fontsize=10, color='red')


plt.figure(figsize=(25, 13))
plt.suptitle("Other Analysis", fontweight="bold", fontsize=20)


plt.subplot(2, 3, 1)
ax1 = sns.boxplot(x='Helmet_Used', y='Age', data=df, hue="Helmet_Used", palette='Set3')
plt.title('Boxplot of Age vs Helmet_Used')
add_quartile_labels_by_hue(ax1, 'Age', 'Helmet_Used', df)

plt.subplot(2, 3, 2)
ax2 = sns.violinplot(x='Helmet_Used', y='Age', data=df, hue="Helmet_Used", palette='Set3')
plt.title('Violin plot of Age vs Helmet_Used')

plt.subplot(2, 3, 3)
ax3 = sns.stripplot(x='Helmet_Used', y='Age', data=df, hue="Helmet_Used", palette='Set3')
plt.title('stripplot of Age vs Helmet_Used')

plt.subplot(2, 3, 4)
ax4 = sns.boxplot(x='Seatbelt_Used', y='Age', data=df, hue="Seatbelt_Used", palette='Set3')
plt.title('Boxplot of Age vs Seatbelt_Used')
add_quartile_labels_by_hue(ax4, 'Age', 'Seatbelt_Used', df)

plt.subplot(2, 3, 5)
ax5 = sns.violinplot(x='Seatbelt_Used', y='Age', data=df, hue="Seatbelt_Used", palette='Set3')
plt.title('Violin plot of Age vs Seatbelt_Used')

plt.subplot(2, 3, 6)
ax6 = sns.stripplot(x='Seatbelt_Used', y='Age', data=df, hue="Seatbelt_Used", palette='Set3')
plt.title('stripplot of Age vs Seatbelt_Used')

#### Now, in our final analysis, we got a very interesting result. We found that higher-speed impacts are much more likely to involve females than males, with a significant difference. When we look at our Q4 and compare, we can see that the top 25% of female accidents are between 104 and 120 km/h, whereas for males, this range is from 86.50 to 120 km/h.

In [None]:
plt.figure(figsize=(25, 5))
plt.suptitle("Other Analysis", fontweight="bold", fontsize=20)


plt.subplot(1, 3, 1)
ax1 = sns.boxplot(x='Gender', y='Speed_of_Impact', data=df, hue="Gender", palette='Set3')
plt.title('Boxplot of Speed_of_Impact vs Gender')
add_quartile_labels_by_hue(ax1, 'Speed_of_Impact', 'Gender', df)

plt.subplot(1, 3, 2)
ax2 = sns.violinplot(x='Gender', y='Speed_of_Impact', data=df, hue="Gender", palette='Set3')
plt.title('Violin plot of Speed_of_Impact vs Gender')

plt.subplot(1, 3, 3)
ax3 = sns.stripplot(x='Gender', y='Speed_of_Impact', data=df, hue="Gender", palette='Set3')
plt.title('stripplot of Speed_of_Impact vs Gender')

## <font color='blue'> 3. Model Building. </font>

#### Label Encoder
- Here we are going to use the LabelEncoder to transform our categorical variables into numeric variables.

In [None]:
label_encoder_Gender = LabelEncoder()
label_encoder_Helmet_Used = LabelEncoder()
label_encoder_Seatbelt_Used = LabelEncoder()

df['Gender'] = label_encoder_Gender.fit_transform(df['Gender'])
df['Helmet_Used'] = label_encoder_Helmet_Used.fit_transform(df['Helmet_Used'])
df['Seatbelt_Used'] = label_encoder_Seatbelt_Used.fit_transform(df['Seatbelt_Used'])

#### Separating into features variables and target variable.

In [None]:
X = df.drop('Survived', axis = 1)
X = X.values
y = df['Survived']

#### StandardScaler
- Here we will use StandardScaler to put our data in the same scale.

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_standard = scaler.fit_transform(X)

#### Transforming Data into Train e Test, here we will use 30% of our data to test the machine learning models.

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test, y_train, y_test = train_test_split(X_standard, y, test_size = 0.3, random_state = 0)

#### Naive Bayes
- Running Gaussian Model.
- Here we will use the Naive Bayes Model, we will test Gaussian model, using our Normal Data.

#### In our first machine learning model, we obtained a very poor result. The model was terrible at predicting both the negative outcome (death) and the positive outcome (survival), achieving only 47% accuracy.

In [None]:
from sklearn.naive_bayes import GaussianNB
naive_bayes = GaussianNB()
naive_bayes.fit(X_train, y_train)
previsoes = naive_bayes.predict(X_test)

cm = ConfusionMatrix(naive_bayes)
cm.fit(X_train, y_train)
cm.score(X_test, y_test)

In [None]:
classification_naive_gaussian = (classification_report(y_test, previsoes))
print(classification_naive_gaussian)

In [None]:
score_naive_gaussian = 0.4745762711864407

#### Decision Tree
- Here we will use the Decision Tree Model, we will test Entropy and Gini calculations.
- Here we are applying GridSearch to check which are the best metrics to use.

In [None]:
parameters = {'max_depth': [3, 4, 5, 6, 7, 9, 11],
              'min_samples_split': [2, 3, 4, 5, 6, 7],
              'criterion': ['entropy', 'gini']
             }

model = DecisionTreeClassifier()
gridDecisionTree = RandomizedSearchCV(model, parameters, cv = 3, n_jobs = -1)
gridDecisionTree.fit(X_train, y_train)

print('Mín Split: ', gridDecisionTree.best_estimator_.min_samples_split)
print('Max Nvl: ', gridDecisionTree.best_estimator_.max_depth)
print('Algorithm: ', gridDecisionTree.best_estimator_.criterion)
print('Score: ', gridDecisionTree.best_score_)

#### Running Decision Tree

#### We can say that it was another terrible result, despite a slight improvement in accuracy. Still, it’s a very poor outcome. The probability of getting it right is the same as flipping a coin and checking if it lands on heads or tails.

In [None]:
decision_tree = DecisionTreeClassifier(criterion = 'entropy', min_samples_split = 3, max_depth= 5, random_state=0)
decision_tree.fit(X_train, y_train)
previsoes = decision_tree.predict(X_test)

cm = ConfusionMatrix(decision_tree)
cm.fit(X_train, y_train)
cm.score(X_test, y_test)

In [None]:
classification_decision = (classification_report(y_test, previsoes))
print(classification_decision)

In [None]:
score_tree = 0.5084745762711864

In [None]:
columns = df.drop('Survived', axis = 1).columns
feature_imp = pd.Series(decision_tree.feature_importances_, index = columns).sort_values(ascending = False)
feature_imp

#### RandomForest
- Here we will use the Random Forest Model, we will test Entropy and Gini calculations.
- Applying GridSearch

In [None]:
from sklearn.ensemble import RandomForestClassifier

parameters = {'max_depth': [3, 4, 5, 6, 7, 9, 11],
              'min_samples_split': [2, 3, 4, 5, 6, 7],
              'criterion': ['entropy', 'gini']
             }

model = RandomForestClassifier()
gridRandomForest = RandomizedSearchCV(model, parameters, cv = 5, n_jobs = -1)
gridRandomForest.fit(X_train, y_train)

print('Algorithm: ', gridRandomForest.best_estimator_.criterion)
print('Score: ', gridRandomForest.best_score_)
print('Mín Split: ', gridRandomForest.best_estimator_.min_samples_split)
print('Max Nvl: ', gridRandomForest.best_estimator_.max_depth)

#### Running Random Forest

#### In the Random Forest model, we obtained an even worse result than the other two, with the same issue of being unable to correctly predict both outcomes.

In [None]:
from sklearn.ensemble import RandomForestClassifier
random_forest = RandomForestClassifier(n_estimators = 100, min_samples_split = 2, max_depth= 4,  criterion = 'entropy', random_state = 0)
random_forest.fit(X_train, y_train)
previsoes = random_forest.predict(X_test)

accuracy = accuracy_score(y_test, previsoes)
confusion = confusion_matrix(y_test, previsoes)
classification_report_result = classification_report(y_test, previsoes)

cm = ConfusionMatrix(random_forest)
cm.fit(X_train, y_train)
cm.score(X_test, y_test)

In [None]:
classification_random = (classification_report(y_test, previsoes))
print(classification_random)

In [None]:
score_random = 0.4406779661016949

In [None]:
feature_imp_random = pd.Series(random_forest.feature_importances_, index = columns).sort_values(ascending = False)
feature_imp_random

#### K-Neighbors
- Here we will use the K-Neighbors Model, we will use the GridSearch Model to figure out the best metrics to use in this model.
- Here we will use the GridSearch to figure out the best metrics to use in this model.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()

k_list = list(range(1,10))
k_values = dict(n_neighbors = k_list)
grid = GridSearchCV(knn, k_values, cv = 2, scoring = 'accuracy', n_jobs = -1)
grid.fit(X_train, y_train)


grid.best_params_, grid.best_score_

#### Running K-Neighbors

#### I won’t go too deep to avoid being repetitive, but we obtained the same pattern as the other models: a terrible performance.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 9, metric = 'minkowski', p = 1)
knn.fit(X_train, y_train)
previsoes = knn.predict(X_test)

accuracy = accuracy_score(y_test, previsoes)
confusion = confusion_matrix(y_test, previsoes)
classification_report_result = classification_report(y_test, previsoes)

cm = ConfusionMatrix(knn)
cm.fit(X_train, y_train)
cm.score(X_test, y_test)

In [None]:
classification_knn = (classification_report(y_test, previsoes))
print(classification_knn)

In [None]:
score_knn = 0.4915254237288136

#### Logistic Regression
- Here we will use the Linear Regression Model.

#### I won’t go too deep to avoid being repetitive, but we obtained the same pattern as the other models: a terrible performance.

In [None]:
from sklearn.linear_model import LogisticRegression
logistic = LogisticRegression(random_state = 1, max_iter=10000)
logistic.fit(X_train, y_train)
previsoes = logistic.predict(X_test)

accuracy = accuracy_score(y_test, previsoes)
confusion = confusion_matrix(y_test, previsoes)
classification_report_result = classification_report(y_test, previsoes)

cm = ConfusionMatrix(logistic)
cm.fit(X_train, y_train)
cm.score(X_test, y_test)

In [None]:
logistic_normal = (classification_report(y_test, previsoes))
print(logistic_normal)

In [None]:
logistic_normal = 0.5084745762711864

#### AdaBoost
- Here we will use the AdaBoost Model, we will use the GridSearch Model to figure out the best metrics to use in this model.
- Applying GridSearch

In [None]:
from sklearn.ensemble import AdaBoostClassifier

parameters = {'learning_rate': [0.01, 0.02, 0.05, 0.07, 0.09, 0.1, 0.3, 0.001, 0.005],
              'n_estimators': [300, 500]
             }

model = AdaBoostClassifier()
gridAdaBoost = RandomizedSearchCV(model, parameters, cv = 2, n_jobs = -1)
gridAdaBoost.fit(X_train, y_train)

print('Learning Rate: ', gridAdaBoost.best_estimator_.learning_rate)
print('Score: ', gridAdaBoost.best_score_)

#### Running AdaBoost

#### I won’t go too deep to avoid being repetitive, but we obtained the same pattern as the other models: a terrible performance.

In [None]:
ada_boost = AdaBoostClassifier(n_estimators = 500, learning_rate =  0.09, random_state = 0)
ada_boost.fit(X_train, y_train)
previsoes = ada_boost.predict(X_test)

accuracy = accuracy_score(y_test, previsoes)
confusion = confusion_matrix(y_test, previsoes)
classification_report_result = classification_report(y_test, previsoes)

cm = ConfusionMatrix(ada_boost)
cm.fit(X_train, y_train)
cm.score(X_test, y_test)

In [None]:
classification_ada_scaler = (classification_report(y_test, previsoes))
print(classification_ada_scaler)

In [None]:
score_ada_scaler = 0.4915254237288136

#### Checking key variables to predict the outcome.
- Chi-2

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

features = X
target = y

best_features = SelectKBest(score_func = chi2,k = 'all')
fit = best_features.fit(features,target)

featureScores = pd.DataFrame(data = fit.scores_,index = list(columns),columns = ['Chi Squared Score']) 

featureScores.sort_values(by = 'Chi Squared Score', ascending = False).round(2)

- Decision Tree

In [None]:
feature_imp

- Random Forest

In [None]:
feature_imp_random

In [None]:
Naive_dict_v1 = {'Model':'Naive Bayes',
               'Scaling':'Normal Data',
               'Type':'Gaussian',
               'Precision':score_naive_gaussian}

Decision_dict = {'Model':'Decision Tree',
               'Scaling':'Normal Data',
               'Type': 'Gini',
               'Precision':score_tree}

Random_dict = {'Model':'Random Forest',
               'Scaling':'Normal Data',
               'Type': 'Gini',
               'Precision':score_random}


KNN_dict_v2 = {'Model':'KNN',
               'Scaling':'Normal',
               'Type':'-',
               'Precision':score_knn}

Logistic_dict_v1 = {'Model':'Logistic Regression',
               'Scaling':'Normal Data',
               'Type':'-',
               'Precision':logistic_normal}

ada_dict_v1 = {'Model':'AdaBoost',
               'Scaling':'StandardScaler',
               'Type':'-',
               'Precision':score_ada_scaler}

resume = pd.DataFrame({'Naive Bayes':pd.Series(Naive_dict_v1),
                       'Decision Tree':pd.Series(Decision_dict),
                       'Random Forest':pd.Series(Random_dict),
                       'KNN':pd.Series(KNN_dict_v2),
                       'Logistic Regression':pd.Series(Logistic_dict_v1),
                       'AdaBoost':pd.Series(ada_dict_v1)
                      })

resume

## <font color='blue'> 4. Conclusion </font>

#### To conclude the project, it's a very basic project because unfortunately we have very few data. The theme is interesting, the idea is quite intriguing, but unfortunately, we don’t have a good amount of data. We only have 6 columns with fairly interesting data, but what makes the work difficult is the fact that we only have 200 rows (our total population), which makes it quite challenging. Although we don’t have many columns, the data present makes it a promising project, but we lacked the quantity of data to carry out a deeper analysis.

#### Looking at our dataset, we can see that we have some categorical variables and some continuous variables. We can also verify that we have 3 null values. Despite having a very limited amount of data, I chose to remove these 3 data points since they represent just 3 rows.

#### Looking at our exploratory analysis, we can see that our dataset is quite well distributed. When we look at the gender variable, we see that there are more women than men in our dataset. We can also see that about 55% of our dataset uses a seatbelt/helmet (which I found alarming, considering that 45% don’t use them). When we look at our target variable, we can see that it is very well distributed, with nearly 50% of the data in each class (which is not a good value if the data is real).

#### Looking at our continuous variables, we see a very similar pattern to the categorical variables in terms of data distribution. The data is very well distributed across our population, showing that we have all types of people available for analysis. There are some peaks, but still, the dataset is well distributed. When we look at the boxplots, we can see that we don’t have any outliers (which means there is no need to treat these data). Looking at the mean, we can verify that the average age of our population is 43, showing that we have an "older" population. The same applies when we look at the impact speed variable, with an average impact speed of 70 km/h.

#### When we started our bivariate analysis, we were able to identify some interesting patterns despite not having much information about the accident. When we look at the gender variable, we observe a higher probability of a woman dying compared to a man, but as mentioned earlier, we cannot distinguish whether the woman was driving or just in the car, the type of accident, etc. Looking at the other two categorical variables, we are surprised to find that people who do not use helmets in this dataset are more likely to die than those who use them, which doesn’t make much sense. We see the opposite behavior when we analyze the use of seat belts (which makes a lot of sense to me, as using a seat belt makes it safer).

#### When we look at our continuous variables compared to our target variable, we don’t find a very well-defined pattern, but we can see some differences. When we look at the age variable, we find that fatal accidents are more likely to happen to younger people (though not much difference). When we look at the other variable, which is impact speed, we don’t see a clear pattern, which is quite strange, as it automatically makes us think that the faster the person is driving, the more likely they are to die.

#### Now, comparing the other variables to understand the behavior between them, when we compare the Age x Impact Speed variables to try to understand if there is a correlation between speed and age, we couldn’t find this differentiation, with a variety of different speeds regardless of age. When we look at age compared to the probability of using a seat belt/helmet, we didn’t find any defined pattern that differentiates age and the habit of using a seat belt/helmet. In our last graph, we can have another interesting insight where we see that most high-speed accidents are more likely to involve females. 25% of the data at the top for males range from 84 km/h to 120 km/h, while for females, it’s from 104 km/h to 120 km/h, showing a predominance of high speeds.

#### For the Machine Learning part, we transformed our categorical variables into numeric variables using LabelEncoder, then scaled the data using StandardScaler. After that, we split our dataset into training and testing sets with a 70/30 percentage split. Speaking about our machine learning models, all of them performed poorly, with a terrible result in predicting both outcomes, averaging 52%. This suggests that we may be just a little better than flipping a coin to predict the outcome. This could be influenced by the small amount of data we have for training the machine learning model or perhaps even the data’s inability to explain the final result. I believe it’s a combination of both, with a predominance of the lack of data