### Assignment-03

- Formulate three original hypotheses based on the raw US Accidents dataset, and perform appropriate statistical hypothesis tests to validate each one.

In [2]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


# Task
Analyze the US Accidents dataset by loading and exploring it, then formulate and statistically test three hypotheses: one on the impact of weather conditions on accident severity/frequency, one on the impact of time of day on accident frequency/severity, and one on the influence of road features on accident occurrence/severity. Finally, summarize and interpret the results of these tests.

## Load and Explore Data

### Subtask:
Load the US Accidents dataset and perform initial exploration to understand its structure, columns, and potential areas for hypothesis formulation. This includes displaying head, info, and descriptive statistics.


**Reasoning**:
First, I need to import the pandas library to work with DataFrames. Then, I will load the 'US_Accidents_March23.csv' dataset into a pandas DataFrame named `df` as per the instructions. After loading, I will display the first 5 rows, print a concise summary of the DataFrame (info), and generate descriptive statistics to fulfill the subtask requirements.



In [3]:
import pandas as pd

# Load the dataset
df = pd.read_csv('/content/drive/MyDrive/RoadSafety_Nov25/data/raw/US_Accidents_March23.csv')

# Display the first 5 rows
print("First 5 rows of the DataFrame:")
print(df.head())

# Print a concise summary of the DataFrame
print("\nDataFrame Information:")
df.info()

# Generate descriptive statistics for numerical columns
print("\nDescriptive Statistics:")
print(df.describe())

First 5 rows of the DataFrame:
    ID   Source  Severity           Start_Time             End_Time  \
0  A-1  Source2         3  2016-02-08 05:46:00  2016-02-08 11:00:00   
1  A-2  Source2         2  2016-02-08 06:07:59  2016-02-08 06:37:59   
2  A-3  Source2         2  2016-02-08 06:49:27  2016-02-08 07:19:27   
3  A-4  Source2         3  2016-02-08 07:23:34  2016-02-08 07:53:34   
4  A-5  Source2         2  2016-02-08 07:39:07  2016-02-08 08:09:07   

   Start_Lat  Start_Lng  End_Lat  End_Lng  Distance(mi)  ... Roundabout  \
0  39.865147 -84.058723      NaN      NaN          0.01  ...      False   
1  39.928059 -82.831184      NaN      NaN          0.01  ...      False   
2  39.063148 -84.032608      NaN      NaN          0.01  ...      False   
3  39.747753 -84.205582      NaN      NaN          0.01  ...      False   
4  39.627781 -84.188354      NaN      NaN          0.01  ...      False   

  Station   Stop Traffic_Calming Traffic_Signal Turning_Loop Sunrise_Sunset  \
0   False  F

In [4]:
import numpy as np

def permutation_test(x, y, n_permutations=1000, alternative="greater", seed=42):
    np.random.seed(seed)

    x = np.array(x)
    y = np.array(y)

    observed_diff = x.mean() - y.mean()
    combined = np.concatenate([x, y])

    perm_diffs = []
    for _ in range(n_permutations):
        np.random.shuffle(combined)
        perm_x = combined[:len(x)]
        perm_y = combined[len(x):]
        perm_diffs.append(perm_x.mean() - perm_y.mean())

    perm_diffs = np.array(perm_diffs)

    if alternative == "greater":
        p_value = np.mean(perm_diffs >= observed_diff)
    elif alternative == "less":
        p_value = np.mean(perm_diffs <= observed_diff)
    else:
        p_value = np.mean(np.abs(perm_diffs) >= abs(observed_diff))

    return observed_diff, p_value

Hypothesis 1: Accidents at night are more severe than daytime accidents

In [5]:
day = df[df["Sunrise_Sunset"] == "Day"]["Severity"].dropna()
night = df[df["Sunrise_Sunset"] == "Night"]["Severity"].dropna()
diff, p_value = permutation_test(
    night, day,
    n_permutations=100,
    alternative="greater"
)
print("Mean severity difference (Night - Day):", diff)
print("p-value:", p_value)
print(
    "Hypothesis supported" if p_value < 0.05
    else "Hypothesis not supported"
)

Mean severity difference (Night - Day): 0.009673063529520132
p-value: 0.0
Hypothesis supported


Hypothesis 2: Accidents near traffic signals have higher severity

In [6]:
signal = df[df["Traffic_Signal"] == True]["Severity"].dropna()
no_signal = df[df["Traffic_Signal"] == False]["Severity"].dropna()
diff, p_value = permutation_test(
    signal, no_signal,
    n_permutations=100,
    alternative="greater"
)

print("Mean severity difference (Signal - No Signal):", diff)
print("p-value:", p_value)
print(
    "Hypothesis supported" if p_value < 0.05
    else "Hypothesis not supported"
)

Mean severity difference (Signal - No Signal): -0.1440243048909231
p-value: 1.0
Hypothesis not supported


Hypothesis 3: Low visibility increases accident severity

In [None]:
low_vis = df[df["Visibility(mi)"] < 3]["Severity"].dropna()
high_vis = df[df["Visibility(mi)"] >= 3]["Severity"].dropna()
diff, p_value = permutation_test(
    low_vis, high_vis,
    n_permutations=100,
    alternative="greater"
)

print("Mean severity difference (Low - High visibility):", diff)
print("p-value:", p_value)
print(
    "Hypothesis supported" if p_value < 0.05
    else "Hypothesis not supported"
)