Assignment-03

Formulate three original hypotheses based on the raw US Accidents dataset, and perform appropriate statistical hypothesis tests to validate each one.

--> Importting Dataset

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


**Task**

Analyze the US Accidents dataset by loading and exploring it, then formulate and statistically test three hypotheses: one on the impact of weather conditions on accident severity/frequency, one on the impact of time of day on accident frequency/severity, and one on the influence of road features on accident occurrence/severity. Finally, summarize and interpret the results of these tests.


**Load and Explore Data**

**Subtask:**

Load the US Accidents dataset and perform initial exploration to understand its structure, columns, and potential areas for hypothesis formulation. This includes displaying head, info, and descriptive statistics.

Reasoning: First, I need to import the pandas library to work with DataFrames. Then, I will load the 'US_Accidents_March23.csv' dataset into a pandas DataFrame named df as per the instructions. After loading, I will display the first 5 rows, print a concise summary of the DataFrame (info), and generate descriptive statistics to fulfill the subtask requirements.

In [None]:
DATA_PATH = "/content/drive/MyDrive/Colab Notebooks/RoadSafety_Nov25/data/raw/US_Accidents_March23.csv"

In [None]:
import pandas as pd
import numpy as np
from scipy import stats

df = pd.read_csv(DATA_PATH)

1.creating orginal hypothesis test for Rain and clear severity based on US Accidents dataset

In [None]:
print(df.columns)

Index(['ID', 'Source', 'Severity', 'Start_Time', 'End_Time', 'Start_Lat',
       'Start_Lng', 'End_Lat', 'End_Lng', 'Distance(mi)', 'Description',
       'Street', 'City', 'County', 'State', 'Zipcode', 'Country', 'Timezone',
       'Airport_Code', 'Weather_Timestamp', 'Temperature(F)', 'Wind_Chill(F)',
       'Humidity(%)', 'Pressure(in)', 'Visibility(mi)', 'Wind_Direction',
       'Wind_Speed(mph)', 'Precipitation(in)', 'Weather_Condition', 'Amenity',
       'Bump', 'Crossing', 'Give_Way', 'Junction', 'No_Exit', 'Railway',
       'Roundabout', 'Station', 'Stop', 'Traffic_Calming', 'Traffic_Signal',
       'Turning_Loop', 'Sunrise_Sunset', 'Civil_Twilight', 'Nautical_Twilight',
       'Astronomical_Twilight'],
      dtype='object')


In [None]:
# The hypothesis(t-test) test for Rain and Clear severity
# here we using one-tail t-test

#H0 :  mean accident severity in rainy weather(μ_rain <= μ_clear)
#H1 : mean accident severity in clear weather(μ_rain > μ_clear)

rain_severity = df.loc[df["Weather_Condition"]=="Rain","Severity"]
clear_severity = df.loc[df["Weather_Condition"]=="Clear","Severity"]

# 1-tail t-test
t_stat, p_val = stats.ttest_ind(rain_severity, clear_severity)
print("Rain vs Clear severity")
print("t-statistic =", t_stat)
print("p-value =", p_val)

# Interpretation
if p_val < 0.05:
    print(" Severity differs significantly between Rain and Clear.")
else:
    print(" No significant difference in severity between Rain and Clear.")

Rain vs Clear severity
t-statistic = -55.56007113511793
p-value = 0.0
 Severity differs significantly between Rain and Clear.


2.creating orginal hypothesis test for Visibility and Severity based on US Accidents dataset



In [None]:
# The Hypothesis test for Rain and Clear severity
#here we using chi-sqr test

#H0 : Severity is independent of Visibility_Category
#H1 : Severity is dependent on Visibility_Category

df["Visibility_Category"] = np.where(df["Visibility(mi)"] < 2, "Low", "High")
contingency = pd.crosstab(df["Visibility_Category"], df["Severity"])

# Chi-square test
chi2, p_val, dof, expected = stats.chi2_contingency(contingency)

print(" Visibility vs Severity distribution")
print("Chi-square =", chi2)
print("p_value =", p_val)

# Interpretation
if p_val < 0.05:
    print("Severity distribution depends on visibility (Low vs High).")
else:
    print(" No significant difference in severity distribution between Low and High visibility.")

 Visibility vs Severity distribution
Chi-square = 1261.4335754832514
p_value = 3.4347534840364016e-273
Severity distribution depends on visibility (Low vs High).


 3.creating orginal hypothesis test for Urban and Rural based on US Accidents dataset

In [None]:
# The Hypothesis test for Urban and Rural
# here we using Z-test

#H0 : proportion of severe accidents (Severity >= 3) in urban areas is equal to or less than in rural areas
#H1 : proportion of severe accidents (Severity >= 3) in urban areas is greater than in rural areas

df["Location_Type"] = np.where((df["Traffic_Signal"] == True) | (df["Junction"] == True), "Urban", "Rural")


urban_severe = np.sum((df["Location_Type"] == "Urban") & (df["Severity"] >= 3))
urban_total = np.sum(df["Location_Type"] == "Urban")

rural_severe = np.sum((df["Location_Type"] == "Rural") & (df["Severity"] >= 3))
rural_total = np.sum(df["Location_Type"] == "Rural")

# Proportions
p1 = urban_severe / urban_total
p2 = rural_severe / rural_total
p_pool = (urban_severe + rural_severe) / (urban_total + rural_total)

# Standard error of difference
std_err_diff = np.sqrt(p_pool * (1 - p_pool) * (1/urban_total + 1/rural_total))

# Z-test statistic
z_stat = (p1 - p2) / std_err_diff
p_val = 1 - stats.norm.cdf(z_stat) # One-tailed p-value (H1: p1 > p2)

print("\nUrban vs Rural severe accident rates")
print("z-statistic =", z_stat)
print("p-value =", p_val)

# Interpretation
if p_val < 0.05:
    print("Severe accident rates are significantly higher in Urban areas than in Rural areas.")
else:
    print(" No significant evidence that severe accident rates are higher in Urban areas than in Rural areas.")


Urban vs Rural severe accident rates
z-statistic = -160.209243311526
p-value = 1.0
 No significant evidence that severe accident rates are higher in Urban areas than in Rural areas.
