# Crash Insight: A Data-Driven Analysis of Road Accidents in India
### Mini Project – AIML SEM IV (Descriptive EDA + ML)

India witnesses a large number of road accidents each year, resulting in significant loss of life. This project uses real accident data to perform a descriptive exploratory data analysis (EDA) and a simple linear regression model to predict fatalities based on the number of road accidents.

### Objectives:
- Analyze road accident and fatality trends year-wise and state-wise.
- Identify top 5 states contributing to road fatalities.
- Predict fatality count from accident count using Linear Regression.
- Recommend actionable insights for road safety.

### Importing Libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.preprocessing import StandardScaler

### Loading Datasets

In [None]:
df = pd.read_csv("RS_Session_266_AU_2647_1.csv")

df.columns = df.columns.str.strip().str.replace(" ", "_")

df.head()

### Data Cleaning

In [None]:
print("Missing values in dataset:")
print(df.isnull().sum())

df = df.dropna()

print(df.head())

### Total Accidents in 2021 and 2022

In [None]:
yearly_stats = df[["2021", "2022"]].sum().reset_index()

yearly_stats.columns = ['Year', 'Total_Road_Accidents']

plt.figure(figsize=(12, 5))
sns.lineplot(data=yearly_stats, x="Year", y="Total_Road_Accidents", marker="o")
plt.title("Total Road Accidents for 2021 and 2022")
plt.ylabel("Total Road Accidents")
plt.xlabel("Year")
plt.show()

### Top 5 States by Accidents in 2021 and 2022

In [None]:
top_accidents_2021 = df[["State_UT", "2021"]].sort_values(by="2021", ascending=False).head(5)

top_accidents_2022 = df[["State_UT", "2022"]].sort_values(by="2022", ascending=False).head(5)

plt.figure(figsize=(14, 5))

plt.subplot(1, 2, 1)
sns.barplot(data=top_accidents_2021, y="State_UT", x="2021", palette="Blues_d")
plt.title("Top 5 States by Road Accidents in 2021")

plt.subplot(1, 2, 2)
sns.barplot(data=top_accidents_2022, y="State_UT", x="2022", palette="Reds_d")
plt.title("Top 5 States by Road Accidents in 2022")

plt.tight_layout()
plt.show()

### Fatality Rate for the top 5 States (2022)

In [None]:
df["Fatality_Rate_%"] = (df["2022"] / df["2021"]) * 100

top_fatality_rates = df.sort_values(by="Fatality_Rate_%", ascending=False).head(5)

plt.figure(figsize=(10, 6))
sns.barplot(data=top_fatality_rates, x="Fatality_Rate_%", y="State_UT", palette="magma")
plt.title("Top 5 States by Fatality Rate (%) – 2022")
plt.xlabel("Fatality Rate (%)")
plt.ylabel("State/UT")
plt.show()

### Correlation matrix

In [None]:
df_numeric = df[["2021", "2022"]]

plt.figure(figsize=(8, 5))
sns.heatmap(df_numeric.corr(), annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Matrix: Accidents in 2021 & 2022")
plt.show()

### Actual vs Predicted Fatalities

In [None]:
X = df[["2021"]]
y = df["2022"]

model = LinearRegression()
model.fit(X, y)

y_pred = model.predict(X)

df["Predicted_Fatalities"] = y_pred

plt.figure(figsize=(10, 6))
sns.scatterplot(x=y, y=y_pred, label="Predicted vs Actual")
plt.plot([y.min(), y.max()], [y.min(), y.max()], '--r', label="Ideal Fit (y = x)")
plt.xlabel("Actual Fatalities (2022)")
plt.ylabel("Predicted Fatalities")
plt.title("Linear Regression: Actual vs Predicted Fatalities")
plt.legend(loc="upper left")
plt.grid(True)
plt.show()

r2 = r2_score(y, y_pred)
print(f"R² Score: {r2:.2f}")

## Conclusion

- Road accidents across Indian states remain a significant concern, with Tamil Nadu and Madhya Pradesh consistently reporting high accident counts.

- In 2022, Tamil Nadu recorded the highest number of road accidents (64,105), while Madhya Pradesh followed with 54,432.

- There is a strong linear relationship between the number of road accidents in one year and fatalities in the next year, as evidenced by an R² score of approximately 0.99 from the linear regression model.

- States with fewer accidents are not always the safest — their fatality rates (deaths per accident) sometimes reveal a more severe risk.

## Future Scope

- Add more granular data such as vehicle type, time of day, road type, and weather conditions to enhance the predictive power of models.

- Use time series forecasting to project future accident trends per state.

- Develop a classification model to flag "high-risk" states based on multi-year patterns in accidents and fatalities.

- Incorporate geospatial visualization (e.g., choropleth maps) to show accident density across India visually.

> **Recommendation:** Authorities and policymakers should leverage such predictive insights to allocate resources, improve infrastructure, and educate drivers in regions identified as high-risk zones.