#### In this notebook lets go through major topics that leads to crashes, by major i mean, weather, lightining, road defects ets.

In [25]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.tsa.seasonal import seasonal_decompose
import folium
import geopandas as gpd


Import the document to the notebook using pd.read_csv and view the first 10 elements using the .head() command

In [None]:
data = pd.read_csv('Traffic_Crashes_-_Crashes.csv')
data.head()

In [None]:
data.columns

We replace all the null values with 'N' or 0 based on if they are string or a integer(or float)

In [28]:

numeric_cols = data.select_dtypes(include=['int64', 'float64']).columns
string_cols = data.select_dtypes(include=['object', 'string']).columns

# Fill numeric columns with 0
data[numeric_cols] = data[numeric_cols].fillna(0)

# Fill string columns with 'N'
data[string_cols] = data[string_cols].fillna('N')


In [None]:
# Check for missing values
missing_values = data.isnull().sum()
missing_values.sort_values(ascending=True, inplace=True)
print(missing_values)



# IMPORTANT INFO
The data we have is from 2013 to 2024. But the data provided for the years 2013, 2014,2015 and 2025 are very low, 2, 6 9000, and 5814 respecively. These low numbers will cause inconsistancies in Seasonality plot and hinder the patterns. Hence for that we will be removing those years 

In [None]:
data['CRASH_DATE'] = pd.to_datetime(data['CRASH_DATE'])
data = data[~((data['CRASH_DATE'].dt.year == 2013) | (data['CRASH_DATE'].dt.year == 2014) | (data['CRASH_DATE'].dt.year == 2015)|(data['CRASH_DATE'].dt.year == 2025))]
data['CRASH_DATE'].dt.year.value_counts()

Convert the CRASH_DATE column into datetime format which is usefull for time series analysis and visualization and create a new column for time of the crash as it will be useful for future analysis.

In [31]:
# Extract features from CRASH_DATE
data['CRASH_DATE'] = pd.to_datetime(data['CRASH_DATE'])
data['TIME'] = data['CRASH_DATE'].dt.time


### YEARLY CRASHES

In [None]:
Yearly_data = data['CRASH_DATE'].dt.year.value_counts().sort_index()
Yearly_data

In [None]:
plt.figure(figsize = (15,6))
plt.plot(Yearly_data.index,Yearly_data.values, marker = 'o')
plt.xlabel('Years')
plt.ylabel('Number of accidents')
plt.title('Number of accidents per year')



it might seem that the crashes are less in 2016 and 2017, actually it was due to the amount of data is less comapred to the subsequent years, but once we normalize the data(Which we will do in the next cells), you will see the rate of crashes for the amount of data will similar to the other years

# Explanation for the above graph

A very notable trend we can observe is , the dip of crashes from 2019 to 2020, which increased later on again. But the dip is as we know, because of the COVID-19. Which decreased the traffic tremendously because of quarantine. BUT, compared to other major cities, Chicago's crash rate during the pandemic hasent fallen drastically. This was explained by the **ISSA LAW** firm on their website, "...During the first few months of 2020, Illinois alone saw an increase of 11 percent in vehicle-related deaths."

So, since the vehicles on road decreased, indeed there was decreasein crashes, but for those who were on road, the crashes rates increased, but this was minor number and hence the graph looks declined. There were various reasons on why the fatality rate incerased during this time
* Motorists may be speeding since there are not as many cars on the road now.
* Distracted driving may also be more prevalent since there is less traffic. Drivers might be practicing less caution since there are fewer cars on the road.
* Drugs and alcohol usgaed spiked during lockdown and with the pent up desire to hit the road, the drivers take vehicles on road while under substance, which as we know impaires their thinking process and response time, inevitabilty leading to accidents.

So this explains why even during the time when pandemic was at its core (2020-2021), the crashes in chicago were increasing, but slowly. That was until early 2021, from which they were controlled with stricter rules and regulations. 

The case study chicago during the pandemic is a very interesting one, as every major city had their accident rates hit the record low during the during of pandemic, while chicago had their all time highest fatality rate during the pandemic. Below are the few links for reference.

* [NHTSA: 2020 Had The Most Fatal Crashes Since ’07](https://www.coplancrane.com/posts/2020-had-most-fatal-crashes-since-07/)
* [Why Are Illinois Vehicle Fatality Rates Up During the Pandemic?](https://www.issalawoffices.com/personal-injury-criminal-law-blog/illinois-vehicle-fatality-rates-pandemic)
* [Average Chicagoan Spent 102 Hours Stuck In Traffic Last Year — Among Worst Gridlock In The World](https://blockclubchicago.org/2025/01/06/chicago-has-2nd-worst-traffic-in-the-world-with-average-driver-spending-102-hours-gridlocked-study/#:~:text=The%20bumper%2Dto%2Dbumper%20bump,year%2C%20according%20to%20the%20study.)

In [None]:
crashes_c = pd.DataFrame(data['CRASH_DATE'])
crashes_c['CRASH_DATE'] = pd.to_datetime(crashes_c['CRASH_DATE'], errors='coerce')
crashes = crashes_c.groupby(crashes_c['CRASH_DATE'].dt.date).size()
crashes.index= pd.to_datetime(crashes.index)
crashes = crashes.reset_index()
crashes = crashes.set_index('CRASH_DATE')
crashes=crashes.rename(columns = {0:'COUNT'})


sns.set(style="darkgrid")
cubehelix_palette = sns.color_palette("cubehelix", 8)  

plot = crashes.plot(y='COUNT', figsize=(15, 5), color=cubehelix_palette[0])
plot.set_xlabel("Year")
plot.set_ylabel("Crashes")
plot.set_title("Crashes per day")
plot.legend(["Crashes"])
plot.grid()

This graph is similar to the previous graph we went through, but we are ploting the graph day wise. But unlike the previous graph(which was very smooth), this one is very noise, and that is to be expected as plotting for individual 

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from statsmodels.tsa.seasonal import seasonal_decompose

# Count the number of crashes per day, group by CRASH DATE
daily_crashes = crashes

# Set plot style
sns.set(style="darkgrid")
cubehelix_palette = sns.color_palette("cubehelix", 8)  # Generate 8 colors from the cubehelix palette

# Plot the daily crashes time series
plt.figure(figsize=(15, 6))
plt.plot(daily_crashes, label='Daily crashes', color=cubehelix_palette[0])  # Use palette color
plt.title('Daily Motor Vehicle Collisions in NYC', fontsize=16)
plt.xlabel('Year', fontsize=14)
plt.ylabel('Crashes', fontsize=14)
plt.legend()
plt.grid(alpha=0.5)
plt.tight_layout()
plt.show()

# Decompose the time series
decomposition = seasonal_decompose(daily_crashes, model='additive', period=365)

# Plot the decomposed components
fig, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(15, 12))
decomposition.trend.plot(ax=ax1, color=cubehelix_palette[1])  # Use palette color
ax1.set_title('Trend', fontsize=14)
ax1.grid(alpha=0.5)

decomposition.seasonal.plot(ax=ax2, color=cubehelix_palette[2])  # Use palette color
ax2.set_title('Seasonality', fontsize=14)
ax2.grid(alpha=0.5)

decomposition.resid.plot(ax=ax3, color=cubehelix_palette[3])  # Use palette color
ax3.set_title('Residuals', fontsize=14)
ax3.grid(alpha=0.5)

plt.tight_layout()
plt.show()

# Analyze residuals for significant fluctuations
residuals = decomposition.resid

# Calculate mean and standard deviation of the residuals (ignoring NaNs)
mean_resid = np.nanmean(residuals)
std_resid = np.nanstd(residuals)

# Define threshold for significant fluctuations (e.g., ±2 standard deviations)
upper_threshold = mean_resid + 2 * std_resid
lower_threshold = mean_resid - 2 * std_resid

# Find dates with significant fluctuations
significant_fluctuations = residuals[(residuals > upper_threshold) | (residuals < lower_threshold)]

# Output the significant dates and their residual values
print("Significant Fluctuations Detected:")
print(significant_fluctuations)

# Plot residuals with highlighted significant points
plt.figure(figsize=(15, 6))
plt.plot(residuals.index, residuals, label='Residuals', color=cubehelix_palette[0])  # Use palette color
plt.axhline(upper_threshold, color=cubehelix_palette[4], linestyle='--', label='Upper Threshold (+2σ)')  # Use palette color
plt.axhline(lower_threshold, color=cubehelix_palette[4], linestyle='--', label='Lower Threshold (-2σ)')  # Use palette color
plt.scatter(significant_fluctuations.index, significant_fluctuations, color=cubehelix_palette[5], label='Significant Fluctuations', zorder=5)  # Use palette color
plt.title('Residuals with Significant Fluctuations Highlighted', fontsize=16)
plt.xlabel('Date', fontsize=14)
plt.ylabel('Residuals', fontsize=14)
plt.legend()
plt.grid(alpha=0.5)
plt.tight_layout()
plt.show()


In [None]:
sns.palplot(cubehelix_palette)
plt.title("Cubehelix Palette", fontsize=16)
plt.show()


In [None]:
significant_fluctuations.sort_values(ascending=False)

In [None]:
crashes[crashes.index == '2020-03-22']

# TIME SERIES ANALYSIS

What we did above is called analysis, we usually use this technique when we have some data that changes with time, and exhibit patterns accordingly. In this instance we are finding out three of the most important patters:
* Trend: This gives us info on how the data changes over the year, but in a linear way.
* Seasonality: Shows how the data behaves in the time frame it is in. The seasonalities keep repeating at regular intervals.
* Redisual: These are the unpredictable patterns and outliers, which the trend and seasonality couldnt make a understading of. So we can now only use machin learning or deep learning models to make aprediction for these data and once predicted, add back the seasonality and trends to make a proper prediction.

## Understanding the plots

###  Trend
As seen above, the crashes drasticaly from 2016, which was not suprising, in illinois there were 1,082 fatalies (second time to cross 1000 since 2008), this was due a lot of factors, increased drivers, no seat belt, drunk and drive etc, so chicago also had contribution to that, with a total of 113 deaths. This was followed up by a increase of fatalities to 133 in 2017, an 18% increase, but this was then controlled in 2018, the gov of chicago, as revised their [VISION ZERO CHICAGO](https://activetrans.org/our-work/walking/vision-zero/) plan, which included more advanced traffic equipment, enhancing child safety zoen etc and since then the crashes had remained constant, until the pandamic, where it dropped until 2020, rose again, stayed stable since then(We have discucced about the trend during the pandamic in the previous section)

Few links for your understadning:
* [1,082 People Died in Illinois Car Accidents in 2016. Here’s Why](https://www.sgklawyers.com/blog/2018/06/car-accidents-fatality/)
* [Chicago Traffic Accident Deaths Rise as Fatalities Nationwide Top 40K](https://www.cooneyconway.com/blog/chicago-traffic-deaths-rise-fatalities-nationwide-top-40k)


### Seasonality
As we can see from the above plots, the highest number of accidents annually are in the month of October, This was due to a lot of days in october being rainy(stats are provided in the link below), and more ever it is when is almost the start of winter, so with a mix of wet and cold climate, its very tricky for drivers and on the opposite side of the spectrum, the lowest accidents annually are in the late december, not to suprise, as this during the winter brake, and a lot people tend to stay home and enjoy the christmas and new year eve with hot chocolate!

links:
* [When Do Most Car Accidents Occur in Chicago?](https://www.dopplr.com/when-do-most-car-accidents-occur-in-chicago-2016-2020/)

### Residual

These are the irregularities which the Trend and seasonality were not able pick up the patterns from. Out of all random spikes, lets look at the highest positive spike(more crashes than upper threshold) and highest negative spike(Less crashes than the lower threshold).

***Highest positive spike:*** The highest positive spike was on 2019-01-12, on this day a total of 583 happened, this is to be blamed on the snow stormed that happened on that day and carried on to 13th, further more, there were many fatality accidents, Driver crashing into a state trooper, 35-years old man head on collision, baby's death during a U-turn crash etc. All this factors had made it very unpredictable tobe picked up by either the trend and seasonality patterns. 

links: [Winter storm results in nearly 300 reported traffic crashes](https://www.theintelligencer.com/news/article/Winter-storm-results-in-nearly-300-reported-13530315.php)

***Highest negative spike:*** The highest negative spike was on 2020-03-22 with a accidents numbering to 96. This day was a nightmare with combination of humungous increase in the corona cases in chicago and heavy snow fall in north illinois. These two reasons can explain as to why there were just 96 crashes on this day.

links: 
* [March 22-23, 2020: Late Season Moderate to Heavy Wet Snow Event](https://www.weather.gov/lot/2020mar2223_snow)
* [Coronavirus in Illinois updates: Here’s what happened March 21-22 with COVID-19 in the Chicago area](https://www.chicagotribune.com/2020/03/22/coronavirus-in-illinois-updates-heres-what-happened-march-21-22-with-covid-19-in-the-chicago-area/)


# WEATHER

In [None]:
weather = data['WEATHER_CONDITION'].value_counts()
print(weather)

In [None]:
weather_df = pd.DataFrame(weather)
weather_df

Clearly the "CLEAR" weather is the winner by a huge margine here, with a whopping 7,04,941 accidents, that is 78.6% of all the crashes happened on a clear day! But, we cant come to the conclusion so early, because out of the year, a lot of days have clear skies, unlike snow and rainy weather, which are season specific. So lets try to normalize the data and see, because, there are weather that happen only few months a year, but results in a lot of accidents. 

Before that lets just have a look at the statistics before normalizing

In [None]:

# Create the dot plot
sns.set(style="darkgrid")
cubehelix_palette = sns.color_palette("cubehelix", 8)
plt.figure(figsize=(8, 6))
plt.scatter(weather_df.index,weather_df['count'], color='blue', s=100, edgecolor='black')

# Add labels and title
plt.xlabel('Proportion (%)')
plt.xticks(rotation=45)
plt.ylabel('Weather Condition')
plt.title('Proportion of Accidents by Weather Condition')
plt.grid(axis='x', linestyle='--', alpha=0.7)

# Show the plot
plt.tight_layout()
plt.show()


We have to remember that the the dots after the "OTHER" laber are not 0, but some numerical. The upper bound for the category is so high that the less occuring events looks almost negligible. lets normalize before to atleast get rid of this.

# EXPLAIN

In [None]:
# Ensure CRASH_DATE is a datetime column
data['CRASH_DATE'] = pd.to_datetime(data['CRASH_DATE'], errors='coerce')

# Extract only the date
data['CRASH_DATE_ONLY'] = data['CRASH_DATE'].dt.date

# Group by WEATHER_CONDITION
weather_grouped = data.groupby('WEATHER_CONDITION').agg(
    accidents_count=('CRASH_RECORD_ID', 'count'),  # Count of crashes
    days_of_weather=('CRASH_DATE_ONLY', 'nunique')  # Unique crash days
).reset_index()

# Calculate normalized rate
weather_grouped['normalized_rate'] = (
    weather_grouped['accidents_count'] / weather_grouped['days_of_weather']
)

# Visualize
import matplotlib.pyplot as plt

fig, ax = plt.subplots(2, 1, figsize=(10, 10))

# Total crashes by weather condition
ax[0].bar(weather_grouped['WEATHER_CONDITION'], weather_grouped['accidents_count'])
ax[0].set_title('Total Crashes by Weather Condition')
ax[0].set_ylabel('Number of Crashes')
ax[0].set_xticklabels(weather_grouped['WEATHER_CONDITION'], rotation=45, ha='right')

# Normalized crash rates
ax[1].bar(weather_grouped['WEATHER_CONDITION'], weather_grouped['normalized_rate'])
ax[1].set_title('Normalized Crash Rate by Weather Condition')
ax[1].set_ylabel('Crashes per Day')
ax[1].set_xticklabels(weather_grouped['WEATHER_CONDITION'], rotation=45, ha='right')

plt.tight_layout()
plt.show()


What we observed from the above graph:
* Suprisingly the number of accidents on "CLEAR" day seems to triumph over the others even after normalization. This was explained in a [reserach paper](https://pmc.ncbi.nlm.nih.gov/articles/PMC1449863/), "Crash counts are not inevitably higher in snowy weather than in dry weather. On the one hand, snow makes driving more dangerous, by reducing tire adherence and impairing visibility. On the other hand, experienced drivers typically drive more slowly and carefully in snowy weather, and many people avoid or postpone unnecessary travel." Well, atleast the chicago drivers being careful here!
* The accident occurances during the "SNOW" days had raisen over "RAIN" accidents, depicting that, accidents tend to happen more often during snowy days than rainy.
* The other categories sure had raisen a little but nothing major.

# LIGHTING

In [None]:
lighting = data['LIGHTING_CONDITION'].value_counts()
lighting

In [None]:
plt.figure(figsize = (15,6))
plt.bar(lighting.index,lighting.values)
plt.xticks(rotation=45)
plt.xlabel('Lighting Condition')
plt.ylabel('Number of accidents')
plt.title('Number of accidents per lighting condition')


Accidents during "DAYLIGHT" seems to happend thrice thats happening during the "DARKNESS, LIGHTED ROAD" (Night). Which is not a suprise, as the city activity hours during the day is quite higher than night. But, then again we can normalize here as for only a few hours of active time at night, there were 196763 accidents, but we dont have data on how many vehicles are on road on the respective time, hence normalization here is not possible. The distinction here looks very reasonable, so lets proceed with that.

**NOW, lets see how WEATHER and LIGHTINING have a combined effect on accidents**

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Create a pivot table
heatmap_data = data.pivot_table(index='WEATHER_CONDITION', columns='LIGHTING_CONDITION', aggfunc='size', fill_value=0)

# Calculate the percentage for each cell
total = heatmap_data.sum().sum()  # Total crashes
percentage_data = heatmap_data / total * 100  # Convert to percentages

# Calculate row and column totals
row_totals = percentage_data.sum(axis=1)  # Row-wise totals
col_totals = percentage_data.sum(axis=0)  # Column-wise totals

# Add the totals as a new row and column
percentage_data_with_totals = percentage_data.copy()
percentage_data_with_totals.loc['Total'] = col_totals
percentage_data_with_totals['Total'] = pd.concat([row_totals, pd.Series(col_totals.sum(), index=['Total'])])

# Combine raw counts and percentages for annotations
annot_data = heatmap_data.astype(str) + "\n(" + percentage_data.round(2).astype(str) + "%)"
annot_data.loc['Total'] = col_totals.round(2).astype(str) + "%"
annot_data['Total'] = pd.concat([row_totals.round(2).astype(str) + "%", pd.Series("100%", index=['Total'])])

# Plot the heatmap
plt.figure(figsize=(14, 10))
sns.heatmap(
    percentage_data_with_totals, 
    annot=annot_data, 
    fmt="",  # Let the formatted annotations handle display
    cmap="YlGnBu", 
    linewidths=0.5, 
    linecolor="gray", 
    cbar_kws={"label": "Percentage of Total Crashes"}
)

# Enhance labels and title
plt.title(
    "Crashes by Weather and Lighting Conditions\n(Note: Percentages rounded to two decimal places)", 
    fontsize=16, pad=20
)
plt.xlabel("Lighting Condition", fontsize=14, labelpad=10)
plt.ylabel("Weather Condition", fontsize=14, labelpad=10)
plt.xticks(fontsize=12, rotation=45, ha="right")
plt.yticks(fontsize=12, rotation=0)

# Display the heatmap
plt.tight_layout()
plt.show()


Heat map is the best visualization for this kind of relations, and without disappointing it delivered us with perfect information. 

What we understood from the plot.
* Tons of accidents happen on a CLEAR day during the DAYLIGHT! i wouldnt have belived this without this anlysis, but number dont lie, and following, other big number accidents seems to be happened on CLEAR days too.
* More accidents happen on a CLOUDY day during the DAYLIGHT than SNOW day during daylight. Now this is unexpected for me too, since anyone would believe more accident to happen on snowy days!


## CRASHES(With Geospatial visualization)

In [None]:
data['ROAD_DEFECT'].value_counts()

In [None]:
street = data[['STREET_NAME', 'STREET_NO', 'CRASH_DATE', 'LONGITUDE', 'LATITUDE']]
street.head()

In [48]:
street = street[~((street['LATITUDE'] == 0) & (street['LONGITUDE'] == 0))]
street.reset_index(drop=True, inplace=True)

In [None]:
street

In [None]:
street_count = street['STREET_NAME'].value_counts()
street_count

# Explain

In [None]:
import folium
from folium.plugins import HeatMap
import pandas as pd

data_geo = street.dropna(subset=['LATITUDE', 'LONGITUDE'])

# Center the map around the mean latitude and longitude
map_center = [data_geo['LATITUDE'].mean(), data_geo['LONGITUDE'].mean()]
m = folium.Map(location=map_center, zoom_start=12)

# Prepare heatmap data
heat_data = [[row['LATITUDE'], row['LONGITUDE']] for _, row in data_geo.iterrows()]

# Create the heatmap
HeatMap(heat_data, radius=8, max_zoom=13).add_to(m)

# Save the map to an HTML file
#m.save("Heatmap.html")     #Uncomment thid line to save the map as a html file
#m          #Uncomment this line to display the map


Finally, the city of chicago!

One notable thing at a glance, is how densed up the crashes were at the Downtown (Near north side, near south side and new eastside), This is to be expected, as this is the most important parts of the chicago city. First of all, The loop, the loop is THE downtown, it contains a lot skycrapers with a lot of tech companies, similar to manhatten at nyc(manhatten is better ;) ) This expalins the happening of accidents for such a small area.

For not a suprise, we can see a LOT of accidents happen at or close to intersections, which is to be expected in major cities

* Western avenue: Western avenue is the biggest street in chicago, so, bigger street means, a lot of intersections, which in turn leads to more accidents. For reference, only in 2024, there were 59 series crashes in the street. 

* Cicero avenue: Cicero avenue is the gateway for people entering from the midway airport and it is one of the most used truck routes in the county.

The main pattern is, more intercetion a street has, more accidents tend to occur.

Links: [Most Dangerous Roads in Chicago, IL](https://www.makeroadssafe.org/most-dangerous-roads-in-chicago-il/)



## Lets have a look at the areas more prone to accidents during snowy weather

In [None]:
data_snow = data[(data["WEATHER_CONDITION"] == "SNOW") | (data["WEATHER_CONDITION"] == "BLOWING SNOW")]
data_snow = data_snow[["CRASH_DATE", "WEATHER_CONDITION", "LATITUDE", "LONGITUDE"]]
data_snow.reset_index(drop=True, inplace=True)
data_snow['WEATHER_CONDITION'].unique()

We filter such that we only take the crashes that happened on a "SNOW" and "BLOWING SNOW" day.

In [None]:
import folium
from folium.plugins import MarkerCluster

# Filter out rows with missing coordinates
data_geo = data_snow.dropna(subset=["LATITUDE", "LONGITUDE"])

# Create a base map
base_map = folium.Map(location=[41.8781, -87.6298], zoom_start=10)  # Chicago as an example

# Define colors for the two weather conditions
color_mapping = {
    "SNOW": "blue",
    "BLOWING SNOW": "red"
}

# Create separate marker clusters for each weather condition
for condition, color in color_mapping.items():
    condition_group = folium.FeatureGroup(name=condition, show=True).add_to(base_map)
    condition_cluster = MarkerCluster().add_to(condition_group)
    
    # Add markers for the specific condition
    for _, row in data_geo[data_geo["WEATHER_CONDITION"] == condition].iterrows():
        folium.Marker(
            location=[row["LATITUDE"], row["LONGITUDE"]],
            popup=f"Date: {row['CRASH_DATE']}, Condition: {row['WEATHER_CONDITION']}",
            icon=folium.Icon(color=color)
        ).add_to(condition_cluster)

# Add layer control to switch between groups
folium.LayerControl().add_to(base_map)

# Save or display the map
#base_map.save("snow_accidents_cluster_colored.html")     #Uncomment thid line to save the map as a html file
#base_map         #Uncomment this line to display the map


To not our suprise, most of the accidents during snow near the downtown chicago. Especially since its such a dense area, even a milli second delay in breaking can lead to minor crashes.

More ever we have to observe that a lot of crashes tend to happen near or around th elivated bridges. This can be at the entries or exits.

In [None]:
data_snow



In [None]:
from folium.plugins import HeatMap

# Create a base map
base_map = folium.Map(location=[41.8781, -87.6298], zoom_start=10)  # Chicago coordinates

# Prepare data for heatmap (remove invalid coordinates)
heat_data = data_snow[(data_snow["LATITUDE"] != 0) & (data_snow["LONGITUDE"] != 0)]
heat_data = heat_data[["LATITUDE", "LONGITUDE"]].values.tolist()

# Add heatmap to the base map
HeatMap(heat_data, radius=10, blur=15, max_zoom=1).add_to(base_map)

# Save or display the map
#base_map.save("snow_accidents_heatmap.html")    #Uncomment thid line to save the map as a html file
#base_map        #Uncomment this line to display the map


# DEFECTS ON ROAD

In [None]:
defects = data[['ROAD_DEFECT','CRASH_DATE','LONGITUDE','LATITUDE', 'STREET_NAME']]
defects = defects[(defects["LATITUDE"] != 0) & (defects["LONGITUDE"] != 0)]

defects.reset_index(drop=True, inplace=True)
defects_count = defects['ROAD_DEFECT'].value_counts()
defects['ROAD_DEFECT'].unique()

In [None]:
plt.figure(figsize = (15,6))
plt.bar(defects_count.index,defects_count.values)
plt.xticks(rotation=45)
plt.xlabel('Road Defect')
plt.ylabel('Number of accidents')
plt.title('Number of accidents per road defect')


Its a good thing that a lot of accidents are occuring due to human errors and not the infrastructre defects, but there are few cases where the road defects had led to minor or major crashes, so lets go thought those.

In [None]:
# Check for rows with LATITUDE or LONGITUDE equal to 0
zero_coordinates = defects[(defects["LATITUDE"] == 0) | (defects["LONGITUDE"] == 0)]

# Display rows with zero coordinates
print(f"Number of rows with zero coordinates: {len(zero_coordinates)}")
print(zero_coordinates)


In [84]:
defects_valid = defects[~((defects['ROAD_DEFECT']== 'UNKNOWN') | (defects['ROAD_DEFECT']== 'OTHER') | (defects['ROAD_DEFECT']== 'NO DEFECTS'))]
defects_valid.reset_index(drop=True, inplace=True)
defects_valid_count = defects_valid['ROAD_DEFECT'].value_counts()

In [None]:
import matplotlib.pyplot as plt

# Plotting the bar chart
plt.figure(figsize=(15, 6))
plt.bar(defects_valid_count.index, defects_valid_count.values, color='skyblue', edgecolor='black')

# Adding labels and title with enhancements
plt.xticks(rotation=45, fontsize=12, ha='right',fontweight='bold')
plt.xlabel('Road Defect', fontsize=14, fontweight='bold')
plt.ylabel('Number of Accidents', fontsize=14,fontweight='bold')
plt.title('Number of Accidents per Road Defect', fontsize=16, fontweight='bold')

# Adding gridlines for better readability
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Adjusting layout for better fit
plt.tight_layout()

# Display the plot
plt.show()



In [None]:

fig = plt.figure(figsize=(12, 8))
plt.scatter(defects_valid['LONGITUDE'], defects_valid['LATITUDE'], alpha=0.5, c='r', s=1)
plt.title('Accidents by Road Defects')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.annotate('Ashland Ave', xy=(-87.673, 41.86), xytext=(-87.57, 41.86),  arrowprops=dict(facecolor='blue', shrink=0.05, width=1, headwidth=6), fontsize=12,color='blue')
plt.annotate('Western Ave', xy=(-87.687, 41.90), xytext=(-87.7, 42),  arrowprops=dict(facecolor='blue', shrink=0.05, width=1, headwidth=6), fontsize=12,color='blue')
plt.annotate('Lake shore dr NB', xy=(-87.652, 41.97), xytext=(-87.6, 42),  arrowprops=dict(facecolor='blue', shrink=0.05, width=1, headwidth=6), fontsize=12,color='blue')
plt.show()


The above graph vaguely looks like the chicago city, because it is! The red dots are the accidents that happened due road_defects,and If u look carefully, we can see the crashes in the weastern, Ash land Avenue , Lake shore dr NB. I think you can guess who will top our list.

The road defects with lake shore drive especially is very major, as pot holes as large as manholes causing heavy damage to the vehicles,multiple vehicles suffered flat tires which inevitably led to minor crashes.

links: [Crews work to repair stretch of DuSable Lake Shore Drive after cars damaged by potholes](https://abc7chicago.com/post/chicago-traffic-cdot-crews-repair-stretch-dusable-lake-shore-drive-after-cars-damaged-potholes-belmont-avenue/15329317/)

In [None]:
Top_10 = defects_valid.groupby('STREET_NAME').size().sort_values(ascending=False)
Top_10.nunique

#### Since we have a lot of streets, lets take a look at the top 10 streets where accidents were caused by road defects

In [None]:
valid_streets = Top_10[Top_10>= 139].index
filtered_defects_top_10 = defects_valid[defects_valid['STREET_NAME'].isin(valid_streets)].reset_index(drop=True)
filtered_defects_top_10

In [None]:
from folium.plugins import HeatMap

# Create a base map
base_map = folium.Map(location=[41.8781, -87.6298], zoom_start=10)  # Chicago coordinates

# Prepare data for heatmap (remove invalid coordinates)
heat_data = filtered_defects_top_10[(filtered_defects_top_10["LATITUDE"] != 0) & (filtered_defects_top_10["LONGITUDE"] != 0)]
heat_data = heat_data[["LATITUDE", "LONGITUDE"]].values.tolist()

# Add heatmap to the base map
HeatMap(heat_data, radius=10, blur=15, max_zoom=1).add_to(base_map)

# Save or display the map
#base_map.save("snow_accidents_heatmap.html")   #Uncomment thid line to save the map as a html file
#base_map       #Uncomment this line to display the map


If you zoom in a little, we can see how the major road defects are only caused in the main streets, this mainly due to heavy usage cars, trucks, buses etc. Unlike connecting streets, which wont be used as much by heavy duety vehicles. 

Crashes overtime due to defects

In [None]:
defects_valid['ROAD_DEFECT'].value_counts()

In [None]:
crashes_d = defects_valid[['ROAD_DEFECT','CRASH_DATE']]
crashes_d['CRASH_DATE'] = pd.to_datetime(crashes_d['CRASH_DATE']).dt.year
crashes_d

In [None]:
crashes_dy = crashes_d.groupby('CRASH_DATE').count()
crashes_dy = pd.DataFrame(crashes_dy)
crashes_dy.mean()

In [None]:
plt.figure(figsize=(15,6))
plt.plot(crashes_dy.index,crashes_dy['ROAD_DEFECT'], marker = 'o')
plt.xlabel('Years')
plt.ylabel('Crashes due to defects')
plt.title('Crashes due to defects over the years')

In [None]:
crashes_holes = crashes_d[crashes_d['ROAD_DEFECT']=='RUT, HOLES']
crashes_holes.reset_index(drop=True,inplace=True)
crashes_holes_y = crashes_holes.groupby('CRASH_DATE').count()
crashes_holes_y = pd.DataFrame(crashes_holes_y)

# worn surface
crashes_worn = crashes_d[crashes_d['ROAD_DEFECT']=='WORN SURFACE']
crashes_worn.reset_index(drop=True,inplace=True)
crashes_worn_y = crashes_worn.groupby('CRASH_DATE').count()
crashes_worn_y = pd.DataFrame(crashes_worn_y)

#Debris
crashes_deb = crashes_d[crashes_d['ROAD_DEFECT']=='DEBRIS ON ROADWAY']
crashes_deb.reset_index(drop=True,inplace=True)
crashes_deb_y = crashes_deb.groupby('CRASH_DATE').count()
crashes_deb_y = pd.DataFrame(crashes_deb_y)

#shoulder
crashes_sh = crashes_d[crashes_d['ROAD_DEFECT']=='SHOULDER DEFECT']
crashes_sh.reset_index(drop=True,inplace=True)
crashes_sh_y = crashes_sh.groupby('CRASH_DATE').count()
crashes_sh_y = pd.DataFrame(crashes_sh_y)

#PLOT
fig, ax = plt.subplots(nrows=2, ncols=2,figsize = (20,8))
ax[0,0].plot(crashes_holes_y.index, crashes_holes_y['ROAD_DEFECT'],color='red',marker = 'o')
ax[0,0].set_title('Pot holes')
ax[0,0].set_ylabel('Crashes')
ax[0,0].legend()

# Plot on the second subplot (axs[1])
ax[0,1].plot(crashes_worn_y.index, crashes_worn_y['ROAD_DEFECT'], color='orange', marker = 'o')
ax[0,1].set_title('Worn surfaces')
ax[0,1].set_ylabel('Crashes')
ax[0,1].legend()

ax[1,0].plot(crashes_deb_y.index, crashes_deb_y['ROAD_DEFECT'],color='green',marker = 'o')
ax[1,0].set_title('Debris on road')
ax[1,0].set_xlabel('Years')
ax[1,0].set_ylabel('Crashes')
ax[1,0].legend()

# Plot on the second subplot (axs[1])
ax[1,1].plot(crashes_sh_y.index, crashes_sh_y['ROAD_DEFECT'],marker = 'o')
ax[1,1].set_title('Shoulder defect')
ax[1,1].set_xlabel('Years')
ax[1,1].set_ylabel('Crashes')
ax[1,1].legend()

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Filter for 'WORN SURFACE' road defects
area_worn = defects_valid[defects_valid['ROAD_DEFECT'] == 'WORN SURFACE']

# Group by 'STREET_NAME' and count occurrences
worn_street_counts = area_worn['STREET_NAME'].value_counts()

# Select the top 10 streets with the highest counts
top_10_worn_streets = worn_street_counts.head(40).index.tolist()

# Determine the number of rows and columns for subplots
num_streets = len(top_10_worn_streets)
num_cols = 2
num_rows = (num_streets + num_cols - 1) // num_cols  # Ceiling division

# Create subplots
fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, 5 * num_rows), sharex=True)
axes = axes.flatten()  # Flatten in case of a 2D array

# Iterate over the top streets and corresponding axes
for i, (street, ax) in enumerate(zip(top_10_worn_streets, axes)):
    # Filter data for the current street
    street_data = area_worn[area_worn['STREET_NAME'] == street]
    
    # Convert 'CRASH_DATE' to datetime format
    street_data['CRASH_DATE'] = pd.to_datetime(street_data['CRASH_DATE'])
    
    # Sort data by 'CRASH_DATE'
    street_data['CRASH_DATE'] = street_data['CRASH_DATE'].dt.year
    street_g = street_data.groupby('CRASH_DATE').count()

    # Plot the number of crashes over time
    ax.plot(street_g.index, street_g.values, marker='o', linestyle='-')
    
    # Set title and labels
    ax.set_title(f'Crashes on {street} Over Time')
    ax.set_xlabel('Crash Date')
    ax.set_ylabel('Cumulative Number of Crashes')

# Remove any unused subplots
for j in range(i + 1, len(axes)):
    fig.delaxes(axes[j])

# Adjust layout
plt.tight_layout()
plt.show()


In [None]:
#test

In [None]:

area_worn = defects_valid[defects_valid['ROAD_DEFECT'] == 'WORN SURFACE']

# Group by 'STREET_NAME' and count occurrences
worn_street_counts = area_worn['STREET_NAME'].value_counts()

# Select the top 40 streets with the highest counts
top_10_worn_streets = worn_street_counts.head(856).index.tolist()

# Initialize a dictionary to store differences
positive_differences = {}
name = {}

# Iterate over the top streets
for street in top_10_worn_streets:
    # Filter data for the current street
    street_data = area_worn[area_worn['STREET_NAME'] == street]
    
    # Convert 'CRASH_DATE' to datetime format
    street_data['CRASH_DATE'] = pd.to_datetime(street_data['CRASH_DATE'])
    street_data['CRASH_YEAR'] = street_data['CRASH_DATE'].dt.year
    
    # Group by year and count crashes
    street_g = street_data.groupby('CRASH_YEAR').size()
    
    # Ensure both 2023 and 2024 exist in the data
    if 2023 in street_g.index and 2024 in street_g.index:
        difference = street_g.loc[2024] - street_g.loc[2023]
        if difference > 0:
            positive_differences[street] = difference
            name[street] = street

# Create a Geo Map for streets with positive differences
geo_map = folium.Map(location=[41.8781, -87.6298], zoom_start=11)  # Centered on Chicago (example location)

# Add proportional circles for streets with positive differences
for street, difference in positive_differences.items():
    # Filter the location data for the street for 2024
    location_data_2024 = area_worn[
        (area_worn['STREET_NAME'] == street) & 
        (area_worn['CRASH_DATE'].dt.year == 2024)
    ]
    
    # Use the median latitude and longitude for better accuracy
    lat = location_data_2024['LATITUDE'].median()
    lon = location_data_2024['LONGITUDE'].median()
    
    # Add a proportional circle marker to the map
    folium.CircleMarker(
        location=[lat, lon],
        radius=int(difference) * 4,  # Convert difference to a standard int
        tooltip=f"Street: {street}",  # Label the street name
        popup=f"{street}: {difference} more crashes in 2024",  # Additional details in popup
        color='blue',
        fill=True,
        fill_color='blue',
        fill_opacity=0.6
    ).add_to(geo_map)

# Save the map to an HTML file or display it
#geo_map.save("proportional_crash_differences_with_labels.html")  #Uncomment thid line to save the map as a html file
from IPython.display import FileLink

# Provide a download link for the HTML file
#display(FileLink("proportional_crash_differences_with_labels.html")) #Uncomment this line to display the download link

#geo_map       #Uncomment this line to display the map


In [None]:
# Ensure 'CRASH_DATE' is in datetime format
defects_valid['CRASH_DATE'] = pd.to_datetime(defects_valid['CRASH_DATE'])

# Filter for streets in the 'name' list and for the year 2024
filtered_2024_data = defects_valid[
    (defects_valid['STREET_NAME'].isin(name.keys())) &
    (defects_valid['CRASH_DATE'].dt.year == 2024)
]

filtered_2024_data_worn = filtered_2024_data[filtered_2024_data['ROAD_DEFECT']=='WORN SURFACE']
filtered_2024_data_worn.reset_index(drop=True, inplace=True)


from folium.plugins import HeatMap
import folium

# Define the center of the map (e.g., Chicago) based on the mean latitude and longitude
map_center_lat = filtered_2024_data_worn['LATITUDE'].mean()
map_center_lon = filtered_2024_data_worn['LONGITUDE'].mean()

# Create a folium map centered on the data
heat_map = folium.Map(location=[map_center_lat, map_center_lon], zoom_start=12)

# Prepare data for the heat map
heat_data = filtered_2024_data_worn[['LATITUDE', 'LONGITUDE']].values.tolist()

# Add a heat map layer
HeatMap(heat_data).add_to(heat_map)

# Save the heat map to an HTML file
#heat_map.save("filtered_worn_surface_2024_heatmap.html")   #Uncomment thid line to save the map as a html file

# Provide a link to download the HTML file
from IPython.display import FileLink
#display(FileLink("filtered_worn_surface_2024_heatmap.html")) #Uncomment this line to display the download link

#heat_map      #Uncomment this line to display the map




In [None]:
import folium

# Create a map centered at the calculated mean latitude and longitude
proportional_map = folium.Map(location=[map_center_lat, map_center_lon], zoom_start=12)

# Ensure 'CRASH_COUNT' column exists
filtered_2024_data_worn['CRASH_COUNT'] = 1

# Group by coordinates and calculate crash counts
crash_counts = filtered_2024_data_worn.groupby(['LATITUDE', 'LONGITUDE']).size().reset_index(name='CRASH_COUNT')

# Function to determine marker color based on crash count
# Ensure CRASH_COUNT is numeric
crash_counts['CRASH_COUNT'] = crash_counts['CRASH_COUNT'].astype(int)

# Define a function to determine the marker color
def get_color(crash_count):
    if crash_count == 1:
        return 'red'  # Red for 1 crash
    elif crash_count == 2:
        return 'blue'  # Blue for 2 crashes
    return 'gray'  # Default color (in case of unexpected values)

# Create proportional markers with dynamic colors
for _, row in crash_counts.iterrows():
    folium.CircleMarker(
        location=[row['LATITUDE'], row['LONGITUDE']],
        radius=row['CRASH_COUNT'] * 5,  # Scale circle size for better visibility
        popup=f"<b>Crashes:</b> {row['CRASH_COUNT']}",  # HTML-styled popup
        tooltip=f"Crashes: {row['CRASH_COUNT']}",  # Tooltip for hover
        color='black',  # Circle border color
        fill=True,
        fill_color=get_color(row['CRASH_COUNT']),  # Dynamic fill color
        fill_opacity=0.8,  # Enhanced opacity for better visibility
    ).add_to(proportional_map)
legend_html = """
<div style="
position: fixed;
bottom: 50px;
left: 50px;
width: 350px;
background-color: white;
border: 2px solid black;
z-index: 1000;
padding: 10px;
font-size: 14px;
">
<b>Crash Count Legend:</b><br>
<span style="color: red;">●</span> 1 Crash<br>
<span style="color: purple;">●</span> 2 Crashes<br>
</div>
"""
proportional_map.get_root().html.add_child(folium.Element(legend_html))
# Save and display the map
#proportional_map.save("proportional_map_final.html")  #Uncomment thid line to save the map as a html file
from IPython.display import FileLink

#display(FileLink("proportional_map_final.html")) #Uncomment this line to display the download link
#proportional_map     #Uncomment this line to display the map


In [None]:
import pandas as pd
import folium
from geopy.geocoders import Nominatim
import time

# Initialize the geolocator
geolocator = Nominatim(user_agent="traffic_analysis_tool")

# Assuming 'defects_valid' is your DataFrame containing the data

# Filter for 'WORN SURFACE' road defects
area_worn = defects_valid[defects_valid['ROAD_DEFECT'] == 'WORN SURFACE']

# Group by 'STREET_NAME' and count occurrences
worn_street_counts = area_worn['STREET_NAME'].value_counts()

# Select the top streets with the highest counts
top_worn_streets = worn_street_counts.index.tolist()

# Initialize a dictionary to store differences by ZIP code
zip_code_differences = {}

# Iterate over the top streets
for street in top_worn_streets:
    # Filter data for the current street
    street_data = area_worn[area_worn['STREET_NAME'] == street]
    
    # Convert 'CRASH_DATE' to datetime format
    street_data['CRASH_DATE'] = pd.to_datetime(street_data['CRASH_DATE'])
    street_data['CRASH_YEAR'] = street_data['CRASH_DATE'].dt.year
    
    # Group by year and count crashes
    street_g = street_data.groupby('CRASH_YEAR').size()
    
    # Ensure both 2023 and 2024 exist in the data
    if 2023 in street_g.index and 2024 in street_g.index:
        difference = street_g.loc[2024] - street_g.loc[2023]
        if difference > 0:
            # Get the mean latitude and longitude for the street
            lat = street_data['LATITUDE'].mean()
            lon = street_data['LONGITUDE'].mean()
            
            # Introduce a delay to prevent rate-limiting
            try:
                location = geolocator.reverse((lat, lon), exactly_one=True)
                zip_code = location.raw['address'].get('postcode', 'Unknown')
                time.sleep(1)  # Delay of 1 second
            except Exception as e:
                print(f"Error retrieving ZIP code for {street}: {e}")
                zip_code = 'Unknown'
            
            # Aggregate differences by ZIP code
            if zip_code != 'Unknown':
                zip_code_differences[zip_code] = zip_code_differences.get(zip_code, 0) + difference

# Create a Geo Map for ZIP codes with positive differences
geo_map = folium.Map(location=[41.8781, -87.6298], zoom_start=11)  # Centered on Chicago (example location)

# Add markers for ZIP codes with positive differences
for zip_code, difference in zip_code_differences.items():
    # Use the ZIP code to get a representative location
    try:
        location = geolocator.geocode(f"{zip_code}, Chicago, IL")
        if location:
            folium.Marker(
                location=[location.latitude, location.longitude],
                popup=f"ZIP Code: {zip_code}\n{difference} more crashes in 2024",
                icon=folium.Icon(color="blue", icon="info-sign")
            ).add_to(geo_map)
            time.sleep(1)  # Delay of 1 second to prevent rate-limiting
    except Exception as e:
        print(f"Error retrieving location for ZIP code {zip_code}: {e}")

# Save the map to an HTML file
#geo_map.save("zip_code_crash_differences.html") #Uncomment thid line to save the map as a html file

# Provide a download link for the HTML file
from IPython.display import FileLink
#display(FileLink("zip_code_crash_differences.html")) #Uncomment this line to display the download link

#geo_map      #Uncomment this line to display the map


In [None]:
heatmap_data = []

# Iterate over the top streets
for street in top_worn_streets:
    # Filter data for the current street
    street_data = area_worn[area_worn['STREET_NAME'] == street]
    
    # Convert 'CRASH_DATE' to datetime format
    street_data['CRASH_DATE'] = pd.to_datetime(street_data['CRASH_DATE'])
    street_data['CRASH_YEAR'] = street_data['CRASH_DATE'].dt.year
    
    # Group by year and count crashes
    street_g = street_data.groupby('CRASH_YEAR').size()
    
    # Ensure both 2023 and 2024 exist in the data
    if 2023 in street_g.index and 2024 in street_g.index:
        difference = street_g.loc[2024] - street_g.loc[2023]
        if difference > 0:
            # Get the mean latitude and longitude for the street
            lat = street_data['LATITUDE'].mean()
            lon = street_data['LONGITUDE'].mean()
            
            # Append the data for the heatmap (convert to float)
            heatmap_data.append([float(lat), float(lon), float(difference)])
heatmap_map = folium.Map(location=[41.8781, -87.6298], zoom_start=11)  # Centered on Chicago (example location)

# Add the heat map layer
HeatMap(
    data=heatmap_data,
    radius=15,  # Size of each point on the heatmap
    blur=10,    # Blurring to smooth out the visualization
    max_zoom=12
).add_to(heatmap_map)

#heatmap_map.save("heatmap_crash_differences.html") #Uncomment thid line to save the map as a html file
from IPython.display import FileLink
#display(FileLink("heatmap_crash_differences.html")) #Uncomment this line to display the download link

#heatmap_map     #Uncomment this line to display the map


In [None]:
import folium
from folium import Choropleth
import pandas as pd

# Path to the converted GeoJSON file
geojson_path = "C:/Users/byash/OneDrive/Desktop/TDSP/Zip_Codes_20250127.geojson"

boundary_map = folium.Map(location=[41.8781, -87.6298], zoom_start=11)

# Add the Choropleth layer
Choropleth(
    geo_data=geojson_path,
    data=pd.DataFrame(list(zip_code_differences.items()), columns=['ZIP', 'DIFFERENCE']),
    columns=['ZIP', 'DIFFERENCE'],
    key_on="feature.properties.ZIP",
    fill_color='Reds',  
    fill_opacity=0.8,  
    line_opacity=0.4, 
    legend_name='Crash Differences (2024 - 2023)'
).add_to(boundary_map)

#boundary_map      #Uncomment this line to display the map


In [None]:
import pandas as pd

# Coordinates of the two endpoints of the line
point1 = (41.81392, -87.76234)  
point2 = (41.86721, -87.5938)  

# Calculate slope (m) and intercept (c)
x1, y1 = point1[1], point1[0]
x2, y2 = point2[1], point2[0]
m = (y2 - y1) / (x2 - x1)
c = y1 - m * x1

# Define a function to determine the position relative to the line
def assign_group(row):
    y_line = m * row['LONGITUDE'] + c
    if row['LATITUDE'] > y_line:
        return 'Above'
    else:
        return 'Below'

# Apply this to your dataset
defects_valid['Group'] = defects_valid.apply(assign_group, axis=1)

# Separate the data into two groups
above_data = defects_valid[defects_valid['Group'] == 'Above']
below_data = defects_valid[defects_valid['Group'] == 'Below']

# Output the results
print("Number of data points above the line:", len(above_data))
print("Number of data points below the line:", len(below_data))


In [None]:
import folium

# Define the center of the map (e.g., the median latitude and longitude of the data)
map_center_lat = defects_valid['LATITUDE'].median()
map_center_lon = defects_valid['LONGITUDE'].median()

# Create a Folium map centered on the data
map_defects = folium.Map(location=[map_center_lat, map_center_lon], zoom_start=12)

# Add markers for each point in defects_valid
for _, row in defects_valid.iterrows():
    folium.CircleMarker(
        location=[row['LATITUDE'], row['LONGITUDE']],
        radius=3,
        color='blue',  
        fill=True,
        fill_color='red',  
        fill_opacity=0.7,
        tooltip=f"Street: {row['STREET_NAME']}<br>Defect: {row['ROAD_DEFECT']}"
    ).add_to(map_defects)

# Save the map to an HTML file
#map_defects.save("defects_valid_points_map.html") #Uncomment thid line to save the map as a html file

# Provide a link to download or view the HTML file
from IPython.display import FileLink
#display(FileLink("defects_valid_points_map.html")) #Uncomment this line to display the download link

#map_defects    #Uncomment this line to display the map


In [None]:
below_data.reset_index(drop=True, inplace=True)


below_data_lat = below_data['LATITUDE'].mean()
below_data_lon = below_data['LONGITUDE'].mean()
below_data_map = folium.Map(location=[below_data_lat, below_data_lon], zoom_start=12)

# Add markers for each point in below_data
for _, row in below_data.iterrows():
    folium.CircleMarker(
        location=[row['LATITUDE'], row['LONGITUDE']],
        radius=3,
        color='blue',  
        fill=True,
        fill_opacity=0.7,
        tooltip=f"Street: {row['STREET_NAME']}<br>Defect: {row['ROAD_DEFECT']}"
    ).add_to(below_data_map)

#below_data_map.save("below_data_map.html") #Uncomment thid line to save the map as a html file

# Provide a link to download or view the HTML file
from IPython.display import FileLink
#display(FileLink("below_data_map.html")) #Uncomment this line to display the download link

#below_data_map   #Uncomment this line to display the map


In [None]:
below_data_d= below_data[['ROAD_DEFECT','CRASH_DATE']]
below_data_d['CRASH_DATE'] = pd.to_datetime(below_data_d['CRASH_DATE']).dt.year
below_data_d

above_data_d= above_data[['ROAD_DEFECT','CRASH_DATE']]
above_data_d['CRASH_DATE'] = pd.to_datetime(above_data_d['CRASH_DATE']).dt.year
above_data_d

In [None]:
# Group by crash date and count for below_data_d
below_data_dy = below_data_d.groupby('CRASH_DATE').count()

# Group by crash date and count for above_data_d
above_data_dy = above_data_d.groupby('CRASH_DATE').count()

# Create subplots
fig, ax = plt.subplots(nrows=2, ncols=1, figsize=(8, 6), sharex=True)

# Plot for below_data_d
ax[0].plot(below_data_dy.index, below_data_dy['ROAD_DEFECT'], marker='o', color='blue', label='Below Line')
ax[0].set_title('Crashes Below Line', fontsize=10)
ax[0].set_ylabel('Crashes', fontsize=9)
ax[0].legend(fontsize=8)
ax[0].grid(alpha=0.3)

# Plot for above_data_d
ax[1].plot(above_data_dy.index, above_data_dy['ROAD_DEFECT'], marker='o', color='green', label='Above Line')
ax[1].set_title('Crashes Above Line', fontsize=10)
ax[1].set_xlabel('Years', fontsize=9)
ax[1].set_ylabel('Crashes', fontsize=9)
ax[1].legend(fontsize=8)
ax[1].grid(alpha=0.3)

# Adjust layout for compactness
plt.tight_layout()
plt.show()


In [None]:
below_data_worn = below_data_d[below_data_d['ROAD_DEFECT']=='WORN SURFACE']
below_data_worn.reset_index(drop=True,inplace=True)
below_data_worn_y = below_data_worn.groupby('CRASH_DATE').count()
below_data_worn_y

In [None]:
above_data_dy = above_data_d.groupby('CRASH_DATE').count()
above_data_dy

In [None]:
above_data_worn = above_data_d[above_data_d['ROAD_DEFECT']=='RUT, HOLES']
above_data_worn.reset_index(drop=True,inplace=True)
above_data_worn_y = above_data_worn.groupby('CRASH_DATE').count()
above_data_worn_y

In [None]:
# RUT, HOLES
below_data_holes = below_data_d[below_data_d['ROAD_DEFECT'] == 'RUT, HOLES']
below_data_holes.reset_index(drop=True, inplace=True)
below_data_holes_y = below_data_holes.groupby('CRASH_DATE').count()
below_data_holes_y = pd.DataFrame(below_data_holes_y)

# WORN SURFACE
below_data_worn = below_data_d[below_data_d['ROAD_DEFECT'] == 'WORN SURFACE']
below_data_worn.reset_index(drop=True, inplace=True)
below_data_worn_y = below_data_worn.groupby('CRASH_DATE').count()
below_data_worn_y = pd.DataFrame(below_data_worn_y)

# DEBRIS ON ROADWAY
below_data_deb = below_data_d[below_data_d['ROAD_DEFECT'] == 'DEBRIS ON ROADWAY']
below_data_deb.reset_index(drop=True, inplace=True)
below_data_deb_y = below_data_deb.groupby('CRASH_DATE').count()
below_data_deb_y = pd.DataFrame(below_data_deb_y)

# SHOULDER DEFECT
below_data_sh = below_data_d[below_data_d['ROAD_DEFECT'] == 'SHOULDER DEFECT']
below_data_sh.reset_index(drop=True, inplace=True)
below_data_sh_y = below_data_sh.groupby('CRASH_DATE').count()
below_data_sh_y = pd.DataFrame(below_data_sh_y)

# PLOT
fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(20, 8))
ax[0, 0].plot(below_data_holes_y.index, below_data_holes_y['ROAD_DEFECT'], color='red', marker='o')
ax[0, 0].set_title('Pot holes')
ax[0, 0].set_ylabel('Crashes')
ax[0, 0].legend()

# Plot on the second subplot (axs[1])
ax[0, 1].plot(below_data_worn_y.index, below_data_worn_y['ROAD_DEFECT'], color='orange', marker='o')
ax[0, 1].set_title('Worn surfaces')
ax[0, 1].set_ylabel('Crashes')
ax[0, 1].legend()

ax[1, 0].plot(below_data_deb_y.index, below_data_deb_y['ROAD_DEFECT'], color='green', marker='o')
ax[1, 0].set_title('Debris on road')
ax[1, 0].set_xlabel('Years')
ax[1, 0].set_ylabel('Crashes')
ax[1, 0].legend()

# Plot on the fourth subplot (axs[1,1])
ax[1, 1].plot(below_data_sh_y.index, below_data_sh_y['ROAD_DEFECT'], marker='o')
ax[1, 1].set_title('Shoulder defect')
ax[1, 1].set_xlabel('Years')
ax[1, 1].set_ylabel('Crashes')
ax[1, 1].legend()

plt.tight_layout()
plt.show()


In [None]:
# RUT, HOLES
above_data_holes = above_data_d[above_data_d['ROAD_DEFECT'] == 'RUT, HOLES']
above_data_holes.reset_index(drop=True, inplace=True)
above_data_holes_y = above_data_holes.groupby('CRASH_DATE').count()
above_data_holes_y = pd.DataFrame(above_data_holes_y)

# WORN SURFACE
above_data_worn = above_data_d[above_data_d['ROAD_DEFECT'] == 'WORN SURFACE']
above_data_worn.reset_index(drop=True, inplace=True)
above_data_worn_y = above_data_worn.groupby('CRASH_DATE').count()
above_data_worn_y = pd.DataFrame(above_data_worn_y)

# DEBRIS ON ROADWAY
above_data_deb = above_data_d[above_data_d['ROAD_DEFECT'] == 'DEBRIS ON ROADWAY']
above_data_deb.reset_index(drop=True, inplace=True)
above_data_deb_y = above_data_deb.groupby('CRASH_DATE').count()
above_data_deb_y = pd.DataFrame(above_data_deb_y)

# SHOULDER DEFECT
above_data_sh = above_data_d[above_data_d['ROAD_DEFECT'] == 'SHOULDER DEFECT']
above_data_sh.reset_index(drop=True, inplace=True)
above_data_sh_y = above_data_sh.groupby('CRASH_DATE').count()
above_data_sh_y = pd.DataFrame(above_data_sh_y)

# PLOT
fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(20, 8))
ax[0, 0].plot(above_data_holes_y.index, above_data_holes_y['ROAD_DEFECT'], color='red', marker='o')
ax[0, 0].set_title('Pot holes')
ax[0, 0].set_ylabel('Crashes')
ax[0, 0].legend()

# Plot on the second subplot (axs[1])
ax[0, 1].plot(above_data_worn_y.index, above_data_worn_y['ROAD_DEFECT'], color='orange', marker='o')
ax[0, 1].set_title('Worn surfaces')
ax[0, 1].set_ylabel('Crashes')
ax[0, 1].legend()

ax[1, 0].plot(above_data_deb_y.index, above_data_deb_y['ROAD_DEFECT'], color='green', marker='o')
ax[1, 0].set_title('Debris on road')
ax[1, 0].set_xlabel('Years')
ax[1, 0].set_ylabel('Crashes')
ax[1, 0].legend()

# Plot on the fourth subplot (axs[1,1])
ax[1, 1].plot(above_data_sh_y.index, above_data_sh_y['ROAD_DEFECT'], marker='o')
ax[1, 1].set_title('Shoulder defect')
ax[1, 1].set_xlabel('Years')
ax[1, 1].set_ylabel('Crashes')
ax[1, 1].legend()

plt.tight_layout()
plt.show()


In [None]:
# Filter for 'WORN SURFACE' road defects
area_worn = below_data[below_data['ROAD_DEFECT'] == 'WORN SURFACE']

# Group by 'STREET_NAME' and count occurrences
worn_street_counts = area_worn['STREET_NAME'].value_counts()

# Select the top streets
top_worn_streets = worn_street_counts.index.tolist()

# Initialize a list for streets with increasing crashes from 2023 to 2024
increasing_streets = []

# Iterate over the top streets
for street in top_worn_streets:
    # Filter data for the current street
    street_data = area_worn[area_worn['STREET_NAME'] == street]
    
    # Convert 'CRASH_DATE' to datetime format
    street_data['CRASH_DATE'] = pd.to_datetime(street_data['CRASH_DATE'])
    street_data['CRASH_YEAR'] = street_data['CRASH_DATE'].dt.year
    
    # Group by year and count crashes
    street_g = street_data.groupby('CRASH_YEAR').size()
    
    # Ensure both 2023 and 2024 exist in the data
    if 2023 in street_g.index and 2024 in street_g.index:
        if street_g.loc[2024] > street_g.loc[2023]:  # Check for increase
            increasing_streets.append(street)

# Create subplots for increasing streets
num_streets = len(increasing_streets)
num_cols = 2
num_rows = (num_streets + num_cols - 1) // num_cols  # Ceiling division

fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, 5 * num_rows), sharex=True)
axes = axes.flatten()  # Flatten in case of a 2D array

# Plot the data for streets with increasing crashes
for i, (street, ax) in enumerate(zip(increasing_streets, axes)):
    street_data = area_worn[area_worn['STREET_NAME'] == street]
    street_data['CRASH_DATE'] = pd.to_datetime(street_data['CRASH_DATE']).dt.year
    street_g = street_data.groupby('CRASH_DATE').size()

    ax.plot(street_g.index, street_g.values, marker='o', linestyle='-')
    ax.set_title(f'Crashes on {street} Over Time')
    ax.set_xlabel('Crash Date')
    ax.set_ylabel('Number of Crashes')

# Remove any unused subplots
for j in range(i + 1, len(axes)):
    fig.delaxes(axes[j])

# Adjust layout
plt.tight_layout()
plt.show()


In [None]:

# Create a Geo Map for streets with positive differences with the default background (OpenStreetMap)
geo_map = folium.Map(location=[41.8781, -87.6298], zoom_start=11, tiles='OpenStreetMap')

# Add proportional circles for streets with positive differences
for street, difference in positive_differences.items():
    # Filter the location data for the street for 2024
    location_data_2024 = area_worn[
        (area_worn['STREET_NAME'] == street) & 
        (area_worn['CRASH_DATE'].dt.year == 2024)
    ]
    
    # Use the median latitude and longitude for better accuracy
    lat = location_data_2024['LATITUDE'].median()
    lon = location_data_2024['LONGITUDE'].median()
    
    # Add a proportional circle marker to the map
    folium.CircleMarker(
        location=[lat, lon],
        radius=int(difference) * 2,  # Convert difference to a standard int
        tooltip=folium.Tooltip(f"<b>Street:</b> {street}<br><b>Crash Increase:</b> {difference}"),  # Styled tooltip with bold text
        popup=folium.Popup(f"<b>Street:</b> {street}<br><b>Crash Increase:</b> {difference}<br><b>Year:</b> 2024", max_width=300),
        color='darkblue',
        fill=True,
        fill_color='orange',  # Use a contrasting fill color for better visibility
        fill_opacity=0.8
    ).add_to(geo_map)

# Add a legend using an HTML element
legend_html = """
<div style="
position: fixed;
bottom: 50px;
left: 50px;
width: 250px;
background-color: white;
border: 2px solid black;
z-index: 1000;
padding: 10px;
font-size: 14px;
box-shadow: 5px 5px 5px rgba(0,0,0,0.3);
">
<b>Legend:</b><br>
<span style="color: darkblue;">●</span> <b>Proportional circles:</b> Crash increase in 2024.<br>
Circle size indicates the magnitude of the increase.
</div>
"""
geo_map.get_root().html.add_child(folium.Element(legend_html))

# Save the map to an HTML file or display it
#geo_map.save("default_background_crash_map.html") #Uncomment thid line to save the map as a html file
from IPython.display import FileLink

# Provide a download link for the HTML file
#display(FileLink("default_background_crash_map.html")) #Uncomment this line to display the download link

#geo_map    #Uncomment this line to display the map


In [None]:
# Ensure 'CRASH_DATE' is in datetime format
below_data['CRASH_DATE'] = pd.to_datetime(below_data['CRASH_DATE'])

# Filter for streets in the 'name' list and for the year 2024
filtered_2024_data = below_data[
    (below_data['STREET_NAME'].isin(name.keys())) &
    (below_data['CRASH_DATE'].dt.year == 2024)
]

filtered_2024_data_worn = filtered_2024_data[filtered_2024_data['ROAD_DEFECT']=='WORN SURFACE']
filtered_2024_data_worn.reset_index(drop=True, inplace=True)


from folium.plugins import HeatMap
import folium

# Define the center of the map (e.g., Chicago) based on the mean latitude and longitude
map_center_lat = filtered_2024_data_worn['LATITUDE'].mean()
map_center_lon = filtered_2024_data_worn['LONGITUDE'].mean()

# Create a folium map centered on the data
heat_map = folium.Map(location=[map_center_lat, map_center_lon], zoom_start=12)

# Prepare data for the heat map
heat_data = filtered_2024_data_worn[['LATITUDE', 'LONGITUDE']].values.tolist()

# Add a heat map layer
HeatMap(heat_data).add_to(heat_map)

# Save the heat map to an HTML file
#heat_map.save("filtered_worn_surface_2024_heatmap.html")  #Uncomment thid line to save the map as a html file

# Provide a link to download the HTML file
from IPython.display import FileLink
#display(FileLink("filtered_worn_surface_2024_heatmap.html")) #Uncomment this line to display the download link

#heat_map   #Uncomment this line to display the map




In [None]:
# Create a map centered at the calculated mean latitude and longitude
proportional_map = folium.Map(location=[map_center_lat, map_center_lon], zoom_start=12)

# Ensure 'CRASH_COUNT' column exists
filtered_2024_data_worn['CRASH_COUNT'] = 1

# Group by coordinates and calculate crash counts
crash_counts = filtered_2024_data_worn.groupby(['LATITUDE', 'LONGITUDE']).size().reset_index(name='CRASH_COUNT')

# Function to determine marker color based on crash count
# Ensure CRASH_COUNT is numeric
crash_counts['CRASH_COUNT'] = crash_counts['CRASH_COUNT'].astype(int)

# Define a function to determine the marker color
def get_color(crash_count):
    if crash_count == 1:
        return 'red'  
    elif crash_count == 2:
        return 'blue'  
    return 'gray'  

# Create proportional markers with dynamic colors
for _, row in crash_counts.iterrows():
    folium.CircleMarker(
        location=[row['LATITUDE'], row['LONGITUDE']],
        radius=row['CRASH_COUNT'] * 5,  
        popup=f"<b>Crashes:</b> {row['CRASH_COUNT']}", 
        tooltip=f"Crashes: {row['CRASH_COUNT']}",  
        color='black', 
        fill=True,
        fill_color=get_color(row['CRASH_COUNT']), 
        fill_opacity=0.8,  
    ).add_to(proportional_map)
legend_html = """
<div style="
position: fixed;
bottom: 50px;
left: 50px;
width: 350px;
background-color: white;
border: 2px solid black;
z-index: 1000;
padding: 10px;
font-size: 14px;
">
<b>Crash Count Legend:</b><br>
<span style="color: red;">●</span> 1 Crash<br>
<span style="color: purple;">●</span> 2 Crashes<br>
</div>
"""
proportional_map.get_root().html.add_child(folium.Element(legend_html))
# Save and display the map
#proportional_map.save("proportional_map_final.html")
from IPython.display import FileLink

#display(FileLink("proportional_map_final.html"))
#proportional_map


In [None]:
import pandas as pd
import folium
from geopy.geocoders import Nominatim
import time

# Initialize the geolocator
geolocator = Nominatim(user_agent="traffic_analysis_tool")

# Filter for 'WORN SURFACE' road defects
area_worn = below_data[below_data['ROAD_DEFECT'] == 'WORN SURFACE']

# Group by 'STREET_NAME' and count occurrences
worn_street_counts = area_worn['STREET_NAME'].value_counts()

# Select the top streets with the highest counts
top_worn_streets = worn_street_counts.index.tolist()

# Initialize a dictionary to store differences by ZIP code
zip_code_differences = {}

# Iterate over the top streets
for street in top_worn_streets:
    # Filter data for the current street
    street_data = area_worn[area_worn['STREET_NAME'] == street]
    
    # Convert 'CRASH_DATE' to datetime format
    street_data['CRASH_DATE'] = pd.to_datetime(street_data['CRASH_DATE'])
    street_data['CRASH_YEAR'] = street_data['CRASH_DATE'].dt.year
    
    # Group by year and count crashes
    street_g = street_data.groupby('CRASH_YEAR').size()
    
    # Ensure both 2023 and 2024 exist in the data
    if 2023 in street_g.index and 2024 in street_g.index:
        difference = street_g.loc[2024] - street_g.loc[2023]
        if difference > 0:
            # Get the mean latitude and longitude for the street
            lat = street_data['LATITUDE'].mean()
            lon = street_data['LONGITUDE'].mean()
            
            # Introduce a delay to prevent rate-limiting
            try:
                location = geolocator.reverse((lat, lon), exactly_one=True)
                zip_code = location.raw['address'].get('postcode', 'Unknown')
                time.sleep(1)  # Delay of 1 second
            except Exception as e:
                print(f"Error retrieving ZIP code for {street}: {e}")
                zip_code = 'Unknown'
            
            # Aggregate differences by ZIP code
            if zip_code != 'Unknown':
                zip_code_differences[zip_code] = zip_code_differences.get(zip_code, 0) + difference

# Create a Geo Map for ZIP codes with positive differences
geo_map = folium.Map(location=[41.8781, -87.6298], zoom_start=11)  # Centered on Chicago (example location)

# Add markers for ZIP codes with positive differences
for zip_code, difference in zip_code_differences.items():
    # Use the ZIP code to get a representative location
    try:
        location = geolocator.geocode(f"{zip_code}, Chicago, IL")
        if location:
            folium.Marker(
                location=[location.latitude, location.longitude],
                popup=f"ZIP Code: {zip_code}\n{difference} more crashes in 2024",
                icon=folium.Icon(color="blue", icon="info-sign")
            ).add_to(geo_map)
            time.sleep(1)  # Delay of 1 second to prevent rate-limiting
    except Exception as e:
        print(f"Error retrieving location for ZIP code {zip_code}: {e}")

# Save the map to an HTML file
#geo_map.save("zip_code_crash_differences.html")

# Provide a download link for the HTML file
from IPython.display import FileLink
#display(FileLink("zip_code_crash_differences.html"))

#geo_map


In [None]:
import folium
from folium import Choropleth
import pandas as pd

# Path to the converted GeoJSON file
geojson_path = "C:/Users/byash/OneDrive/Desktop/TDSP/Zip_Codes_20250127.geojson"

# Create a map
boundary_map = folium.Map(location=[41.8781, -87.6298], zoom_start=11)

# Add the Choropleth layer
Choropleth(
    geo_data=geojson_path,
    data=pd.DataFrame(list(zip_code_differences.items()), columns=['ZIP', 'DIFFERENCE']),
    columns=['ZIP', 'DIFFERENCE'],
    key_on="feature.properties.ZIP",  # Match this to the GeoJSON property for ZIP
    fill_color='Reds',  # Use shades of red
    fill_opacity=0.8,  # Increase fill opacity
    line_opacity=0.4,  # Slightly thicker boundary lines
    legend_name='Crash Differences (2024 - 2023)'
).add_to(boundary_map)

# Display the map directly in the notebook
#boundary_map


In [1]:
import os

def find_file(filename, search_path):
    for root, dirs, files in os.walk(search_path):
        if filename in files:
            return os.path.join(root, filename)
    return None

# Define the filename and the search path (current directory)
filename = 'map_with_street_name.html'
search_path = '.'

# Find the file
file_path = find_file(filename, search_path)

if file_path:
    print(f"File found: {file_path}")
else:
    print("File not found")

File not found
