# 03 Data Visualization

## Introduction

This notebook focuses on the **data visualization** phase of the Traffic Accident Analysis project. The purpose of this step is to explore accident patterns through charts and geospatial maps, providing insights that are not immediately visible in raw tables.  

During this phase, I will:  
- Visualize the distribution of accident severity and location.  
- Identify regional accident hotspots using geospatial heatmaps.  
- Explore temporal patterns and relationships between variables.  
- Incorporate interactive filters (e.g., by severity, state, and date range) to allow dynamic exploration of the data.  

By the end of this notebook, I will have a clearer understanding of accident trends across geography and time. These insights will serve as a foundation for building predictive models in Notebook 4.


## Step 1(a): Load and Inspect Data

Here I load the cleaned dataset and perform basic checks to confirm it was imported correctly.  
This includes viewing the shape, column names, and the first few rows.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


# Loading cleaned dataset
df = pd.read_csv(
    "C:/Users/tuite/Desktop/Software Portfolio/python/Traffic_Accident_Analysis/data/final_cleaned_accident_data.csv"
)

# Check dataset shape (rows, columns)
print("Shape:", df.shape)

# View column names
print("\nColumns:")
print(df.columns.tolist())

# Preview first 5 rows
df.head()

## Step 1(b): Data Types and Missing Values

Before visualizing, I want to confirm the data types of each column and check how many non-null values each contains.


In [None]:
# Display datatypes and non-null counts
df.info()


## Step 1(c): Summary Statistics

Next, I’ll genarate summary statistics to check the numerical columns to understand their ranges and averages, and to see if there are obvious outliers that might affect my analysis.


In [None]:
# Summary statistics for numerical columns
df.describe().T


## Step 1d: Categorical Summary

In addition to numerical features, the dataset contains several categorical columns (e.g., `State`, `City`, `Weather_Condition`).  
To better understand these, I’ll display these columns to see the most frequent values, like which state or weather condition occurs most often.


In [None]:
# Summary statistics for categorical columns
df.describe(include=["object", "category"]).T


## Step 1 Summary

The dataset was successfully loaded with just under 7 million rows and 40 columns.  
- Numerical features show realistic ranges, but there are some extreme outliers (e.g. distances >400 miles, unrealistic temperatures, and very high wind speeds).  
- Categorical features show California, Los Angeles County, and Houston as major accident hotspots.  
- Most accidents occurred during the day and under fair weather conditions.  

Now i will know what to expect for visualization, where Stats will be shown more effectively through charts and maps.
