# **EDA**

## Objectives

* Describe the data using basic statistics
* Use visualisations to explore the distribution, skew and kurtosis of numerical varaibles
* Use bar charts to explore the distribution of categorical variables
* Compare median clearance time of categorical variables 
* Compute statistical significance between variables of interest
* Use a correlation matrix to explore relationships 

## Inputs

* The dataset, "US_Accidents_For_EDA.csv", saved locally in "Data/EDA"

## Outputs

* The dataset, "US_Accidents_For_ML.csv", saved locally in "Data/ML" 

## Summary of Steps

* Load the dataset
* 



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\sonia\\Documents\\VS Studio Projects\\US_Accidents_ML_Project\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\sonia\\Documents\\VS Studio Projects\\US_Accidents_ML_Project'

---

## Required Libraries

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from feature_engine import transformation as vt
from statsmodels.graphics.gofplots import qqplot
import pingouin as pg
import math

---

## Load the Dataset

I will use Pandas to open the csv file.

In [5]:
df = pd.read_csv("Data/EDA/US_Accidents_For_EDA.csv")
pd.set_option("display.max_columns", None)
df.head()

Unnamed: 0,Severity,Start_Lat,Start_Lng,Distance(mi),Timezone,Temperature(F),Wind_Chill(F),Humidity(%),Pressure(in),Visibility(mi),Wind_Direction,Wind_Speed(mph),Precipitation(in),Amenity,Bump,Crossing,Give_Way,Junction,No_Exit,Railway,Station,Stop,Traffic_Calming,Traffic_Signal,Sunrise_Sunset,Clearance_Time(hr),Clearance_Class,Weather_Simplified,State_Other,Population,County_Other,Month
0,2,32.456486,-93.774536,0.501,Central,78.0,78.0,62.0,29.61,10.0,CALM,0.0,0.0,False,False,False,False,False,False,False,False,False,False,False,Night,26.205833,Very Long,Fair,LA,187540,Caddo,Sep
1,2,36.804693,-76.189728,0.253,Eastern,54.0,54.0,90.0,30.4,7.0,CALM,0.0,0.0,False,False,True,False,False,False,False,False,False,False,True,Night,81.274444,Very Long,Fair,VA,459444,Virginia Beach,May
2,2,29.895741,-90.090026,1.154,Pacific,40.0,33.0,58.0,30.28,10.0,N,10.0,0.0,False,False,False,False,True,False,False,False,False,False,False,Day,8761.75,Very Long,Cloudy,LA,440784,Jefferson,Jan
3,2,32.456459,-93.779709,0.016,Central,62.0,62.0,75.0,29.8,10.0,SSE,8.0,0.0,False,False,False,False,False,False,False,False,False,False,False,Night,28.096667,Very Long,Cloudy,LA,187540,Caddo,Nov
4,2,26.966433,-82.255414,0.057,Eastern,84.0,84.0,69.0,29.99,10.0,E,18.0,0.0,False,False,False,False,False,False,False,False,False,False,False,Day,27.26,Very Long,Cloudy,FL,186824,Other,Sep


---

## Descriptive Statistics

I am going to look at the descriptive statistics of numerical variables. This will give: the count (number of values); the mean, which is the central tendency, and standard deviation, which describes the variation or spread of data from the mean; the minimum and maximum values, together with the quartiles, which describe the spread of the data by telling us the values at which 25 %, 50 % or 75 % of the data points fall within. The 50 % quartile is also known as the median.

In general, when data is normally distributed, mean and standard deviation are typically used, while if data is skewed, median and interquartile range are typically used to describe data.

In [8]:
df.describe()

Unnamed: 0,Severity,Start_Lat,Start_Lng,Distance(mi),Temperature(F),Wind_Chill(F),Humidity(%),Pressure(in),Visibility(mi),Wind_Speed(mph),Precipitation(in),Clearance_Time(hr),Population
count,9960.0,9960.0,9960.0,9960.0,9960.0,9960.0,9960.0,9960.0,9960.0,9960.0,9960.0,9960.0,9960.0
mean,2.157028,35.731321,-95.102106,0.937225,61.28743,60.001466,64.901004,29.295206,9.053608,7.388765,0.005092,239.888811,680950.7
std,0.484565,5.549589,16.939049,2.42448,18.732327,20.949519,22.854755,1.27407,2.790325,5.579474,0.036646,1396.189647,1387282.0
min,1.0,25.435661,-124.430507,0.0,-17.0,-36.0,4.0,20.33,0.0,0.0,0.0,0.091667,34.0
25%,2.0,32.461804,-116.33859,0.008,49.0,48.0,49.0,29.19,10.0,3.0,0.0,0.995278,61055.0
50%,2.0,35.155197,-90.932593,0.181,63.0,63.0,68.0,29.73,10.0,7.0,0.0,3.038056,226525.0
75%,2.0,40.034891,-80.414568,0.882,75.0,75.0,84.0,29.97,10.0,10.0,0.0,13.474653,652522.0
max,4.0,48.999569,-68.732984,65.308,112.0,112.0,100.0,30.71,75.0,51.0,1.67,26304.992778,10017400.0


For the purposes of EDA, I am going to treat the variable "Severity" as categorical. It will be changed back to number at the end of this notebook.

In [None]:
df["Severity"].astype("object")
df["Severity"].dtype

dtype('int64')

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.