In [272]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [273]:
import matplotlib.pyplot as plt
import seaborn as sns



road_accidents=pd.read_csv('../input/road-accidents/road-accidents.csv',sep='|',skiprows=9,index_col='state')
np.random.seed(0)

**Read in and get an overview of the data**

printing the first five rows of road accident data

In [274]:
road_accidents.head()

**Create a textual and a graphical summary of the data**

in the next cell we will do a summary by calculating some statistical values, to get some info about the data we're dealing with.

In [275]:
road_accidents.describe()

and in the next cell we'll visualize the distribution of data in every column

In [276]:
fig,ax=plt.subplots(2,2,figsize=(10,10))
sns.histplot(road_accidents['drvr_fatl_col_bmiles'],ax=ax[0][0])
sns.histplot(road_accidents['perc_fatl_speed'],ax=ax[0][1])
sns.histplot(road_accidents['perc_fatl_alcohol'],ax=ax[1][0])
sns.histplot(road_accidents['perc_fatl_1st_time'],ax=ax[1][1])
plt.show()

in the next cell we'll use a scatter plot matrix to visualize the relationship between each column in the data.

the diagonal plots represent a histogram of each column.

In [277]:
pd.plotting.scatter_matrix(road_accidents,figsize=(10,8))
plt.show()

**Quantify the association of features and accidents**

in the next cell we'll calculate the correlation coeficients to quantify the relation between each column with the other columns in dataframe road_accidents.

In [278]:
road_accidents.corr()

- as we can see from the table above the column perc_fatl_alcohol has the highest influence on our target column drvr_fatl_col_bmiles.

- the highest correlation coeficient is between the two column **perc_fatl_alcohol** and **perc_fatl_speed**

**Fit a multivariate linear regression**

In [279]:
from sklearn.linear_model import LinearRegression

linear_regression=LinearRegression()
X=road_accidents.copy()
Y=X.pop(road_accidents.columns[0])
linear_regression.fit(X,Y)
print(*X.columns)
print(*linear_regression.coef_)

from the results above we see that the third coeficient which refers to the regression coeficient between **perc_fatl_1st_time** column and the target column **drvr_fatl_col_bmiles** is positive while their correlation coeficient is negative and that shows us the existance of a masked relation between the two columns.

**Perform PCA on standardized data**

In [280]:
from sklearn.decomposition import PCA

X=(X-X.mean())/X.std()

pca=PCA(random_state=0).fit_transform(X)
X=pd.DataFrame(pca,columns=X.columns,index=road_accidents.index)


X.head()

**Visualize the first two principal components**

In [281]:
sns.scatterplot(x=X.iloc[:,0],y=X.iloc[:,1])
plt.show()

**Find clusters of similar states in the data**

In [282]:
from sklearn.cluster import KMeans

l=[]
for i in range(1,10):
    kmeans=KMeans(n_clusters=i,random_state=0)
    kmeans.fit(X)
    l+=[kmeans.inertia_]

plt.plot([i for i in range(1,10)],l,'ro-')
plt.xlabel('n_clusters')
plt.show()

- as we can see from the scree plot above we're not able to find an optimal value for n_clusters as there's no clear elbow in the graph.

**KMeans to visualize clusters in the PCA scatter plot**

In [283]:
kmeans=KMeans(n_clusters=3,random_state=0)
X['cluster']=kmeans.fit_predict(X)
X['cluster']=X['cluster'].astype('category')
print(X.head())

In [284]:
sns.scatterplot(x=X.iloc[:,0],y=X.iloc[:,1],hue='cluster',data=X)
plt.show()

**Visualize the feature differences between the clusters**

In [285]:
Xcluster=X['cluster']

X=road_accidents.copy()
X.pop(road_accidents.columns[0])

X['cluster']=Xcluster

sns.scatterplot(x=X.iloc[:,0],y=X.iloc[:,1],hue='cluster',data=X)
plt.show()

**Compute the number of accidents within each cluster**

In [286]:
miles_driven=pd.read_csv('../input/miles-driven/miles-driven.csv',sep='|',index_col='state')
miles_driven['million_miles_annually']=miles_driven['million_miles_annually']*road_accidents['drvr_fatl_col_bmiles']/1000
X=X.join(miles_driven)
X=X.loc[:,['cluster','million_miles_annually']]
sns.violinplot(data=X,x='cluster',y='million_miles_annually')
plt.ylabel('total fatal traffic accidents')
plt.ylim(0,6000)
plt.show()

**Make a decision when there is no clear right choice**

    As we can see from the diagram above, the three shapes have only some slight differences which leaves us incapable of choosing one cluster as dangerous than the other two, but if we had to choose one of those clusters to be the first to focus on,i'll choose the cluster 1 as it reaches the highest values for the total number of fatal accidents.