# Exploring the different data sets

The following data sets will be looked into further:
* Train.csv - Contains all accidents with time and location over 18 months (01/2018 - 07/2019)
* Weather_Nairobi_daily_GFS.csv - Daily (one row) information for 6 weather features in Nairobi over 24 months (01/2018 - 01/2020)
* Segment_info.csv - Column meanings unknown, value tbd

Check out documentation at https://github.com/caiomiyashiro/geospatial_data_analysis/blob/master/AMLD-2020/Presentation_AMLD_2020.ipynb

### Importing packages

In [2]:
import pandas as pd
import math
import seaborn as sns

ModuleNotFoundError: No module named 'seaborn'

## Accident location over time

In [None]:
df = pd.read_csv('Train.csv', parse_dates=['datetime'])
print(df.shape)
df.head()

Creating time windows as demand "containers" helps to interpret demand over time

In [None]:
df["time_window"] = df["datetime"].apply(lambda x: math.floor(x.hour / 3) + 1)

In [None]:
dict_windows = {1: "00-03", 2: "03-06", 3: "06-09", 4: "09-12", 5: "12-15", 6: "15-18", 7: "18-21", 8: "21-24"}

In [None]:
df["time_window_str"] = df["time_window"].apply(lambda x: dict_windows.get(x))

In [None]:
dict_months = {1: "Jan", 2: "Feb", 3: "Mar", 4: "Apr", 5: "May", 6: "Jun",
               7: "Jul", 8: "Aug", 9: "Sep", 10: "Oct", 11: "Nov", 12: "Dec"}

In [None]:
df["day"] = df["datetime"].apply(lambda x: x.day)

In [None]:
df["month"] = df["datetime"].apply(lambda x: dict_months.get(x.month))

In [None]:
df["year"] = df["datetime"].apply(lambda x: x.year)

In [None]:
df["weekday"] = df["datetime"].apply(lambda x: x.weekday())

In [None]:
df.tail()

In [None]:
df.groupby("time_window_str").datetime.count()

### Overall accidents per time window for 2018 and 2019

In [None]:
fig01 = sns.countplot(data=df, x="time_window_str", palette="Greens")
fig01.set_title("Overall accidents per time window");

### Mean and median accidents per time window for 2018 and 2019

In [None]:
max_acc = pd.crosstab(df["time_window_str"], df["day"]).max(axis=1)
min_acc = pd.crosstab(df["time_window_str"], df["day"]).min(axis=1)
mean_acc = pd.crosstab(df["time_window_str"], df["day"]).mean(axis=1)
median_acc = pd.crosstab(df["time_window_str"], df["day"]).median(axis=1)

In [None]:
df_stats = pd.DataFrame([max_acc, min_acc, mean_acc, median_acc]).T
df_stats.columns = ["max", "min", "mean", "median"]
df_stats.reset_index(inplace=True)
df_stats.head()

In [None]:
fig = sns.barplot(data=df_stats, x="time_window_str", y="max", palette="Reds")
fig.set_title("Maximum amount of accidents per time window");

Note: Minimum cannot be 0 because then we also do not have a column ... need to fix that somehow

In [None]:
fig = sns.barplot(data=df_stats, x="time_window_str", y="min", palette="Reds")
fig.set_title("Minimum amount of accidents per time window");

In [None]:
fig = sns.barplot(data=df_stats, x="time_window_str", y="mean", palette="Reds")
fig.set_title("Mean of accidents per time window");

In [None]:
fig = sns.barplot(data=df_stats, x="time_window_str", y="median", palette="Reds")
fig.set_title("Median amount of accidents per time window");

### Overall accidents per month for 2018

Note: Avoid counting both first halfs (2018 & 2019)

In [None]:
fig02 = sns.countplot(data=df[df.year == 2018], x="month", palette="Blues")
fig02.set_title("Overall accidents per month");

## Accidents per weekday

In [None]:
fig03 = sns.countplot(data=df[df.year == 2018], x="weekday", palette="Greens")
fig03.set_title("Overall accidents per weekday");

### Time window per day

In [None]:
sns.catplot(x="time_window_str", col="weekday",data=df[df.year == 2018], kind="count", col_wrap=3, palette="Greens");

### Overall accidents per month and time window for 2018

In [None]:
sns.catplot(x="time_window_str", col="month",data=df[df.year == 2018], kind="count", col_wrap=3, palette="Greens");

### Overall accidents per month and day for 2018

In [None]:
sns.catplot(x="day", col="month",data=df[df.year == 2018], kind="count", col_wrap=3, palette="Blues");

Note: **It seems that we have missing data for some days.**

In [None]:
df_check = df[(df.month == "Sep") & (df.day > 20) & (df.day < 26)]
df_check

In [None]:
df_check.shape

We will have to create these non-existent rows and fill them with something ...

### To Do
* Create rows for non-existent time windows (merge with a template dataframe of time windows?)
* Fill these rows with some data to get better statistics