# Project - Programming for Data Analysis
***


### References
***
**Road Safty Authority (RSA)**       
    - www.rsa.ie/en/RSA/Road-Safety/Our-Research/Deaths-injuries-on-Irish-roads  
    - www.rsa.ie/Documents 
**Irish Times** 
    - https://www.irishtimes.com/news/environment/crash-report
**Technical References**       
    - http://pandas.pydata.org/pandas-docs/stable/
    - https://docs.scipy.org/doc/numpy/reference/routines.random.html
    - https://www.bogotobogo.com/python/python_fncs_map_filter_reduce.php
    - https://www.analyticsvidhya.com/blog/2017/09/6-probability-distributions-data-science/
    - http://effbot.org/zone/python-list.htm
    - https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.truncnorm.html
    


***

***

**Real Scenario** <br> <br>
The below summmary is based on the road accidents statistics prepared by the Road Saftey Authority in the year 2016.

    - There were 175 fatal collisions happened in the Irish roads which resulted in 187 fatalities
    - 13% more collisions and 15% more deaths compared to the previous year (2015)
    - Maximum number of fatalities occured in counties Dublin and Cork
    - Highest fatalities occured in the age group "66 and above"
    - Maximum fatalities occured for the road user type "Driver"
    - Maximum number of fatalities occured on the week day "Sunday"

***
**Project** <br> <br>
This project is inspired from the above real world scenario. The objectives of the projects are listed below

    - Generate 200 data sets using the python random sampling functions
    - Each data set to contain 6 variables 
    
               - Irish counties where the accident took place
               - Age group of the Driver [ Traditionalists, Baby Boomers, Gen-X, Gen-Y, Gen-Z]
               - Type of the Vehicle [Car, Van, Bus, Lorry, bi-cycle, Jeep]
               - Road Type [Two-way single carriageway, One-way single carriageway, Dual Carriageway]
               - Weather on the particular day [Sunny, Rainy, Snow, Windy, cloudy]
        
    - Investigate the types of variables involved, their likely distributions, and their relationships with each other.
    - Summarise the findings.
    
***

In [140]:
#Import Pandas library
import pandas as pd

# Variable 1 - Counties
# The irish counties are stored in the Json file
# Create a dataframe for the irish counties
url = "https://raw.githubusercontent.com/SomanathanSubramaniyan/PDA-Project/master/Counties.json"
df_counties = pd.read_json(url, orient='columns')

# Variable 2 - Age group of the Driver
# Create a  list for the AgeGroup
AgeGroup =['Traditionalists', 'Baby Boomers', 'Gen-X', 'Gen-Y', 'Gen-Z']

# Variable 3 - Type of the Vehicle
# Create a  list for different type of vechicles
VehicleType = ['Van', 'Bus', 'bi-cycle', 'Car','SUV', 'Lorry']

# Variable 4 - Road type
# Create a  list for different Road Types
RoadType = ['Two-way single carriageway', 'One-way single carriageway', 'Dual Carriageway']

# Variable 5 - weather
# Create a  list for different weather scenarios
Weather = ['Sunny','Cloudy','Rainy', 'Windy','Snow']

df_counties, AgeGroup, VehicleType,RoadType,Weather

(            0
 0      Antrim
 1      Armagh
 2      Carlow
 3       Cavan
 4       Clare
 5        Cork
 6       Derry
 7     Donegal
 8        Down
 9      Dublin
 10  Fermanagh
 11     Galway
 12      Kerry
 13    Kildare
 14   Kilkenny
 15      Laois
 16    Leitrim
 17   Limerick
 18   Longford
 19      Louth
 20       Mayo
 21      Meath
 22   Monaghan
 23     Offaly
 24  Roscommon
 25      Sligo
 26  Tipperary
 27     Tyrone
 28  Waterford
 29  Westmeath
 30    Wexford
 31    Wicklow,
 ['Traditionalists', 'Baby Boomers', 'Gen-X', 'Gen-Y', 'Gen-Z'],
 ['Van', 'Bus', 'bi-cycle', 'Car', 'SUV', 'Lorry'],
 ['Two-way single carriageway',
  'One-way single carriageway',
  'Dual Carriageway'],
 ['Sunny', 'Cloudy', 'Rainy', 'Windy', 'Snow'])

In [143]:
# This section of the code is to create 100 dataset
# Create dataframe for variables county, Agegroup, Vechicle type,Road Type, Weather and Number of accidents
# User for loop to create a 200 data set

from scipy.stats import truncnorm,poisson, uniform
import numpy as np
import random
import pandas as pd

# Function to return the truncated NORMAL random values
# the upper and the lower values are within expected range

def truncatednormal(mu=0, sigma=1, low=0, upp=10):
    return truncnorm( (low - mu)/sigma, (upp - mu)/ sigma, mu, sigma)

# Function to return the POISSON random values
# the upper and the lower values are within expected range

def tpoisson(sample_size=1, maxval=5, mu=3.2):
    cutoff = poisson.cdf(maxval, mu)
    u = uniform.rvs(scale=cutoff, size= sample_size)
    y = poisson.ppf(u, mu)
    return y

dataset = pd.DataFrame(columns=['County','AgeGroup','VechicleType','RoadType', 'Weather','NoofAccidents'])

### Variable 1  -- County ###
# Use UNIFORM DISTRIBUTION to populate the county column in the dataframe
# this ensures all the country are equally represented in the dataset.
# On average 31 distinct counties out of 32 are populated using logic during each execution
# Use round and integer functions to convert the float result to the nearest integer.

for x in range(100):
    icounty = int(round(random.uniform(0,31),0))
    dataset.loc[x,'County'] = df_counties.at[icounty,0]

# County - Unique value and their counts - results of the UNIFORM random distribution
dataset.County.value_counts()
    
### Variable 2  -- Age Group of the Driver ###
# Use TRUNCATED NORMAL DISTRIBUTION to populate the Age Group column in the dataframe
# this ensures most of the data set has "Gen-Y" or "Gen-X"
# Use round and integer functions to convert the float result to the nearest integer.

for x in range(100):
    y = truncatednormal(2.2,1,0,4)
    iAG = y.rvs(1)
    z = int(round(iAG[0],0))
    dataset.loc[x,'AgeGroup'] = AgeGroup[z]
    
# Age Group - Unique value and their counts - results of the Normal random distribution
dataset.AgeGroup.value_counts()

### Variable 3, Variable 4 and Varibale 5  -- Vehicle Type, Road Type and Weather ###
# Use POISSON DISTRIBUTION to populate the Vechicle, Road Type and weather from the reference data
# this ensures most of the data set has values as "car", "SUV" and "bi-cycle"

for x in range(100):
    # call function tpoisson and pass the size, upper limite and mu parameters
    y = tpoisson(1,5,3.2)
    dataset.loc[x,'VechicleType'] = VehicleType[int(y)]
    # call function tpoisson and pass the size, upper limite and mu parameters
    y = tpoisson(1,4,1.5)
    dataset.loc[x,'Weather'] = Weather[int(y)]
    # call function tpoisson and pass the size, upper limite and mu parameters
    y = tpoisson(1,2,0.5)
    dataset.loc[x,'RoadType'] = RoadType[int(y)]    
    
dataset.County.value_counts()


Dublin       8
Kilkenny     6
Cork         4
Derry        4
Kildare      4
Roscommon    4
Cavan        4
Wexford      4
Clare        4
Laois        4
Kerry        4
Meath        4
Down         3
Westmeath    3
Louth        3
Wicklow      3
Leitrim      3
Mayo         3
Fermanagh    3
Sligo        3
Monaghan     3
Galway       3
Donegal      2
Carlow       2
Waterford    2
Longford     2
Tyrone       2
Offaly       2
Tipperary    1
Antrim       1
Armagh       1
Limerick     1
Name: County, dtype: int64

In [136]:
import numpy as np

np.random.poisson(2,200)

array([3, 0, 3, 3, 2, 2, 3, 0, 2, 0, 1, 0, 6, 2, 3, 6, 3, 3, 2, 2, 1, 3,
       2, 1, 2, 2, 2, 0, 2, 1, 7, 5, 0, 3, 1, 2, 0, 1, 1, 4, 0, 2, 4, 0,
       0, 7, 1, 0, 2, 1, 2, 1, 2, 3, 0, 2, 2, 2, 0, 1, 0, 1, 2, 0, 4, 0,
       0, 4, 1, 0, 3, 1, 3, 1, 3, 1, 2, 0, 4, 0, 0, 0, 2, 3, 2, 2, 1, 3,
       1, 4, 1, 1, 5, 7, 3, 2, 2, 2, 2, 4, 3, 1, 2, 3, 1, 2, 0, 2, 1, 0,
       1, 0, 1, 1, 2, 0, 2, 1, 3, 3, 3, 2, 0, 1, 3, 1, 3, 0, 2, 4, 1, 2,
       3, 2, 2, 1, 4, 7, 2, 1, 0, 0, 4, 3, 3, 3, 2, 3, 1, 3, 2, 3, 1, 2,
       1, 2, 8, 1, 3, 0, 1, 1, 0, 1, 1, 3, 2, 0, 1, 0, 1, 4, 6, 1, 0, 2,
       2, 1, 1, 0, 3, 3, 2, 3, 6, 5, 0, 5, 1, 4, 0, 0, 1, 2, 1, 2, 3, 1,
       4, 3])