# Data 
### Data requirement
According to the problem description We need a dataset which has a large combination of the data related to a particular place which can be used to create a best suitable model and predict the severity of accident by using the required data. The dimensions of the dataset should be large and should have a high number of entries for better accuracy of model.

The quality we need in our data are -
* It should have a large amount of data for better model training
* The data should have a column which shows the severity level of accident.
* Other necessary traits are:
    * Condition of road, weather, light at place of accident
    * Driver behaviour and consciousness
    * Detailed description of collision with date and time and location
    * Number of person and vehicles involved.

### DATA Description
The data we will be using for the project is from Seattle, Wasington, US named as “Data-Collisions.csv” which is publicly available at this link: https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv. It has stored data from the year 2004-Present. It is a large dataset with dimension 193673 x 38 to work on. It has a special column showing the Severity of the collision which can be used for training and predicting the model.

#### Some Data attribute description,

* **OBJECTID:** ESRI unique identifier(ObjectID).
* **ADDRTYPE:** Collision address type: Alley, Block, Intersection (Text, 12),
* **LOCATION:**  Description of the general location of the collision(Text, 255),
* **SEVERITYCODE:** A code that corresponds to the severity of the collision(Text, 100),
* **COLLISIONTYPE:** Collision type (Text, 300),
* **PERSONCOUNT:** The total number of people involved in the collision(Double),
* **PEDCOUNT:** The number of pedestrians involved in the collision(Double),
* **PEDCYLCOUNT:** The number of bicycles involved in the collision.(Double),
* **VEHCOUNT:** The number of vehicles involved in the collision.(Double),
* **INCDTTM:** The date and time of the incident(Text, 30),
* **WEATHER:** A description of the weather conditions during the time of the collision.(Text, 300),
* **ROADCOND:** The condition of the road during the collision.(Text, 300),
* **LIGHTCOND:** The light conditions during the collision.(Text, 300),
* **PEDROWNOTGRNT:** Whether or not the pedestrian right of way was not granted.(Text, 1),
* **SPEEDING:** Whether or not speeding was a factor in the collision.(Text, 1).

let's import the data and see the data.
### Import Basic Libraries

In [2]:
# firstly, let's import the basic data libraries
import pandas as pd
import numpy as np

this data file is a .csv file, lets import it now

### Load Data From CSV File

In [3]:
df= pd.read_csv('Data-Collisions.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [4]:
df.head() #lets see first 5 rows of ouw data

Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


## Data visualization and pre-processing:
Taking a view of data a kind of data it have. Also prepare the data that we will need.

In [5]:
df.describe(include='all')

Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
count,194673.0,189339.0,189339.0,194673.0,194673.0,194673.0,194673.0,194673,192747,65070.0,...,189661,189503,4667,114936.0,9333,194655.0,189769,194673.0,194673.0,194673
unique,,,,,,,194670.0,2,3,,...,9,9,1,,1,115.0,62,,,2
top,,,,,,,1782439.0,Matched,Block,,...,Dry,Daylight,Y,,Y,32.0,One parked--one moving,,,N
freq,,,,,,,2.0,189786,126926,,...,124510,116137,4667,,9333,27612.0,44421,,,187457
mean,1.298901,-122.330518,47.619543,108479.36493,141091.45635,141298.811381,,,,37558.450576,...,,,,7972521.0,,,,269.401114,9782.452,
std,0.457778,0.029976,0.056157,62649.722558,86634.402737,86986.54211,,,,51745.990273,...,,,,2553533.0,,,,3315.776055,72269.26,
min,1.0,-122.419091,47.495573,1.0,1001.0,1001.0,,,,23807.0,...,,,,1007024.0,,,,0.0,0.0,
25%,1.0,-122.348673,47.575956,54267.0,70383.0,70383.0,,,,28667.0,...,,,,6040015.0,,,,0.0,0.0,
50%,1.0,-122.330224,47.615369,106912.0,123363.0,123363.0,,,,29973.0,...,,,,8023022.0,,,,0.0,0.0,
75%,2.0,-122.311937,47.663664,162272.0,203319.0,203459.0,,,,33973.0,...,,,,10155010.0,,,,0.0,0.0,


#### Get the Dimensions of Data

In [6]:
df.shape #To get the dimensions of the data

(194673, 38)

#### Get the DataTypes of Data
Checking the datatype of data we have and seeing the attributes we need along with their datatype so that preplanning the changes to be made on them

In [7]:
df.dtypes

SEVERITYCODE        int64
X                 float64
Y                 float64
OBJECTID            int64
INCKEY              int64
COLDETKEY           int64
REPORTNO           object
STATUS             object
ADDRTYPE           object
INTKEY            float64
LOCATION           object
EXCEPTRSNCODE      object
EXCEPTRSNDESC      object
SEVERITYCODE.1      int64
SEVERITYDESC       object
COLLISIONTYPE      object
PERSONCOUNT         int64
PEDCOUNT            int64
PEDCYLCOUNT         int64
VEHCOUNT            int64
INCDATE            object
INCDTTM            object
JUNCTIONTYPE       object
SDOT_COLCODE        int64
SDOT_COLDESC       object
INATTENTIONIND     object
UNDERINFL          object
WEATHER            object
ROADCOND           object
LIGHTCOND          object
PEDROWNOTGRNT      object
SDOTCOLNUM        float64
SPEEDING           object
ST_COLCODE         object
ST_COLDESC         object
SEGLANEKEY          int64
CROSSWALKKEY        int64
HITPARKEDCAR       object
dtype: objec

We can seprate the data that we need from the unnecessary data like the multiole Ids and coordinates etc. 

#### Count the Total SEVERITYCODE:
take a look at the target label that we have to predict and plan our further steps. Here we can see that the SEVERITYCODE is already in code form, so we dont have to convert it

In [8]:
#The data we have to predict is 'severitycode'
df['SEVERITYCODE'].value_counts()

1    136485
2     58188
Name: SEVERITYCODE, dtype: int64

Clearly, the data have only 2 codes, which being 1 and 2 only

##### What severity code means:

* Code: Severity
* 0: unknown
* 1: serious damage
* 2: injury
* 2b: serious injury
* 3: fatality

If we look at the null values, the original data have many columns with having NAN values. so in analysis of data we have to remove this too. even though , the panda skips these terms but still the can affect our model.

#### Null Value of Each Column

In [9]:
#checking nan values in each column
df.isnull().sum()

SEVERITYCODE           0
X                   5334
Y                   5334
OBJECTID               0
INCKEY                 0
COLDETKEY              0
REPORTNO               0
STATUS                 0
ADDRTYPE            1926
INTKEY            129603
LOCATION            2677
EXCEPTRSNCODE     109862
EXCEPTRSNDESC     189035
SEVERITYCODE.1         0
SEVERITYDESC           0
COLLISIONTYPE       4904
PERSONCOUNT            0
PEDCOUNT               0
PEDCYLCOUNT            0
VEHCOUNT               0
INCDATE                0
INCDTTM                0
JUNCTIONTYPE        6329
SDOT_COLCODE           0
SDOT_COLDESC           0
INATTENTIONIND    164868
UNDERINFL           4884
WEATHER             5081
ROADCOND            5012
LIGHTCOND           5170
PEDROWNOTGRNT     190006
SDOTCOLNUM         79737
SPEEDING          185340
ST_COLCODE            18
ST_COLDESC          4904
SEGLANEKEY             0
CROSSWALKKEY           0
HITPARKEDCAR           0
dtype: int64

By analyzing Null values we can choose all posible ways to find best suitable attributes to be selected for future use in the model.

#### Data Grouping

In [10]:
df.groupby(['ROADCOND','LIGHTCOND','SPEEDING','ST_COLCODE'])['SEVERITYCODE'].value_counts()

ROADCOND  LIGHTCOND                SPEEDING  ST_COLCODE  SEVERITYCODE
Dry       Dark - No Street Lights  Y         14          2               1
                                             23          1               1
                                             32          1               5
                                                         2               2
                                             50          2               7
                                                                        ..
Wet       Unknown                  Y         28          1               1
                                             30          1               1
                                             32          1               3
                                                         2               1
                                             50          1               4
Name: SEVERITYCODE, Length: 898, dtype: int64

In [11]:
df.groupby('PEDCYLCOUNT')['SEVERITYCODE'].value_counts()

PEDCYLCOUNT  SEVERITYCODE
0            1               135806
             2                53383
1            2                 4762
             1                  679
2            2                   43
Name: SEVERITYCODE, dtype: int64

####            END OF THE DATA DESCRIPTIONS AND PREPROCESSING