In [1]:
import pandas as pd

### Data Types: all objects in python have a type. You can check the type by using the _type()_ function. Here are a few standard ones

In [2]:
type(1.5)

float

In [3]:
type(3)

int

In [4]:
type('abc')

str

In [5]:
type(True)

bool

### You can convert between types.

In [6]:
float(1)

1.0

In [7]:
str(1)

'1'

In [8]:
int('9')

9

In [9]:
int(9.9)

9

### DataFrames also have a type

In [10]:
accidents = pd.read_csv('../data/Traffic_Accidents__2019_.csv')

In [11]:
type(accidents)

pandas.core.frame.DataFrame

### And each column has a type

In [13]:
accidents.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34692 entries, 0 to 34691
Data columns (total 25 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Accident Number             34692 non-null  int64  
 1   Date and Time               34692 non-null  object 
 2   Number of Motor Vehicles    34692 non-null  int64  
 3   Number of Injuries          34692 non-null  int64  
 4   Number of Fatalities        34692 non-null  int64  
 5   Property Damage             2495 non-null   object 
 6   Hit and Run                 34691 non-null  object 
 7   Reporting Officer           34684 non-null  float64
 8   Collision Type Code         34688 non-null  float64
 9   Collision Type Description  34688 non-null  object 
 10  Weather Code                34641 non-null  float64
 11  Weather Description         34641 non-null  object 
 12  Illumination Code           34665 non-null  float64
 13  Illumination Description    346

Notice that quite a few of the columns are of the "object" type. By default, pandas will convert text data into the object datatype.

You can convert between types using the `.astype` method. For example, if we needed to treat the accident number as text instead of an integer, we could use the following:

In [16]:
accidents['Accident Number'] = accidents['Accident Number'].astype(str)

In [17]:
accidents.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34692 entries, 0 to 34691
Data columns (total 25 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Accident Number             34692 non-null  object 
 1   Date and Time               34692 non-null  object 
 2   Number of Motor Vehicles    34692 non-null  int64  
 3   Number of Injuries          34692 non-null  int64  
 4   Number of Fatalities        34692 non-null  int64  
 5   Property Damage             2495 non-null   object 
 6   Hit and Run                 34691 non-null  object 
 7   Reporting Officer           34684 non-null  float64
 8   Collision Type Code         34688 non-null  float64
 9   Collision Type Description  34688 non-null  object 
 10  Weather Code                34641 non-null  float64
 11  Weather Description         34641 non-null  object 
 12  Illumination Code           34665 non-null  float64
 13  Illumination Description    346

Notice that the `Date and Time` column is currently being treated as an `object`. This would make it quite difficult to do comparisions or aggregations, for example, between months or days of the week.

Fortunately, we can convert it to a more useful data type, the `datetime` data type.

In order to do this, we can use the `pd.to_datetime` function.

If we don't tell it otherwise, this function will infer the different date and time components of the string. This can be slow, especially when we have a large number of rows of data.

However, we can help it out be being explicit about the format. To do this, you will have to use datetime symbols: https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior

In [18]:
accidents['Date and Time'] = pd.to_datetime(accidents['Date and Time'], 
                                            format = '%m/%d/%Y %I:%M:%S %p')

In [19]:
# Now the column is a datetime64[ns]
accidents.dtypes

Accident Number                       object
Date and Time                 datetime64[ns]
Number of Motor Vehicles               int64
Number of Injuries                     int64
Number of Fatalities                   int64
Property Damage                       object
Hit and Run                           object
Reporting Officer                    float64
Collision Type Code                  float64
Collision Type Description            object
Weather Code                         float64
Weather Description                   object
Illumination Code                    float64
Illumination Description              object
Harmful Code                          object
Harmful Code Description              object
Street Address                        object
City                                  object
State                                 object
ZIP                                  float64
RPA                                  float64
Precinct                              object
Latitude  

In [20]:
# The values in the Date and Time column look different now
accidents.head()

Unnamed: 0,Accident Number,Date and Time,Number of Motor Vehicles,Number of Injuries,Number of Fatalities,Property Damage,Hit and Run,Reporting Officer,Collision Type Code,Collision Type Description,...,Harmful Code Description,Street Address,City,State,ZIP,RPA,Precinct,Latitude,Longitude,Mapped Location
0,20190038972,2019-01-15 19:40:00,2,0,0,,N,256374.0,4.0,ANGLE,...,MOTOR VEHICLE IN TRANSPORT,BELL RD & CEDAR POINTE PKWY,ANTIOCH,TN,37013.0,8753.0,SOUTH,36.0449,-86.6671,POINT (-86.6671 36.0449)
1,20190045402,2019-01-17 23:09:00,2,0,0,,Y,405424.0,11.0,Front to Rear,...,PARKED MOTOR VEHICLE,3248 PERCY PRIEST DR,NASHVILLE,TN,37214.0,8955.0,HERMIT,36.1531,-86.6291,POINT (-86.6291 36.1531)
2,20190051468,2019-01-20 12:57:00,2,0,0,,N,834798.0,6.0,SIDESWIPE - OPPOSITE DIRECTION,...,PARKED MOTOR VEHICLE,700 THOMPSON LN,NASHVILLE,TN,37204.0,8305.0,MIDTOW,36.1122,-86.7625,POINT (-86.7625 36.1122)
3,20190088097,2019-02-02 00:38:00,2,0,0,,Y,660929.0,4.0,ANGLE,...,MOTOR VEHICLE IN TRANSPORT;PARKED MOTOR VEHICLE,400 RADNO,NASHVILLE,TN,,,,36.0483,-86.4369,POINT (-86.4369 36.0483)
4,20190091289,2019-02-03 13:25:00,2,0,0,,N,212369.0,4.0,ANGLE,...,MOTOR VEHICLE IN TRANSPORT,ELLINGTON AG CENTER PVTDR & EDMONDSON PK,NASHVILLE,TN,37220.0,8615.0,MIDTOW,36.0618,-86.7405,POINT (-86.7405 36.0618)


In [21]:
# And we can see each value is a timestamp
accidents.loc[0, 'Date and Time']

Timestamp('2019-01-15 19:40:00')

### Once you have a `datetime` object, you can pull out [individual parts](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.html)
- Use `.dt` to specify a datetime attribute/function and then what you want to pull out
- Pull out the month from the 'Date and Time' column and save it to a new column called 'month'

In [22]:
accidents['month'] = accidents['Date and Time'].dt.month
accidents.head()

Unnamed: 0,Accident Number,Date and Time,Number of Motor Vehicles,Number of Injuries,Number of Fatalities,Property Damage,Hit and Run,Reporting Officer,Collision Type Code,Collision Type Description,...,Street Address,City,State,ZIP,RPA,Precinct,Latitude,Longitude,Mapped Location,month
0,20190038972,2019-01-15 19:40:00,2,0,0,,N,256374.0,4.0,ANGLE,...,BELL RD & CEDAR POINTE PKWY,ANTIOCH,TN,37013.0,8753.0,SOUTH,36.0449,-86.6671,POINT (-86.6671 36.0449),1
1,20190045402,2019-01-17 23:09:00,2,0,0,,Y,405424.0,11.0,Front to Rear,...,3248 PERCY PRIEST DR,NASHVILLE,TN,37214.0,8955.0,HERMIT,36.1531,-86.6291,POINT (-86.6291 36.1531),1
2,20190051468,2019-01-20 12:57:00,2,0,0,,N,834798.0,6.0,SIDESWIPE - OPPOSITE DIRECTION,...,700 THOMPSON LN,NASHVILLE,TN,37204.0,8305.0,MIDTOW,36.1122,-86.7625,POINT (-86.7625 36.1122),1
3,20190088097,2019-02-02 00:38:00,2,0,0,,Y,660929.0,4.0,ANGLE,...,400 RADNO,NASHVILLE,TN,,,,36.0483,-86.4369,POINT (-86.4369 36.0483),2
4,20190091289,2019-02-03 13:25:00,2,0,0,,N,212369.0,4.0,ANGLE,...,ELLINGTON AG CENTER PVTDR & EDMONDSON PK,NASHVILLE,TN,37220.0,8615.0,MIDTOW,36.0618,-86.7405,POINT (-86.7405 36.0618),2


### Now, let's see if we can take advantage of the datetime format to answer some questions.

#### Question 1: What is the maximum number of cars involved in a single accident in July?
- First, subset the `accidents` DataFrame to get the July accidents
- Then, find the maximum `Number of Motor Vehicles` for accidents that happened in July

In [1]:
# Fill in the code here

And if we want to get more information on this accident, we can use the `nlargest` method.

In [32]:
accidents[accidents['month']==7].nlargest(1, 'Number of Motor Vehicles')

Unnamed: 0,Accident Number,Date and Time,Number of Motor Vehicles,Number of Injuries,Number of Fatalities,Property Damage,Hit and Run,Reporting Officer,Collision Type Code,Collision Type Description,...,Street Address,City,State,ZIP,RPA,Precinct,Latitude,Longitude,Mapped Location,month
1818,20190560008,2019-07-27 14:45:00,8,0,0,,Y,299267.0,11.0,Front to Rear,...,MM 1 4 I 440,NASHVILLE,TN,37209.0,52320.0,WEST,36.1496,-86.8227,POINT (-86.8227 36.1496),7


#### Question 2: How many total accidents happened in December?

In [2]:
# Fill in the code here

### There are [many different attributes associated with datetimes](https://towardsdatascience.com/working-with-datetime-in-pandas-dataframe-663f7af6c587)

In [33]:
accidents['Date and Time'].dt.time.head()

0    19:40:00
1    23:09:00
2    12:57:00
3    00:38:00
4    13:25:00
Name: Date and Time, dtype: object

In [34]:
accidents['Date and Time'].dt.date.head()

0    2019-01-15
1    2019-01-17
2    2019-01-20
3    2019-02-02
4    2019-02-03
Name: Date and Time, dtype: object

In [35]:
accidents['Date and Time'].dt.day_name().head()

0     Tuesday
1    Thursday
2      Sunday
3    Saturday
4      Sunday
Name: Date and Time, dtype: object

In [36]:
accidents['Date and Time'].dt.is_leap_year.head()

0    False
1    False
2    False
3    False
4    False
Name: Date and Time, dtype: bool

### You can use comparison symbols on `datetime` objects as well

In [37]:
# How many accidents happened before March 3
(accidents['Date and Time'] < '03/03/2019').sum()

# Note: You have to input the comparison value as a string,
# but the format can vary and pandas will attempt to infer the format.
# Try putting in different formats and rerunning this cell.

5558

### You can also perform calculations on `datetime` objects

The difference of datetime objects is a [Timedelta](https://pandas.pydata.org/docs/reference/api/pandas.Timedelta.html).

In [29]:
# How long between the 1st and 101th accident?
accidents = accidents.sort_values('Date and Time')
accidents.loc[100, 'Date and Time'] - accidents.loc[0, 'Date and Time']

# It appears as a Timedelta, or a change in time

Timedelta('81 days 16:05:00')