<center><img src="https://i.imgur.com/zRrFdsf.png" width="700"></center>


# Formatting Data (dates)

It is very common to find dates (some combination of year, month, day of week and time) in data that is collected in real time (and other that organize event information.

Let's see a data frame that comes with dates from an API.

In [1]:
import pandas as pd
from sodapy import Socrata

client = Socrata("data.seattle.gov", None)

results = client.get("kzjm-xkqj", limit=2000)

# Convert to pandas DataFrame
calls911 = pd.DataFrame.from_records(results)



Let's check some information:

In [2]:
calls911.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 12 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   address                      2000 non-null   object
 1   type                         2000 non-null   object
 2   datetime                     2000 non-null   object
 3   latitude                     1998 non-null   object
 4   longitude                    1998 non-null   object
 5   report_location              1998 non-null   object
 6   incident_number              2000 non-null   object
 7   :@computed_region_ru88_fbhk  1991 non-null   object
 8   :@computed_region_kuhn_3gp2  1991 non-null   object
 9   :@computed_region_q256_3sug  1998 non-null   object
 10  :@computed_region_2day_rhn5  175 non-null    object
 11  :@computed_region_cyqu_gs94  168 non-null    object
dtypes: object(12)
memory usage: 187.6+ KB


Let's get rid of some columns:

In [3]:
calls911=calls911.iloc[:,:7]

Let's check the column _datetime_:

In [4]:
calls911.datetime.head()

0    2023-03-13T14:24:00.000
1    2023-03-13T14:21:00.000
2    2023-03-13T14:18:00.000
3    2023-03-13T14:13:00.000
4    2023-03-13T14:10:00.000
Name: datetime, dtype: object

In [5]:
# then
type(calls911.datetime[0])

str

The date and time information is not useful at this time, that is, the information it offers is of limited use, as it is just a string. 

Let's make it useful:

In [6]:
calls911.datetime=pd.to_datetime(calls911.datetime,format='%Y-%m-%d')
calls911.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   address          2000 non-null   object        
 1   type             2000 non-null   object        
 2   datetime         2000 non-null   datetime64[ns]
 3   latitude         1998 non-null   object        
 4   longitude        1998 non-null   object        
 5   report_location  1998 non-null   object        
 6   incident_number  2000 non-null   object        
dtypes: datetime64[ns](1), object(6)
memory usage: 109.5+ KB


Once you have this data type, you can retrieve important information:

In [7]:
calls911['date']=calls911.datetime.dt.date
calls911['year']=calls911.datetime.dt.year
calls911['month']=calls911.datetime.dt.month
calls911['month_name']=calls911.datetime.dt.month_name()
calls911['day']=calls911.datetime.dt.day
calls911['weekday']=calls911.datetime.dt.day_name()
calls911['hour']=calls911.datetime.dt.hour
calls911['minute']=calls911.datetime.dt.minute

In [8]:
calls911.head()

Unnamed: 0,address,type,datetime,latitude,longitude,report_location,incident_number,date,year,month,month_name,day,weekday,hour,minute
0,509 3rd Ave,Aid Response,2023-03-13 14:24:00,47.602114,-122.330809,"{'type': 'Point', 'coordinates': [-122.330809,...",F230031084,2023-03-13,2023,3,March,13,Monday,14,24
1,937 N 96th St,Aid Response,2023-03-13 14:21:00,47.698721,-122.34681,"{'type': 'Point', 'coordinates': [-122.34681, ...",F230031083,2023-03-13,2023,3,March,13,Monday,14,21
2,908 Jefferson St,Medic Response,2023-03-13 14:18:00,47.604879,-122.324007,"{'type': 'Point', 'coordinates': [-122.324007,...",F230031081,2023-03-13,2023,3,March,13,Monday,14,18
3,4535 17th Ave Ne,Aid Response,2023-03-13 14:13:00,47.661569,-122.309717,"{'type': 'Point', 'coordinates': [-122.309717,...",F230031080,2023-03-13,2023,3,March,13,Monday,14,13
4,724 26th Ave,Nurseline/AMR,2023-03-13 14:10:00,47.608238,-122.298865,"{'type': 'Point', 'coordinates': [-122.298865,...",F230031079,2023-03-13,2023,3,March,13,Monday,14,10


Let's create a new column with what we have. In this case, a boolean where you tell if it is night time (after 8 pm before 6 am):

In [9]:
calls911['nightTime']=((calls911['hour']<=6) | (calls911['hour']>=20))

Can we use several columns to build a _datetime_?

In [10]:
pd.to_datetime(calls911[['month', 'day','year','hour','minute']])

0      2023-03-13 14:24:00
1      2023-03-13 14:21:00
2      2023-03-13 14:18:00
3      2023-03-13 14:13:00
4      2023-03-13 14:10:00
               ...        
1995   2023-03-07 09:17:00
1996   2023-03-07 09:17:00
1997   2023-03-07 09:08:00
1998   2023-03-07 09:04:00
1999   2023-03-07 08:54:00
Length: 2000, dtype: datetime64[ns]

Notice that the columns about lat/long are non numeric, let's solve that:

In [11]:
calls911[['longitude','latitude']]=calls911[['longitude','latitude']].apply(lambda x:pd.to_numeric(x))

In [None]:
calls911.info

Let's save what we have:

In [None]:
import os

where=os.path.join('data','calls911.pkl')
calls911.to_pickle(where)