<img src="https://i.imgur.com/6U6q5jQ.png"/>

# Formatting Dates in Python

It is very common to find dates (some combination of year, month, day of week and time) in data that is collected in real time (and other that organize event information.

Let's see a data frame that comes with dates from an API.

In [11]:
#!pip install sodapy

In [12]:
import pandas as pd
from sodapy import Socrata

client = Socrata("data.seattle.gov", None)

results = client.get("kzjm-xkqj", limit=2000)

# Convert to pandas DataFrame
calls911 = pd.DataFrame.from_records(results)



Let's check the data types:

In [13]:
calls911.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 12 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   address                      2000 non-null   object
 1   type                         2000 non-null   object
 2   datetime                     2000 non-null   object
 3   latitude                     2000 non-null   object
 4   longitude                    2000 non-null   object
 5   report_location              2000 non-null   object
 6   incident_number              2000 non-null   object
 7   :@computed_region_ru88_fbhk  1997 non-null   object
 8   :@computed_region_kuhn_3gp2  1997 non-null   object
 9   :@computed_region_q256_3sug  2000 non-null   object
 10  :@computed_region_2day_rhn5  154 non-null    object
 11  :@computed_region_cyqu_gs94  146 non-null    object
dtypes: object(12)
memory usage: 187.6+ KB


Let's get rid of some columns:

In [14]:
calls911=calls911.iloc[:,:7]

Let's check the column _datetime_:

In [15]:
calls911.datetime.head()

0    2024-02-11T07:06:00.000
1    2024-02-11T07:04:00.000
2    2024-02-11T06:57:00.000
3    2024-02-11T06:54:00.000
4    2024-02-11T06:53:00.000
Name: datetime, dtype: object

In [16]:
# verify data type
type(calls911.datetime[0])


str

The date and time information is not useful at this time, that is, the information it offers is of limited use, as it is just a string. 

Let's make it useful:

In [17]:
calls911.datetime=pd.to_datetime(calls911.datetime)
calls911.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   address          2000 non-null   object        
 1   type             2000 non-null   object        
 2   datetime         2000 non-null   datetime64[ns]
 3   latitude         2000 non-null   object        
 4   longitude        2000 non-null   object        
 5   report_location  2000 non-null   object        
 6   incident_number  2000 non-null   object        
dtypes: datetime64[ns](1), object(6)
memory usage: 109.5+ KB


In [18]:
calls911

Unnamed: 0,address,type,datetime,latitude,longitude,report_location,incident_number
0,N 46th St / Woodland Park Ave N,Illegal Burn,2024-02-11 07:06:00,47.662124,-122.344551,"{'type': 'Point', 'coordinates': [-122.344551,...",F240021372
1,4501 12th Ave Ne,Aid Response,2024-02-11 07:04:00,47.661285,-122.315442,"{'type': 'Point', 'coordinates': [-122.315442,...",F240021371
2,5240 University Way Ne,Aid Response,2024-02-11 06:57:00,47.666712,-122.313047,"{'type': 'Point', 'coordinates': [-122.313047,...",F240021369
3,2835 S Bayview St,Triaged Incident,2024-02-11 06:54:00,47.58101,-122.296025,"{'type': 'Point', 'coordinates': [-122.296025,...",F240021367
4,4th Ave / Madison St,Medic Response,2024-02-11 06:53:00,47.606088,-122.332909,"{'type': 'Point', 'coordinates': [-122.332909,...",F240021366
...,...,...,...,...,...,...,...
1995,4700 38th Ave Sw,Aid Response,2024-02-05 14:40:00,47.561082,-122.380063,"{'type': 'Point', 'coordinates': [-122.380063,...",F240018709
1996,77 S Washington St,Aid Response,2024-02-05 14:35:00,47.600885,-122.334925,"{'type': 'Point', 'coordinates': [-122.334925,...",F240018708
1997,9019 Rainier Ave S,Low Acuity Response,2024-02-05 14:32:00,47.522815,-122.269988,"{'type': 'Point', 'coordinates': [-122.269988,...",F240018707
1998,10750 Greenwood Ave N,Aid Response,2024-02-05 14:19:00,47.707197,-122.355469,"{'type': 'Point', 'coordinates': [-122.355469,...",F240018704


Once you have this data type, you can retrieve important information:

In [19]:
calls911['date']=calls911.datetime.dt.date
calls911['year']=calls911.datetime.dt.year
calls911['month']=calls911.datetime.dt.month_name()
calls911['weekday']=calls911.datetime.dt.day_name()
calls911['hour']=calls911.datetime.dt.hour

In [20]:
calls911.head()

Unnamed: 0,address,type,datetime,latitude,longitude,report_location,incident_number,date,year,month,weekday,hour
0,N 46th St / Woodland Park Ave N,Illegal Burn,2024-02-11 07:06:00,47.662124,-122.344551,"{'type': 'Point', 'coordinates': [-122.344551,...",F240021372,2024-02-11,2024,February,Sunday,7
1,4501 12th Ave Ne,Aid Response,2024-02-11 07:04:00,47.661285,-122.315442,"{'type': 'Point', 'coordinates': [-122.315442,...",F240021371,2024-02-11,2024,February,Sunday,7
2,5240 University Way Ne,Aid Response,2024-02-11 06:57:00,47.666712,-122.313047,"{'type': 'Point', 'coordinates': [-122.313047,...",F240021369,2024-02-11,2024,February,Sunday,6
3,2835 S Bayview St,Triaged Incident,2024-02-11 06:54:00,47.58101,-122.296025,"{'type': 'Point', 'coordinates': [-122.296025,...",F240021367,2024-02-11,2024,February,Sunday,6
4,4th Ave / Madison St,Medic Response,2024-02-11 06:53:00,47.606088,-122.332909,"{'type': 'Point', 'coordinates': [-122.332909,...",F240021366,2024-02-11,2024,February,Sunday,6


Let's create a new column with what we have. In this case, a boolean where you tell if it is night time (after 8 pm before 6 am):

In [43]:
calls911['nightTime']=((calls911['hour']<=6) | (calls911['hour']>=20))

#here

calls911.head()

Unnamed: 0,address,type,datetime,latitude,longitude,report_location,incident_number,date,year,month,weekday,hour,nightTime
0,N 46th St / Woodland Park Ave N,Illegal Burn,2024-02-11 07:06:00,47.662124,-122.344551,"{'type': 'Point', 'coordinates': [-122.344551,...",F240021372,2024-02-11,2024,February,Sunday,7,False
1,4501 12th Ave Ne,Aid Response,2024-02-11 07:04:00,47.661285,-122.315442,"{'type': 'Point', 'coordinates': [-122.315442,...",F240021371,2024-02-11,2024,February,Sunday,7,False
2,5240 University Way Ne,Aid Response,2024-02-11 06:57:00,47.666712,-122.313047,"{'type': 'Point', 'coordinates': [-122.313047,...",F240021369,2024-02-11,2024,February,Sunday,6,True
3,2835 S Bayview St,Triaged Incident,2024-02-11 06:54:00,47.58101,-122.296025,"{'type': 'Point', 'coordinates': [-122.296025,...",F240021367,2024-02-11,2024,February,Sunday,6,True
4,4th Ave / Madison St,Medic Response,2024-02-11 06:53:00,47.606088,-122.332909,"{'type': 'Point', 'coordinates': [-122.332909,...",F240021366,2024-02-11,2024,February,Sunday,6,True


Let's save what we have:

In [44]:
import os 

calls911.to_pickle(os.path.join('DataFiles','calls911.pkl'))

What about data that comes in Spanish?

In [23]:
#!pip install bs4

In [24]:
link="https://es.wikipedia.org/wiki/Pandemia_de_COVID-19"

import pandas as pd

covid=pd.read_html(link, flavor="bs4", attrs={"class":"wikitable sortable"})

Let me keep the second df:

In [25]:
covidDF=covid[1].copy()
covidDF

Unnamed: 0,Territorios,Territorios.1,Fecha del análisis,Porcentaje con anticuerpos,Personas que han sido infectadas,Referencia
0,Bérgamo,Italia,23 de abril de 2020 a 3 de junio de 2020,57%,635 000,[123]​
1,Ginebra,Suiza,6 de abril de 2020 a 9 de mayo de 2020,"10,9%",54 000,[124]​ [125]​
2,España,Europa,15 de diciembre de 2020,"9,9%",4 700 000,[126]​
3,Karnataka,India,16 de septiembre de 2020,"27,3%",19 300 000,[127]​
4,México,América,16 de diciembre de 2020,"25,0%",32 000 000,[128]​
5,Nueva Delhi,India,27 de junio de 2020 a 10 de julio de 2020,"23,5%",5 111 000,[129]​
6,Nueva York (ciudad),Estados Unidos,19 de abril de 2020 a 28 de abril de 2020,"22,7%",1 907 000,[130]​ [125]​
7,Nueva York (estado),Estados Unidos,19 de abril de 2020 a 28 de abril de 2020,14%,2 139 000,[130]​
8,Reino Unido,Europa,noviembre de 2020,"8,8%",5 900 000,[131]​


Notice the presence of some non-English punctuation:

In [26]:
covidDF.columns

Index(['Territorios', 'Territorios.1', 'Fecha del análisis',
       'Porcentaje con anticuerpos', 'Personas que han sido infectadas',
       'Referencia'],
      dtype='object')

Let's get rid of those:

In [27]:
#!pip install unidecode

In [28]:
import unidecode as ud

[ud.unidecode(c) for c in covidDF.columns]

['Territorios',
 'Territorios.1',
 'Fecha del analisis',
 'Porcentaje con anticuerpos',
 'Personas que han sido infectadas',
 'Referencia']

In [29]:
#or
import re

[re.sub('\\s','',ud.unidecode(c)) for c in covidDF.columns]

['Territorios',
 'Territorios.1',
 'Fechadelanalisis',
 'Porcentajeconanticuerpos',
 'Personasquehansidoinfectadas',
 'Referencia']

In [30]:
#then
covidDF.columns=[re.sub('\\s','',ud.unidecode(c)) for c in covidDF.columns]

Let's  focus on the _Fechadelanalisis_ column:

In [31]:
# use " a " to split:
covidDF.Fechadelanalisis.str.split(" a ",expand=True)

Unnamed: 0,0,1
0,23 de abril de 2020,3 de junio de 2020
1,6 de abril de 2020,9 de mayo de 2020
2,15 de diciembre de 2020,
3,16 de septiembre de 2020,
4,16 de diciembre de 2020,
5,27 de junio de 2020,10 de julio de 2020
6,19 de abril de 2020,28 de abril de 2020
7,19 de abril de 2020,28 de abril de 2020
8,noviembre de 2020,


In [32]:
# create the two columns

covidDF[["fecha1","fecha2"]]=covidDF.Fechadelanalisis.str.split(" a ",expand=True)
covidDF

Unnamed: 0,Territorios,Territorios.1,Fechadelanalisis,Porcentajeconanticuerpos,Personasquehansidoinfectadas,Referencia,fecha1,fecha2
0,Bérgamo,Italia,23 de abril de 2020 a 3 de junio de 2020,57%,635 000,[123]​,23 de abril de 2020,3 de junio de 2020
1,Ginebra,Suiza,6 de abril de 2020 a 9 de mayo de 2020,"10,9%",54 000,[124]​ [125]​,6 de abril de 2020,9 de mayo de 2020
2,España,Europa,15 de diciembre de 2020,"9,9%",4 700 000,[126]​,15 de diciembre de 2020,
3,Karnataka,India,16 de septiembre de 2020,"27,3%",19 300 000,[127]​,16 de septiembre de 2020,
4,México,América,16 de diciembre de 2020,"25,0%",32 000 000,[128]​,16 de diciembre de 2020,
5,Nueva Delhi,India,27 de junio de 2020 a 10 de julio de 2020,"23,5%",5 111 000,[129]​,27 de junio de 2020,10 de julio de 2020
6,Nueva York (ciudad),Estados Unidos,19 de abril de 2020 a 28 de abril de 2020,"22,7%",1 907 000,[130]​ [125]​,19 de abril de 2020,28 de abril de 2020
7,Nueva York (estado),Estados Unidos,19 de abril de 2020 a 28 de abril de 2020,14%,2 139 000,[130]​,19 de abril de 2020,28 de abril de 2020
8,Reino Unido,Europa,noviembre de 2020,"8,8%",5 900 000,[131]​,noviembre de 2020,


Let's format one of those columns:

In [33]:
covidDF.fecha1

0         23 de abril de 2020
1          6 de abril de 2020
2     15 de diciembre de 2020
3    16 de septiembre de 2020
4     16 de diciembre de 2020
5         27 de junio de 2020
6         19 de abril de 2020
7         19 de abril de 2020
8           noviembre de 2020
Name: fecha1, dtype: object

In [34]:
covidDF.loc[8,'fecha1']='1 de noviembre de 2020'

In [35]:
# let's split again:

covidDF.fecha1.str.split(" de ",expand=True)

Unnamed: 0,0,1,2
0,23,abril,2020
1,6,abril,2020
2,15,diciembre,2020
3,16,septiembre,2020
4,16,diciembre,2020
5,27,junio,2020
6,19,abril,2020
7,19,abril,2020
8,1,noviembre,2020


I could create three new columns:

In [36]:
covidDF[['fecha1_dia','fecha1_mes','fecha1_anho']]=covidDF.fecha1.str.split(" de ",expand=True)
covidDF[['fecha1_dia','fecha1_mes','fecha1_anho']]

Unnamed: 0,fecha1_dia,fecha1_mes,fecha1_anho
0,23,abril,2020
1,6,abril,2020
2,15,diciembre,2020
3,16,septiembre,2020
4,16,diciembre,2020
5,27,junio,2020
6,19,abril,2020
7,19,abril,2020
8,1,noviembre,2020


We should use the month number instead of name. Let's prepare a dict of changes:

In [37]:
monthName=('enero','febrero','marzo','abril','mayo','junio','julio','agosto','septiembre','octubre','noviembre','diciembre')
changes={name:number for name,number in zip(monthName,range(1,len(monthName)+1))}
changes

{'enero': 1,
 'febrero': 2,
 'marzo': 3,
 'abril': 4,
 'mayo': 5,
 'junio': 6,
 'julio': 7,
 'agosto': 8,
 'septiembre': 9,
 'octubre': 10,
 'noviembre': 11,
 'diciembre': 12}

In [38]:
covidDF.fecha1_mes.replace(changes,inplace=True)

Now we have:

In [39]:
covidDF[['fecha1_dia','fecha1_mes','fecha1_anho']]

Unnamed: 0,fecha1_dia,fecha1_mes,fecha1_anho
0,23,4,2020
1,6,4,2020
2,15,12,2020
3,16,9,2020
4,16,12,2020
5,27,6,2020
6,19,4,2020
7,19,4,2020
8,1,11,2020


We will use those columns to create a date:

In [40]:
pd.to_datetime(dict(year=covidDF.fecha1_anho, month=covidDF.fecha1_mes, day=covidDF.fecha1_dia))

0   2020-04-23
1   2020-04-06
2   2020-12-15
3   2020-09-16
4   2020-12-16
5   2020-06-27
6   2020-04-19
7   2020-04-19
8   2020-11-01
dtype: datetime64[ns]

In [46]:
# creating the column
covidDF['newDate']=pd.to_datetime(dict(year=covidDF.fecha1_anho, month=covidDF.fecha1_mes, day=covidDF.fecha1_dia))

# so far
covidDF.head()

Unnamed: 0,Territorios,Territorios.1,Fechadelanalisis,Porcentajeconanticuerpos,Personasquehansidoinfectadas,Referencia,fecha1,fecha2,fecha1_dia,fecha1_mes,fecha1_anho,newDate
0,Bérgamo,Italia,23 de abril de 2020 a 3 de junio de 2020,57%,635 000,[123]​,23 de abril de 2020,3 de junio de 2020,23,4,2020,2020-04-23
1,Ginebra,Suiza,6 de abril de 2020 a 9 de mayo de 2020,"10,9%",54 000,[124]​ [125]​,6 de abril de 2020,9 de mayo de 2020,6,4,2020,2020-04-06
2,España,Europa,15 de diciembre de 2020,"9,9%",4 700 000,[126]​,15 de diciembre de 2020,,15,12,2020,2020-12-15
3,Karnataka,India,16 de septiembre de 2020,"27,3%",19 300 000,[127]​,16 de septiembre de 2020,,16,9,2020,2020-09-16
4,México,América,16 de diciembre de 2020,"25,0%",32 000 000,[128]​,16 de diciembre de 2020,,16,12,2020,2020-12-16


In [45]:
# data types
covidDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 12 columns):
 #   Column                        Non-Null Count  Dtype         
---  ------                        --------------  -----         
 0   Territorios                   9 non-null      object        
 1   Territorios.1                 9 non-null      object        
 2   Fechadelanalisis              9 non-null      object        
 3   Porcentajeconanticuerpos      9 non-null      object        
 4   Personasquehansidoinfectadas  9 non-null      object        
 5   Referencia                    9 non-null      object        
 6   fecha1                        9 non-null      object        
 7   fecha2                        5 non-null      object        
 8   fecha1_dia                    9 non-null      object        
 9   fecha1_mes                    9 non-null      int64         
 10  fecha1_anho                   9 non-null      object        
 11  newDate                       9 non-

Note the presence of numeric columns:

In [52]:
covidDF.loc[:,['Porcentajeconanticuerpos','Personasquehansidoinfectadas']]

Unnamed: 0,Porcentajeconanticuerpos,Personasquehansidoinfectadas
0,57%,635 000
1,"10,9%",54 000
2,"9,9%",4 700 000
3,"27,3%",19 300 000
4,"25,0%",32 000 000
5,"23,5%",5 111 000
6,"22,7%",1 907 000
7,14%,2 139 000
8,"8,8%",5 900 000


Let's clean and **format**:

In [65]:
covidDF[['Porcentajeconanticuerpos','Personasquehansidoinfectadas']]=covidDF.iloc[:,[3,4]].replace('\\%|\\s',"",regex=True).replace(',','.',regex=True)
covidDF[['Porcentajeconanticuerpos','Personasquehansidoinfectadas']]

Unnamed: 0,Porcentajeconanticuerpos,Personasquehansidoinfectadas
0,57.0,635000
1,10.9,54000
2,9.9,4700000
3,27.3,19300000
4,25.0,32000000
5,23.5,5111000
6,22.7,1907000
7,14.0,2139000
8,8.8,5900000


In [66]:
#but
covidDF[['Porcentajeconanticuerpos','Personasquehansidoinfectadas']].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 2 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   Porcentajeconanticuerpos      9 non-null      object
 1   Personasquehansidoinfectadas  9 non-null      object
dtypes: object(2)
memory usage: 276.0+ bytes


They are clean, but need the data type change:

In [72]:
covidDF[['Porcentajeconanticuerpos','Personasquehansidoinfectadas']]=covidDF.iloc[:,[3,4]].apply(lambda x: pd.to_numeric(x))

In [74]:
covidDF[['Porcentajeconanticuerpos','Personasquehansidoinfectadas']]

Unnamed: 0,Porcentajeconanticuerpos,Personasquehansidoinfectadas
0,57.0,635000
1,10.9,54000
2,9.9,4700000
3,27.3,19300000
4,25.0,32000000
5,23.5,5111000
6,22.7,1907000
7,14.0,2139000
8,8.8,5900000


Now we got it:

In [71]:
covidDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 12 columns):
 #   Column                        Non-Null Count  Dtype         
---  ------                        --------------  -----         
 0   Territorios                   9 non-null      object        
 1   Territorios.1                 9 non-null      object        
 2   Fechadelanalisis              9 non-null      object        
 3   Porcentajeconanticuerpos      9 non-null      float64       
 4   Personasquehansidoinfectadas  9 non-null      int64         
 5   Referencia                    9 non-null      object        
 6   fecha1                        9 non-null      object        
 7   fecha2                        5 non-null      object        
 8   fecha1_dia                    9 non-null      object        
 9   fecha1_mes                    9 non-null      int64         
 10  fecha1_anho                   9 non-null      object        
 11  newDate                       9 non-