<img src="https://i.imgur.com/6U6q5jQ.png"/>

# Formatting Dates in Python

It is very common to find dates (some combination of year, month, day of week and time) in data that is collected in real time (and other that organize event information.

Let's see a data frame that comes with dates from an API.

In [2]:
#!pip install sodapy

Collecting sodapy
  Downloading sodapy-2.2.0-py2.py3-none-any.whl (15 kB)
Installing collected packages: sodapy
Successfully installed sodapy-2.2.0


In [1]:
import pandas as pd
from sodapy import Socrata

client = Socrata("data.seattle.gov", None)

results = client.get("kzjm-xkqj", limit=2000)

# Convert to pandas DataFrame
calls911 = pd.DataFrame.from_records(results)



Let's check some information:

In [2]:
calls911.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 12 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   address                      2000 non-null   object
 1   type                         2000 non-null   object
 2   datetime                     2000 non-null   object
 3   latitude                     2000 non-null   object
 4   longitude                    2000 non-null   object
 5   report_location              2000 non-null   object
 6   incident_number              2000 non-null   object
 7   :@computed_region_ru88_fbhk  1993 non-null   object
 8   :@computed_region_kuhn_3gp2  1993 non-null   object
 9   :@computed_region_q256_3sug  2000 non-null   object
 10  :@computed_region_2day_rhn5  149 non-null    object
 11  :@computed_region_cyqu_gs94  143 non-null    object
dtypes: object(12)
memory usage: 187.6+ KB


Let's get rid of some columns:

In [3]:
calls911=calls911.iloc[:,:7]

Let's check the column _datetime_:

In [4]:
calls911.datetime.head()

0    2024-02-10T11:21:00.000
1    2024-02-10T11:18:00.000
2    2024-02-10T11:16:00.000
3    2024-02-10T11:15:00.000
4    2024-02-10T11:14:00.000
Name: datetime, dtype: object

In [5]:
# then
type(calls911.datetime[0])


str

The date and time information is not useful at this time, that is, the information it offers is of limited use, as it is just a string. 

Let's make it useful:

In [6]:
calls911.datetime=pd.to_datetime(calls911.datetime)
calls911.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   address          2000 non-null   object        
 1   type             2000 non-null   object        
 2   datetime         2000 non-null   datetime64[ns]
 3   latitude         2000 non-null   object        
 4   longitude        2000 non-null   object        
 5   report_location  2000 non-null   object        
 6   incident_number  2000 non-null   object        
dtypes: datetime64[ns](1), object(6)
memory usage: 109.5+ KB


In [7]:
calls911

Unnamed: 0,address,type,datetime,latitude,longitude,report_location,incident_number
0,100 Melrose Ave E,Medic Response,2024-02-10 11:21:00,47.618497,-122.327854,"{'type': 'Point', 'coordinates': [-122.327854,...",F240021003
1,8635 Fauntleroy Pl Sw,Aid Response,2024-02-10 11:18:00,47.526068,-122.390815,"{'type': 'Point', 'coordinates': [-122.390815,...",F240021002
2,3870 Montlake Blvd Ne,EVENT - Special Event,2024-02-10 11:16:00,47.651797,-122.303502,"{'type': 'Point', 'coordinates': [-122.303502,...",F240021001
3,6400 15th Ave Nw,Automatic Medical Alarm,2024-02-10 11:15:00,47.675279,-122.376211,"{'type': 'Point', 'coordinates': [-122.376211,...",F240021000
4,2821 S Walden St,Nurseline/AMR,2024-02-10 11:14:00,47.572163,-122.296173,"{'type': 'Point', 'coordinates': [-122.296173,...",F240020999
...,...,...,...,...,...,...,...
1995,9999 Holman Rd Nw,Nurseline/AMR,2024-02-04 17:08:00,47.701637,-122.362244,"{'type': 'Point', 'coordinates': [-122.362244,...",F240018336
1996,220 W Olympic Pl,Investigate Out Of Service,2024-02-04 17:07:00,47.626727,-122.359559,"{'type': 'Point', 'coordinates': [-122.359559,...",F240018337
1997,3855 34th Ave W,Aid Response,2024-02-04 16:57:00,47.65455,-122.400898,"{'type': 'Point', 'coordinates': [-122.400898,...",F240018335
1998,1811 Eastlake Ave,Aid Response,2024-02-04 16:53:00,47.618002,-122.329182,"{'type': 'Point', 'coordinates': [-122.329182,...",F240018334


Once you have this data type, you can retrieve important information:

In [8]:
calls911['date']=calls911.datetime.dt.date
calls911['year']=calls911.datetime.dt.year
calls911['month']=calls911.datetime.dt.month_name()
calls911['weekday']=calls911.datetime.dt.day_name()
calls911['hour']=calls911.datetime.dt.hour

In [9]:
calls911.head()

Unnamed: 0,address,type,datetime,latitude,longitude,report_location,incident_number,date,year,month,weekday,hour
0,100 Melrose Ave E,Medic Response,2024-02-10 11:21:00,47.618497,-122.327854,"{'type': 'Point', 'coordinates': [-122.327854,...",F240021003,2024-02-10,2024,February,Saturday,11
1,8635 Fauntleroy Pl Sw,Aid Response,2024-02-10 11:18:00,47.526068,-122.390815,"{'type': 'Point', 'coordinates': [-122.390815,...",F240021002,2024-02-10,2024,February,Saturday,11
2,3870 Montlake Blvd Ne,EVENT - Special Event,2024-02-10 11:16:00,47.651797,-122.303502,"{'type': 'Point', 'coordinates': [-122.303502,...",F240021001,2024-02-10,2024,February,Saturday,11
3,6400 15th Ave Nw,Automatic Medical Alarm,2024-02-10 11:15:00,47.675279,-122.376211,"{'type': 'Point', 'coordinates': [-122.376211,...",F240021000,2024-02-10,2024,February,Saturday,11
4,2821 S Walden St,Nurseline/AMR,2024-02-10 11:14:00,47.572163,-122.296173,"{'type': 'Point', 'coordinates': [-122.296173,...",F240020999,2024-02-10,2024,February,Saturday,11


Let's create a new column with what we have. In this case, a boolean where you tell if it is night time (after 8 pm before 6 am):

In [10]:
calls911['nightTime']=((calls911['hour']<=6) | (calls911['hour']>=20))

Let's save what we have:

In [11]:
calls911

Unnamed: 0,address,type,datetime,latitude,longitude,report_location,incident_number,date,year,month,weekday,hour,nightTime
0,100 Melrose Ave E,Medic Response,2024-02-10 11:21:00,47.618497,-122.327854,"{'type': 'Point', 'coordinates': [-122.327854,...",F240021003,2024-02-10,2024,February,Saturday,11,False
1,8635 Fauntleroy Pl Sw,Aid Response,2024-02-10 11:18:00,47.526068,-122.390815,"{'type': 'Point', 'coordinates': [-122.390815,...",F240021002,2024-02-10,2024,February,Saturday,11,False
2,3870 Montlake Blvd Ne,EVENT - Special Event,2024-02-10 11:16:00,47.651797,-122.303502,"{'type': 'Point', 'coordinates': [-122.303502,...",F240021001,2024-02-10,2024,February,Saturday,11,False
3,6400 15th Ave Nw,Automatic Medical Alarm,2024-02-10 11:15:00,47.675279,-122.376211,"{'type': 'Point', 'coordinates': [-122.376211,...",F240021000,2024-02-10,2024,February,Saturday,11,False
4,2821 S Walden St,Nurseline/AMR,2024-02-10 11:14:00,47.572163,-122.296173,"{'type': 'Point', 'coordinates': [-122.296173,...",F240020999,2024-02-10,2024,February,Saturday,11,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,9999 Holman Rd Nw,Nurseline/AMR,2024-02-04 17:08:00,47.701637,-122.362244,"{'type': 'Point', 'coordinates': [-122.362244,...",F240018336,2024-02-04,2024,February,Sunday,17,False
1996,220 W Olympic Pl,Investigate Out Of Service,2024-02-04 17:07:00,47.626727,-122.359559,"{'type': 'Point', 'coordinates': [-122.359559,...",F240018337,2024-02-04,2024,February,Sunday,17,False
1997,3855 34th Ave W,Aid Response,2024-02-04 16:57:00,47.65455,-122.400898,"{'type': 'Point', 'coordinates': [-122.400898,...",F240018335,2024-02-04,2024,February,Sunday,16,False
1998,1811 Eastlake Ave,Aid Response,2024-02-04 16:53:00,47.618002,-122.329182,"{'type': 'Point', 'coordinates': [-122.329182,...",F240018334,2024-02-04,2024,February,Sunday,16,False


In [16]:
#!pip install bs4

Collecting bs4
  Downloading bs4-0.0.2-py2.py3-none-any.whl.metadata (411 bytes)
Downloading bs4-0.0.2-py2.py3-none-any.whl (1.2 kB)
Installing collected packages: bs4
Successfully installed bs4-0.0.2


In [12]:
link="https://es.wikipedia.org/wiki/Pandemia_de_COVID-19"

import pandas as pd

covid=pd.read_html(link, flavor="bs4", attrs={"class":"wikitable sortable"})

In [13]:
covidDF=covid[1].copy()
covidDF

Unnamed: 0,Territorios,Territorios.1,Fecha del análisis,Porcentaje con anticuerpos,Personas que han sido infectadas,Referencia
0,Bérgamo,Italia,23 de abril de 2020 a 3 de junio de 2020,57%,635 000,[123]​
1,Ginebra,Suiza,6 de abril de 2020 a 9 de mayo de 2020,"10,9%",54 000,[124]​ [125]​
2,España,Europa,15 de diciembre de 2020,"9,9%",4 700 000,[126]​
3,Karnataka,India,16 de septiembre de 2020,"27,3%",19 300 000,[127]​
4,México,América,16 de diciembre de 2020,"25,0%",32 000 000,[128]​
5,Nueva Delhi,India,27 de junio de 2020 a 10 de julio de 2020,"23,5%",5 111 000,[129]​
6,Nueva York (ciudad),Estados Unidos,19 de abril de 2020 a 28 de abril de 2020,"22,7%",1 907 000,[130]​ [125]​
7,Nueva York (estado),Estados Unidos,19 de abril de 2020 a 28 de abril de 2020,14%,2 139 000,[130]​
8,Reino Unido,Europa,noviembre de 2020,"8,8%",5 900 000,[131]​


In [14]:
covidDF.columns

Index(['Territorios', 'Territorios.1', 'Fecha del análisis',
       'Porcentaje con anticuerpos', 'Personas que han sido infectadas',
       'Referencia'],
      dtype='object')

In [16]:
#!pip install unidecode

Collecting unidecode
  Downloading Unidecode-1.3.8-py3-none-any.whl.metadata (13 kB)
Downloading Unidecode-1.3.8-py3-none-any.whl (235 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m235.5/235.5 kB[0m [31m608.1 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: unidecode
Successfully installed unidecode-1.3.8


In [17]:
import unidecode as ud

[ud.unidecode(c) for c in covidDF.columns]

['Territorios',
 'Territorios.1',
 'Fecha del analisis',
 'Porcentaje con anticuerpos',
 'Personas que han sido infectadas',
 'Referencia']

In [18]:
covidDF.columns=[ud.unidecode(c) for c in covidDF.columns]

In [19]:
covidDF.columns=covidDF.columns.str.replace("\\s","",regex=True).str.strip()

In [20]:
covidDF.Fechadelanalisis.str.split(" a ",expand=True)

Unnamed: 0,0,1
0,23 de abril de 2020,3 de junio de 2020
1,6 de abril de 2020,9 de mayo de 2020
2,15 de diciembre de 2020,
3,16 de septiembre de 2020,
4,16 de diciembre de 2020,
5,27 de junio de 2020,10 de julio de 2020
6,19 de abril de 2020,28 de abril de 2020
7,19 de abril de 2020,28 de abril de 2020
8,noviembre de 2020,


In [21]:
covidDF[["fecha1","fecha2"]]=covidDF.Fechadelanalisis.str.split(" a ",expand=True)
covidDF

Unnamed: 0,Territorios,Territorios.1,Fechadelanalisis,Porcentajeconanticuerpos,Personasquehansidoinfectadas,Referencia,fecha1,fecha2
0,Bérgamo,Italia,23 de abril de 2020 a 3 de junio de 2020,57%,635 000,[123]​,23 de abril de 2020,3 de junio de 2020
1,Ginebra,Suiza,6 de abril de 2020 a 9 de mayo de 2020,"10,9%",54 000,[124]​ [125]​,6 de abril de 2020,9 de mayo de 2020
2,España,Europa,15 de diciembre de 2020,"9,9%",4 700 000,[126]​,15 de diciembre de 2020,
3,Karnataka,India,16 de septiembre de 2020,"27,3%",19 300 000,[127]​,16 de septiembre de 2020,
4,México,América,16 de diciembre de 2020,"25,0%",32 000 000,[128]​,16 de diciembre de 2020,
5,Nueva Delhi,India,27 de junio de 2020 a 10 de julio de 2020,"23,5%",5 111 000,[129]​,27 de junio de 2020,10 de julio de 2020
6,Nueva York (ciudad),Estados Unidos,19 de abril de 2020 a 28 de abril de 2020,"22,7%",1 907 000,[130]​ [125]​,19 de abril de 2020,28 de abril de 2020
7,Nueva York (estado),Estados Unidos,19 de abril de 2020 a 28 de abril de 2020,14%,2 139 000,[130]​,19 de abril de 2020,28 de abril de 2020
8,Reino Unido,Europa,noviembre de 2020,"8,8%",5 900 000,[131]​,noviembre de 2020,


In [22]:
covidDF.fecha1

0         23 de abril de 2020
1          6 de abril de 2020
2     15 de diciembre de 2020
3    16 de septiembre de 2020
4     16 de diciembre de 2020
5         27 de junio de 2020
6         19 de abril de 2020
7         19 de abril de 2020
8           noviembre de 2020
Name: fecha1, dtype: object

In [41]:
covidDF.loc[8,'fecha1']='1 de noviembre de 2020'

In [42]:
covidDF.fecha1.str.split(" de ",expand=True)

Unnamed: 0,0,1,2
0,23,abril,2020
1,6,abril,2020
2,15,diciembre,2020
3,16,septiembre,2020
4,16,diciembre,2020
5,27,junio,2020
6,19,abril,2020
7,19,abril,2020
8,1,noviembre,2020


In [48]:
covidDF[['fecha1_dia','fecha1_mes','fecha1_anho']]=covidDF.fecha1.str.split(" de ",expand=True)
covidDF[['fecha1_dia','fecha1_mes','fecha1_anho']]

Unnamed: 0,fecha1_dia,fecha1_mes,fecha1_anho
0,23,abril,2020
1,6,abril,2020
2,15,diciembre,2020
3,16,septiembre,2020
4,16,diciembre,2020
5,27,junio,2020
6,19,abril,2020
7,19,abril,2020
8,1,noviembre,2020


In [51]:
changesMonth={'abril':4,'diciembre':12,'septiembre':9,'junio':6,'noviembre':11}
covidDF.fecha1_mes.replace(changesMonth,inplace=True)

In [52]:
covidDF[['fecha1_dia','fecha1_mes','fecha1_anho']]

Unnamed: 0,fecha1_dia,fecha1_mes,fecha1_anho
0,23,4,2020
1,6,4,2020
2,15,12,2020
3,16,9,2020
4,16,12,2020
5,27,6,2020
6,19,4,2020
7,19,4,2020
8,1,11,2020


In [56]:
covidDF[['fecha1_anho','fecha1_mes','fecha1_dia']].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   fecha1_anho  9 non-null      object
 1   fecha1_mes   9 non-null      int64 
 2   fecha1_dia   9 non-null      object
dtypes: int64(1), object(2)
memory usage: 348.0+ bytes


In [59]:
covidDF[['fecha1_anho','fecha1_mes','fecha1_dia']]=covidDF[['fecha1_anho','fecha1_mes','fecha1_dia']].apply(lambda x: pd.to_numeric(x))

In [62]:
pd.to_datetime(dict(year=covidDF.fecha1_anho, month=covidDF.fecha1_mes, day=covidDF.fecha1_dia))

0   2020-04-23
1   2020-04-06
2   2020-12-15
3   2020-09-16
4   2020-12-16
5   2020-06-27
6   2020-04-19
7   2020-04-19
8   2020-11-01
dtype: datetime64[ns]