# Formatting Data (dates)

It is very common to find dates (some combination of year, month, day of week and time) in data that is collected in real time (and other that organize event information.

Let's see a data frame that comes with dates from an API.

In [2]:
!pip install sodapy

Collecting sodapy
  Downloading sodapy-2.2.0-py2.py3-none-any.whl (15 kB)
Installing collected packages: sodapy
Successfully installed sodapy-2.2.0


In [3]:
import pandas as pd
from sodapy import Socrata

client = Socrata("data.seattle.gov", None)

results = client.get("kzjm-xkqj", limit=2000)

# Convert to pandas DataFrame
calls911 = pd.DataFrame.from_records(results)



Let's check some information:

In [4]:
calls911.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 12 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   address                      2000 non-null   object
 1   type                         2000 non-null   object
 2   datetime                     2000 non-null   object
 3   latitude                     2000 non-null   object
 4   longitude                    2000 non-null   object
 5   report_location              2000 non-null   object
 6   incident_number              2000 non-null   object
 7   :@computed_region_ru88_fbhk  1992 non-null   object
 8   :@computed_region_kuhn_3gp2  1992 non-null   object
 9   :@computed_region_q256_3sug  2000 non-null   object
 10  :@computed_region_2day_rhn5  149 non-null    object
 11  :@computed_region_cyqu_gs94  141 non-null    object
dtypes: object(12)
memory usage: 187.6+ KB


Let's get rid of some columns:

In [5]:
calls911=calls911.iloc[:,:7]

Let's check the column _datetime_:

In [6]:
calls911.datetime.head()

0    2024-02-09T09:06:00.000
1    2024-02-09T09:05:00.000
2    2024-02-09T09:05:00.000
3    2024-02-09T08:50:00.000
4    2024-02-09T08:46:00.000
Name: datetime, dtype: object

In [7]:
# then
type(calls911.datetime[0])


str

The date and time information is not useful at this time, that is, the information it offers is of limited use, as it is just a string. 

Let's make it useful:

In [9]:
calls911.datetime=pd.to_datetime(calls911.datetime)
calls911.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   address          2000 non-null   object        
 1   type             2000 non-null   object        
 2   datetime         2000 non-null   datetime64[ns]
 3   latitude         2000 non-null   object        
 4   longitude        2000 non-null   object        
 5   report_location  2000 non-null   object        
 6   incident_number  2000 non-null   object        
dtypes: datetime64[ns](1), object(6)
memory usage: 109.5+ KB


In [10]:
calls911

Unnamed: 0,address,type,datetime,latitude,longitude,report_location,incident_number
0,746 19th Ave E,Aid Response,2024-02-09 09:06:00,47.625602,-122.307314,"{'type': 'Point', 'coordinates': [-122.307314,...",F240020453
1,4225 Beach Dr Sw,Aid Response,2024-02-09 09:05:00,47.565587,-122.407826,"{'type': 'Point', 'coordinates': [-122.407826,...",F240020452
2,77 S Washington St,Aid Response,2024-02-09 09:05:00,47.600885,-122.334925,"{'type': 'Point', 'coordinates': [-122.334925,...",F240020454
3,400 Broad St,Automatic Fire Alarm False,2024-02-09 08:50:00,47.619744,-122.348859,"{'type': 'Point', 'coordinates': [-122.348859,...",F240020450
4,2802 Nw 91st St,Medic Response,2024-02-09 08:46:00,47.69532,-122.392919,"{'type': 'Point', 'coordinates': [-122.392919,...",F240020449
...,...,...,...,...,...,...,...
1995,300 Pine St,Aid Response,2024-02-03 13:52:00,47.610743,-122.338702,"{'type': 'Point', 'coordinates': [-122.338702,...",F240017806
1996,2119 3rd Ave,Aid Response,2024-02-03 13:50:00,47.613308,-122.342432,"{'type': 'Point', 'coordinates': [-122.342432,...",F240017805
1997,808 Fir St,Alarm Bell,2024-02-03 13:44:00,47.602699,-122.322029,"{'type': 'Point', 'coordinates': [-122.322029,...",F240017803
1998,2340 21st Ave S,Automatic Fire Alarm Resd,2024-02-03 13:43:00,47.582554,-122.305564,"{'type': 'Point', 'coordinates': [-122.305564,...",F240017802


Once you have this data type, you can retrieve important information:

In [11]:
calls911['date']=calls911.datetime.dt.date
calls911['year']=calls911.datetime.dt.year
calls911['month']=calls911.datetime.dt.month_name()
calls911['weekday']=calls911.datetime.dt.day_name()
calls911['hour']=calls911.datetime.dt.hour

In [12]:
calls911.head()

Unnamed: 0,address,type,datetime,latitude,longitude,report_location,incident_number,date,year,month,weekday,hour
0,746 19th Ave E,Aid Response,2024-02-09 09:06:00,47.625602,-122.307314,"{'type': 'Point', 'coordinates': [-122.307314,...",F240020453,2024-02-09,2024,February,Friday,9
1,4225 Beach Dr Sw,Aid Response,2024-02-09 09:05:00,47.565587,-122.407826,"{'type': 'Point', 'coordinates': [-122.407826,...",F240020452,2024-02-09,2024,February,Friday,9
2,77 S Washington St,Aid Response,2024-02-09 09:05:00,47.600885,-122.334925,"{'type': 'Point', 'coordinates': [-122.334925,...",F240020454,2024-02-09,2024,February,Friday,9
3,400 Broad St,Automatic Fire Alarm False,2024-02-09 08:50:00,47.619744,-122.348859,"{'type': 'Point', 'coordinates': [-122.348859,...",F240020450,2024-02-09,2024,February,Friday,8
4,2802 Nw 91st St,Medic Response,2024-02-09 08:46:00,47.69532,-122.392919,"{'type': 'Point', 'coordinates': [-122.392919,...",F240020449,2024-02-09,2024,February,Friday,8


Let's create a new column with what we have. In this case, a boolean where you tell if it is night time (after 8 pm before 6 am):

In [13]:
calls911['nightTime']=((calls911['hour']<=6) | (calls911['hour']>=20))

Let's save what we have:

In [11]:
calls911

Unnamed: 0,address,type,datetime,latitude,longitude,report_location,incident_number,date,year,month,weekday,hour,nightTime
0,4th Ave N / Mercer St,Triaged Incident,2022-08-18 10:15:00,47.624564,-122.348877,"{'type': 'Point', 'coordinates': [-122.348877,...",F220099503,2022-08-18,2022,August,Thursday,10,False
1,815 S Dearborn St,Investigate Out Of Service,2022-08-18 10:08:00,47.595831,-122.322292,"{'type': 'Point', 'coordinates': [-122.322292,...",F220099502,2022-08-18,2022,August,Thursday,10,False
2,9401 Myers Way S,Triaged Incident,2022-08-18 10:07:00,47.518658,-122.333265,"{'type': 'Point', 'coordinates': [-122.333265,...",F220099501,2022-08-18,2022,August,Thursday,10,False
3,11030 5th Ave Ne,Auto Fire Alarm,2022-08-18 09:11:00,47.709488,-122.323301,"{'type': 'Point', 'coordinates': [-122.323301,...",F220099179,2022-08-18,2022,August,Thursday,9,False
4,3013 Harvard Ave E,MVI - Motor Vehicle Incident,2022-08-18 09:06:00,47.647935,-122.322101,"{'type': 'Point', 'coordinates': [-122.322101,...",F220099178,2022-08-18,2022,August,Thursday,9,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,1236 S King St,Aid Response,2022-08-12 12:07:00,47.598337,-122.316533,"{'type': 'Point', 'coordinates': [-122.316533,...",F220096444,2022-08-12,2022,August,Friday,12,False
1996,6312 California Ave Sw,Medic Response,2022-08-12 12:02:00,47.546604,-122.387196,"{'type': 'Point', 'coordinates': [-122.387196,...",F220096443,2022-08-12,2022,August,Friday,12,False
1997,1023 E Alder St,Auto Fire Alarm,2022-08-12 12:00:00,47.60436,-122.319104,"{'type': 'Point', 'coordinates': [-122.319104,...",F220096442,2022-08-12,2022,August,Friday,12,False
1998,1401 2nd Ave,Aid Response,2022-08-12 11:58:00,47.608292,-122.337995,"{'type': 'Point', 'coordinates': [-122.337995,...",F220096441,2022-08-12,2022,August,Friday,11,False


In [16]:
!pip install bs4

Collecting bs4
  Downloading bs4-0.0.2-py2.py3-none-any.whl.metadata (411 bytes)
Downloading bs4-0.0.2-py2.py3-none-any.whl (1.2 kB)
Installing collected packages: bs4
Successfully installed bs4-0.0.2


In [17]:
link="https://es.wikipedia.org/wiki/Pandemia_de_COVID-19"

import pandas as pd

covid=pd.read_html(link, flavor="bs4", attrs={"class":"wikitable sortable"})

In [20]:
covidDF=covid[1].copy()
covidDF

Unnamed: 0,Territorios,Territorios.1,Fecha del análisis,Porcentaje con anticuerpos,Personas que han sido infectadas,Referencia
0,Bérgamo,Italia,23 de abril de 2020 a 3 de junio de 2020,57%,635 000,[123]​
1,Ginebra,Suiza,6 de abril de 2020 a 9 de mayo de 2020,"10,9%",54 000,[124]​ [125]​
2,España,Europa,15 de diciembre de 2020,"9,9%",4 700 000,[126]​
3,Karnataka,India,16 de septiembre de 2020,"27,3%",19 300 000,[127]​
4,México,América,16 de diciembre de 2020,"25,0%",32 000 000,[128]​
5,Nueva Delhi,India,27 de junio de 2020 a 10 de julio de 2020,"23,5%",5 111 000,[129]​
6,Nueva York (ciudad),Estados Unidos,19 de abril de 2020 a 28 de abril de 2020,"22,7%",1 907 000,[130]​ [125]​
7,Nueva York (estado),Estados Unidos,19 de abril de 2020 a 28 de abril de 2020,14%,2 139 000,[130]​
8,Reino Unido,Europa,noviembre de 2020,"8,8%",5 900 000,[131]​


In [21]:
covidDF.columns

Index(['Territorios', 'Territorios.1', 'Fecha del análisis',
       'Porcentaje con anticuerpos', 'Personas que han sido infectadas',
       'Referencia'],
      dtype='object')

In [23]:
import unidecode as ud

[ud.unidecode(c) for c in covidDF.columns]

['Territorios',
 'Territorios.1',
 'Fecha del analisis',
 'Porcentaje con anticuerpos',
 'Personas que han sido infectadas',
 'Referencia']

In [24]:
covidDF.columns=[ud.unidecode(c) for c in covidDF.columns]

In [32]:
covidDF.columns=covidDF.columns.str.replace("\\s","",regex=True).str.strip()

In [37]:
covidDF.Fechadelanalisis.str.split(" a ",expand=True)

Unnamed: 0,0,1
0,23 de abril de 2020,3 de junio de 2020
1,6 de abril de 2020,9 de mayo de 2020
2,15 de diciembre de 2020,
3,16 de septiembre de 2020,
4,16 de diciembre de 2020,
5,27 de junio de 2020,10 de julio de 2020
6,19 de abril de 2020,28 de abril de 2020
7,19 de abril de 2020,28 de abril de 2020
8,noviembre de 2020,


In [38]:
covidDF[["fecha1","fecha2"]]=covidDF.Fechadelanalisis.str.split(" a ",expand=True)
covidDF

Unnamed: 0,Territorios,Territorios.1,Fechadelanalisis,Porcentajeconanticuerpos,Personasquehansidoinfectadas,Referencia,fecha1,fecha2
0,Bérgamo,Italia,23 de abril de 2020 a 3 de junio de 2020,57%,635 000,[123]​,23 de abril de 2020,3 de junio de 2020
1,Ginebra,Suiza,6 de abril de 2020 a 9 de mayo de 2020,"10,9%",54 000,[124]​ [125]​,6 de abril de 2020,9 de mayo de 2020
2,España,Europa,15 de diciembre de 2020,"9,9%",4 700 000,[126]​,15 de diciembre de 2020,
3,Karnataka,India,16 de septiembre de 2020,"27,3%",19 300 000,[127]​,16 de septiembre de 2020,
4,México,América,16 de diciembre de 2020,"25,0%",32 000 000,[128]​,16 de diciembre de 2020,
5,Nueva Delhi,India,27 de junio de 2020 a 10 de julio de 2020,"23,5%",5 111 000,[129]​,27 de junio de 2020,10 de julio de 2020
6,Nueva York (ciudad),Estados Unidos,19 de abril de 2020 a 28 de abril de 2020,"22,7%",1 907 000,[130]​ [125]​,19 de abril de 2020,28 de abril de 2020
7,Nueva York (estado),Estados Unidos,19 de abril de 2020 a 28 de abril de 2020,14%,2 139 000,[130]​,19 de abril de 2020,28 de abril de 2020
8,Reino Unido,Europa,noviembre de 2020,"8,8%",5 900 000,[131]​,noviembre de 2020,


In [40]:
covidDF.fecha1

0         23 de abril de 2020
1          6 de abril de 2020
2     15 de diciembre de 2020
3    16 de septiembre de 2020
4     16 de diciembre de 2020
5         27 de junio de 2020
6         19 de abril de 2020
7         19 de abril de 2020
8           noviembre de 2020
Name: fecha1, dtype: object

In [42]:
covidDF.fecha1.str.split(" de ",expand=True)

Unnamed: 0,0,1,2
0,23,abril,2020.0
1,6,abril,2020.0
2,15,diciembre,2020.0
3,16,septiembre,2020.0
4,16,diciembre,2020.0
5,27,junio,2020.0
6,19,abril,2020.0
7,19,abril,2020.0
8,noviembre,2020,


In [47]:
from googletrans import Translator
translator = Translator()
[translator.translate(x, src='es', dest='en') for x in covidDF.fecha1.str.split(" de ",expand=True).iloc[:,1]]

AttributeError: 'NoneType' object has no attribute 'group'