# Notebook 2: Data Cleaning and Preparation (2) => Traffic Data

#### In this notebook we are going to get the traffic data of each month from Jan 2015 up to April 2018.
#### The goal is to get a final dataframe with all the traffic observations of the devices I previously chose

## IMPORTANT:
#### The files per month are heavy so please check your disk capacity ;)
#### Each file has millions of observations (each device * 30 days per month * 24 hours * each 15 min)
#### Opening all of them gave me memory problems 
#### Due to problems with the memory of the kernel, I have had to delete the varables with the dataframes after creating them after several times. You may see below the procedure
#### The process I did was: open the csv of a month, apply it the function defined 'actionstoDataframe' to keep the data I want, save the new dataframe into a new csv and delete the variable with the initial dataframe. It could be good also to delete the csv files from the folder since I generate new csv lighters with the new dataframe

In [1]:
import pandas as pd
import numpy as np
import calendar
import datetime
import folium
import zipfile
import warnings
import itertools
warnings.filterwarnings("ignore")

import matplotlib.pyplot as plt
%matplotlib inline

In [143]:
pd.options.display.max_columns = None

## 2015 files

In [4]:
# Unziping the traffic data for 2015

zip_ref = zipfile.ZipFile('DatosTrafico2015.zip', 'r')
zip_ref.extractall()
zip_ref.close()

#### Let's open some random files to see how they look (due to the memory kernel issues I commented you can omit this steps or open some of them)

In [5]:
# As commented I opened all the files to see how they look but in the code below I open only some files...
# ...so you can see how they look but again I opened all of them

data201512 = pd.read_csv('Datos201512.csv', sep = ';')
#data201511 = pd.read_csv('Datos201511.csv', sep = ';')
#data201510 = pd.read_csv('Datos201510.csv', sep = ';')
data201509 = pd.read_csv('Datos201509.csv', sep = ';')
#data201508 = pd.read_csv('Datos201508.csv', sep = ';')
#data201507 = pd.read_csv('Datos201507.csv', sep = ';')
data201506 = pd.read_csv('Datos201506.csv', sep = ';')
#data201505 = pd.read_csv('Datos201505.csv', sep = ';')
#data201504 = pd.read_csv('Datos201504.csv', sep = ';')
#data201503 = pd.read_csv('Datos201503.csv', sep = ';')
#data201502 = pd.read_csv('Datos201502.csv', sep = ';')
data201501 = pd.read_csv('Datos201501.csv', sep = ';')



In [6]:
data201501.head()

Unnamed: 0,idelem,fecha,identif,tipo_elem,tipo,intensidad,ocupacion,carga,vmed,error,periodo_integracion
0,1047,2015-01-01 00:00:00,03FL08PM01,494,Z,180,1,0,71,N,8
1,1046,2015-01-01 00:00:00,03FL08PM02,494,Z,45,-1,0,57,N,4
2,6703,2015-01-01 00:15:00,PM12981,494,M,83,1,2,55,N,15
3,6712,2015-01-01 00:15:00,PM20002,494,M,12,0,0,20,N,15
4,6652,2015-01-01 00:15:00,PM10343,494,M,21,0,1,30,N,15


In [7]:
data201506.head()

Unnamed: 0,idelem,fecha,identif,tipo_elem,tipo,intensidad,ocupacion,carga,vmed,error,periodo_integracion
0,4517,2015-06-03 17:15:00,31032,495,E,138,2,57,0,N,15
1,5771,2015-06-06 04:45:00,97014,495,E,0,0,0,0,N,15
2,4553,2015-06-03 17:15:00,32011,495,E,223,4,32,0,N,15
3,9850,2015-06-25 15:00:00,84040,495,G,0,0,0,0,N,8
4,4586,2015-06-29 01:45:00,33002,495,E,80,49,26,0,N,3


In [8]:
data201509.head()

Unnamed: 0,idelem,fecha,identif,tipo_elem,intensidad,ocupacion,carga,vmed,error,periodo_integracion
0,3797,2015-09-25 12:30:00,PM10005,PUNTOS MEDIDA M-30,917,11,43,70,N,15
1,6657,2015-09-25 12:30:00,PM10442,PUNTOS MEDIDA M-30,2969,11,65,72,N,15
2,6658,2015-09-25 12:30:00,PM104431,PUNTOS MEDIDA M-30,1571,6,65,69,N,15
3,6659,2015-09-25 12:30:00,PM10444,PUNTOS MEDIDA M-30,896,7,40,71,N,15
4,6660,2015-09-25 12:30:00,PM104541,PUNTOS MEDIDA M-30,1371,24,38,36,N,15


In [9]:
data201512.head()

Unnamed: 0,idelem,fecha,identif,tipo_elem,intensidad,ocupacion,carga,vmed,error,periodo_integracion
0,3841,2015-12-25 10:15:00,1002,PUNTOS MEDIDA URBANOS,72,0,1,0,N,8
1,3999,2015-12-25 10:15:00,7013,PUNTOS MEDIDA URBANOS,48,1,2,0,N,5
2,4617,2015-12-25 10:15:00,33034,PUNTOS MEDIDA URBANOS,56,0,4,0,N,9
3,3690,2015-12-25 10:15:00,34033,PUNTOS MEDIDA URBANOS,23,5,5,0,N,12
4,5954,2015-12-25 10:15:00,45008,PUNTOS MEDIDA URBANOS,37,0,4,0,N,8


In [15]:
data201501[data201501['tipo_elem'] == 495]['vmed'].max(), data201501[data201501['tipo_elem'] == 494]['vmed'].max()

(0, 169)

In [67]:
data201509.dtypes

idelem                  int64
fecha                  object
identif                object
tipo_elem              object
intensidad              int64
ocupacion               int64
carga                   int64
vmed                    int64
error                  object
periodo_integracion     int64
dtype: object

In [132]:
# as mentioned previously, I delete the opened files

del(data201501,data201506,data201509,data201512)

## Information about the variables in the file:
    -intensidad: vehicles/hour on a period of 15 min
    -ocupacion: time of ocupation of the device on a period of 15 min
    -carga: load of vehicles on a period of 15 min. It stablishes the use of the road from 0 to 100 considering intensidad,          ocupacion and capacity of the road.
    -vmed: average speed on a period of 15 min. Only for M-30 devices. For urban is 0
    -error: indicates if the sample taken by the device is not valid
    -periodo_integracion: number of samples received and used

### Some considerations after checking how 2015 files look:
#### - the column with the device it is named 'idelem'.
#### - the urban observations are '495' (I checked the 'vmed' values and are 0)
#### - the M-30 observations are '494' (the 'vmed' variable has figures)
#### - columns that are in some months but not in others as 'tipo'
#### - some columns i won't use


## Before proceeding with 2015 data, after checking how the files look, let's prepare 2 features to add to the final dataframe:
- a) IntensidadSat
- b) Calendar days

#### - a) Since the most important variable for traffic analysis is 'intensity' and during the process of data acquisition, I observed a file with an interesting variable: intensidadSat. The code below is to extract this variable per device and save it in order to add it to the final dataframe

In [2]:
# The following code defines a function to parse a XML file with the live traffic situation to extract from it...
# ...the column 'intensidadSat 'of each device to add it after to the full traffic observations dataframe

import xml.etree.ElementTree as ET

df_cols = ["idelem", "descripcion", "accessoAsociado", "intensidad", "ocupacion", "carga", "nivelServicio", "intensidadSat", "error", "subarea", "st_x", "st_y"]

def parse_XML(xml_file, df_cols): 
    """Parse the input XML file and store the result in a pandas DataFrame 
    with the given columns. The first element of df_cols is supposed to be 
    the identifier variable, which is an attribute of each node element in 
    the XML data; other features will be parsed from the text content of 
    each sub-element. """
    
    xtree = ET.parse('pm.xml')
    xroot = xtree.getroot()
    out_df = pd.DataFrame(columns = df_cols)
    
    for node in xroot: 
        res = []
        res.append(node.attrib.get(df_cols[0]))
        for el in df_cols[1:]: 
            if node is not None and node.find(el) is not None:
                res.append(node.find(el).text)
            else: 
                res.append(None)
        out_df = out_df.append(pd.Series(res, index = df_cols), ignore_index = True)
        
    return out_df

In [3]:
# parsing the XML file

xml_live_traffic = ET.parse('pm.xml')

In [4]:
# Saving it into a dataframe
live_traffic = parse_XML(xml_live_traffic, ["pm","idelem", "descripcion", "accessoAsociado", "intensidad", "ocupacion", "carga", "nivelServicio", "intensidadSat", "error", "subarea", "st_x", "st_y"]
)

In [5]:
live_traffic.head()

Unnamed: 0,pm,idelem,descripcion,accessoAsociado,intensidad,ocupacion,carga,nivelServicio,intensidadSat,error,subarea,st_x,st_y
0,,,,,,,,,,,,,
1,,3409.0,SEPULVEDA Ø118 N-S (CEBREROS-CJAL. FCO. J. JIM...,,265.0,2.0,11.0,0.0,3000.0,N,1718.0,436008175534995.0,447259378531503.0
2,,4739.0,CJAL. FCO. J. JIMENEZ Ø126 E-O (BERLANAS-SEPUL...,,303.0,2.0,12.0,0.0,3000.0,N,1718.0,436039395885266.0,447239754735486.0
3,,4740.0,CJAL. FCO. J. JIMENEZ Ø86 O-E (F. CALVO-ALHAMBRA),,360.0,3.0,14.0,0.0,3000.0,N,1712.0,43671847941987.0,447247932966933.0
4,,4741.0,CJAL. FCO. J. JIMENEZ Ø76 E-O (HURTUMPASCUAL-A...,,270.0,2.0,10.0,0.0,3000.0,N,1712.0,436901191978887.0,447251503855835.0


In [6]:
# the dataframe has many variables but the one I am interested in is 'intensidadSat' because for the others, or I already have..
# ...it or is live data that I don't want
# Creating the columns I want keep

columns_from_live_traffic = ['idelem', 'intensidadSat']

In [7]:
# Filtering from the second row applying the following code: df.[1:0]

live_traffic = live_traffic[1:][columns_from_live_traffic]

In [8]:
live_traffic.head(10)

Unnamed: 0,idelem,intensidadSat
1,3409,3000
2,4739,3000
3,4740,3000
4,4741,3000
5,4742,3000
6,4743,3000
7,4744,3000
8,4746,3000
9,4747,1200
10,4748,3000


In [9]:
# Making a function to rename column 'idelem' by 'id' in all the files affected
def renameColumnId(df):
    df.rename(columns = {'idelem':'id'}, inplace = True)
    return df

In [10]:
# Use renameColum function to change the 'idelem' by 'id'

live_traffic_set = renameColumnId(live_traffic)

In [11]:
live_traffic_set.head()

Unnamed: 0,id,intensidadSat
1,3409,3000
2,4739,3000
3,4740,3000
4,4741,3000
5,4742,3000


In [12]:
live_traffic_set.dtypes

# Need to change the type of the variables since as we saw in Notebook 1, the 'id' and 'intensidad' are integers

id               object
intensidadSat    object
dtype: object

In [13]:
# Transform type of column 'id' and keep the new format (df['column']).astype() only works from int, float or bool
# pd.to_mumeric works from string to number
live_traffic_set['id'] = pd.to_numeric(live_traffic_set['id'])

# since sometimes on these columns are Nan, we indicate errors = 'coerce', if ‘coerce’, then invalid parsing will be set as NaN
live_traffic_set['intensidadSat'] = pd.to_numeric(live_traffic_set['intensidadSat'], errors = 'coerce')

In [14]:
live_traffic_set.dtypes

id                 int64
intensidadSat    float64
dtype: object

In [15]:
live_traffic_set.isnull().any()

# it has null values but let see if they are on the devices I will focus on

id               False
intensidadSat     True
dtype: bool

In [16]:
devices_to_analyze = [4353,4354,4265,3478,4211,5104,4386,4384,3848,3850,7138,7139,4301,4305]

In [17]:
# Checking that the devices are in the dataframe
live_traffic_set[live_traffic_set['id'].isin(devices_to_analyze)]

Unnamed: 0,id,intensidadSat
2046,3848,2800.0
2047,3850,3250.0
2377,7138,5770.0
2378,7139,5810.0
2421,4354,2850.0
2422,4301,1850.0
2427,4305,500.0
2451,4384,900.0
2453,4386,3110.0
2931,4265,3350.0


In [18]:
live_traffic_set = live_traffic_set[live_traffic_set['id'].isin(devices_to_analyze)]

In [19]:
len(live_traffic_set)

14

In [20]:
# Order by 'id'

live_traffic_set.sort_values(by = 'id', ascending = True).reset_index(drop=True).head(20)

Unnamed: 0,id,intensidadSat
0,3478,4000.0
1,3848,2800.0
2,3850,3250.0
3,4211,5300.0
4,4265,3350.0
5,4301,1850.0
6,4305,500.0
7,4353,2400.0
8,4354,2850.0
9,4384,900.0


In [21]:
# Saving it into a csv

live_traffic_set.to_csv('livetraffic.csv', index = False)

#### -b) Importing and adding working days, week-end and bank holidays

In [22]:
# Open the calendar downloaded

calendar = pd.read_csv('calendario_2013-2019.csv', sep = ';', encoding = 'ISO-8859-1')
calendar.head(10)

# the date is in the following format DD/MM/YYYY and in my files is YYYY-MM-DD

Unnamed: 0,Dia,Dia_semana,laborable / festivo / domingo festivo,Tipo de Festivo,Festividad
0,01/01/2013,martes,festivo,Festivo nacional,Año Nuevo
1,02/01/2013,miercoles,laborable,,
2,03/01/2013,jueves,laborable,,
3,04/01/2013,viernes,laborable,,
4,05/01/2013,sabado,sabado,,
5,06/01/2013,domingo,domingo,,
6,07/01/2013,lunes,festivo,Festivo nacional,Traslado Epifania del Señor
7,08/01/2013,martes,laborable,,
8,09/01/2013,miercoles,laborable,,
9,10/01/2013,jueves,laborable,,


In [23]:
# first check the types

calendar.dtypes

Dia                                      object
Dia_semana                               object
laborable / festivo / domingo festivo    object
Tipo de Festivo                          object
Festividad                               object
dtype: object

In [24]:
# the Nan are on Tipo de Festivo and Festividad which I will not use them
calendar.isnull().any()

Dia                                      False
Dia_semana                               False
laborable / festivo / domingo festivo    False
Tipo de Festivo                           True
Festividad                                True
dtype: bool

In [25]:
calendar['laborable / festivo / domingo festivo'].value_counts()

# let see the "martes" observation...

laborable    1737
domingo       359
sabado        355
festivo       103
martes          1
Name: laborable / festivo / domingo festivo, dtype: int64

In [26]:
# there is martes day in this columns so afer checking it with a calendar I replace it by 'laborable'

calendar[calendar['laborable / festivo / domingo festivo'] == 'martes']

Unnamed: 0,Dia,Dia_semana,laborable / festivo / domingo festivo,Tipo de Festivo,Festividad
2197,08/01/2019,martes,martes,,


In [27]:
# Replace 'martes' by 'laborable'

calendar['laborable / festivo / domingo festivo'].replace(to_replace = 'martes', value = 'laborable', inplace = True)

In [28]:
calendar['laborable / festivo / domingo festivo'].value_counts()

laborable    1738
domingo       359
sabado        355
festivo       103
Name: laborable / festivo / domingo festivo, dtype: int64

In [29]:
calendar.tail()

Unnamed: 0,Dia,Dia_semana,laborable / festivo / domingo festivo,Tipo de Festivo,Festividad
2550,27/12/2019,viernes,laborable,,
2551,28/12/2019,sabado,sabado,,
2552,29/12/2019,domingo,domingo,,
2553,30/12/2019,lunes,laborable,,
2554,31/12/2019,martes,laborable,,


In [30]:
calendar['Dia'].max()

'31/12/2019'

In [31]:
calendar['Tipo de Festivo'].value_counts()

Festivo nacional                        69
Festivo local de la ciudad de Madrid    14
Festivo de la Comunidad de Madrid       13
Festivo de la comunidad de Madrid        7
Name: Tipo de Festivo, dtype: int64

In [32]:
calendar['Festividad'].value_counts()

Viernes Santo                                               7
Todos los Santos                                            7
Fiesta del Trabajo                                          7
Dos de Mayo. Fiesta de la Comunidad de Madrid               7
Jueves Santo                                                7
Asuncion de la Virgen                                       7
Natividad del Señor                                         6
Inmaculada Concepcion                                       6
San Isidro Labrador                                         6
Fiesta Nacional de España                                   6
Año Nuevo                                                   6
Ntra. Sra. de la Almudena. Patrona de la ciudad             6
Dia de la Constitucion                                      6
Epifania del Señor                                          5
Corpus Christi                                              2
Traslado Natividad del Señor                                1
Traslado

In [33]:
calendar.dtypes

Dia                                      object
Dia_semana                               object
laborable / festivo / domingo festivo    object
Tipo de Festivo                          object
Festividad                               object
dtype: object

In [34]:
# transforming dia column to datetime

calendar["Dia"] = pd.to_datetime(calendar["Dia"], dayfirst = True)

In [35]:
calendar.dtypes

Dia                                      datetime64[ns]
Dia_semana                                       object
laborable / festivo / domingo festivo            object
Tipo de Festivo                                  object
Festividad                                       object
dtype: object

In [36]:
# renaming Dia columna for Date and 'laborable / festivo / domingo festivo' for 'Tipo de dia' to simplify

calendar.rename(columns={'Dia': 'Fecha_corta','laborable / festivo / domingo festivo': 'Tipo_dia' }, inplace= True)

In [37]:
calendar.head()

Unnamed: 0,Fecha_corta,Dia_semana,Tipo_dia,Tipo de Festivo,Festividad
0,2013-01-01,martes,festivo,Festivo nacional,Año Nuevo
1,2013-01-02,miercoles,laborable,,
2,2013-01-03,jueves,laborable,,
3,2013-01-04,viernes,laborable,,
4,2013-01-05,sabado,sabado,,


In [38]:
# Selecting only the columns that I need
columns_calendar = ['Fecha_corta', 'Tipo_dia']
calendar = calendar[columns_calendar]

In [39]:
calendar['Tipo_dia'].value_counts()

laborable    1738
domingo       359
sabado        355
festivo       103
Name: Tipo_dia, dtype: int64

In [40]:
calendar.head()

Unnamed: 0,Fecha_corta,Tipo_dia
0,2013-01-01,festivo
1,2013-01-02,laborable
2,2013-01-03,laborable
3,2013-01-04,laborable
4,2013-01-05,sabado


In [41]:
len(calendar)

2555

In [42]:
# random check to see if the new column works

calendar[calendar['Fecha_corta'].astype(str).str.contains('2019-03-12')]

Unnamed: 0,Fecha_corta,Tipo_dia
2260,2019-03-12,laborable


In [43]:
# Save on a csv

calendar.to_csv('calendar_ok.csv', index = False)

### Now that I have prepared some variables to add to my dataframes with the traffic data, I am preparing a function to apply to each file to obtain as a result a "clean" dataframe for each month to concat all of them as a final step

#### Prepare a function that makes into 1 step the following actions appending the resulting dataset to the previous one:

 - Select the hours of interest (before that I create a function to get this column
 - Set up an order of columns
 - Create the devices I focus the analysis on
 - Rename column 'idelem' by 'id' to identify the devices: goal is to merge with the devices location dataset
 - Replace column 'tipo_elem': from '494' and '495' to 'M-30' and 'URB'
 - Replace column 'tipo_elem': from 'PUNTOS DE MEDIDA M-30' and 'PUNTOS DE MEDIDA URBANOS' to 'M-30' and 'URB'
 - Filter by 'tipo_elem': 'URB'
 - Add volumen => dividing the intensidad by 4 I get the number of vehicles that pass by each device each 15 min
 - Create columns 'Fecha_corta' and 'Hora' from the column 'Fecha'
 - Transforming 'Fecha' and 'fecha_corta' to datetime types
 - Create column with 'Hour24' by using a previously defined function based on the 'Hora' column
 - Create column 'Dia_nombre'
 - Create column 'Dia_numero' by passing a dictionary with the weekdays and its number
 - Create column 'Month_nombre'
 - Create column 'Month_numero' by passing a dictionary with the months and its number
 - Create column with 'Hour24' by using a previously defined function based on the 'Time' column
 - Filter by the devices I focus on
 - Merge with live_traffic_set to get the 'intensidadSat'
 - Merge with calendar to get the 'Tipo_dia'
 - Drop the columns I am not interested
 - Add 'volumenSat' from 'intensidadSat'
 - Filter by the Hours of interest


In [44]:
# Here i create the dictionaries and the columns of interests of my dataframe

day_number = {'Monday': 1, 'Tuesday':  2, 'Wednesday' : 3, 'Thursday' : 4, 'Friday' : 5, 'Saturday' : 6, 'Sunday' : 7}

month_number = {'January': 1, 'February':  2, 'March' : 3, 'April' : 4, 'May' : 5, 'June' : 6, 'July' : 7,
                'August' : 8,'September' : 9,'October' : 10,'November' : 11, 'December':12}

In [45]:
# In order to reduce the timeframe values, create a function to pass the hours to 24h fomat

def hour24(x):
    if x in ('00:00:00', '00:15:00', '00:30:00', '00:45:00'):
        return '0'
    elif x in ('01:00:00', '01:15:00', '01:30:00', '01:45:00'):
        return '1'
    elif x in ('02:00:00', '02:15:00', '02:30:00', '02:45:00'):
        return '2'
    elif x in ('03:00:00', '03:15:00', '03:30:00', '03:45:00'):
        return '3'
    elif x in ('04:00:00', '04:15:00', '04:30:00', '04:45:00'):
        return '4'
    elif x in ('05:00:00', '05:15:00', '05:30:00', '05:45:00'):
        return '5'
    elif x in ('06:00:00', '06:15:00', '06:30:00', '06:45:00'):
        return '6'
    elif x in ('07:00:00', '07:15:00', '07:30:00', '07:45:00'):
        return '7'
    elif x in ('08:00:00', '08:15:00', '08:30:00', '08:45:00'):
        return '8'
    elif x in ('09:00:00', '09:15:00', '09:30:00', '09:45:00'):
        return '9'
    elif x in ('10:00:00', '10:15:00', '10:30:00', '10:45:00'):
        return '10'
    elif x in ('11:00:00', '11:15:00', '11:30:00', '11:45:00'):
        return '11'
    elif x in ('12:00:00', '12:15:00', '12:30:00', '12:45:00'):
        return '12'
    elif x in ('13:00:00', '13:15:00', '13:30:00', '13:45:00'):
        return '13'
    elif x in ('14:00:00', '14:15:00', '14:30:00', '14:45:00'):
        return '14'
    elif x in ('15:00:00', '15:15:00', '15:30:00', '15:45:00'):
        return '15'
    elif x in ('16:00:00', '16:15:00', '16:30:00', '16:45:00'):
        return '16'
    elif x in ('17:00:00', '17:15:00', '17:30:00', '17:45:00'):
        return '17'
    elif x in ('18:00:00', '18:15:00', '18:30:00', '18:45:00'):
        return '18'
    elif x in ('19:00:00', '19:15:00', '19:30:00', '19:45:00'):
        return '19'
    elif x in ('20:00:00', '20:15:00', '20:30:00', '20:45:00'):
        return '20'
    elif x in ('21:00:00', '21:15:00', '21:30:00', '21:45:00'):
        return '21'
    elif x in ('22:00:00', '22:15:00', '22:30:00', '22:45:00'):
        return '22'
    elif x in ('23:00:00', '23:15:00', '23:30:00', '23:45:00'):
        return '23'
    else:
        return 'Other'

In [46]:
def actionsDataFrameTraffic(df):
    
    hours_of_interest = ['6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','21','22']
    columnsOrder = ['fecha','Fecha_corta', 'Hora', 'Hour24', 'Dia_nombre', 'Dia_numero','Tipo_dia', 'Mes_nombre',
                    'Mes_numero', 'id', 'tipo_elem', 'intensidad','intensidadSat','volumen','volumenSat','ocupacion','carga']
    
    devices_to_analyze = [4353,4354,4265,3478,4211,5104,4386,4384,3848,3850,7138,7139,4301,4305]

    df.rename(columns = {'idelem':'id'}, inplace = True)
    df['tipo_elem'].replace(to_replace =[494, 495], value =['M30', 'URB'], inplace = True)
    df['tipo_elem'].replace(to_replace= ['PUNTOS MEDIDA M-30', 'PUNTOS MEDIDA URBANOS'], value = ['M30', 'URB'], inplace = True)# renaming obs of 'tipo_elem'
    df = df[df['tipo_elem'] == 'URB']
    df['volumen'] = df['intensidad'] // 4
    df[['Fecha_corta', 'Hora']] = df['fecha'].str.split(expand=True)
    df['fecha'] = pd.to_datetime(df['fecha'])
    df['Fecha_corta'] = pd.to_datetime(df['Fecha_corta'])
    df['Hour24'] = df['Hora'].apply(lambda x: hour24(x))
    df['Dia_nombre'] = df['Fecha_corta'].dt.day_name()
    df['Dia_numero'] =df['Dia_nombre'].apply(lambda x: day_number[x])
    df['Mes_nombre'] = df['Fecha_corta'].dt.month_name()
    df['Mes_numero'] =df['Mes_nombre'].apply(lambda x: month_number[x])
    df = df[df['id'].isin(devices_to_analyze)]
    df = pd.merge(df,calendar, how = 'left', on = 'Fecha_corta')
    df = pd.merge(df,live_traffic_set, left_on = 'id', right_on = 'id')
    df['volumenSat'] = df['intensidadSat'] // 4
    df = df[df['Hour24'].isin(hours_of_interest)]
    df = df[columnsOrder]
    
    return df
      

### Once I have my function defined, apply it to each month traffic file

## Important: here is where the long process mentioned in the README and at the beginning of this notebooks starts.
### the code processing time per month last quite long...basically I apply the function per month per year. At the end of the notebook I saved all of them into a csv file that you have already downloaded on notebook 0 in case you want to jump the entire process

#### January 2015

In [145]:
data201501 = pd.read_csv('Datos201501.csv', sep = ';')
data201501_clean = actionsDataFrameTraffic(data201501)
# I save the resulting df and delete the former one due to memory issues
data201501_clean.to_csv('data201501_clean.csv', sep = ';', index = False) 
del data201501

#### February 2015

In [144]:
data201502 = pd.read_csv('Datos201502.csv', sep = ';')
data201502_clean = actionsDataFrameTraffic(data201502)
data201502_clean.to_csv('data201502_clean.csv', sep = ';', index = False)
del data201502

#### March 2015

In [147]:
data201503 = pd.read_csv('Datos201503.csv', sep = ';')
data201503_clean = actionsDataFrameTraffic(data201503)
data201503_clean.to_csv('data201503_clean.csv', sep = ';', index = False)
del data201503

#### April 2015

In [157]:
data201504 = pd.read_csv('Datos201504.csv', sep = ';')
data201504_clean = actionsDataFrameTraffic(data201504)
data201504_clean.to_csv('data201504_clean.csv', sep = ';',index = False)
del data201504#### April 2015

#### May 2015

In [158]:
data201505 = pd.read_csv('Datos201505.csv', sep = ';')
data201505_clean = actionsDataFrameTraffic(data201505)
data201505_clean.to_csv('data201505_clean.csv', sep = ';', index = False)
del data201505

#### June 2015

In [160]:
data201506 = pd.read_csv('Datos201506.csv', sep = ';')
data201506_clean = actionsDataFrameTraffic(data201506)
data201506_clean.to_csv('data201506_clean.csv', sep = ';', index = False)
del data201506

#### July 2015

In [161]:
data201507 = pd.read_csv('Datos201507.csv', sep = ';')
data201507_clean = actionsDataFrameTraffic(data201507)
data201507_clean.to_csv('data201507_clean.csv', sep = ';', index = False)
del data201507

#### August 2015

In [162]:
data201508 = pd.read_csv('Datos201508.csv', sep = ';')
data201508_clean = actionsDataFrameTraffic(data201508)
data201508_clean.to_csv('data201508_clean.csv',sep = ';', index = False)
del data201508

#### September 2015

In [164]:
data201509 = pd.read_csv('Datos201509.csv', sep = ';')
data201509_clean = actionsDataFrameTraffic(data201509)
data201509_clean.to_csv('data201509_clean.csv', sep = ';', index = False)
del data201509

#### October 2015

In [165]:
data201510 = pd.read_csv('Datos201510.csv', sep = ';')
data201510_clean = actionsDataFrameTraffic(data201510)
data201510_clean.to_csv('data201510_clean.csv', sep = ';', index = False)
del data201510

#### November 2015

In [166]:
data201511 = pd.read_csv('Datos201511.csv', sep = ';')
data201511_clean = actionsDataFrameTraffic(data201511)
data201511_clean.to_csv('data201511_clean.csv',sep = ';', index = False)
del data201511

#### December 2015

In [167]:
data201512 = pd.read_csv('Datos201512.csv', sep = ';')
data201512_clean = actionsDataFrameTraffic(data201512)
data201512_clean.to_csv('data201512_clean.csv', sep = ';', index = False)
del data201512

In [168]:
# Now that I have all the files for 2015 ready for the analysis, I save them into a list to concatenate them

files_2015_clean = [data201501_clean, data201502_clean, data201503_clean, data201504_clean, data201505_clean, data201506_clean, 
             data201507_clean, data201508_clean, data201509_clean, data201510_clean, data201511_clean, data201512_clean]


In [169]:
# Concatenating all 2015 ready

data2015 = pd.concat(files_2015_clean, ignore_index = True)

In [170]:
data2015.shape

(285132, 17)

In [171]:
data2015.head()

Unnamed: 0,fecha,Fecha_corta,Hora,Hour24,Dia_nombre,Dia_numero,Tipo_dia,Mes_nombre,Mes_numero,id,tipo_elem,intensidad,intensidadSat,volumen,volumenSat,ocupacion,carga
0,2015-01-12 10:45:00,2015-01-12,10:45:00,10,Monday,1,laborable,January,1,7138,URB,2886,5770.0,721,1442.0,14,50
1,2015-01-15 20:45:00,2015-01-15,20:45:00,20,Thursday,4,laborable,January,1,7138,URB,1975,5770.0,493,1442.0,6,31
2,2015-01-12 16:00:00,2015-01-12,16:00:00,16,Monday,1,laborable,January,1,7138,URB,2603,5770.0,650,1442.0,10,42
3,2015-01-12 17:45:00,2015-01-12,17:45:00,17,Monday,1,laborable,January,1,7138,URB,2238,5770.0,559,1442.0,10,38
4,2015-01-12 11:00:00,2015-01-12,11:00:00,11,Monday,1,laborable,January,1,7138,URB,2881,5770.0,720,1442.0,12,45


In [172]:
# Checking nulls
data2015.isnull().any()

fecha            False
Fecha_corta      False
Hora             False
Hour24           False
Dia_nombre       False
Dia_numero       False
Tipo_dia         False
Mes_nombre       False
Mes_numero       False
id               False
tipo_elem        False
intensidad       False
intensidadSat    False
volumen          False
volumenSat       False
ocupacion        False
carga            False
dtype: bool

In [173]:
# Delete the individual dataframes for memory since the are already concatenated

del(data201501_clean, data201502_clean, data201503_clean, data201504_clean, data201505_clean, data201506_clean,
   data201507_clean, data201508_clean, data201509_clean, data201510_clean, data201511_clean, data201512_clean)

In [174]:
# Save on a csv the 2015 dataset ready to use to avoid making all the steps above!

data2015.to_csv('data2015.csv', sep = ';', index = False)

## 2016

In [176]:
zip_ref = zipfile.ZipFile('DatosTrafico2016.zip', 'r')
zip_ref.extractall()
zip_ref.close()

#### First of all have a look at all the files to see if the function used for 2015 file works for this year of it is necessary an update

In [177]:
data201601 = pd.read_csv('Datos201601.csv', sep = ';')
data201601.head()

Unnamed: 0,idelem,fecha,identif,tipo_elem,intensidad,ocupacion,carga,vmed,error,periodo_integracion
0,3581,2016-01-01 09:15:00,61079,PUNTOS MEDIDA URBANOS,40,0,0,0,N,1
1,3982,2016-01-01 00:00:00,6042,PUNTOS MEDIDA URBANOS,52,0,2,0,N,13
2,4291,2016-01-01 09:15:00,16013,PUNTOS MEDIDA URBANOS,43,4,12,0,N,7
3,4200,2016-01-01 09:15:00,14002,PUNTOS MEDIDA URBANOS,52,0,4,0,N,14
4,5928,2016-01-01 09:15:00,44030,PUNTOS MEDIDA URBANOS,124,0,6,0,N,13


In [178]:
del(data201601)

In [179]:
data201602 = pd.read_csv('Datos201602.csv', sep = ';')
data201602.head()

Unnamed: 0,idelem,fecha,identif,tipo_elem,intensidad,ocupacion,carga,vmed,error,periodo_integracion
0,6799,2016-02-15 10:45:00,PM42352,PUNTOS MEDIDA M-30,428,3,21,79,N,15
1,6805,2016-02-15 10:45:00,PM43031,PUNTOS MEDIDA M-30,1484,9,26,86,N,15
2,6806,2016-02-15 10:45:00,PM43032,PUNTOS MEDIDA M-30,560,8,30,72,N,15
3,6807,2016-02-15 10:45:00,PM43151,PUNTOS MEDIDA M-30,3203,13,50,57,N,15
4,5352,2016-02-15 11:15:00,87021,PUNTOS MEDIDA URBANOS,330,3,15,0,N,14


In [180]:
del(data201602)

In [181]:
data201603 = pd.read_csv('Datos201603.csv', sep = ';')
data201603.head()

Unnamed: 0,idelem,fecha,identif,tipo_elem,intensidad,ocupacion,carga,vmed,error,periodo_integracion
0,6639,2016-03-11 21:00:00,PM10001,PUNTOS MEDIDA M-30,4488,14,65,71,N,15
1,3797,2016-03-11 21:00:00,PM10005,PUNTOS MEDIDA M-30,1252,13,56,73,N,15
2,6640,2016-03-11 21:00:00,PM10013,PUNTOS MEDIDA M-30,2021,11,47,66,N,15
3,6641,2016-03-11 21:00:00,PM10021,PUNTOS MEDIDA M-30,4856,15,70,82,N,15
4,6642,2016-03-11 21:00:00,PM10091,PUNTOS MEDIDA M-30,5732,16,81,87,N,15


In [182]:
del(data201603)

In [183]:
data201604 = pd.read_csv('Datos201604.csv', sep = ';')
data201604.head()

Unnamed: 0,idelem,fecha,identif,tipo_elem,intensidad,ocupacion,carga,vmed,error,periodo_integracion
0,3406,2016-04-14 17:00:00,39043,PUNTOS MEDIDA URBANOS,264,3,21,0,N,15
1,4990,2016-04-14 17:00:00,39044,PUNTOS MEDIDA URBANOS,124,1,10,0,N,15
2,4992,2016-04-14 17:00:00,39046,PUNTOS MEDIDA URBANOS,139,40,47,0,N,15
3,4993,2016-04-14 17:00:00,39047,PUNTOS MEDIDA URBANOS,7,6,6,0,N,15
4,6455,2016-04-09 20:30:00,67017,PUNTOS MEDIDA URBANOS,257,1,12,0,N,10


In [184]:
del(data201604)

In [185]:
data201605 = pd.read_csv('Datos201605.csv', sep = ';')
data201605.head()

Unnamed: 0,idelem,fecha,identif,tipo_elem,intensidad,ocupacion,carga,vmed,error,periodo_integracion
0,4551,2016-05-15 10:15:00,32009,PUNTOS MEDIDA URBANOS,148,9,17,0,N,15
1,6592,2016-05-28 15:45:00,79045,PUNTOS MEDIDA URBANOS,261,1,9,0,N,15
2,6614,2016-05-28 15:45:00,79018,PUNTOS MEDIDA URBANOS,316,4,24,0,N,15
3,3928,2016-05-08 17:00:00,4022,PUNTOS MEDIDA URBANOS,17,10,10,0,N,7
4,3772,2016-05-28 15:45:00,82414,PUNTOS MEDIDA URBANOS,25,16,15,0,N,8


In [186]:
del(data201605)

In [187]:
data201606 = pd.read_csv('Datos201606.csv', sep = ';')
data201606.head()

Unnamed: 0,idelem,fecha,identif,tipo_elem,intensidad,ocupacion,carga,vmed,error,periodo_integracion
0,9843,2016-06-09 09:45:00,03002,PUNTOS MEDIDA URBANOS,1178,53,71,0,N,15
1,3797,2016-06-04 12:30:00,PM10005,PUNTOS MEDIDA M-30,1011,7,46,81,N,15
2,9842,2016-06-09 09:45:00,03003,PUNTOS MEDIDA URBANOS,193,3,33,0,N,15
3,3799,2016-06-04 12:30:00,PM10611,PUNTOS MEDIDA M-30,4964,10,54,88,N,15
4,3949,2016-06-09 09:45:00,03004,PUNTOS MEDIDA URBANOS,2650,14,66,0,N,15


In [188]:
del(data201606)

In [189]:
data201607 = pd.read_csv('Datos201607.csv', sep = ';')
data201607.head()

Unnamed: 0,idelem,fecha,identif,tipo_elem,intensidad,ocupacion,carga,vmed,error,periodo_integracion
0,6694,2016-07-08 07:45:00,PM12331,PUNTOS MEDIDA M-30,4131,10,59,86,N,15
1,3800,2016-07-02 03:30:00,PM10612,PUNTOS MEDIDA M-30,467,1,11,89,N,15
2,3488,2016-07-02 03:30:00,PM10712,PUNTOS MEDIDA M-30,672,2,11,85,N,15
3,6668,2016-07-02 03:30:00,PM107631,PUNTOS MEDIDA M-30,128,0,5,42,N,15
4,3611,2016-07-02 03:30:00,PM107642,PUNTOS MEDIDA M-30,121,1,1,45,N,15


In [190]:
del(data201607)

In [191]:
data201608 = pd.read_csv('Datos201608.csv', sep = ';')
data201608.head()

Unnamed: 0,idelem,fecha,identif,tipo_elem,intensidad,ocupacion,carga,vmed,error,periodo_integracion
0,3396,2016-08-10 06:45:00,23012,PUNTOS MEDIDA URBANOS,82,0,5,0,N,10
1,3397,2016-08-10 06:45:00,23013,PUNTOS MEDIDA URBANOS,91,0,9,0,N,15
2,9961,2016-08-10 06:45:00,23016,PUNTOS MEDIDA URBANOS,518,1,9,0,N,14
3,9962,2016-08-10 06:45:00,23017,PUNTOS MEDIDA URBANOS,143,1,12,0,N,15
4,4477,2016-08-10 06:45:00,23033,PUNTOS MEDIDA URBANOS,70,0,6,0,N,14


In [192]:
del(data201608)

In [193]:
data201609 = pd.read_csv('Datos201609.csv', sep = ';')
data201609.head()

Unnamed: 0,idelem,fecha,identif,tipo_elem,intensidad,ocupacion,carga,vmed,error,periodo_integracion
0,4023,2016-09-23 05:00:00,9007,PUNTOS MEDIDA URBANOS,56,0,3,0,N,15
1,3616,2016-09-23 05:00:00,9021,PUNTOS MEDIDA URBANOS,212,0,7,0,N,15
2,4036,2016-09-23 05:00:00,9022,PUNTOS MEDIDA URBANOS,107,0,6,0,N,15
3,4075,2016-09-23 05:00:00,10031,PUNTOS MEDIDA URBANOS,167,10,11,0,N,15
4,4077,2016-09-23 05:00:00,10033,PUNTOS MEDIDA URBANOS,16,4,5,0,N,15


In [194]:
del(data201609)

In [195]:
data201610 = pd.read_csv('Datos201610.csv', sep = ';')
data201610.head()

Unnamed: 0,idelem,fecha,identif,tipo_elem,intensidad,ocupacion,carga,vmed,error,periodo_integracion
0,4016,2016-10-05 11:30:00,08017,PUNTOS MEDIDA URBANOS,363,7,51,0,N,15
1,3664,2016-10-18 19:00:00,71001,PUNTOS MEDIDA URBANOS,847,8,45,0,N,15
2,6512,2016-10-18 19:00:00,71010,PUNTOS MEDIDA URBANOS,1046,51,68,0,N,15
3,1050,2016-10-30 23:15:00,08RR02PM01,PUNTOS MEDIDA M-30,228,1,0,61,N,5
4,6879,2016-10-30 23:15:00,11XC46PM01,PUNTOS MEDIDA M-30,1032,2,0,67,N,5


In [196]:
del(data201610)

In [197]:
data201611 = pd.read_csv('Datos201611.csv', sep = ';')
data201611.head()

Unnamed: 0,idelem,fecha,identif,tipo_elem,intensidad,ocupacion,carga,vmed,error,periodo_integracion
0,5924,2016-11-25 13:00:00,44020,PUNTOS MEDIDA URBANOS,847,19,49,0,N,3
1,6831,2016-11-25 13:00:00,37033,PUNTOS MEDIDA URBANOS,188,0,12,0,N,15
2,6952,2016-11-19 13:45:00,PM41891,PUNTOS MEDIDA M-30,3324,12,71,90,N,15
3,3605,2016-11-12 23:30:00,04025,PUNTOS MEDIDA URBANOS,20,0,14,0,N,1
4,6599,2016-11-19 13:45:00,78010,PUNTOS MEDIDA URBANOS,303,4,10,0,N,15


In [198]:
del(data201611)

In [5]:
data201612 = pd.read_csv('Datos201612.csv', sep = ';')
data201612.head()

Unnamed: 0,idelem,fecha,identif,tipo_elem,intensidad,ocupacion,carga,vmed,error,periodo_integracion
0,3652,2016-12-01 11:30:00,16300,PUNTOS MEDIDA URBANOS,467,67,48,0,N,6
1,6952,2016-12-31 23:30:00,PM41891,PUNTOS MEDIDA M-30,177,1,5,96,N,15
2,4368,2016-12-31 23:30:00,17037,PUNTOS MEDIDA URBANOS,132,7,4,0,N,5
3,6026,2016-12-31 23:30:00,46019,PUNTOS MEDIDA URBANOS,40,0,1,0,N,1
4,4387,2016-12-13 19:30:00,18010,PUNTOS MEDIDA URBANOS,486,5,27,0,N,15


In [6]:
del(data201612)

#### 2016 files have all the same columns and the same function as for 2015 can be applied. Just modified the name just in case I need it for the future

#### January 2016

In [59]:
data201601 = pd.read_csv('Datos201601.csv', sep = ';')
data201601_clean = actionsDataFrameTraffic(data201601)
data201601_clean.to_csv('data201601_clean.csv', sep = ';', index = False)
del data201601

#### February 2016

In [60]:
data201602 = pd.read_csv('Datos201602.csv', sep = ';')
data201602_clean = actionsDataFrameTraffic(data201602)
data201602_clean.to_csv('data201602_clean.csv', sep = ';', index = False)
del data201602

#### March 2016

In [61]:
data201603 = pd.read_csv('Datos201603.csv', sep = ';')
data201603_clean = actionsDataFrameTraffic(data201603)
data201603_clean.to_csv('data201603_clean.csv', sep = ';', index = False)
del data201603

#### April 2016

In [62]:
data201604 = pd.read_csv('Datos201604.csv', sep = ';')
data201604_clean = actionsDataFrameTraffic(data201604)
data201604_clean.to_csv('data201604_clean.csv', sep = ';', index = False)
del data201604

#### May 2016

In [63]:
data201605 = pd.read_csv('Datos201605.csv', sep = ';')
data201605_clean = actionsDataFrameTraffic(data201605)
data201605_clean.to_csv('data201605_clean.csv', sep = ';', index = False)
del data201605

#### June 2016

In [64]:
data201606 = pd.read_csv('Datos201606.csv', sep = ';')
data201606_clean = actionsDataFrameTraffic(data201606)
data201606_clean.to_csv('data201606_clean.csv', sep = ';', index = False)
del data201606

#### July 2016

In [65]:
data201607 = pd.read_csv('Datos201607.csv', sep = ';')
data201607_clean = actionsDataFrameTraffic(data201607)
data201607_clean.to_csv('data201607_clean.csv', sep = ';', index = False)
del data201607

#### August 2016

In [66]:
data201608 = pd.read_csv('Datos201608.csv', sep = ';')
data201608_clean = actionsDataFrameTraffic(data201608)
data201608_clean.to_csv('data201608_clean.csv', sep = ';', index = False)
del data201608

#### September 2016

In [67]:
data201609 = pd.read_csv('Datos201609.csv', sep = ';')
data201609_clean = actionsDataFrameTraffic(data201609)
data201609_clean.to_csv('data201609_clean.csv', sep = ';', index = False)
del data201609

#### October 2016

In [68]:
data201610 = pd.read_csv('Datos201610.csv', sep = ';')
data201610_clean = actionsDataFrameTraffic(data201610)
data201610_clean.to_csv('data201610_clean.csv', sep = ';', index = False)
del data201610

#### November 2016

In [69]:
data201611 = pd.read_csv('Datos201611.csv', sep = ';')
data201611_clean = actionsDataFrameTraffic(data201611)
data201611_clean.to_csv('data201611_clean.csv', sep = ';', index = False)
del data201611

#### December 2016

In [70]:
data201612 = pd.read_csv('Datos201612.csv', sep = ';')
data201612_clean = actionsDataFrameTraffic(data201612)
data201612_clean.to_csv('data201612_clean.csv', sep = ';', index = False)
del data201612

In [71]:
# Create a list to concatenate

files_2016_clean = [data201601_clean, data201602_clean, data201603_clean, data201604_clean, data201605_clean, data201606_clean, 
             data201607_clean, data201608_clean, data201609_clean, data201610_clean, data201611_clean, data201612_clean]


In [72]:
# Concatenating all 2016 ready

data2016 = pd.concat(files_2016_clean, ignore_index = True)

In [73]:
data2016.shape

(319029, 17)

In [74]:
data2016.head()

Unnamed: 0,fecha,Fecha_corta,Hora,Hour24,Dia_nombre,Dia_numero,Tipo_dia,Mes_nombre,Mes_numero,id,tipo_elem,intensidad,intensidadSat,volumen,volumenSat,ocupacion,carga
0,2016-01-16 20:00:00,2016-01-16,20:00:00,20,Saturday,6,sabado,January,1,4265,URB,855,3350.0,213,837.0,4,19
1,2016-01-01 09:15:00,2016-01-01,09:15:00,9,Friday,5,festivo,January,1,4265,URB,151,3350.0,37,837.0,0,3
2,2016-01-18 14:30:00,2016-01-18,14:30:00,14,Monday,1,laborable,January,1,4265,URB,690,3350.0,172,837.0,2,18
3,2016-01-09 07:30:00,2016-01-09,07:30:00,7,Saturday,6,sabado,January,1,4265,URB,266,3350.0,66,837.0,1,8
4,2016-01-01 09:30:00,2016-01-01,09:30:00,9,Friday,5,festivo,January,1,4265,URB,176,3350.0,44,837.0,0,4


In [75]:
data2016.isnull().any()

fecha            False
Fecha_corta      False
Hora             False
Hour24           False
Dia_nombre       False
Dia_numero       False
Tipo_dia          True
Mes_nombre       False
Mes_numero       False
id               False
tipo_elem        False
intensidad       False
intensidadSat    False
volumen          False
volumenSat       False
ocupacion        False
carga            False
dtype: bool

In [76]:
data2016[data2016['Tipo_dia'].isnull()]

Unnamed: 0,fecha,Fecha_corta,Hora,Hour24,Dia_nombre,Dia_numero,Tipo_dia,Mes_nombre,Mes_numero,id,tipo_elem,intensidad,intensidadSat,volumen,volumenSat,ocupacion,carga
30803,2016-02-29 06:00:00,2016-02-29,06:00:00,6,Monday,1,,February,2,5104,URB,177,6500.0,44,1625.0,0,4
30809,2016-02-29 06:15:00,2016-02-29,06:15:00,6,Monday,1,,February,2,5104,URB,259,6500.0,64,1625.0,1,6
30810,2016-02-29 06:30:00,2016-02-29,06:30:00,6,Monday,1,,February,2,5104,URB,300,6500.0,75,1625.0,1,10
30811,2016-02-29 06:45:00,2016-02-29,06:45:00,6,Monday,1,,February,2,5104,URB,407,6500.0,101,1625.0,1,13
30812,2016-02-29 07:00:00,2016-02-29,07:00:00,7,Monday,1,,February,2,5104,URB,468,6500.0,117,1625.0,2,14
30813,2016-02-29 07:15:00,2016-02-29,07:15:00,7,Monday,1,,February,2,5104,URB,579,6500.0,144,1625.0,1,17
30814,2016-02-29 07:30:00,2016-02-29,07:30:00,7,Monday,1,,February,2,5104,URB,758,6500.0,189,1625.0,2,23
30815,2016-02-29 07:45:00,2016-02-29,07:45:00,7,Monday,1,,February,2,5104,URB,872,6500.0,218,1625.0,4,30
30816,2016-02-29 08:00:00,2016-02-29,08:00:00,8,Monday,1,,February,2,5104,URB,1066,6500.0,266,1625.0,4,33
30817,2016-02-29 09:00:00,2016-02-29,09:00:00,9,Monday,1,,February,2,5104,URB,1106,6500.0,276,1625.0,5,38


In [81]:
# 2016-02-29 was Monday so I fill it with 'laborable'
data2016['Tipo_dia'].fillna('laborable', inplace = True)

In [82]:
data2016.isnull().any()

fecha            False
Fecha_corta      False
Hora             False
Hour24           False
Dia_nombre       False
Dia_numero       False
Tipo_dia         False
Mes_nombre       False
Mes_numero       False
id               False
tipo_elem        False
intensidad       False
intensidadSat    False
volumen          False
volumenSat       False
ocupacion        False
carga            False
dtype: bool

In [83]:
# Delete the individual dataframes for space since the are already concatenated

del(data201601_clean, data201602_clean, data201603_clean, data201604_clean, data201605_clean, data201606_clean, 
    data201607_clean, data201608_clean, data201609_clean, data201610_clean, data201611_clean, data201612_clean)

In [84]:
# Save on a csv the 2016 dataset ready to use!

data2016.to_csv('data2016.csv', sep = ';', index = False)

## 2017

In [85]:
zip_ref = zipfile.ZipFile('DatosTrafico2017.zip', 'r')
zip_ref.extractall()
zip_ref.close()

In [86]:
data201701 = pd.read_csv('Datos201701.csv', sep = ';')
data201701.head()

Unnamed: 0,idelem,fecha,identif,tipo_elem,intensidad,ocupacion,carga,vmed,error,periodo_integracion
0,6900,2017-01-01 01:00:00,04TL40PM01,PUNTOS MEDIDA M-30,168,1,0,49,N,5
1,3846,2017-01-01 00:45:00,01007,PUNTOS MEDIDA URBANOS,1182,2,34,0,N,15
2,1047,2017-01-01 00:30:00,03FL08PM01,PUNTOS MEDIDA M-30,60,0,0,35,N,10
3,6691,2017-01-01 00:30:00,PM12121,PUNTOS MEDIDA M-30,540,0,8,82,N,15
4,7125,2017-01-01 00:30:00,PM12122,PUNTOS MEDIDA M-30,64,1,3,39,N,15


In [87]:
del(data201701)

In [88]:
data201702 = pd.read_csv('Datos201702.csv', sep = ';')
data201702.head()

Unnamed: 0,idelem,fecha,identif,tipo_elem,intensidad,ocupacion,carga,vmed,error,periodo_integracion
0,3753,2017-02-18 01:45:00,9004,PUNTOS MEDIDA URBANOS,1986,23,93,0,N,15
1,4021,2017-02-18 01:45:00,9005,PUNTOS MEDIDA URBANOS,330,1,19,0,N,15
2,4022,2017-02-18 01:45:00,9006,PUNTOS MEDIDA URBANOS,298,1,31,0,N,15
3,4023,2017-02-18 01:45:00,9007,PUNTOS MEDIDA URBANOS,263,4,23,0,N,15
4,4024,2017-02-18 01:45:00,9008,PUNTOS MEDIDA URBANOS,95,0,10,0,N,15


In [89]:
del(data201702)

In [90]:
data201703 = pd.read_csv('Datos201703.csv', sep = ',') # watch out sep is coma
data201703.head()

Unnamed: 0,idelem,fecha,identif,tipo_elem,intensidad,ocupacion,carga,vmed,error,periodo_integracion
0,6655,2017-03-05 00:00:00,PM10402,PUNTOS MEDIDA M-30,1580,4,18,80,N,15
1,6656,2017-03-05 00:00:00,PM10441,PUNTOS MEDIDA M-30,1451,3,22,90,N,15
2,6674,2017-03-05 00:00:00,PM10945,PUNTOS MEDIDA M-30,788,7,36,78,N,15
3,7124,2017-03-05 00:00:00,PM10948,PUNTOS MEDIDA M-30,149,1,7,59,N,15
4,6676,2017-03-05 00:00:00,PM10981,PUNTOS MEDIDA M-30,2400,6,36,88,N,15


In [91]:
del(data201703)

In [92]:
data201704 = pd.read_csv('Datos201704.csv', sep = ';')
data201704.head()

Unnamed: 0,idelem,fecha,identif,tipo_elem,intensidad,ocupacion,carga,vmed,error,periodo_integracion
0,3854,2017-04-02 13:00:00,1020,PUNTOS MEDIDA URBANOS,112,2,22,0,N,15
1,3855,2017-04-02 13:00:00,1021,PUNTOS MEDIDA URBANOS,289,5,38,0,N,15
2,3852,2017-04-02 13:00:00,1030,PUNTOS MEDIDA URBANOS,640,2,13,0,N,15
3,7026,2017-04-02 13:00:00,1301,PUNTOS MEDIDA URBANOS,661,5,26,0,N,15
4,7011,2017-04-02 13:00:00,1302,PUNTOS MEDIDA URBANOS,511,11,39,0,N,15


In [93]:
del(data201704)

In [94]:
data201705 = pd.read_csv('Datos201705.csv', sep = ';')
data201705.head()

Unnamed: 0,idelem,fecha,identif,tipo_elem,intensidad,ocupacion,carga,vmed,error,periodo_integracion
0,1001,2017-05-01 00:00:00,05FT10PM01,PUNTOS MEDIDA M-30,636,1,0,64,N,5
1,1002,2017-05-01 00:00:00,05FT37PM01,PUNTOS MEDIDA M-30,600,3,0,72,N,5
2,1003,2017-05-01 00:00:00,05FT66PM01,PUNTOS MEDIDA M-30,840,2,0,77,N,5
3,1006,2017-05-01 00:00:00,04FT74PM01,PUNTOS MEDIDA M-30,816,3,0,68,N,5
4,1009,2017-05-01 00:00:00,03FT52PM01,PUNTOS MEDIDA M-30,768,2,0,65,N,5


In [95]:
del(data201705)

In [96]:
data201706 = pd.read_csv('Datos201706.csv', sep = ';')
data201706.head()

Unnamed: 0,idelem,fecha,identif,tipo_elem,intensidad,ocupacion,carga,vmed,error,periodo_integracion
0,1001,2017-06-01 00:00:00,05FT10PM01,PUNTOS MEDIDA M-30,576,4,0,73,N,5
1,1002,2017-06-01 00:00:00,05FT37PM01,PUNTOS MEDIDA M-30,888,4,0,70,N,5
2,1003,2017-06-01 00:00:00,05FT66PM01,PUNTOS MEDIDA M-30,1008,4,0,75,N,5
3,1006,2017-06-01 00:00:00,04FT74PM01,PUNTOS MEDIDA M-30,888,3,0,67,N,5
4,1009,2017-06-01 00:00:00,03FT52PM01,PUNTOS MEDIDA M-30,756,2,0,64,N,5


In [97]:
del(data201706)

In [98]:
data201707 = pd.read_csv('Datos201707.csv', sep = ';')
data201707.head()

Unnamed: 0,idelem,fecha,identif,tipo_elem,intensidad,ocupacion,carga,vmed,error,periodo_integracion
0,1001,2017-07-01 00:00:00,05FT10PM01,PUNTOS MEDIDA M-30,1164,3,0,62,N,5
1,1002,2017-07-01 00:00:00,05FT37PM01,PUNTOS MEDIDA M-30,996,4,0,68,N,5
2,1003,2017-07-01 00:00:00,05FT66PM01,PUNTOS MEDIDA M-30,1392,4,0,74,N,5
3,1006,2017-07-01 00:00:00,04FT74PM01,PUNTOS MEDIDA M-30,1044,3,0,65,N,5
4,1009,2017-07-01 00:00:00,03FT52PM01,PUNTOS MEDIDA M-30,1296,3,0,62,N,5


In [99]:
del(data201707)

In [100]:
data201708 = pd.read_csv('Datos201708.csv', sep = ';')
data201708.head()

Unnamed: 0,idelem,fecha,identif,tipo_elem,intensidad,ocupacion,carga,vmed,error,periodo_integracion
0,1001,2017-08-01 00:00:00,05FT10PM01,PUNTOS MEDIDA M-30,480,2,0,69,N,5
1,1002,2017-08-01 00:00:00,05FT37PM01,PUNTOS MEDIDA M-30,480,2,0,71,N,5
2,1003,2017-08-01 00:00:00,05FT66PM01,PUNTOS MEDIDA M-30,864,3,0,76,N,5
3,1006,2017-08-01 00:00:00,04FT74PM01,PUNTOS MEDIDA M-30,732,2,0,67,N,5
4,1009,2017-08-01 00:00:00,03FT52PM01,PUNTOS MEDIDA M-30,804,2,0,65,N,5


In [101]:
del(data201708)

In [102]:
data201709 = pd.read_csv('Datos201709.csv', sep = ';')
data201709.head()

Unnamed: 0,idelem,fecha,identif,tipo_elem,intensidad,ocupacion,carga,vmed,error,periodo_integracion
0,1001,2017-09-01 00:00:00,05FT10PM01,PUNTOS MEDIDA M-30,1128,3,0,65,N,5
1,1002,2017-09-01 00:00:00,05FT37PM01,PUNTOS MEDIDA M-30,948,4,0,70,N,5
2,1003,2017-09-01 00:00:00,05FT66PM01,PUNTOS MEDIDA M-30,1140,2,0,76,N,5
3,1006,2017-09-01 00:00:00,04FT74PM01,PUNTOS MEDIDA M-30,1284,4,0,66,N,5
4,1009,2017-09-01 00:00:00,03FT52PM01,PUNTOS MEDIDA M-30,1068,2,0,68,N,5


In [103]:
del(data201709)

In [104]:
data201710 = pd.read_csv('Datos201710.csv', sep = ';')
data201710.head()

Unnamed: 0,id,fecha,tipo_elem,intensidad,ocupacion,carga,vmed,error,periodo_integracion
0,1001,2017-10-01 00:00:00,PUNTOS MEDIDA M-30,1356,4,0,61,N,5
1,1002,2017-10-01 00:00:00,PUNTOS MEDIDA M-30,1152,6,0,68,N,5
2,1003,2017-10-01 00:00:00,PUNTOS MEDIDA M-30,1404,5,0,71,N,5
3,1006,2017-10-01 00:00:00,PUNTOS MEDIDA M-30,1608,5,0,68,N,5
4,1009,2017-10-01 00:00:00,PUNTOS MEDIDA M-30,1848,4,0,67,N,5


In [105]:
del(data201710)

In [106]:
data201711 = pd.read_csv('Datos201711.csv', sep = ';')
data201711.head()

Unnamed: 0,id,fecha,tipo_elem,intensidad,ocupacion,carga,vmed,error,periodo_integracion
0,0,2017-11-01 00:00:00,PUNTOS MEDIDA M-30,156,1,0,55,N,15
1,1001,2017-11-01 00:00:00,PUNTOS MEDIDA M-30,1008,12,0,64,N,5
2,1002,2017-11-01 00:00:00,PUNTOS MEDIDA M-30,972,4,0,73,N,5
3,1003,2017-11-01 00:00:00,PUNTOS MEDIDA M-30,1344,4,0,77,N,5
4,1006,2017-11-01 00:00:00,PUNTOS MEDIDA M-30,984,3,0,73,N,5


In [107]:
del(data201711)

In [108]:
data201712 = pd.read_csv('Datos201712.csv', sep = ';')
data201712.head()

Unnamed: 0,id,fecha,tipo_elem,intensidad,ocupacion,carga,vmed,error,periodo_integracion
0,0,2017-12-01 00:00:00,PUNTOS MEDIDA M-30,188,1,0,60,N,15
1,1001,2017-12-01 00:00:00,PUNTOS MEDIDA M-30,384,23,0,68,N,5
2,1002,2017-12-01 00:00:00,PUNTOS MEDIDA M-30,504,2,0,70,N,5
3,1003,2017-12-01 00:00:00,PUNTOS MEDIDA M-30,804,3,0,75,N,5
4,1006,2017-12-01 00:00:00,PUNTOS MEDIDA M-30,648,2,0,70,N,5


In [109]:
del(data201712)

### January 2017

In [110]:
data201701 = pd.read_csv('Datos201701.csv', sep = ';')
data201701_clean = actionsDataFrameTraffic(data201701)
data201701_clean.to_csv('data201701_clean.csv', sep = ';', index = False)
del data201701

### February 2017

In [111]:
data201702 = pd.read_csv('Datos201702.csv', sep = ';')
data201702_clean = actionsDataFrameTraffic(data201702)
data201702_clean.to_csv('data201702_clean.csv', sep = ';', index = False)
del data201702

### March 2017

In [112]:
data201703 = pd.read_csv('Datos201703.csv', sep = ',') # sep coma
data201703_clean = actionsDataFrameTraffic(data201703)
data201703_clean.to_csv('data201703_clean.csv', sep = ';', index = False)
del data201703

### April 2017

In [113]:
data201704 = pd.read_csv('Datos201704.csv', sep = ';')
data201704_clean = actionsDataFrameTraffic(data201704)
data201704_clean.to_csv('data201704_clean.csv', sep = ';', index = False)
del data201704

### May 2017

In [114]:
data201705 = pd.read_csv('Datos201705.csv', sep = ';')
data201705_clean = actionsDataFrameTraffic(data201705)
data201705_clean.to_csv('data201705_clean.csv', sep = ';', index = False)
del data201705

### June 2017

In [115]:
data201706 = pd.read_csv('Datos201706.csv', sep = ';')
data201706_clean = actionsDataFrameTraffic(data201706)
data201706_clean.to_csv('data201706_clean.csv', sep = ';', index = False)
del data201706

### July 2017

In [116]:
data201707 = pd.read_csv('Datos201707.csv', sep = ';')
data201707_clean = actionsDataFrameTraffic(data201707)
data201707_clean.to_csv('data201707_clean.csv', sep = ';', index = False)
del data201707

### August 2017

In [117]:
data201708 = pd.read_csv('Datos201708.csv', sep = ';')
data201708_clean = actionsDataFrameTraffic(data201708)
data201708_clean.to_csv('data201708_clean.csv', sep = ';', index = False)
del data201708

### September 2017

In [118]:
data201709 = pd.read_csv('Datos201709.csv', sep = ';')
data201709_clean = actionsDataFrameTraffic(data201709)
data201709_clean.to_csv('data201709_clean.csv', sep = ';', index = False)
del data201709

### October 2017

In [119]:
data201710 = pd.read_csv('Datos201710.csv', sep = ';')
data201710_clean = actionsDataFrameTraffic(data201710)
data201710_clean.to_csv('data201710_clean.csv', sep = ';', index = False)
del data201710

### November 2017

In [120]:
data201711 = pd.read_csv('Datos201711.csv', sep = ';')
data201711_clean = actionsDataFrameTraffic(data201711)
data201711_clean.to_csv('data201711_clean.csv', sep = ';', index = False)
del data201711

### December 2017

In [122]:
data201712 = pd.read_csv('Datos201712.csv', sep = ';')
data201712_clean = actionsDataFrameTraffic(data201712)
data201712_clean.to_csv('data201712_clean.csv', sep = ';', index = False)
del data201712

In [123]:
files_2017_clean = [data201701_clean, data201702_clean, data201703_clean, data201704_clean, data201705_clean, data201706_clean, 
             data201707_clean, data201708_clean, data201709_clean, data201710_clean, data201711_clean, data201712_clean]


In [124]:
# Concatenating all 2017 ready

data2017 = pd.concat(files_2017_clean, ignore_index = True)

In [125]:
data2017.shape

(321060, 17)

In [126]:
data2017.head()

Unnamed: 0,fecha,Fecha_corta,Hora,Hour24,Dia_nombre,Dia_numero,Tipo_dia,Mes_nombre,Mes_numero,id,tipo_elem,intensidad,intensidadSat,volumen,volumenSat,ocupacion,carga
0,2017-01-01 06:00:00,2017-01-01,06:00:00,6,Sunday,7,domingo,January,1,4305,URB,97,500.0,24,125.0,2,13
1,2017-01-01 06:45:00,2017-01-01,06:45:00,6,Sunday,7,domingo,January,1,4305,URB,81,500.0,20,125.0,2,14
2,2017-01-01 06:15:00,2017-01-01,06:15:00,6,Sunday,7,domingo,January,1,4305,URB,64,500.0,16,125.0,11,15
3,2017-01-01 06:30:00,2017-01-01,06:30:00,6,Sunday,7,domingo,January,1,4305,URB,55,500.0,13,125.0,2,10
4,2017-01-01 07:00:00,2017-01-01,07:00:00,7,Sunday,7,domingo,January,1,4305,URB,76,500.0,19,125.0,4,12


In [127]:
data2017.isnull().any()

fecha            False
Fecha_corta      False
Hora             False
Hour24           False
Dia_nombre       False
Dia_numero       False
Tipo_dia         False
Mes_nombre       False
Mes_numero       False
id               False
tipo_elem        False
intensidad       False
intensidadSat    False
volumen          False
volumenSat       False
ocupacion        False
carga            False
dtype: bool

In [128]:
# Delete the individual dataframes for space since the are already concatenated

del(data201701_clean, data201702_clean, data201703_clean, data201704_clean, data201705_clean, data201706_clean, 
    data201707_clean, data201708_clean, data201709_clean, data201710_clean, data201711_clean, data201712_clean)

In [129]:
# Save on a csv the 2017 dataset ready to use

data2017.to_csv('data2017.csv', sep = ';', index = False)

## 2018

In [130]:
zip_ref = zipfile.ZipFile('DatosTrafico2018.zip', 'r')
zip_ref.extractall()
zip_ref.close()

In [47]:
data201801 = pd.read_csv('Datos201801.csv', sep = ';')
data201801.head()

Unnamed: 0,id,fecha,tipo_elem,intensidad,ocupacion,carga,vmed,error,periodo_integracion
0,1001,2018-01-01 00:00:00,PUNTOS MEDIDA M-30,204,12,0,73,N,5
1,1002,2018-01-01 00:00:00,PUNTOS MEDIDA M-30,252,1,0,79,N,5
2,1003,2018-01-01 00:00:00,PUNTOS MEDIDA M-30,420,2,0,82,N,5
3,1006,2018-01-01 00:00:00,PUNTOS MEDIDA M-30,288,1,0,75,N,5
4,1009,2018-01-01 00:00:00,PUNTOS MEDIDA M-30,276,0,0,76,N,5


In [48]:
del(data201801)

In [49]:
data201802 = pd.read_csv('Datos201802.csv', sep = ';')
data201802.head()

Unnamed: 0,id,fecha,tipo_elem,intensidad,ocupacion,carga,vmed,error,periodo_integracion
0,1001,2018-02-01 00:00:00,M30,480,20,0,71,N,4
1,1002,2018-02-01 00:00:00,M30,660,3,0,72,N,4
2,1003,2018-02-01 00:00:00,M30,1155,4,0,75,N,4
3,1006,2018-02-01 00:00:00,M30,435,1,0,70,N,4
4,1009,2018-02-01 00:00:00,M30,690,1,0,65,N,4


In [50]:
del(data201802)

In [51]:
data201803 = pd.read_csv('Datos201803.csv', sep = ';')
data201803.head()

Unnamed: 0,id,fecha,tipo_elem,intensidad,ocupacion,carga,vmed,error,periodo_integracion
0,1001,2018-03-01 00:00:00,M30,720,26,0,53,N,4
1,1002,2018-03-01 00:00:00,M30,540,2,0,68,N,4
2,1003,2018-03-01 00:00:00,M30,600,2,0,76,N,4
3,1006,2018-03-01 00:00:00,M30,690,2,0,59,N,4
4,1009,2018-03-01 00:00:00,M30,510,1,0,66,N,4


In [52]:
del(data201803)

In [53]:
data201804 = pd.read_csv('Datos201804.csv', sep = ';')
data201804.head()

Unnamed: 0,id,fecha,tipo_elem,intensidad,ocupacion,carga,vmed,error,periodo_integracion
0,1001,2018-04-01 00:00:00,M30,1212,3,0,66,N,5
1,1002,2018-04-01 00:00:00,M30,1176,4,0,72,N,5
2,1003,2018-04-01 00:00:00,M30,1176,3,0,78,N,5
3,1006,2018-04-01 00:00:00,M30,948,3,0,67,N,5
4,1009,2018-04-01 00:00:00,M30,1116,2,0,67,N,5


In [54]:
del(data201804)

In [55]:
data201805 = pd.read_csv('Datos201805.csv', sep = ';')
data201805.head()

Unnamed: 0,id,fecha,tipo_elem,intensidad,ocupacion,carga,vmed,error,periodo_integracion
0,1001,2018-05-01 00:00:00,M30,876,2,0,70,N,5
1,1002,2018-05-01 00:00:00,M30,816,3,0,74,N,5
2,1003,2018-05-01 00:00:00,M30,972,3,0,77,N,5
3,1006,2018-05-01 00:00:00,M30,852,2,0,69,N,5
4,1009,2018-05-01 00:00:00,M30,780,2,0,67,N,5


In [56]:
del(data201805)

In [57]:
data201806 = pd.read_csv('Datos201806.csv', sep = ';')
data201806.head()

Unnamed: 0,id,fecha,tipo_elem,intensidad,ocupacion,carga,vmed,error,periodo_integracion
0,1001,2018-06-01 00:00:00,M30,804,2,0,61,N,5
1,1002,2018-06-01 00:00:00,M30,816,4,0,72,N,5
2,1003,2018-06-01 00:00:00,M30,900,4,0,74,N,5
3,1006,2018-06-01 00:00:00,M30,1020,3,0,68,N,5
4,1009,2018-06-01 00:00:00,M30,984,2,0,72,N,5


In [58]:
del(data201806)

In [59]:
data201807 = pd.read_csv('Datos201807.csv', sep = ';')
data201807.head()

Unnamed: 0,id,fecha,tipo_elem,intensidad,ocupacion,carga,vmed,error,periodo_integracion
0,1001,2018-07-01 00:00:00,M30,1512,4,0,61,N,5
1,1002,2018-07-01 00:00:00,M30,1368,5,0,70,N,5
2,1003,2018-07-01 00:00:00,M30,1920,6,0,76,N,5
3,1006,2018-07-01 00:00:00,M30,1680,5,0,66,N,5
4,1009,2018-07-01 00:00:00,M30,1536,3,0,65,N,5


In [60]:
del(data201807)

In [61]:
data201808 = pd.read_csv('Datos201808.csv', sep = ';')
data201808.head()

Unnamed: 0,id,fecha,tipo_elem,intensidad,ocupacion,carga,vmed,error,periodo_integracion
0,1001,2018-08-01 00:00:00,M30,684,1,0,53,N,5
1,1002,2018-08-01 00:00:00,M30,540,2,0,71,N,5
2,1003,2018-08-01 00:00:00,M30,732,1,0,79,N,5
3,1006,2018-08-01 00:00:00,M30,1212,3,0,71,N,5
4,1009,2018-08-01 00:00:00,M30,1284,2,0,67,N,5


In [62]:
del(data201808)

In [63]:
data201809 = pd.read_csv('Datos201809.csv', sep = ';')
data201809.head()

Unnamed: 0,id,fecha,tipo_elem,intensidad,ocupacion,carga,vmed,error,periodo_integracion
0,1001,2018-09-01 00:00:00,M30,1140.0,3.0,,62.0,N,5
1,1001,2018-09-01 00:15:00,M30,1140.0,2.0,,55.0,N,5
2,1001,2018-09-01 00:30:00,M30,1488.0,4.0,,55.0,N,5
3,1001,2018-09-01 00:45:00,M30,1068.0,3.0,,55.0,N,5
4,1001,2018-09-01 01:00:00,M30,1224.0,3.0,,59.0,N,5


In [64]:
del(data201809)

In [65]:
data201810 = pd.read_csv('Datos201810.csv', sep = ';')
data201810.head()

Unnamed: 0,id,fecha,tipo_elem,intensidad,ocupacion,carga,vmed,error,periodo_integracion
0,1001,2018-10-01 00:00:00,M30,696.0,1.0,,67.0,N,5
1,1001,2018-10-01 00:15:00,M30,816.0,2.0,,64.0,N,5
2,1001,2018-10-01 00:30:00,M30,696.0,2.0,,58.0,N,5
3,1001,2018-10-01 00:45:00,M30,444.0,1.0,,59.0,N,5
4,1001,2018-10-01 01:00:00,M30,300.0,1.0,,63.0,N,5


In [66]:
del(data201810)

In [67]:
data201811 = pd.read_csv('Datos201811.csv', sep = ';')
data201811.head()

Unnamed: 0,id,fecha,tipo_elem,intensidad,ocupacion,carga,vmed,error,periodo_integracion
0,1001,2018-11-01 00:00:00,M30,828.0,3.0,,56.0,N,5
1,1001,2018-11-01 00:15:00,M30,1032.0,3.0,,54.0,N,5
2,1001,2018-11-01 00:30:00,M30,1080.0,2.0,,56.0,N,5
3,1001,2018-11-01 00:45:00,M30,912.0,2.0,,54.0,N,5
4,1001,2018-11-01 01:00:00,M30,768.0,3.0,,49.0,N,5


In [68]:
del(data201811)

In [69]:
data201812 = pd.read_csv('Datos201812.csv', sep = ';')
data201812.head()

Unnamed: 0,id,fecha,tipo_elem,intensidad,ocupacion,carga,vmed,error,periodo_integracion
0,1001,2018-12-01 00:00:00,M30,804,2.0,0,60.0,N,5
1,1001,2018-12-01 00:15:00,M30,1260,4.0,0,59.0,N,5
2,1001,2018-12-01 00:30:00,M30,996,2.0,0,62.0,N,5
3,1001,2018-12-01 00:45:00,M30,732,2.0,0,69.0,N,5
4,1001,2018-12-01 01:00:00,M30,792,2.0,0,62.0,N,5


In [70]:
del(data201812)

### January 2018

In [71]:
data201801 = pd.read_csv('Datos201801.csv', sep = ';')
data201801_clean = actionsDataFrameTraffic(data201801)
data201801_clean.to_csv('data201801_clean.csv', sep = ';', index = False)
del data201801

### February 2018

In [72]:
data201802 = pd.read_csv('Datos201802.csv', sep = ';')
data201802_clean = actionsDataFrameTraffic(data201802)
data201802_clean.to_csv('data201802_clean.csv',  sep = ';', index = False)
del data201802

### March 2018

In [73]:
data201803 = pd.read_csv('Datos201803.csv', sep = ';')
data201803_clean = actionsDataFrameTraffic(data201803)
data201803_clean.to_csv('data201803_clean.csv',  sep = ';', index = False)
del data201803

### April 2019

In [74]:
data201804 = pd.read_csv('Datos201804.csv', sep = ';')
data201804_clean = actionsDataFrameTraffic(data201804)
data201804_clean.to_csv('data201804_clean.csv',  sep = ';', index = False)
del data201804

### May 2018

In [75]:
data201805 = pd.read_csv('Datos201805.csv', sep = ';')
data201805_clean = actionsDataFrameTraffic(data201805)
data201805_clean.to_csv('data201805_clean.csv',  sep = ';', index = False)
del data201805

### June 2018

In [76]:
data201806 = pd.read_csv('Datos201806.csv', sep = ';')
data201806_clean = actionsDataFrameTraffic(data201806)
data201806_clean.to_csv('data201806_clean.csv',  sep = ';', index = False)
del data201806

### July 2018

In [77]:
data201807 = pd.read_csv('Datos201807.csv', sep = ';')
data201807_clean = actionsDataFrameTraffic(data201807)
data201807_clean.to_csv('data201807_clean.csv',  sep = ';', index = False)
del data201807

### August 2018

In [78]:
data201808 = pd.read_csv('Datos201808.csv', sep = ';')
data201808_clean = actionsDataFrameTraffic(data201808)
data201808_clean.to_csv('data201808_clean.csv',  sep = ';', index = False)
del data201808

### September 2018

In [79]:
data201809 = pd.read_csv('Datos201809.csv', sep = ';')
data201809_clean = actionsDataFrameTraffic(data201809)
data201809_clean.to_csv('data201809_clean.csv',  sep = ';', index = False)
del data201809

### October 2018

In [80]:
data201810 = pd.read_csv('Datos201810.csv', sep = ';')
data201810_clean = actionsDataFrameTraffic(data201810)
data201810_clean.to_csv('data201810_clean.csv', sep = ';', index = False)
del data201810

### November 2018

In [81]:
data201811 = pd.read_csv('Datos201811.csv', sep = ';')
data201811_clean = actionsDataFrameTraffic(data201811)
data201811_clean.to_csv('data201811_clean.csv', sep = ';', index = False)
del data201811

### December 2018

In [82]:
data201812 = pd.read_csv('Datos201812.csv', sep = ';')
data201812_clean = actionsDataFrameTraffic(data201812)
data201812_clean.to_csv('data201812_clean.csv',  sep = ';', index = False)
del data201812

In [83]:
files_2018_clean = [data201801_clean, data201802_clean, data201803_clean, data201804_clean, data201805_clean, data201806_clean, 
             data201807_clean, data201808_clean, data201809_clean, data201810_clean, data201811_clean, data201812_clean]


In [84]:
# Concatenating all 2018 ready

data2018 = pd.concat(files_2018_clean, ignore_index = True)

In [86]:
data2018.shape

(327030, 17)

In [87]:
data2018.head()

Unnamed: 0,fecha,Fecha_corta,Hora,Hour24,Dia_nombre,Dia_numero,Tipo_dia,Mes_nombre,Mes_numero,id,tipo_elem,intensidad,intensidadSat,volumen,volumenSat,ocupacion,carga
0,2018-01-01 06:00:00,2018-01-01,06:00:00,6,Monday,1,festivo,January,1,3478,URB,400.0,4000.0,100.0,1000.0,3.0,12.0
1,2018-01-01 06:15:00,2018-01-01,06:15:00,6,Monday,1,festivo,January,1,3478,URB,361.0,4000.0,90.0,1000.0,2.0,11.0
2,2018-01-01 06:30:00,2018-01-01,06:30:00,6,Monday,1,festivo,January,1,3478,URB,483.0,4000.0,120.0,1000.0,3.0,14.0
3,2018-01-01 06:45:00,2018-01-01,06:45:00,6,Monday,1,festivo,January,1,3478,URB,468.0,4000.0,117.0,1000.0,4.0,14.0
4,2018-01-01 07:00:00,2018-01-01,07:00:00,7,Monday,1,festivo,January,1,3478,URB,434.0,4000.0,108.0,1000.0,3.0,13.0


In [88]:
data2018.isnull().any()

fecha            False
Fecha_corta      False
Hora             False
Hour24           False
Dia_nombre       False
Dia_numero       False
Tipo_dia         False
Mes_nombre       False
Mes_numero       False
id               False
tipo_elem        False
intensidad        True
intensidadSat    False
volumen           True
volumenSat       False
ocupacion         True
carga             True
dtype: bool

#### Dealing NaN values with 'ffill'

In [89]:
data2018[data2018['intensidad'].isnull()]

Unnamed: 0,fecha,Fecha_corta,Hora,Hour24,Dia_nombre,Dia_numero,Tipo_dia,Mes_nombre,Mes_numero,id,tipo_elem,intensidad,intensidadSat,volumen,volumenSat,ocupacion,carga
228777,2018-09-01 07:15:00,2018-09-01,07:15:00,7,Saturday,6,sabado,September,9,4301,URB,,1850.0,,462.0,47.0,37.0
228778,2018-09-01 09:00:00,2018-09-01,09:00:00,9,Saturday,6,sabado,September,9,4301,URB,,1850.0,,462.0,57.0,33.0
228779,2018-09-02 06:15:00,2018-09-02,06:15:00,6,Sunday,7,domingo,September,9,4301,URB,,1850.0,,462.0,44.0,33.0
228780,2018-09-02 08:45:00,2018-09-02,08:45:00,8,Sunday,7,domingo,September,9,4301,URB,,1850.0,,462.0,50.0,33.0
228781,2018-09-02 13:30:00,2018-09-02,13:30:00,13,Sunday,7,domingo,September,9,4301,URB,,1850.0,,462.0,41.0,36.0
228782,2018-09-03 09:00:00,2018-09-03,09:00:00,9,Monday,1,laborable,September,9,4301,URB,,1850.0,,462.0,50.0,41.0
228784,2018-09-03 12:30:00,2018-09-03,12:30:00,12,Monday,1,laborable,September,9,4301,URB,,1850.0,,462.0,50.0,41.0
228786,2018-09-03 13:15:00,2018-09-03,13:15:00,13,Monday,1,laborable,September,9,4301,URB,,1850.0,,462.0,50.0,18.0
228787,2018-09-26 09:45:00,2018-09-26,09:45:00,9,Wednesday,3,laborable,September,9,4301,URB,,1850.0,,462.0,87.0,48.0
231263,2018-09-07 21:45:00,2018-09-07,21:45:00,21,Friday,5,laborable,September,9,4353,URB,,2400.0,,600.0,66.0,42.0


#### Method for fill NaN: after checking where are the Nans I consider that the traffic of 15 min before or after is more realistic than the mean so I fill the NaN values with the method 'ffill'. In any case, these are 90 observations out of 327030 (0,0275%) of the 2018 dataset


In [90]:
data2018['intensidad'].fillna(method ='ffill', inplace = True)
data2018['ocupacion'].fillna(method ='ffill', inplace = True)
data2018['carga'].fillna(method ='ffill', inplace = True)
data2018['volumen'].fillna(method ='ffill', inplace = True)

In [91]:
data2018.isnull().any()

fecha            False
Fecha_corta      False
Hora             False
Hour24           False
Dia_nombre       False
Dia_numero       False
Tipo_dia         False
Mes_nombre       False
Mes_numero       False
id               False
tipo_elem        False
intensidad       False
intensidadSat    False
volumen          False
volumenSat       False
ocupacion        False
carga            False
dtype: bool

In [92]:
# Delete the individual dataframes for space since the are already concatenated

del(data201801_clean, data201802_clean, data201803_clean, data201804_clean, data201805_clean, data201806_clean, 
    data201807_clean, data201808_clean, data201809_clean, data201810_clean, data201811_clean, data201812_clean)

In [93]:
# Save on a csv the 2018 dataset ready to use!

data2018.to_csv('data2018.csv',sep = ';', index = False)

## 2019

In [94]:
zip_ref = zipfile.ZipFile('DatosTrafico2019.zip', 'r')
zip_ref.extractall()
zip_ref.close()

In [95]:
data201901 = pd.read_csv('Datos201902.csv', sep = ';')
data201901.head()

Unnamed: 0,id,fecha,tipo_elem,intensidad,ocupacion,carga,vmed,error,periodo_integracion
0,1001,2019-02-01 00:00:00,M30,12,,0,10.0,N,5
1,1001,2019-02-01 00:15:00,M30,36,,0,34.0,N,5
2,1001,2019-02-01 00:30:00,M30,0,0.0,0,0.0,N,5
3,1001,2019-02-01 00:45:00,M30,0,0.0,0,0.0,N,5
4,1001,2019-02-01 01:00:00,M30,0,0.0,0,0.0,N,5


In [96]:
del(data201901)

In [97]:
data201902 = pd.read_csv('Datos201902.csv', sep = ';')
data201902.head()

Unnamed: 0,id,fecha,tipo_elem,intensidad,ocupacion,carga,vmed,error,periodo_integracion
0,1001,2019-02-01 00:00:00,M30,12,,0,10.0,N,5
1,1001,2019-02-01 00:15:00,M30,36,,0,34.0,N,5
2,1001,2019-02-01 00:30:00,M30,0,0.0,0,0.0,N,5
3,1001,2019-02-01 00:45:00,M30,0,0.0,0,0.0,N,5
4,1001,2019-02-01 01:00:00,M30,0,0.0,0,0.0,N,5


In [98]:
del(data201902)

In [99]:
data201903 = pd.read_csv('Datos201903.csv', sep = ';')
data201903.head()

Unnamed: 0,id,fecha,tipo_elem,intensidad,ocupacion,carga,vmed,error,periodo_integracion
0,1001,2019-03-01 00:00:00,M30,32,0.0,0,19.0,N,5
1,1001,2019-03-01 00:15:00,M30,0,0.0,0,0.0,N,5
2,1001,2019-03-01 00:30:00,M30,0,0.0,0,0.0,N,5
3,1001,2019-03-01 00:45:00,M30,32,0.0,0,23.0,N,5
4,1001,2019-03-01 01:00:00,M30,16,0.0,0,13.0,N,5


In [100]:
del(data201903)

In [101]:
data201904 = pd.read_csv('Datos201904.csv', sep = ';')
data201904.head()

Unnamed: 0,id,fecha,tipo_elem,intensidad,ocupacion,carga,vmed,error,periodo_integracion
0,1001,2019-04-01 00:00:00,M30,828,2.0,0,61.0,N,5
1,1001,2019-04-01 00:15:00,M30,684,2.0,0,62.0,N,5
2,1001,2019-04-01 00:30:00,M30,396,2.0,0,60.0,N,5
3,1001,2019-04-01 00:45:00,M30,288,1.0,0,54.0,N,5
4,1001,2019-04-01 01:00:00,M30,480,1.0,0,60.0,N,5


In [102]:
del(data201904)

### January 2019

In [103]:
data201901 = pd.read_csv('Datos201901.csv', sep = ';')
data201901_clean = actionsDataFrameTraffic(data201901)
data201901_clean.to_csv('data201901_clean.csv', sep = ';', index = False)
del data201901

### February 2019

In [104]:
data201902 = pd.read_csv('Datos201902.csv', sep = ';')
data201902_clean = actionsDataFrameTraffic(data201902)
data201902_clean.to_csv('data201902_clean.csv',sep =';', index = False)
del data201902

### March 2019

In [105]:
data201903 = pd.read_csv('Datos201903.csv', sep = ';')
data201903_clean = actionsDataFrameTraffic(data201903)
data201903_clean.to_csv('data201903_clean.csv', sep = ';', index = False)
del data201903

### April 2019

In [106]:
data201904 = pd.read_csv('Datos201904.csv', sep = ';')
data201904_clean = actionsDataFrameTraffic(data201904)
data201904_clean.to_csv('data201904_clean.csv', sep = ';', index = False)
del data201904

In [107]:
files_2019_clean = [data201901_clean, data201902_clean, data201903_clean, data201904_clean]


In [108]:
# Concatenating all 2019 ready

data2019 = pd.concat(files_2019_clean, ignore_index = True)

In [109]:
data2019.shape

(109744, 17)

In [110]:
data2019.head()

Unnamed: 0,fecha,Fecha_corta,Hora,Hour24,Dia_nombre,Dia_numero,Tipo_dia,Mes_nombre,Mes_numero,id,tipo_elem,intensidad,intensidadSat,volumen,volumenSat,ocupacion,carga
0,2019-01-01 06:00:00,2019-01-01,06:00:00,6,Tuesday,2,festivo,January,1,3478,URB,196,4000.0,49,1000.0,1.0,6
1,2019-01-01 06:15:00,2019-01-01,06:15:00,6,Tuesday,2,festivo,January,1,3478,URB,252,4000.0,63,1000.0,2.0,9
2,2019-01-01 06:30:00,2019-01-01,06:30:00,6,Tuesday,2,festivo,January,1,3478,URB,259,4000.0,64,1000.0,2.0,9
3,2019-01-01 06:45:00,2019-01-01,06:45:00,6,Tuesday,2,festivo,January,1,3478,URB,288,4000.0,72,1000.0,4.0,10
4,2019-01-01 07:00:00,2019-01-01,07:00:00,7,Tuesday,2,festivo,January,1,3478,URB,180,4000.0,45,1000.0,2.0,6


In [111]:
data2019.isnull().any()

fecha            False
Fecha_corta      False
Hora             False
Hour24           False
Dia_nombre       False
Dia_numero       False
Tipo_dia         False
Mes_nombre       False
Mes_numero       False
id               False
tipo_elem        False
intensidad       False
intensidadSat    False
volumen          False
volumenSat       False
ocupacion        False
carga            False
dtype: bool

In [122]:
data2019.dtypes

fecha            datetime64[ns]
Fecha_corta      datetime64[ns]
Hora                     object
Hour24                   object
Dia_nombre               object
Dia_numero                int64
Tipo_dia                 object
Mes_nombre               object
Mes_numero                int64
id                        int64
tipo_elem                object
intensidad                int64
intensidadSat           float64
volumen                   int64
volumenSat              float64
ocupacion               float64
carga                     int64
dtype: object

In [112]:
# Delete the individual dataframes for space since the are already concatenated

del(data201901_clean,data201902_clean, data201903_clean, data201904_clean)

In [113]:
# Save on a csv the 2019 dataset ready to use!

data2019.to_csv('data2019.csv', sep = ';', index = False)

## After dealing with the data traffic, now we merge it with the devices location file

#### First of all we concatenate into 1 dataframe all the observations from all the years

In [115]:
data_compilation = [data2015, data2016, data2017, data2018, data2019]

In [126]:
del(data2015, data2016, data2017, data2018, data2019)

In [128]:
data_traffic = pd.concat(data_compilation, ignore_index = True)

In [129]:
data_traffic.shape

(1361995, 17)

In [130]:
data_traffic.isnull().any()

fecha            False
Fecha_corta      False
Hora             False
Hour24           False
Dia_nombre       False
Dia_numero       False
Tipo_dia         False
Mes_nombre       False
Mes_numero       False
id               False
tipo_elem        False
intensidad       False
intensidadSat    False
volumen          False
volumenSat       False
ocupacion        False
carga            False
dtype: bool

In [131]:
data_traffic.dtypes

fecha             object
Fecha_corta       object
Hora              object
Hour24            object
Dia_nombre        object
Dia_numero         int64
Tipo_dia          object
Mes_nombre        object
Mes_numero         int64
id                 int64
tipo_elem         object
intensidad       float64
intensidadSat    float64
volumen          float64
volumenSat       float64
ocupacion        float64
carga            float64
dtype: object

In [132]:
data_traffic.head()

# the columns Fecha_corta changes the format so I will slice it to remove the hours

Unnamed: 0,fecha,Fecha_corta,Hora,Hour24,Dia_nombre,Dia_numero,Tipo_dia,Mes_nombre,Mes_numero,id,tipo_elem,intensidad,intensidadSat,volumen,volumenSat,ocupacion,carga
0,2015-01-12 10:45:00,2015-01-12 00:00:00,10:45:00,10,Monday,1,laborable,January,1,7138,URB,2886.0,5770.0,721.0,1442.0,14.0,50.0
1,2015-01-15 20:45:00,2015-01-15 00:00:00,20:45:00,20,Thursday,4,laborable,January,1,7138,URB,1975.0,5770.0,493.0,1442.0,6.0,31.0
2,2015-01-12 16:00:00,2015-01-12 00:00:00,16:00:00,16,Monday,1,laborable,January,1,7138,URB,2603.0,5770.0,650.0,1442.0,10.0,42.0
3,2015-01-12 17:45:00,2015-01-12 00:00:00,17:45:00,17,Monday,1,laborable,January,1,7138,URB,2238.0,5770.0,559.0,1442.0,10.0,38.0
4,2015-01-12 11:00:00,2015-01-12 00:00:00,11:00:00,11,Monday,1,laborable,January,1,7138,URB,2881.0,5770.0,720.0,1442.0,12.0,45.0


In [133]:
# appply since the Fecha_corta columna changes the format

data_traffic['Fecha_corta'] = data_traffic['Fecha_corta'].map(lambda x: str(x)[:10])

In [134]:
data_traffic.head()

Unnamed: 0,fecha,Fecha_corta,Hora,Hour24,Dia_nombre,Dia_numero,Tipo_dia,Mes_nombre,Mes_numero,id,tipo_elem,intensidad,intensidadSat,volumen,volumenSat,ocupacion,carga
0,2015-01-12 10:45:00,2015-01-12,10:45:00,10,Monday,1,laborable,January,1,7138,URB,2886.0,5770.0,721.0,1442.0,14.0,50.0
1,2015-01-15 20:45:00,2015-01-15,20:45:00,20,Thursday,4,laborable,January,1,7138,URB,1975.0,5770.0,493.0,1442.0,6.0,31.0
2,2015-01-12 16:00:00,2015-01-12,16:00:00,16,Monday,1,laborable,January,1,7138,URB,2603.0,5770.0,650.0,1442.0,10.0,42.0
3,2015-01-12 17:45:00,2015-01-12,17:45:00,17,Monday,1,laborable,January,1,7138,URB,2238.0,5770.0,559.0,1442.0,10.0,38.0
4,2015-01-12 11:00:00,2015-01-12,11:00:00,11,Monday,1,laborable,January,1,7138,URB,2881.0,5770.0,720.0,1442.0,12.0,45.0


## This is the file you can use directly from the Notebook 0

In [136]:
# Save on a csv the 2019 dataset ready to use!

data_traffic.to_csv('data_traffic.csv', sep = ';', index = False)


### Opening devices_to_analize.csv to merge with the data_traffic

In [137]:
devices = pd.read_csv('devices_to_analize.csv', sep = ';', usecols = ['id','nombre','distrito','latitud', 'longitud'])

In [138]:
len(devices['id'])

14

In [139]:
len(devices['id'].unique())

14

In [140]:
devices.head()

Unnamed: 0,distrito,id,latitud,longitud,nombre
0,1.0,3478,40.407849,-3.712531,(AFOROS)GRAN VMA DE SAN FRANCISCO N-S(AGUILA-P...
1,7.0,3848,40.426925,-3.694187,(AFOROS) Genova 13 E-O - Zurbano-Campoamor
2,1.0,3850,40.426004,-3.692525,(AFOROS) Genova O-E - General Castaños-Pl. Colon
3,2.0,4211,40.405533,-3.700601,(AFOROS) RONDA VALENCIA O-E(MESON DE PAREDES-F...
4,1.0,4265,40.409298,-3.713404,(AFOROS)Gran Vía San Francisco S-N - San Berna...


### Merging with the data_traffic

In [141]:
final_data = pd.merge(data_traffic, devices, how = 'left', on = 'id')

In [144]:
final_data.head()

Unnamed: 0,fecha,Fecha_corta,Hora,Hour24,Dia_nombre,Dia_numero,Tipo_dia,Mes_nombre,Mes_numero,id,tipo_elem,intensidad,intensidadSat,volumen,volumenSat,ocupacion,carga,distrito,latitud,longitud,nombre
0,2015-01-12 10:45:00,2015-01-12,10:45:00,10,Monday,1,laborable,January,1,7138,URB,2886.0,5770.0,721.0,1442.0,14.0,50.0,1.0,40.412659,-3.692902,(AFOROS) Pº del Prado S-N - Espalter-Pl.Canova...
1,2015-01-15 20:45:00,2015-01-15,20:45:00,20,Thursday,4,laborable,January,1,7138,URB,1975.0,5770.0,493.0,1442.0,6.0,31.0,1.0,40.412659,-3.692902,(AFOROS) Pº del Prado S-N - Espalter-Pl.Canova...
2,2015-01-12 16:00:00,2015-01-12,16:00:00,16,Monday,1,laborable,January,1,7138,URB,2603.0,5770.0,650.0,1442.0,10.0,42.0,1.0,40.412659,-3.692902,(AFOROS) Pº del Prado S-N - Espalter-Pl.Canova...
3,2015-01-12 17:45:00,2015-01-12,17:45:00,17,Monday,1,laborable,January,1,7138,URB,2238.0,5770.0,559.0,1442.0,10.0,38.0,1.0,40.412659,-3.692902,(AFOROS) Pº del Prado S-N - Espalter-Pl.Canova...
4,2015-01-12 11:00:00,2015-01-12,11:00:00,11,Monday,1,laborable,January,1,7138,URB,2881.0,5770.0,720.0,1442.0,12.0,45.0,1.0,40.412659,-3.692902,(AFOROS) Pº del Prado S-N - Espalter-Pl.Canova...


In [145]:
final_data.shape

(1361995, 21)

### For the Time Series and further analysis I add 3 new columns

#### Create a function to create a column that indicates when Madrid Central starts

In [146]:
def Madrid_Central(x):
    if x >= '2018-12-01':
        return 1
    else:
        return 0
    

#### Create a function to create a column that indicates when the working activities of Gran Via lasted

In [147]:
def Gran_Via_remodelation(x):
    if (x >= '2018-03-09') & (x <= '2018-11-23'):
        return 1
    else:
        return 0

In [148]:
final_data['Madrid-Central'] = final_data['Fecha_corta'].apply(lambda x: Madrid_Central(x))

In [149]:
final_data['Gran-Via-remodelation'] = final_data['Fecha_corta'].apply(lambda x: Gran_Via_remodelation(x))

#### Create a column with Year and month for the Time Series

In [152]:
final_data['Año_mes'] = final_data['Fecha_corta'].map(lambda x: str(x)[:7])

In [153]:
final_data.head()

Unnamed: 0,fecha,Fecha_corta,Hora,Hour24,Dia_nombre,Dia_numero,Tipo_dia,Mes_nombre,Mes_numero,id,tipo_elem,intensidad,intensidadSat,volumen,volumenSat,ocupacion,carga,distrito,latitud,longitud,nombre,Madrid-Central,Gran-Via-remodelation,Año_mes
0,2015-01-12 10:45:00,2015-01-12,10:45:00,10,Monday,1,laborable,January,1,7138,URB,2886.0,5770.0,721.0,1442.0,14.0,50.0,1.0,40.412659,-3.692902,(AFOROS) Pº del Prado S-N - Espalter-Pl.Canova...,0,0,2015-01
1,2015-01-15 20:45:00,2015-01-15,20:45:00,20,Thursday,4,laborable,January,1,7138,URB,1975.0,5770.0,493.0,1442.0,6.0,31.0,1.0,40.412659,-3.692902,(AFOROS) Pº del Prado S-N - Espalter-Pl.Canova...,0,0,2015-01
2,2015-01-12 16:00:00,2015-01-12,16:00:00,16,Monday,1,laborable,January,1,7138,URB,2603.0,5770.0,650.0,1442.0,10.0,42.0,1.0,40.412659,-3.692902,(AFOROS) Pº del Prado S-N - Espalter-Pl.Canova...,0,0,2015-01
3,2015-01-12 17:45:00,2015-01-12,17:45:00,17,Monday,1,laborable,January,1,7138,URB,2238.0,5770.0,559.0,1442.0,10.0,38.0,1.0,40.412659,-3.692902,(AFOROS) Pº del Prado S-N - Espalter-Pl.Canova...,0,0,2015-01
4,2015-01-12 11:00:00,2015-01-12,11:00:00,11,Monday,1,laborable,January,1,7138,URB,2881.0,5770.0,720.0,1442.0,12.0,45.0,1.0,40.412659,-3.692902,(AFOROS) Pº del Prado S-N - Espalter-Pl.Canova...,0,0,2015-01


In [151]:
final_data.dtypes

fecha                     object
Fecha_corta               object
Hora                      object
Hour24                    object
Dia_nombre                object
Dia_numero                 int64
Tipo_dia                  object
Mes_nombre                object
Mes_numero                 int64
id                         int64
tipo_elem                 object
intensidad               float64
intensidadSat            float64
volumen                  float64
volumenSat               float64
ocupacion                float64
carga                    float64
distrito                 float64
latitud                  float64
longitud                 float64
nombre                    object
Madrid-Central             int64
Gran-Via-remodelation      int64
dtype: object

### Let's put the columns in a nice order for making it more visual gathering the columns by fields

In [154]:
columnsOrder = ['fecha','Año_mes','Fecha_corta', 'Hora', 'Hour24', 'Dia_nombre', 'Dia_numero','Tipo_dia', 'Mes_nombre', 
                'Mes_numero', 'id','distrito', 'tipo_elem', 'intensidad','intensidadSat','volumen','volumenSat','ocupacion',
                'carga','latitud','longitud','nombre', 'Madrid-Central', 'Gran-Via-remodelation']

In [155]:
final_data = final_data[columnsOrder]

In [157]:
final_data.shape

(1361995, 24)

In [158]:
# Saving it to CSV

final_data.to_csv('final_data.csv', sep = ';', index = False)

#### Move to next notebook number 3