# World Data League 2021
## Notebook Template

This notebook is one of the mandatory deliverables when you submit your solution (alongside the video pitch). Its structure follows the WDL evaluation criteria and it has dedicated cells where you can add descriptions. Make sure your code is readable as it will be the only technical support the jury will have to evaluate your work.

The notebook must:

*   💻 have all the code that you want the jury to evaluate
*   🧱 follow the predefined structure
*   📄 have markdown descriptions where you find necessary
*   👀 be saved with all the output that you want the jury to see
*   🏃‍♂️ be runnable


## Authors
Write the name (first and last) of the people on your team that are responsible for developing this solution.

## External links and resources
Paste here all the links to external resources that are necessary to understand and run your code. Add descriptions to make it clear how to use them during evaluation.

## Introduction
Describe how you framed the challenge by telling us what problem are you trying to solve and how your solution solves that problem.

## Development
Start coding here! 👩‍💻

Don't hesitate to create markdown cells to include descriptions of your work where you see fit, as well as commenting your code.

We know that you know exactly where to start when it comes to crunching data and building models, but don't forget that WDL is all about social impact...so take that into consideration as well.

In [76]:
def read_shapefile(shp_path):
    """
    Read a shapefile into a Pandas dataframe with a 'coords' column holding
    the geometry information. This uses the pyshp package
    """

    #read file, parse out the records and shapes
    sf = shapefile.Reader(shp_path)
    fields = [x[0] for x in sf.fields][1:]
    records = sf.records()
    shps = [s.points for s in sf.shapes()]

    #write into a dataframe
    df = pd.DataFrame(columns=fields, data=records)
    df = df.assign(coords=shps)

    return df

In [80]:
import shapefile

tmp = read_shapefile('data/census_data/Torino_ACE81_3003.shp')

In [2]:
import pandas as pd

In [68]:
sensor_list = pd.read_csv('data/noise_sensor_list.csv', sep = ';')
sensor_list['Sensor_ID'] = ['C1', 'C2', 'C3', 'C4', 'C5']

sensor_list['Lat'] = sensor_list['Lat'].str.replace(',', '.').astype(float)
sensor_list['Long'] = sensor_list['Long'].str.replace(',', '.').astype(float)

In [69]:
sensor_list

Unnamed: 0,code,address,Lat,Long,streaming,Sensor_ID
0,s_01,"Via Saluzzo, 26 Torino",45.059172,7.678986,https://userportal.smartdatanet.it/userportal/...,C1
1,s_02,"Via Principe Tommaso, 18bis Torino",45.057837,7.681555,https://userportal.smartdatanet.it/userportal/...,C2
2,s_03,Largo Saluzzo Torino,45.058518,7.678854,https://userportal.smartdatanet.it/userportal/...,C3
3,s_05,Via Principe Tommaso angolo via Baretti Torino,45.057603,7.681348,https://userportal.smartdatanet.it/userportal/...,C4
4,s_06,"Corso Marconi, 27 Torino",45.055554,7.68259,https://userportal.smartdatanet.it/userportal/...,C5


There are also two other noise sensors in the following locations, outside of San Salvario,
with an increasing busy nightlife: TTO-001, TT0-002 that are outside San Salvario. Ignored for now.

In [71]:
def load_noise_data(file_paths, sensor_list):
    """
    Function for loading noise data into the correct format
    """
    concat_data = []
    for file in file_paths:
        df = pd.read_csv(file, header=8, sep=';')
        df = df.melt(id_vars=['Data', 'Ora'])
        df['Timestamp'] = pd.to_datetime(df['Data'] + ' ' + df['Ora'])
        df.columns = ['Date', 'Time', 'Sensor_ID', 'Intensity', 'Timestamp']

        concat_data.append(df)

    concat_df = pd.concat(concat_data)

    output = concat_df.merge(sensor_list, on=['Sensor_ID'])
    return output[['Timestamp', 'Sensor_ID', 'Intensity', 'address', 'Lat', 'Long']]

In [114]:
data = load_noise_data(file_paths, sensor_list)

In [111]:
data

Unnamed: 0,Timestamp,Sensor_ID,Intensity,address,Lat,Long
0,2016-01-06 00:00:00,C1,611,"Via Saluzzo, 26 Torino",45.059172,7.678986
1,2016-01-06 01:00:00,C1,572,"Via Saluzzo, 26 Torino",45.059172,7.678986
2,2016-01-06 02:00:00,C1,525,"Via Saluzzo, 26 Torino",45.059172,7.678986
3,2016-01-06 03:00:00,C1,506,"Via Saluzzo, 26 Torino",45.059172,7.678986
4,2016-01-06 04:00:00,C1,474,"Via Saluzzo, 26 Torino",45.059172,7.678986
...,...,...,...,...,...,...
25675,2016-12-31 19:00:00,C5,618,"Corso Marconi, 27 Torino",45.055554,7.682590
25676,2016-12-31 20:00:00,C5,639,"Corso Marconi, 27 Torino",45.055554,7.682590
25677,2016-12-31 21:00:00,C5,592,"Corso Marconi, 27 Torino",45.055554,7.682590
25678,2016-12-31 22:00:00,C5,576,"Corso Marconi, 27 Torino",45.055554,7.682590


In [30]:
file_paths = ['data/noise_data/san_salvario_2016.csv']

In [102]:
# We create a function to create our targets
# As you can see, we created our target (label) based on a date offset (i.e., our label will be the intensity of the next day at the same time)
def create_target(df_resampled, date_col = 'Timestamp', target_col = 'Intensity', entity_id='Sensor_ID', date_offset = 24):
    """
    Function from creating lagged or future features for a specific date offset.
    For instance, this adds a new column with the intensity values 24 hours in the future, for each row, by default.    
    """
    
    df_resampled[f'date_col_{target_col}'] = df_resampled[date_col] + pd.DateOffset(hours=date_offset)
    tmp = df_resampled[[entity_id, date_col, f'date_col_{target_col}', target_col]].merge(
        df_resampled[[entity_id, date_col, f'date_col_{target_col}', target_col]], 
        left_on = [entity_id, f'date_col_{target_col}'], 
        right_on=[entity_id, date_col], 
        how='left'
    )

    tmp = tmp[[entity_id, f'{date_col}_x', f'{target_col}_y']]
    tmp.columns = [entity_id, date_col, f'target_{target_col}_{str(date_offset)}h']

    df_resampled = df_resampled.merge(tmp, on=[entity_id, date_col])
    
    return df_resampled

In [115]:
df = create_target(data, target_col='Intensity', date_offset=1)

In [108]:
import holidays
import numpy as np

it_holidays = holidays.CountryHoliday('Italy')

# We created a function to get some interesting date features, based on Pandas DataSeries predefined functions
def get_date_features(df_resampled, date_col, suffix, holidays_list):
    """
    Function for getting date features from a datetime column. 
    """
    df_resampled[f'day_{suffix}'] = df_resampled[date_col].dt.day
    df_resampled[f'hour_{suffix}'] = df_resampled[date_col].dt.hour
    df_resampled[f'month_{suffix}'] = df_resampled[date_col].dt.month
    df_resampled[f'dayofweek_{suffix}'] = df_resampled[date_col].dt.dayofweek
    # df_resampled[f'year_{suffix}'] = df_resampled[date_col].dt.year
    df_resampled[f'quarter_{suffix}'] = df_resampled[date_col].dt.quarter
    df_resampled[f'is_holiday_{suffix}'] = df_resampled[date_col].apply(lambda x: x in holidays_list)
    # df_resampled[f'is_year_end_{suffix}'] = df_resampled[date_col].dt.is_year_end
    df_resampled[f'is_weekend_{suffix}'] = np.where(df_resampled[f'dayofweek_{suffix}'].isin([5, 6]), 1, 0)
                                                  
    return df_resampled

In [109]:
df = get_date_features(df, date_col='Timestamp', suffix='today', holidays_list=it_holidays)

In [110]:
df.head(3)

Unnamed: 0,Date,Time,Sensor_ID,Intensity,Timestamp,dateobserved_tomorrow,date_col_Intensity,target_Intensity_1h,day_today,hour_today,month_today,dayofweek_today,quarter_today,is_holiday_today,is_weekend_today
0,01-06-2016,00:00,C1,611,2016-01-06 00:00:00,2016-01-06 01:00:00,2016-01-06 01:00:00,572,6,0,1,2,1,True,0
1,01-06-2016,01:00,C1,572,2016-01-06 01:00:00,2016-01-06 02:00:00,2016-01-06 02:00:00,525,6,1,1,2,1,True,0
2,01-06-2016,02:00,C1,525,2016-01-06 02:00:00,2016-01-06 03:00:00,2016-01-06 03:00:00,506,6,2,1,2,1,True,0


# Sensor Location

In [75]:
import folium

m = folium.Map(location=[45.0445, 7.4034], zoom_start=14)

for indice, row in sensor_list.iterrows():
    folium.Marker(
        location=[row["Lat"], row["Long"]],
        popup=row['address'],
        icon=folium.Icon(color="red", icon='automobile', prefix='fa')
        ).add_to(m)

m

## Conclusions

### Scalability and Impact
Tell us how applicable and scalable your solution is if you were to implement it in a city. Identify possible limitations and measure the potential social impact of your solution.

### Future Work
Now picture the following scenario: imagine you could have access to any type of data that could help you solve this challenge even better. What would that data be and how would it improve your solution? 🚀