tmp# World Data League 2021
## Notebook Template

This notebook is one of the mandatory deliverables when you submit your solution (alongside the video pitch). Its structure follows the WDL evaluation criteria and it has dedicated cells where you can add descriptions. Make sure your code is readable as it will be the only technical support the jury will have to evaluate your work.

The notebook must:

*   💻 have all the code that you want the jury to evaluate
*   🧱 follow the predefined structure
*   📄 have markdown descriptions where you find necessary
*   👀 be saved with all the output that you want the jury to see
*   🏃‍♂️ be runnable


## Authors
Write the name (first and last) of the people on your team that are responsible for developing this solution.

## External links and resources
Paste here all the links to external resources that are necessary to understand and run your code. Add descriptions to make it clear how to use them during evaluation.

## Introduction
Describe how you framed the challenge by telling us what problem are you trying to solve and how your solution solves that problem.

## Development
Start coding here! 👩‍💻

Don't hesitate to create markdown cells to include descriptions of your work where you see fit, as well as commenting your code.

We know that you know exactly where to start when it comes to crunching data and building models, but don't forget that WDL is all about social impact...so take that into consideration as well.

In [None]:
# Imports
import os
import glob
import math
import numpy as np
import pandas as pd
import geopandas as gpd
from shapely import wkt
import shapefile
from geopy.distance import geodesic
from geopy.geocoders import Nominatim

ModuleNotFoundError: No module named 'geopandas'

In [None]:
# Helper functions for loading data

def read_shapefile(shp_path):
    """
    Read a shapefile into a Pandas dataframe with a 'coords' column holding
    the geometry information. This uses the pyshp package
    """

    #read file, parse out the records and shapes
    sf = shapefile.Reader(shp_path)
    fields = [x[0] for x in sf.fields][1:]
    records = sf.records()
    shps = [s.points for s in sf.shapes()]

    #write into a dataframe
    df = pd.DataFrame(columns=fields, data=records)
    df = df.assign(coords=shps)

    return df



# Function: Load Noise Data
def load_noise_data(file_paths, sensor_list):
    """
    Function for loading noise data into the correct format
    """
    concat_data = []
    for file in file_paths:
        df = pd.read_csv(file, header=8, sep=';')
        df = df.melt(id_vars=['Data', 'Ora'])
        df['Timestamp'] = pd.to_datetime(df['Data'] + ' ' + df['Ora'])
        df.columns = ['Date', 'Time', 'Sensor_ID', 'Intensity', 'Timestamp']
        df['Intensity'] = df['Intensity'].str.replace(',', '.').astype(float)

        concat_data.append(df)

    concat_df = pd.concat(concat_data)

    output = concat_df.merge(sensor_list, on=['Sensor_ID'])
    
    return output[['Timestamp', 'Sensor_ID', 'Intensity', 'address', 'Lat', 'Long', 'day_max_db', 'night_max_db', 'area_type']]



# Function that calculates the distance in km between two points using the latitude and longitude data 
def get_lat_lon_dist(row):
    
    latlon1 = tuple(row[['latitude1', 'longitude1']])
    latlon2 = tuple(row[['latitude2', 'longitude2']])

    return geodesic(latlon1, latlon2).kilometers



def compute_distances(df_pois, data_entities):

    # In order to be able to apply the function defined above with the data from the file with the points of interest, 
    data_entities[['latitude', 'longitude']] = data_entities[["Lat", "Long"]]
    df_pois[['latitude', 'longitude']] = df_pois[["Latitude", "Longitude"]]
    

    # Cross-join to get all combinations of latitude/longitude
    dist = pd.merge(data_entities.copy().assign(k=1), df_pois.copy().assign(k=1), on='k', suffixes=('1', '2')).drop('k', axis=1)
    
    # Application of the get_lat_lon_dist function defined with the data from the points of interest and the sensors. 
    # Creation of a new column "dist_NOME_DA_ENTIDADE" with the distance in km from each sensor to each point of interest
    dist['dist_NOME_DA_ENTIDADE'] = dist.apply(get_lat_lon_dist, axis=1)
    
    return dist


def return_amount_of_points_of_interest_per_sensor(distances_df, threshold):
    
    # We set a thresold distance equal to 1.5km because we consider, that given the size of the Porto region, 1.5km is a walkable distance. 
    # For each sensor, we defined the number of restaurants, hotels, shopping centers, etc that are at a distance of 1.5km or less
    dist_new = distances_df.copy()
    dist_new['is_below_threshold'] = np.where(dist_new['dist_NOME_DA_ENTIDADE'] <= threshold, 1, 0)
    
    
    # Check this later
    # sensor_categ = dist_new.groupby(['entity_id', 'category'])['is_below_threshold'].sum().reset_index()
    # sensor_categ = sensor_categ.pivot_table(index = "entity_id", columns = "category", values = "is_below_threshold")
    
    return dist_new

In [None]:
# Load list of sensors

sensor_list = pd.read_csv('data/noise_sensor_list.csv', sep = ';')
sensor_list['Sensor_ID'] = ['C1', 'C2', 'C3', 'C4', 'C5']
sensor_list['Lat'] = sensor_list['Lat'].str.replace(',', '.').astype(float)
sensor_list['Long'] = sensor_list['Long'].str.replace(',', '.').astype(float)

# Get mapping locations and correspondence to area type
# Link: https://webgis.arpa.piemonte.it/Geoviewer2D/?config=other-configs/acustica_config.json

mapping_location_area_code = pd.DataFrame(
    [['s_01', 65, 55, 'IV - Aree di intensa attività umana'],
    ['s_02', 60, 50, 'III - Aree di tipo misto'],
    ['s_03', 60, 50, 'III - Aree di tipo misto'],
    ['s_05', 65, 55, 'IV - Aree di intensa attività umana'],
    ['s_06', 60, 50, 'III - Aree di tipo misto']],
    columns=['code', 'day_max_db', 'night_max_db', 'area_type']
)

sensor_list = sensor_list.merge(mapping_location_area_code, on=['code'])

In [None]:
sensor_list

Unnamed: 0,code,address,Lat,Long,streaming,Sensor_ID,day_max_db,night_max_db,area_type
0,s_01,"Via Saluzzo, 26 Torino",45.059172,7.678986,https://userportal.smartdatanet.it/userportal/...,C1,65,55,IV - Aree di intensa attività umana
1,s_02,"Via Principe Tommaso, 18bis Torino",45.057837,7.681555,https://userportal.smartdatanet.it/userportal/...,C2,60,50,III - Aree di tipo misto
2,s_03,Largo Saluzzo Torino,45.058518,7.678854,https://userportal.smartdatanet.it/userportal/...,C3,60,50,III - Aree di tipo misto
3,s_05,Via Principe Tommaso angolo via Baretti Torino,45.057603,7.681348,https://userportal.smartdatanet.it/userportal/...,C4,65,55,IV - Aree di intensa attività umana
4,s_06,"Corso Marconi, 27 Torino",45.055554,7.68259,https://userportal.smartdatanet.it/userportal/...,C5,60,50,III - Aree di tipo misto


In [None]:
file_paths_noise_data = [
    'data/noise_data/san_salvario_2016.csv',
    'data/noise_data/san_salvario_2017.csv',
    'data/noise_data/san_salvario_2018.csv',
    'data/noise_data/san_salvario_2019.csv',
]
data = load_noise_data(file_paths_noise_data, sensor_list)

In [None]:
# Police complaints

file_paths=glob.glob('data/police_complaints/*.csv')

concat_data = []
for file in file_paths:
    df = pd.read_csv(file, sep=',')
    df['Timestamp'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
    concat_data.append(df)
df_final = pd.concat(concat_data)

filter = ['Facilities disturbances', 'Disturbing noises', 'Youth aggregation']
df_filtered = df_final.loc[df_final['Criminal sub-category'].isin(filter)]

df_filtered_san_salvario = df_filtered[df_filtered.District == 8]

df_filtered_san_salvario['Localization'] = df_filtered_san_salvario['Localization'].str.lower().str.strip()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered_san_salvario['Localization'] = df_filtered_san_salvario['Localization'].str.lower().str.strip()


In [None]:
localization_address_mapping = {
   'principe tommaso/(via)':  'Via Principe Tommaso, 18bis Torino',
   'baretti/giuseppe (via)': 'Via Principe Tommaso angolo via Baretti Torino',
   'marconi/guglielmo (corso)' : 'Corso Marconi, 27 Torino',
   'saluzzo/(largo)': 'Largo Saluzzo Torino',
   'saluzzo/(via)': 'Via Saluzzo, 26 Torino'
}

In [None]:
df_filtered_san_salvario['address'] = df_filtered_san_salvario['Localization'].map(localization_address_mapping)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered_san_salvario['address'] = df_filtered_san_salvario['Localization'].map(localization_address_mapping)


In [None]:
df_filtered_san_salvario = df_filtered_san_salvario[~df_filtered_san_salvario.address.isna()]
                                                    
# Tiago starts here

In [None]:
def create_target_complaints(df_filtered_san_salvario, range_days = 2, target_forecast = 1):

    complaint_dates = pd.to_datetime(df_filtered_san_salvario['Date']).unique()

    # Target dates: same day or subtract one day (by default)
    range_deltas = [pd.Timedelta(x + target_forecast, 'd') for x in np.arange(0, range_days)]
    
    target_dates = []
    for x in range_deltas:
        for y in complaint_dates:
            target_dates.append(y - x)
            
    return target_dates, complaint_dates

In [None]:
target_dates, complaint_dates = create_target_complaints(df_filtered_san_salvario)

In [None]:
# Queixa no dia 24
# Houve barulho no 24 ou 23 
# Queremos prever no 23 ou 22

In [None]:
sorted(target_dates)[0:2], sorted(complaint_dates)[0]

([Timestamp('2016-02-22 00:00:00'), Timestamp('2016-02-23 00:00:00')],
 numpy.datetime64('2016-02-24T00:00:00.000000000'))

In [None]:
data['Timestamp_trunc'] = data['Timestamp'].truncate()

In [None]:
data['complaint_followed'] = np.where(data['Timestamp_trunc'].isin(target_dates), 1, 0)

In [None]:
tmp = df_filtered_san_salvario.merge(data, on=['address'])

# Filtrar as linhas em que timestamp_x >= timestamp_y + 1 dia 

# Criar um pandas Series / dicionario que tenha como chave o dia/timestamp no qual houve uma queixa no mesmo dia ou no dia seguinte (que vai ser a label=1)
# label = 0 vão ser os dias em que não se conseguiu mapear nada

In [None]:
# Filtrar dataset de treino para as datas

In [None]:
# Get the Points of Interest of this Region
# We start by loading the .CSV file
businesses = pd.read_csv('data/businesses.csv', delimiter=';')

# Let's show 
businesses.head()

In [None]:
# TODO (if we have time): Change this loop to a more efficient loop

# To obtain the coordinates we performed a reversed mapping of the address
# Create a Geolocator
geolocator = Nominatim(user_agent="wdl-tech-moguls")

# Iterate through the businesses dataframe
for i in range(len(businesses)):
    location = geolocator.geocode(businesses.loc[i, "ADDRESS"])
    businesses.loc[i, "Longitude"], businesses.loc[i, "Latitude"] = location.longitude, location.latitude

    
# We saved this into a .CSV for further use
businesses.to_csv("data/businesses_proc.csv")

In [None]:
# Let's check if it is everything OK
businesses = pd.read_csv("data/businesses_proc.csv")
businesses

In [None]:
# Let's now compute the distances between each point of interest and each sensor
distances = compute_distances(businesses, sensor_list)
distances

In [None]:
# We save it into CSV for further use
distances.to_csv("data/distances_sensors_pois.csv")

In [None]:
# Get different dataframes per threshold
for thresh in [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 1.1, 1.2, 1.3, 1.4, 1.5]:
    thresh_df = return_amount_of_points_of_interest_per_sensor(distances, thresh)
    thresh_w_mapping = thresh_df.merge(mapping_location_area_code, on=['code'])
    thresh_w_mapping.to_csv(f"data/thresh_df_{thresh}.csv")
    
# Shown an example
thresh_w_mapping

In [None]:
cols_to_keep = [
    'code', 
    'address', 
    'Sensor_ID', 
    'day_max_db_x', 
    'night_max_db_x', 
    'area_type_x', 
    'TYPE', 
    'Description', 
    'Merchandise Type', 
    'dist_NOME_DA_ENTIDADE', 
    'is_below_threshold'
]

thresh_w_mapping = thresh_w_mapping[cols_to_keep]
thresh_w_mapping

In [None]:
# Group by number of counts per category
# Read the file we want (each file has a different threshold)
# thresh_w_mapping = pd.read_csv("data/thresh_df_0.1.csv")

# We have some merchadises types that are not informative (they are given by numbers)
cols_to_remove = ['205', '207', '208', '214', '217', '99']

sensor_categ = thresh_w_mapping.copy().groupby(['Sensor_ID', 'Merchandise Type'])['is_below_threshold'].sum().reset_index()
sensor_categ = sensor_categ.copy().pivot_table(index = "Sensor_ID", columns = "Merchandise Type", values = "is_below_threshold")
sensor_categ = sensor_categ.copy().drop(cols_to_remove, axis = 1).reset_index()
sensor_categ

In [None]:
# Merge again to have the remaining columns ("day_max_db", "night_max_db", "area_type")
final_df = sensor_categ.merge(sensor_list.copy()[["Sensor_ID", "day_max_db", "night_max_db", "area_type"]], on=["Sensor_ID"])
final_df

# Exploratory Data Analysis

## Sensor Location

TODO: Description here about the sensors' proximity

In [None]:
import folium

m = folium.Map(location=[45.0530, 7.6798], zoom_start=15)

for indice, row in sensor_list.iterrows():
    folium.Marker(
        location=[row["Lat"], row["Long"]],
        popup=row['address'],
        icon=folium.Icon(color="red", icon='automobile', prefix='fa')
        ).add_to(m)

m

## Sazonality and regular behavior studies

In [None]:
import holidays
import numpy as np

it_holidays = holidays.CountryHoliday('Italy')

# We created a function to get some interesting date features, based on Pandas DataSeries predefined functions
def get_date_features(df_resampled, date_col, suffix, holidays_list):
    """
    Function for getting date features from a datetime column. 
    """
    df_resampled[f'day_{suffix}'] = df_resampled[date_col].dt.day
    df_resampled[f'hour_{suffix}'] = df_resampled[date_col].dt.hour
    df_resampled[f'month_{suffix}'] = df_resampled[date_col].dt.month
    df_resampled[f'dayofweek_{suffix}'] = df_resampled[date_col].dt.dayofweek
    # df_resampled[f'year_{suffix}'] = df_resampled[date_col].dt.year
    df_resampled[f'quarter_{suffix}'] = df_resampled[date_col].dt.quarter
    df_resampled[f'is_holiday_{suffix}'] = df_resampled[date_col].apply(lambda x: x in holidays_list)
    # df_resampled[f'is_year_end_{suffix}'] = df_resampled[date_col].dt.is_year_end
    df_resampled[f'is_weekend_{suffix}'] = np.where(df_resampled[f'dayofweek_{suffix}'].isin([5, 6]), 1, 0)
                                                  
    return df_resampled

data = get_date_features(data, date_col='Timestamp', suffix='now', holidays_list=it_holidays)

In [None]:
def noise_threshold(data, date_col='hour_now', suffix='now', value_col='Intensity'):
    mask_day = (data[date_col] > 6) & (data[date_col] < 22) & (data[value_col] > data['day_max_db'])
    mask_night = (data[date_col] > 22) | (data[date_col] < 6) & (data[value_col] > data['night_max_db'])
    mask = mask_day | mask_night

    data[f'noise_exceeds_threshold_{suffix}'] = np.where(mask, 1, 0)
    
    return data

In [None]:
data = noise_threshold(data, date_col='hour_now', suffix='now', value_col='Intensity')

In [None]:
def current_db(data, date_col='hour_now'):
    mask_day = (data[date_col] > 6) & (data[date_col] < 22) 
    mask_night = (data[date_col] > 22) | (data[date_col] < 6) 
    mask = mask_day | mask_night

    data[f'current_max_db_value'] = np.where(mask==mask_day, data[f'day_max_db'], data[f'night_max_db'])
    
    return data

In [None]:
data = current_db(data, date_col='hour_now')
data['relative_diff'] = (data.Intensity - data.current_max_db_value	) / data.Intensity 

In [None]:
def dbmean(levels, axis=None):
    """
    Energetic average of levels.
    :param levels: Sequence of levels.
    :param axis: Axis over which to perform the operation.
    .. math:: L_{mean} = 10 \\log_{10}{\\frac{1}{n}\\sum_{i=0}^n{10^{L/10}}}
    """
    # levels = np.asanyarray(levels)
    return 10.0 * np.log10((10.0**(levels / 10.0)).mean(axis=axis))

In [None]:
avg_intensity_per_hour = data[data.Sensor_ID == 'C1'].groupby('hour_now')['Intensity'].apply(dbmean)

In [None]:
import matplotlib.pyplot as plt

plt.plot(np.arange(len(avg_intensity_per_hour)), avg_intensity_per_hour.values)
plt.title('Average sensor behavior for sensor C1 in the different times of day')
plt.ylabel('Noise (dB)')
plt.xlabel('Hour of day (h)')

In [None]:
# We create a function to create our targets
# As you can see, we created our target (label) based on a date offset (i.e., our label will be the intensity of the next day at the same time)
def create_target(df_resampled, date_col = 'Timestamp', target_col = 'Intensity', entity_id='Sensor_ID', date_offset = 24):
    """
    Function from creating lagged or future features for a specific date offset.
    For instance, this adds a new column with the intensity values 24 hours in the future, for each row, by default.    
    """
    
    df_resampled[f'date_col_{target_col}'] = df_resampled[date_col] + pd.DateOffset(hours=date_offset)
    tmp = df_resampled[[entity_id, date_col, f'date_col_{target_col}', target_col]].merge(
        df_resampled[[entity_id, date_col, f'date_col_{target_col}', target_col]], 
        left_on = [entity_id, f'date_col_{target_col}'], 
        right_on=[entity_id, date_col], 
        how='left'
    )

    tmp = tmp[[entity_id, f'{date_col}_x', f'{target_col}_y']]
    tmp.columns = [entity_id, date_col, f'target_{target_col}_{str(date_offset)}h']

    df_resampled = df_resampled.merge(tmp, on=[entity_id, date_col])
    
    return df_resampled

In [None]:
data = create_target(data, target_col='Intensity', date_offset=24)

In [None]:
data = get_date_features(data, date_col='date_col_Intensity', suffix='target', holidays_list=it_holidays)


In [None]:
data = noise_threshold(data, date_col='hour_target', suffix='target', value_col='target_Intensity_24h')

In [None]:
# Avg noise intensity next 3 hours

#indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=3)
#data['average_intensity_next_3h'] = data.groupby(['Sensor_ID'])['Intensity'].rolling(window=indexer, min_periods=1).agg(dbmean).reset_index()['Intensity']
# data = noise_threshold(data, date_col='hour_target', suffix='target_2')


## Train first model

In [None]:
# We create a list of columns that we do not need to train our model
COLS_TO_REMOVE = [
    'Timestamp',
    'Sensor_ID',
    'address',
    'Lat',
    'Long',
    'area_type',
    'target_Intensity_24h',
    'date_col_Intensity',
    'noise_exceeds_threshold_target',
    #'average_intensity_next_3h',
    #'noise_exceeds_threshold_target_2'
]

# Based on the previous list, we create a new list with the features that we actually need!
COLS_TO_KEEP = [x for x in data.columns if x not in COLS_TO_REMOVE]

In [None]:
COLS_TO_REMOVE

In [None]:
data = data.sort_values(by= ['Timestamp', 'Sensor_ID']).reset_index(drop=True)

In [None]:
target_1 = 'target_Intensity_24h'

from xgboost import XGBClassifier

# Train model
# Please note that we use 80% of the data set as our train set!
X_train = data[0:int(0.7*len(data))]

# We remove the NaNs (labels that are NaNs)
X_train = X_train[~X_train[target_1].isna()]

# We use the remaining 20% as test set
X_test = data[int(0.7*len(data)):]

# We remove the NaNs (labels that are NaNs)
X_test = X_test[~X_test[target_1].isna()]

# Our labels column
y_train = X_train['noise_exceeds_threshold_target']

# We train an XGBoost Regressor. 
# Since it is a decision tree, it becomes easier to explain the decisions of our model
xgb = XGBClassifier(n_estimators=100)

# We train our model
xgb.fit(X_train[COLS_TO_KEEP].fillna(9999).astype(float), y_train)

y_pred = xgb.predict_proba(X_test[COLS_TO_KEEP].fillna(9999).astype(float))
X_test['pred_score'] = y_pred[:, 1]

In [None]:
X_train.Timestamp.max(), X_test.Timestamp.min()

In [None]:
X_test[(X_test.noise_exceeds_threshold_target == 1)][['Timestamp', 'Sensor_ID', 'Intensity', 'target_Intensity_24h']]

In [None]:
from sklearn.metrics import roc_auc_score

print("ROC AUC", roc_auc_score(X_test['noise_exceeds_threshold_target'], y_pred[:, 1]))

In [None]:
import shap
# Shap explanation

# We now explain the model's predictions using SHAP
# (same syntax works for LightGBM, CatBoost, scikit-learn, transformers, Spark, etc.)
explainer = shap.Explainer(xgb)
shap_values = explainer(X_test[COLS_TO_KEEP].fillna(9999).astype(float))

In [None]:
# Let's get a nice plot with the shap values so you can have an intuition on the rationale behind the model learned by the XGBoost Regressor
shap.plots.beeswarm(shap_values, max_display=15)

## Conclusions

### Scalability and Impact
Tell us how applicable and scalable your solution is if you were to implement it in a city. Identify possible limitations and measure the potential social impact of your solution.

### Future Work
Now picture the following scenario: imagine you could have access to any type of data that could help you solve this challenge even better. What would that data be and how would it improve your solution? 🚀