# Location Grouping Multivariate LSTM



By Nida Copty, India Tory, Emily Nguyen, Thomas Nguyen


This notebook represents the next evolution in our LSTM modeling journey, building upon the foundation laid by the univariate LSTM model and the multivariate LSTM model utilizing city-wide occupancy rate data. Now, our focus shifts to training the model on individual shelter data, aiming to capture unique patterns inherent to each shelter.

To avoid overfitting, we employ a strategic approach to encode shelter data using one-hot vectors. With over 200 shelters, a direct one-hot encoding would lead to excessive feature dimensions, risking over-parameterization. Thus, we devise a grouping strategy, leveraging geolocation information to cluster shelters with similar characteristics into cohesive groups.

The key challenge lies in determining the optimal number of clusters for grouping shelters. We adopt a data-driven approach, employing k-means clustering with varying centroids. Through varying methods of analysis, we aim to discern the ideal number of centroids that best capture the nuanced patterns within the shelter data.

### Imports

In [1]:
import pandas as pd
import numpy as np
from copy import deepcopy as dc
from torch.utils.data import Dataset
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from sklearn.preprocessing import MinMaxScaler
from geopy.geocoders import Nominatim
import matplotlib.pyplot as plt
import random

### Dataset Load

In [2]:
#Occupancy Rate (Output Data):
data_23 = r"/content/drive/MyDrive/RBC | Borealis AI | Lets Solve IT/Datasets/daily-shelter-overnight-service-occupancy-capacity-2023.csv"
data_22 = r"/content/drive/MyDrive/RBC | Borealis AI | Lets Solve IT/Datasets/daily-shelter-overnight-service-occupancy-capacity-2022.csv"
data_21 = r"/content/drive/MyDrive/RBC | Borealis AI | Lets Solve IT/Datasets/daily-shelter-overnight-service-occupancy-capacity-2021.csv"
data_24 = r"/content/drive/MyDrive/RBC | Borealis AI | Lets Solve IT/Datasets/Daily shelter overnight occupancy.csv"
links = [data_24, data_23, data_22, data_21]

#Weather Data
data_w_23 = r"/content/drive/MyDrive/RBC | Borealis AI | Lets Solve IT/Datasets/en_climate_daily_ON_6158355_2023_P1D.csv"
data_w_24 = r"/content/drive/MyDrive/RBC | Borealis AI | Lets Solve IT/Datasets/en_climate_daily_ON_6158355_2024_P1D.csv"
data_w_22 = r"/content/drive/MyDrive/RBC | Borealis AI | Lets Solve IT/Datasets/en_climate_daily_ON_6158355_2022_P1D.csv"
data_w_21 = r"/content/drive/MyDrive/RBC | Borealis AI | Lets Solve IT/Datasets/en_climate_daily_ON_6158355_2021_P1D.csv"
links_weather = [data_w_24, data_w_23, data_w_22, data_w_21]

#Housing
data_housing = r"/content/drive/MyDrive/RBC | Borealis AI | Lets Solve IT/Datasets/Housing.csv"

#Crisis helpline
data_crisis = r"/content/drive/MyDrive/RBC | Borealis AI | Lets Solve IT/Datasets/Persons_in_Crisis_Calls_for_Service_Attended_Open_Data.csv"

Class to read and report on errors when reading csv files.

In [5]:
def load_csv_to_pandas(file_path):
    try:
        # Load CSV file into a pandas dataFrame
        df = pd.read_csv(file_path, header=0, low_memory=False, encoding='unicode_escape')
        print("Number of rows in the dataFrame:", file_path, len(df))
        return df
    except FileNotFoundError:
        print(f"File '{file_path}' not found.")
        return None
    except Exception as e:
        print("An error occurred:", str(e))
        return None

This function below is to convert all the datasets of different features into a singular panda dataframes and also a hashmap containing individual shelter datas.

In [3]:
def loadData(output_data, weather_data, housing, crisis):

    #-------Output Data-------#
    #Loading up the links to the output dataset
    for i in range(len(output_data)):
        output_data[i] = load_csv_to_pandas(output_data[i])

    #Dropping irrelevant columns for output datasets
    for i in range(len(output_data)):
        #print(output_data[i])
        output_data[i] = output_data[i].drop(columns = ['_id', 'ORGANIZATION_ID', 'SHELTER_ID', 'LOCATION_ID', 'LOCATION_CITY', 'LOCATION_PROVINCE', 'PROGRAM_NAME', 'SECTOR', 'PROGRAM_MODEL','OVERNIGHT_SERVICE_TYPE', 'PROGRAM_AREA', 'SERVICE_USER_COUNT', 'CAPACITY_FUNDING_BED', 'UNOCCUPIED_BEDS', 'UNAVAILABLE_BEDS', 'CAPACITY_FUNDING_ROOM', 'UNOCCUPIED_ROOMS', 'UNAVAILABLE_ROOMS'])
        output_data[i]['OCCUPANCY_DATE'] = output_data[i]['OCCUPANCY_DATE']
        output_data[i]['OCCUPANCY_DATE'] =  pd.to_datetime(output_data[i]['OCCUPANCY_DATE'], format='%Y-%m-%d')

    #Joining the Output data together
    big_data = output_data[0]
    for i in range(1,len(output_data)):
        big_data = pd.concat([big_data, output_data[i]], ignore_index = True)

    #Determine the max and min date in the dataset to create a date vector to fill out empty values
    max_date = big_data['OCCUPANCY_DATE'].max()
    min_date = big_data['OCCUPANCY_DATE'].min()
    date_range = pd.date_range(start=min_date, end=max_date, freq = 'D')
    date_df = pd.DataFrame({'OCCUPANCY_DATE': date_range})

    #-------Weather Data-------#

    #loading up the links to the weather dataset
    for i in range(len(weather_data)):
        weather_data[i] = load_csv_to_pandas(weather_data[i])

    #Dropping irrelevant columns for weather datasets
    for i in range(len(weather_data)):
        weather_data[i] = weather_data[i].drop(columns = ['ï»¿"Longitude (x)"', 'Latitude (y)', 'Station Name', 'Climate ID', 'Year', 'Month', 'Day', 'Data Quality', 'Max Temp Flag', 'Min Temp Flag', 'Mean Temp Flag', 'Heat Deg Days Flag', 'Cool Deg Days Flag', 'Total Rain (mm)', 'Total Rain Flag', 'Total Snow (cm)', 'Total Snow Flag', 'Total Precip Flag',
        'Snow on Grnd Flag', 'Dir of Max Gust (10s deg)', 'Dir of Max Gust Flag', 'Spd of Max Gust (km/h)', 'Spd of Max Gust Flag'])
        weather_data[i]['Date/Time'] = weather_data[i]['Date/Time'].astype(str)
        weather_data[i]['Date/Time'] = pd.to_datetime(weather_data[i]['Date/Time'])

    #Joining the Weather data together
    big_weather = weather_data[0]
    for i in range(1, len(weather_data)):
        big_weather = pd.concat([big_weather, weather_data[i]], ignore_index = True)

    #Cut down all data with dates that is bigger than the biggest date and smaller than the smallest date with an output
    big_weather = big_weather[big_weather['Date/Time'] <= max_date]
    big_weather = big_weather[big_weather['Date/Time'] >= min_date]

    #Fill out datasets' entries w no data w 0
    big_weather = big_weather.fillna(0)

    #Changing non output dataset's date column to 'OCCUPANCY_DATE'
    big_weather = big_weather.rename(columns = {'Date/Time': 'OCCUPANCY_DATE'})

    #-------Housing Data-------#

    #loading up housing data
    housing = load_csv_to_pandas(housing)

    #Dropping irrelevant columns for housing dataset
    housing = housing[housing['GEO'] == 'Toronto, Ontario']
    housing = housing[housing['New housing price indexes'] == 'Total (house and land)']
    housing = housing.drop(columns = ['GEO', 'DGUID', 'New housing price indexes', 'UOM', 'UOM_ID', 'SCALAR_FACTOR', 'SCALAR_ID', 'VECTOR', 'COORDINATE', 'STATUS', 'SYMBOL', 'TERMINATED', 'DECIMALS'])
    housing = housing.rename(columns = {housing.columns[0]: 'OCCUPANCY_DATE'})
    housing["OCCUPANCY_DATE"] = pd.to_datetime(housing["OCCUPANCY_DATE"])
    housing = housing[housing["OCCUPANCY_DATE"] >= min_date]
    housing = housing[housing["OCCUPANCY_DATE"] <= max_date].reset_index(drop=True)
    housing = pd.merge(housing, date_df, on = 'OCCUPANCY_DATE', how = 'outer')
    housing = housing.sort_values(by='OCCUPANCY_DATE').reset_index(drop=True)
    housing = housing.ffill()

    #-------Crisis Data-------#

    #Loading the crisis dataset
    crisis = load_csv_to_pandas(crisis)

    #Analyize Data
    crisis = crisis.drop(columns = ['ï»¿OBJECTID', 'EVENT_ID', 'EVENT_YEAR', 'EVENT_MONTH', 'EVENT_DOW', 'EVENT_HOUR', 'DIVISION', 'OCCURRENCE_CREATED', 'APPREHENSION_MADE', 'MCIT_ATTEND', 'HOOD_158', 'NEIGHBOURHOOD_158', 'HOOD_140', 'NEIGHBOURHOOD_140'])
    crisis = crisis.rename(columns = {'EVENT_DATE': 'OCCUPANCY_DATE'})
    crisis = crisis.groupby(['OCCUPANCY_DATE', 'EVENT_TYPE']).size().unstack(fill_value=0)
    crisis.reset_index(inplace=True)
    crisis = crisis.rename_axis(None, axis=1)
    crisis['OCCUPANCY_DATE'] = pd.to_datetime(crisis['OCCUPANCY_DATE']).dt.date
    crisis['OCCUPANCY_DATE'] = pd.to_datetime(crisis['OCCUPANCY_DATE'])
    crisis = crisis[crisis["OCCUPANCY_DATE"] >= min_date]
    crisis = crisis[crisis["OCCUPANCY_DATE"] <= max_date]
    crisis = pd.merge(date_df, crisis, on='OCCUPANCY_DATE', how='left')

    #-------Final Data Prep-------#

    #Merge the datasets together through date
    big_data = pd.merge(big_data, big_weather, on = 'OCCUPANCY_DATE', how = 'inner')
    big_data = pd.merge(big_data, housing, on = 'OCCUPANCY_DATE', how = 'inner')
    big_data = pd.merge(big_data, crisis, on = 'OCCUPANCY_DATE', how = 'inner')

    big_data = big_data.sort_values(by='OCCUPANCY_DATE')

    #Placing the bed and room occupancy column last
    room_occupancy = big_data.pop('OCCUPANCY_RATE_ROOMS')
    bed_occupancy = big_data.pop('OCCUPANCY_RATE_BEDS')
    big_data['OCCUPANCY_RATE_BEDS'] = bed_occupancy
    big_data['OCCUPANCY_RATE_ROOMS'] = room_occupancy

    grouped_data = big_data.groupby('PROGRAM_ID')
    shelter_data_frames = {}
    for shelter_id, shelter_group in grouped_data:
        shelter_data_frames[shelter_id] = shelter_group
        shelter_data_frames[shelter_id]['OCCUPANCY_DATE'] = pd.to_datetime(shelter_data_frames[shelter_id]['OCCUPANCY_DATE'])

    big_data.reset_index(inplace=True)
    big_data = big_data.drop(columns = ['index'])

    return big_data, shelter_data_frames

Running the function to get the dataframe and hashmap

In [6]:
dataframe, iso_data = loadData(links.copy(), links_weather.copy(), data_housing, data_crisis)

Number of rows in the dataFrame: /content/drive/MyDrive/RBC | Borealis AI | Lets Solve IT/Datasets/Daily shelter overnight occupancy.csv 11459
Number of rows in the dataFrame: /content/drive/MyDrive/RBC | Borealis AI | Lets Solve IT/Datasets/daily-shelter-overnight-service-occupancy-capacity-2023.csv 48345
Number of rows in the dataFrame: /content/drive/MyDrive/RBC | Borealis AI | Lets Solve IT/Datasets/daily-shelter-overnight-service-occupancy-capacity-2022.csv 49478
Number of rows in the dataFrame: /content/drive/MyDrive/RBC | Borealis AI | Lets Solve IT/Datasets/daily-shelter-overnight-service-occupancy-capacity-2021.csv 50944
Number of rows in the dataFrame: /content/drive/MyDrive/RBC | Borealis AI | Lets Solve IT/Datasets/en_climate_daily_ON_6158355_2024_P1D.csv 366
Number of rows in the dataFrame: /content/drive/MyDrive/RBC | Borealis AI | Lets Solve IT/Datasets/en_climate_daily_ON_6158355_2023_P1D.csv 365
Number of rows in the dataFrame: /content/drive/MyDrive/RBC | Borealis AI 