# General Preprocessing Steps
From the previous analysis, we have the following steps that we need to do for the pre-processing

1. Standardize Categorical Values:

Ensure all categorical values are consistent (e.g., standardizing "Drug Toxicity" and "Drug toxicity" to a single format).

2. Handle Missing Values:

Since we found no missing values, this step can be skipped for now.

3. Remove Duplicates:

Since we found no duplicate records, this step can be skipped for now.

4. Verify Data Types:

Ensure all columns have the correct data types. Convert columns if necessary.

5. Identify and Handle Outliers:

Review identified outliers and decide whether to keep them, remove them, or handle them differently based on the context.


# Specific Preprocessing Steps for Each Dataset

## 1. homeless-deaths-by-cause.csv
Standardize Cause_of_death Values:

Convert all values to lower case or upper case to ensure consistency.
Convert Data Types (if necessary):

Ensure all columns are of the correct type.

## 2. homeless-deaths-by-demographics.csv
Verify Consistency of Age_group and Gender:

No inconsistencies found, so this step can be skipped.
Convert Data Types (if necessary):

Ensure all columns are of the correct type.

## 3. homeless-deaths-by-month.csv
Verify Consistency of Month of death:

Ensure all month values are consistent and correctly spelled.
Convert Data Types (if necessary):

Ensure all columns are of the correct type.

In [1]:
import os

# Get the current path
current_path = os.getcwd()
project_folder = os.path.abspath(os.path.join(current_path, '..', '..', '..'))
print(project_folder)

C:\projects\BigData2Project


In [2]:
from Downloaders import OpenCanada as oc
from Datasets import HomelessnessDataset as homelessDS

downloader = oc.TorontoDownloader()
category = homelessDS.HomelessnessDataset()

downloader.load_pages(category.get_urls_death_people_experiencing_homelessness())
downloader.get_datasets_info()

download_folder = os.path.join(project_folder, 'data', 'raw', 'homeless', 'death_by')
downloader.download_datasets(output_directory=download_folder)

[Downloader] Added 1 urls to the list.
0) Page URL: https://open.toronto.ca/dataset/deaths-of-people-experiencing-homelessness/, Name: Homeless deaths by month, Last Modified: None, Type: CSV, Size: 0.00 MB, URL: https://ckan0.cf.opendata.inter.prod-toronto.ca/datastore/dump/8b2e5ec9-7cee-49cc-a67e-bba38e5077be
1) Page URL: https://open.toronto.ca/dataset/deaths-of-people-experiencing-homelessness/, Name: Homeless deaths by month.csv, Last Modified: 2024-04-17T21:09:06.846517, Type: CSV, Size: 0.00 MB, URL: https://ckan0.cf.opendata.inter.prod-toronto.ca/dataset/a7ae08f3-c512-4a88-bb3c-ab40eca50c5e/resource/dc4ec2fa-d655-46ca-af32-f216d26e9804/download/homeless-deaths-by-month.csv
2) Page URL: https://open.toronto.ca/dataset/deaths-of-people-experiencing-homelessness/, Name: Homeless deaths by month.xml, Last Modified: 2024-04-17T21:09:08.113343, Type: XML, Size: 0.01 MB, URL: https://ckan0.cf.opendata.inter.prod-toronto.ca/dataset/a7ae08f3-c512-4a88-bb3c-ab40eca50c5e/resource/2ba00593

# Death by Cause

In [3]:
import pandas as pd

filename = "homeless-deaths-by-cause.csv"
data_by_cause = pd.read_csv(downloader.get_path_for_downloaded_file(filename))

data_by_cause['Cause_of_death'] = data_by_cause['Cause_of_death'].str.lower().str.strip()
data_by_cause['Year of death'] = data_by_cause['Year of death'].astype(int)
data_by_cause['Count'] = data_by_cause['Count'].astype(int)
data_by_cause['Count'] = data_by_cause['Count'].apply(lambda x: min(x, 50))

cleaned_folder = os.path.join(project_folder, 'data', 'cleaned', 'homeless', 'death_by')
os.makedirs(cleaned_folder, exist_ok=True)

cleaned_file_path = os.path.join(cleaned_folder, filename)
data_by_cause.to_csv(cleaned_file_path, index=False)

# Death by Demographic

In [4]:
import os
import pandas as pd

# Load the dataset
filename = "homeless-deaths-by-demographics.csv"
data_by_demographics = pd.read_csv(downloader.get_path_for_downloaded_file(filename))

# Convert Data Types (if necessary)
data_by_demographics['Year of death'] = data_by_demographics['Year of death'].astype(int)
data_by_demographics['Count'] = data_by_demographics['Count'].astype(int)

# Define the path for the cleaned data
cleaned_folder = os.path.join(project_folder, 'data', 'cleaned', 'homeless', 'death_by')
os.makedirs(cleaned_folder, exist_ok=True)  # Creates the directory if it doesn't exist

cleaned_file_path = os.path.join(cleaned_folder, filename)
data_by_demographics.to_csv(cleaned_file_path, index=False)

# Death by Month

In [5]:
import os
import pandas as pd

# Load the dataset
filename = "homeless-deaths-by-month.csv"
data_by_month = pd.read_csv(downloader.get_path_for_downloaded_file(filename))

# Convert Data Types (if necessary)
data_by_month['Year of death'] = data_by_month['Year of death'].astype(int)
data_by_month['Count'] = data_by_month['Count'].astype(int)

# Define the path for the cleaned data
cleaned_folder = os.path.join(project_folder, 'data', 'cleaned', 'homeless', 'death_by')
os.makedirs(cleaned_folder, exist_ok=True)  # Creates the directory if it doesn't exist

cleaned_file_path = os.path.join(cleaned_folder, filename)
data_by_month.to_csv(cleaned_file_path, index=False)