# Analysis for the Homelessness Dataset
Category: Deaths of People Experiencing Homelessness

This analysis will focus on the dataset related to the deaths of people experiencing homelessness in Toronto. 
The dataset contains information about the number of deaths by month for each year.

In [1]:
import os

# Get the current path
current_path = os.getcwd()
project_folder = os.path.abspath(os.path.join(current_path, '..', '..', '..'))
print(project_folder)


C:\projects\BigData2Project


In [2]:
# Import the class OpenToronto from Downloaders
from Downloaders import OpenCanada as oc
from Datasets import HomelessnessDataset as homelessDS

# Create an instance of the OpenCanada class
downloader = oc.TorontoDownloader()

category = homelessDS.HomelessnessDataset()

#We want to download only the urls of the dataset related to the deaths of people experiencing homelessness
downloader.load_pages(category.get_urls_death_people_experiencing_homelessness())
downloader.get_datasets_info()

download_folder = os.path.join(project_folder, 'data', 'raw', 'homeless', 'death_by')
downloader.download_datasets(output_directory=download_folder)

[Downloader] Added 1 urls to the list.
0) Page URL: https://open.toronto.ca/dataset/deaths-of-people-experiencing-homelessness/, Name: Homeless deaths by month, Last Modified: None, Type: CSV, Size: 0.00 MB, URL: https://ckan0.cf.opendata.inter.prod-toronto.ca/datastore/dump/8b2e5ec9-7cee-49cc-a67e-bba38e5077be
1) Page URL: https://open.toronto.ca/dataset/deaths-of-people-experiencing-homelessness/, Name: Homeless deaths by month.csv, Last Modified: 2024-04-17T21:09:06.846517, Type: CSV, Size: 0.00 MB, URL: https://ckan0.cf.opendata.inter.prod-toronto.ca/dataset/a7ae08f3-c512-4a88-bb3c-ab40eca50c5e/resource/dc4ec2fa-d655-46ca-af32-f216d26e9804/download/homeless-deaths-by-month.csv
2) Page URL: https://open.toronto.ca/dataset/deaths-of-people-experiencing-homelessness/, Name: Homeless deaths by month.xml, Last Modified: 2024-04-17T21:09:08.113343, Type: XML, Size: 0.01 MB, URL: https://ckan0.cf.opendata.inter.prod-toronto.ca/dataset/a7ae08f3-c512-4a88-bb3c-ab40eca50c5e/resource/2ba00593

# Death by Month

In [3]:
import pandas as pd

data_by_month = pd.read_csv(downloader.get_path_for_downloaded_file("homeless-deaths-by-month.csv"))
data_by_month.head()

Unnamed: 0,_id,Year of death,Month of death,Count
0,1,2022,July,19
1,2,2023,February,13
2,3,2023,April,8
3,4,2017,October,6
4,5,2023,March,14


In [17]:
from Datasets import Datasets as ds
dataset_check = ds.DatasetCheck(data_by_month)
# Perform the data quality check
quality_check_results = dataset_check.data_quality_check()

# Loop through the results of missing values and if all are 0, no missing values, if not, print the missing values
if all(value == 0 for value in quality_check_results['missing_values'].values()):
    print("No missing values were found in the dataset.")
else:
    print("Missing values were found in the dataset:")
    print(quality_check_results['missing_values'])
    
# Check for duplicates
if quality_check_results['duplicates'] == 0:
    print("No duplicate records were found.")
else:
    print("Duplicate records were found.")

Missing Values:
_id               0
Year of death     0
Month of death    0
Count             0
dtype: int64
Duplicates:
0
No missing values were found in the dataset.
No duplicate records were found.


In [7]:
# Perform the data types verification
types = dataset_check.get_data_types()
types

{'_id': 'int64',
 'Year of death': 'int64',
 'Month of death': 'object',
 'Count': 'int64'}

## Data Quality Check

In [8]:
# checking for missing values and duplicates
missing_values_month = data_by_month.isnull().sum()
duplicates_month = data_by_month.duplicated().sum()

missing_values_month, duplicates_month

(_id               0
 Year of death     0
 Month of death    0
 Count             0
 dtype: int64,
 0)

### Results
- Missing Values: No missing values were found in the dataset.
- Duplicates: No duplicate records were found.

## Data Types Verification

We check the data types of each column and make sure they are as expected

In [9]:
data_types_month = data_by_month.dtypes
data_types_month

_id                int64
Year of death      int64
Month of death    object
Count              int64
dtype: object

## Data Types Verification Results
- _id: Integer (int64) - Correct
- Year of death: Integer (int64) - Correct
- Month of death: String (object) - Correct
- Count: Integer (int64) - Correct

*The data types are correct as expected*

# Identify Outliers

We need to identify any outliers in the Count column, as it represents the number of deaths and can help in understanding if there are any unusually high or low values.

## Data Consistency Check
We will also verify the consistency of categorical values in the column Month of death.

In [10]:
outliers_month = data_by_month['Count'].describe()
unique_month_of_death = data_by_month['Month of death'].unique()
outliers_month, unique_month_of_death

(count    84.000000
 mean     12.166667
 std       5.060676
 min       1.000000
 25%       8.000000
 50%      11.000000
 75%      15.000000
 max      26.000000
 Name: Count, dtype: float64,
 array(['July', 'February', 'April', 'October', 'March', 'August',
        'November', 'May', 'January', 'December', 'June', 'September'],
       dtype=object))

## Outliers and Data Consistency Check Results

### Outliers in Count Column:
- Count: 84 records
- Mean: 12.17
- Standard Deviation: 5.06
- Minimum: 1
- Maximum: 26

The Count column has a maximum value of 26, which is higher than the mean but within a reasonable range considering the standard deviation.

### Consistency of Categorical Values:
Month of death:
- Unique Values: ['July', 'February', 'April', 'October', 'March', 'August', 'November', 'May', 'January', 'December', 'June', 'September']

These categories are consistent and cover all months of the year.

# Summary and Documentation for PySpark

1. Schema Definition:

```json
{
  "_id": "Integer - Unique identifier for each record",
  "Year of death": "Integer - The year when the death occurred",
  "Month of death": "String - The month when the death occurred",
  "Count": "Integer - The number of deaths"
}
```

2. Data Quality:
- No missing values.
- No duplicate records.
- Data types are correct.
- No significant outliers in the Count column.

3. Data Consistency:
Month of death: Consistent and correctly categorized.

4. Dataset Overview:
This dataset contains records of homeless deaths in Toronto, categorized by month and year, with the count of deaths for each month.
The data spans multiple years and provides insights into the seasonal trends in deaths among the homeless population.

5. Source Information:
Provide information about the source and date of the last update.

6. Metadata:
Any additional notes or observations about the data.

# Death by demographics

In [11]:
import pandas as pd
data_by_demographics = pd.read_csv(downloader.get_path_for_downloaded_file("homeless-deaths-by-demographics.csv"))
data_by_demographics.head()

Unnamed: 0,_id,Year of death,Age_group,Gender,Count
0,1,2017,Unknown,Female,1
1,2,2017,20-39,Transgender,1
2,3,2017,60+,Female,6
3,4,2017,40-59,Female,12
4,5,2017,60+,Male,20


In [None]:
# Check for missing values
missing_values_demographics = data_by_demographics.isnull().sum()

# Check for duplicates
duplicates_demographics = data_by_demographics.duplicated().sum()

missing_values_demographics, duplicates_demographics


In [None]:
# Verify data types
data_types_demographics = data_by_demographics.dtypes

data_types_demographics


In [None]:
# Identify outliers in the 'Count' column
outliers_demographics = data_by_demographics['Count'].describe()

# Verify consistency of categorical values
unique_age_group_demographics = data_by_demographics['Age_group'].unique()
unique_gender_demographics = data_by_demographics['Gender'].unique()

outliers_demographics, unique_age_group_demographics, unique_gender_demographics


## Summary and Documentation for PySpark

1. Schema
{
  "_id": "Integer - Unique identifier for each record",
  "Year of death": "Integer - The year when the death occurred",
  "Age_group": "String - The age group of the deceased (e.g., Unknown, 20-39, 40-59, 60+, <20)",
  "Gender": "String - The gender of the deceased (e.g., Male, Female, Transgender, Unknown)",
  "Count": "Integer - The number of deaths"
}

2. Data Quality:
- No missing values.
- No duplicate records.
- Data types are correct.
- One potential outlier in the Count column (value of 70).

3. Data Consistency:
- Age_group and Gender: Consistent and correctly categorized.

4. Dataset Overview:
This dataset contains records of homeless deaths in Toronto, categorized by age group and gender, with the count of deaths for each category.
The data spans multiple years and provides insights into the trends and demographics of deaths among the homeless population.

5. Source Information:
[TODO] We need to provide information about the source and date of the last update.

6. Metadata:
Any additional notes or observations about the data.

# Death by Cause

In [None]:
import pandas as pd

# Load the CSV file
file_path = '/mnt/data/homeless-deaths-by-cause.csv'
data_by_cause = pd.read_csv(file_path)

# Display the first few rows of the dataset
data_by_cause.head()


In [None]:
# Check for missing values
missing_values = data_by_cause.isnull().sum()

# Check for duplicates
duplicates = data_by_cause.duplicated().sum()

missing_values, duplicates


In [None]:
# Verify data types
data_types = data_by_cause.dtypes

data_types


In [None]:
# Identify outliers in the 'Count' column
outliers = data_by_cause['Count'].describe()

# Verify consistency of categorical values
unique_cause_of_death = data_by_cause['Cause_of_death'].unique()
unique_age_group = data_by_cause['Age_group'].unique()
unique_gender = data_by_cause['Gender'].unique()

outliers, unique_cause_of_death, unique_age_group, unique_gender


## Summary and Documentation for PySpark

1. Schema Definition:
{
  "_id": "Integer - Unique identifier for each record",
  "Year of death": "Integer - The year when the death occurred",
  "Cause_of_death": "String - The cause of death (e.g., Cardiovascular Disease, Other, Suicide, Accident)",
  "Age_group": "String - The age group of the deceased (e.g., Unknown, 20-39, 40-59, 60+, <20)",
  "Gender": "String - The gender of the deceased (e.g., Male, Female, Unknown)",
  "Count": "Integer - The number of deaths"
}

2. Data Quality:
- No missing values.
- No duplicate records.
- Data types are correct.

One potential outlier in the Count column (value of 51).

3. Data Consistency:
- Cause_of_death: Standardize "Drug Toxicity" and "Drug toxicity" to a single category.
- Age_group and Gender: Consistent and correctly categorized

4. Dataset Overview:
This dataset contains records of homeless deaths in Toronto, categorized by cause, age group, and gender, with the count of deaths for each category.
The data spans multiple years and provides insights into the trends and causes of deaths among the homeless population.

5. Source Information:
[TODO] We need to provide information about the source and date of the last update.