# COGS 108 - Final Project 

# Overview

This project serves to determine the relationship between environmental health and the presence of parks within San Diego communities. The datasets used included thorough environmental health screenings for California locations and a set of park locations in different parts of San Diego.

# Name & GitHub ID

- Name: Dylan Cokic
- GitHub Username: dylpc

# Research Question

Does the establishment of parks in San Diego communities directly improve their overall environmental health?

## Background and Prior Work

As it is known that parks promote environmental quality and health, this study will consider whether parks play a significant role in boosting the overall health quality of communities in San Diego. 
One way parks benefit the environment is that they bring more nature to their communities, especially through trees. This is significant in that trees can largely improve air quality through their oxygen production, thus reducing pollution and the risk of physical illness [5]. Another benefit is that by increasing the rates of outdoor activity, parks can help strengthen people's health by reducing the risk of various diseases, prolonging one's lifespan, and improving mental health [1]. 
Noting the relation between these factors and the presence of parks in various San Diego communities will help determine whether there is a direct relationship between park establishment and healthy living (as measured by environmental health screenings).

References (include links):
- 1) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3091337/
- 2) https://www.americantrails.org/resources/improving-public-health-through-public-parks-and-trails
- 3) https://data.sandiegocounty.gov/Environment/A-2-2-Increase-County-Tree-Planting/4yft-ep46
- 4) https://www.sandiegocounty.gov/hhsa/programs/phs/community_health_statistics/
- 5) https://www.cdc.gov/healthyplaces/healthtopics/parks.htm#:~:text=The%20physical%20activity%20you%20get,control%20your%20weight

# Hypothesis


I hypothesize that parks do improve the overall environmental health of communities in San Diego, as their presence improves air quality and leads to lower pollution levels.

# Dataset(s)

Two datasets were used for this analysis: San Diego parks and recreation location data, and California environmental health screening data.

- Dataset Name: SD Parks Locations
- Link to the dataset: https://data.sandiego.gov/datasets/park-locations/
- Number of observations: 2770
- Description: The locations of parks in San Diego are provided. The file was converted to the csv format prior to data analysis.

- Dataset Name: CalEnviroScreen 2.0
- Link to the dataset: https://data.ca.gov/dataset/calenviroscreen-2-0
- Number of observations: 8036
- Description: Various communities in California are screened for environmental health via their water, waste, pollution, asthma levels, poverty levels, and more. All of these variables are used to calculate their CalEnviroScreen score, which determines their overall health.

The two datasets will be used to compare the relative environmental health levels of different communities with respect to the prominence of parks within them. This allows me to determine whether their is a relationship between park presence and overall environmental health. Additionally, looking at individual variables in the health screening dataset will be useful in determining possible outliers and oddities in the analysis, as factors such as employment rate and poverty may be confounds.

# Setup

In [3]:
#imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#read data
df_env = pd.read_csv("data/calenviroscreen-final-report.csv")
df_parks = pd.read_csv("data/parks_datasd.csv")

# Data Cleaning

I stripped the datasets of information that would be unnecessary for this particular analysis. For the environmental screening dataset, I removed most of the columns for the individual variables being screened (such as drinking water, low birth weight, traffic) because all of these variables were used to calculate their CES (CalEnviroScreen) score and this is the main factor I will be using for my conclusions. However, I left in a few columns such as asthma percentile, pollution percentile, and poverty percentile, as they may be useful in the final analysis. I also dropped the rows of locations other than San Diego, as this analysis is based on San Diego communities and there is enough pertinent data to draw viable conclusions.
For the parks dataset, I removed coordinates as the park locations can be determined by their names and community owners. I also removed the park type column as knowing whether it is state-owned or local-owned is not relevant to this analysis.

In [4]:
#drop unnecessary info (columns) from environmental health screening
df_env = df_env.drop(["Census Tract", "Click for interactive map", "Hyperlink", "Ozone", "Ozone Pctl", "PM2.5", "PM2.5 Pctl", "Diesel PM", "Diesel PM Pctl", "Drinking Water", "Drinking Water Pctl", "Pesticides", "Pesticides Pctl", "Tox. Release", "Tox. Release Pctl", "Traffic", "Traffic Pctl", "Cleanup Sites", "Cleanup Sites Pctl", "Groundwater Threats", "Groundwater Threats Pctl", "Haz. Waste", "Haz. Waste Pctl", "Imp. Water Bodies", "Imp Water Bodies Pctl", "Solid Waste", "Solid Waste Pctl", "Pollution Burden", "Age", "Asthma", "Low Birth Weight", "Low Birth Weight Pctl", "Education", "Education Pctl", "Linguistic Isolation", "Linguistic Isolation Pctl", "Poverty", "Unemployment", "Pop. Char.", "Pop. Char. Score", "Pop. Char. Pctl"], axis=1)
#drop non-san diego
df_env = df_env[df_env.California = "San Diego"]

#drop columns from parks dataset
df_parks = df_parks.drop(["X", "Y", "park_type"])


# Data Analysis & Results

this section is unfinished, please view ethics & privacy

In [5]:
## UNFINISHED SECTION

# Ethics & Privacy

As stated previously, people's health may be impacted not only by the prevalence of parks in their area, but also other life factors such as income and age, as lower income or old age may be associated with greater health risks. Additionally, low income areas and poverty-ridden areas are less likely to be able to afford parks and other environmentally beneficial commodities, so it would be biased and unethical to state that they are solely less healthy because of their relative lack of parks. For these reasons, I left in the aforementioned possible confounds in the environmental screening dataset, as they are necessary to explain inconsistencies or bias in the data. In terms of privacy, there were no detectable issues, as the datasets used are available to the public and the data within them does not contain any private or revealing information.

# Conclusion & Discussion

this section is unfinished