In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import re
import plotly.express as px

from plotly.subplots import make_subplots
import plotly.graph_objects as go

In [2]:
data=pd.read_csv('machine_learning_data_Final.csv')
data=data.iloc[:,1:]
data.head()

Unnamed: 0,year,district,NO2,PM2_5,O3,SO2,asthma_hosp,Boroughs
0,2009,101,23.2,11.03,23.67,6.62,61,Bronx
1,2009,102,22.39,10.68,26.82,5.38,282,Bronx
2,2009,103,24.82,11.1,24.47,9.48,693,Bronx
3,2009,104,22.83,10.59,26.72,5.15,588,Bronx
4,2009,105,28.07,11.76,23.08,9.36,631,Bronx


In [3]:
def rep_array(array,nrep=5):
    repetitions=[]; torep=list(array)
    for i in range(nrep):
        repetitions=repetitions+torep
    return repetitions

"""def NormalizeData(data):
    return (data - np.min(data)) / (np.max(data) - np.min(data))
"""

"""def NormalizeData(array):
    #norm = np.linalg.norm(array)
    norm = np.mean(array)
    return array/norm"""

def NormalizeData(array):
    return array

# let's extract the mean astham hosp. values per year
asthma_per_year=NormalizeData([data[data.year==el].asthma_hosp.mean() for el in np.unique(data.year)])

# Mean pollutants per year
PM2_5_per_year=NormalizeData([data[data.year==el].PM2_5.mean() for el in np.unique(data.year)])
NO2_per_year=NormalizeData([data[data.year==el].NO2.mean() for el in np.unique(data.year)])
O3_per_year=NormalizeData([data[data.year==el].O3.mean() for el in np.unique(data.year)])
SO2_per_year=NormalizeData([data[data.year==el].SO2.mean() for el in np.unique(data.year)])

years=np.unique(data.year)

labels=[np.repeat('PM2_5',8),np.repeat('NO2',8),np.repeat('O3',8),np.repeat('SO2',8),
        np.repeat('Asthma Hospitalizations',8)]

labels=np.hstack(labels)
y=rep_array(years)
values=np.hstack([PM2_5_per_year,NO2_per_year,O3_per_year,SO2_per_year,asthma_per_year])
values

line_plot_df=pd.DataFrame({'year':y,'value':values,'label':labels})

# Air pollution and asthma hospitalizations in children. What does data visualization tell us?

It is almost general knoweldge that air pollution might be one of the main causes of respiratory diseases. An among all the broad types of respiratory diseases that exists, specifically asthma. Asthma is a chronic respiratory disease characterized by variable airflow obstruction, bronchial hyperresponsiveness, and airway inflammation. [[1]](https://pubmed.ncbi.nlm.nih.gov/32867076/) Researchers have long linked asthma with exposure to air pollution, which can make asthma symptoms worse and trigger asthma attacks. Moreover, it is estimated six million children in the United States with asthma are especially vulnerable to air pollution. [[2]](https://www.epa.gov/sciencematters/links-between-air-pollution-and-childhood-asthma#:~:text=Researchers%20have%20long%20linked%20asthma,worse%20and%20trigger%20asthma%20attacks). And not only vulnerable once they have the disease. It has been also proved that exposure to main air pollutants such as NO2, CO, and PM2.5 is linked to regional DNA methylation differences in asthma. [[3]](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5756438/pdf/13148_2017_Article_433.pdf)

The aim of this study is to dig deeper into this relation between air pollution and asthma cases in children. To do so, NYC air pollution data will be used and combined with asthma hospitalizations data for children. Is there any clear correlation between air pollution and the number of asthma hospitalizations? If so, which are the air pollutants that are more correlated? How do these values vary between the different neighbourhoods? Where is more urgent to reduce the emissions of hazardous air pollutants? Is it possible to predict which is gonna be the number of asthma hospitalizations in order to prepare these hospitals that are gonna suffer more for such big workloads? 

But to start with, let's take a look at how the number of asthma cases in children and the levels of pollutants has evolved in the last years, and which tendence do they show.

In [5]:
fig = make_subplots(
    rows=1, cols=5,column_widths=[0.4, 0.15,0.15,0.15,0.15],
    subplot_titles=("Asthma hosp","PM2.5", "NO2", "O3", "SO2"))

fig.add_trace(go.Scatter(x=years, y=asthma_per_year,name='Asthma hosp'),
              row=1, col=1)

fig.add_trace(go.Scatter(x=years, y=PM2_5_per_year,name='PM2.5'),
              row=1, col=2)

fig.add_trace(go.Scatter(x=years, y=NO2_per_year,name='NO2'),
              row=1, col=3)

fig.add_trace(go.Scatter(x=years, y=O3_per_year,name='O3'),
              row=1, col=4)

fig.add_trace(go.Scatter(x=years, y=SO2_per_year,name='SO2'),
              row=1, col=5)

# Update xaxis properties
fig.update_xaxes(title_text="year", row=1, col=1)
fig.update_xaxes(title_text="year", row=1, col=2)
fig.update_xaxes(title_text="year", row=1, col=3)
fig.update_xaxes(title_text="year", row=1, col=4)
fig.update_xaxes(title_text="year", row=1, col=5)


fig.update_layout(height=350, width=1000,
                  title_text='Evolution of Air Pollution and Children Asthma Hospitalizations in NYC')

fig.update_layout(showlegend=False)
fig.show()

It can be clearly seen how there is a tendence in decrease in both the number of asthma hospitalizations and the levels of emission of the main pollutants. With the exception of O3, the levels of PM2.5, NO2 and SO2 have been significanlty reduced in the last decade, which evidence that sustainability policies were put into practice. Air quality levels registred during the 2000-2010 decade and their impact on population's health raised the attention of key decision makers and the public which lead to polices aming to improve local air quality, reduce greenhouse gas emissions and revert the situation that air pollution was taking. [[4]](https://www.healthypeople.gov/2020/healthy-people-in-action/story/new-york-city-air-quality-programs-reduce-harmful-air-pollutants)

Once seen how the tendency that both asthma cases and air pollutants are following, the next question that arise is: Is it possible to build a model that relates the number of asthma hospitalizations with the different levels of air pollutants? Which is the impact that each district has?

### Air Pollutant type vs Pearson product-moment correlation coefficients

(Should we include any more descriptive plot before directly implementing the machine learning model?)

In [19]:
corr1=np.corrcoef(data.PM2_5,data.asthma_hosp)[0][1]
corr2=np.corrcoef(data.NO2,data.asthma_hosp)[0][1]
corr3=np.corrcoef(data.O3,data.asthma_hosp)[0][1]
corr4=np.corrcoef(data.SO2,data.asthma_hosp)[0][1]

corr_df=pd.DataFrame({'Pollutants':['PM2.5','NO2','O3','SO2'],'Correlation Coef.':[corr1,corr2,corr3,corr4]})

import plotly.express as px
fig = px.bar(corr_df,y='Pollutants',x='Correlation Coef.')
fig.show()

### Asthma Hospitalizations vs SO2

In [22]:
fig = px.scatter(data, x='SO2', y="asthma_hosp", trendline="ols",color="year")
fig.show()

To answer these questions, a machine learning model has been built.

## Introduce the machine learning model and all that stuff

## Pollution per Borough: The case of PM2.5

In [24]:
fig = px.scatter(data, x="PM2_5", y="asthma_hosp", color="year",trendline="ols", facet_col="Boroughs")
fig.show()

### More money = Better healthcare. When the trendline highly varies depending on the district we are located
Something really interesting can be observed here, and is the fact that really different regression trendlines can be observed depending on the neighbourhood. While in the Bronx the number of asthma hospitalizations strongly increases as the levels of PM2.5 increase, in Manhattan happens exactly the opposite.

Bronx, Brookykn and Queens district seem to show the expected behaviour. As levels of PM2.5 increase, air becomes more polluted, and so the number of asthma hospitalizations increases. However, why does the number of hospitalizations decrease in Manhattan as the levels of air pollution increase?

Different hypothesis appear here. Is it because in the more polluted areas of Manhattan, there is less space for hospitals in which you can treat asthma cases? Or is it because healtcare and quality of life is basically better in Mantattan?

Is it hard to believe that the first hypothesis might be the cause, specially becuase if we look at the more polluted ares of Manhattan we can find hospitals such as the Metropolitan [[5]](https://www.nychealthandhospitals.org/metropolitan/about-us/), which have a Children’s Asthma Program or Bellevue [[6]](https://www.nychealthandhospitals.org/bellevue/health-care-services/childrens-health/) which provides multidisciplinary care for children and adolescents with asthma.

Thus, it seems that these differences are due to the fact that the quality of life and the healthcare services might be better in Manhattan. If we go to check the Median Household Income on 2017 in NYC, we observe that the Bronx has the lowest value with a Median Household Income of 37.397 \\$ , while Manhattan has a value up to 85.071 \\$.

