# Datastory Trachea-, bronchiën- en longkanker



# Inleiding

Kanker is een ingrijpende ziekte die vele mensen wereldwijd raakt. De oorzaak van kanker hangt af van de soort kanker. Voor trachea, bronchus en long (TBL) kanker zijn er  twee hoofdoorzaken: roken en slechte luchtkwaliteit (Safiri et al., 2021). In dit dataverhaal wordt gekeken naar deze twee perspectieven. Wordt TBL-kanker veroorzaakt door de slechte luchtkwaliteit of door roken? 
Om dit te onderzoeken bekijken we de volgende dataset over kankertype en doden: https://www.kaggle.com/datasets/antimoni/cancer-deaths-by-country-and-type-1990-2016 
The Cancer Deaths by Country and Type dataset bevat data van 18 verschillende soorten kanker (prostaatkanker, leverkanker etc.) en het dodental per soort kanker per jaar en per land. De data is van 1990-2016. De data is verzameld van de WHO (World Health Organisation).
Dit gaan we dan vergelijken met de dataset van uitstoot:
https://www.kaggle.com/datasets/thedevastator/global-fossil-co2-emissions-by-country-2002-2022

De Emissions by Country dataset bevat mondiale data van fossiele brandstof CO2-uitstoot per land per jaar. De dataset bevat totale CO2-uitstoot per jaar per land maar ook uitstoot per aandeel gas, olie, kolen of uitstoot per inwoner. De dataset is van 1750-2021.

En de dataset over roken: 
https://www.kaggle.com/datasets/mexwell/us-smoking-trend

De Global Smoking Trend dataset bevat data over het aantal rokers per land per jaar. De dataset is van 1980-2012.


# 

In [202]:
# Import packages
import pandas as pd
import plotly.express as px
import plotly.graph_objs as go 
import matplotlib.pyplot as plt
import seaborn as sns
import IPython
from plotly.subplots import make_subplots

In [203]:
from IPython.display import display

# Load the Cancer Data Set
cancer_df = pd.read_csv("csv/CancerDeaths.csv")
# print("Cancer Data Set")
# display(cancer_df.head(n=5))

# Load the Emissions Data Set
emissions_df = pd.read_csv("csv/Emissions.csv")
# print("Emissions Data Set")
# display(emissions_df.iloc[250:255])

# Load the Population Data Set
population_df = pd.read_csv("csv/Population.csv")
# print("Population Data Set")
# display(population_df.head(n=5))

# Load the Smoking Data Set
smoking_df = pd.read_csv("csv/smoking.csv")
# print("Smoking Data Set")
# display(smoking_df.head(n=5))

## Fijnstof

Luchtvervuiling en met name fijnstof is een van de grootste veroorzakers van longkanker.

Luchtvervuiling en met name fijnstof, wat veroorzaakt wordt door onvolledige verbranding van fossiele brandstoffen (Cohen & Pope, 1995), is een grote veroorzaker van onder andere longkanker. In 2016 was met 19 procent van alle kanker-gerelateerde overlijdens TBL-kanker (Trachea, bronchiën en longkanker) de grootste doodsoorzaak onder alle kankersoorten (Safiri et al., 2021). De globale uitstoot stijgt, en met die uitstoot dus ook de uitstoot van fijnstof. Dit leidt tot een toename in TBL kanker, hieronder is een aantal grafieken neergezet die goed laten zien wat de uitstoot van fijnstof voor invloed heeft op TBL-kanker.

In [204]:
# Filter the DataFrame for the desired country and year range
cancer_country = 'World'  # Replace with the desired country code
emissions_country = 'Global'  # Replace with the desired country code

cancer_data = cancer_df[(cancer_df['Country'] == cancer_country) & (cancer_df['Year'] >= 2001)]
emissions_data = emissions_df[(emissions_df['Country'] == emissions_country) & (emissions_df['Year'] >= 2001)]

# Extract the Year and Lung cancer columns
year_lung_cancer = cancer_data[['Year', "Tracheal, bronchus, and lung cancer "]]
year_total_emissions = emissions_data[['Year', 'Total']]

# Create a subplot figure with 1 row and 2 columns
fig = make_subplots(rows=1, cols=2, subplot_titles=('Tracheale, bronchus- en longkanker per jaar', 'Totale Emissies Per Jaar Wereldwijd'))

# Add the lung cancer data trace to the first subplot
fig.add_trace(go.Scatter(x=year_lung_cancer['Year'], y=year_lung_cancer["Tracheal, bronchus, and lung cancer "],
                         mode='lines', name='Tracheale, bronchus- en longkanker'),
              row=1, col=1)

# Add the total emissions data trace to the second subplot
fig.add_trace(go.Scatter(x=year_total_emissions['Year'], y=year_total_emissions['Total'],
                         mode='lines', name='Totale Emissies'),
              row=1, col=2)

# Update the layout for the entire figure
fig.update_layout(
    title_text='Vergelijking van Tracheale, bronchus- en longkanker en Totale Emissies per jaar',
    showlegend=False
)

# Update x-axis and y-axis titles for each subplot
fig.update_xaxes(title_text='Jaar', row=1, col=1)
fig.update_yaxes(title_text='Tracheale, bronchus- en longkanker', row=1, col=1)
fig.update_xaxes(title_text='Jaar', row=1, col=2)
fig.update_yaxes(title_text='Totale Emissies', row=1, col=2)

# Show the plot
fig.show()

# Wereld visualisatie

Er zijn ook erg veel culturele verschillen op het gebied van longkanker, op deze wereldkaarten valt te zien welke gebieden in de wereld het meeste te maken hebben met longkanker en luchtvervuiling.

In [205]:
# Load the datasets
df_lung = cancer_df
df_emissions = emissions_df
df_population = population_df

# Strip any leading/trailing whitespace from column names
df_lung.columns = df_lung.columns.str.strip()
df_emissions.columns = df_emissions.columns.str.strip()
df_population.columns = df_population.columns.str.strip()

# Ensure correct data types for population columns
df_population['PopTotal'] = df_population['PopTotal'].str.replace(',', '').astype(float)

# Standardize country names using a mapping dictionary
country_name_mapping = {
    'Russian Federation': 'Russia',
    'United States of America': 'United States',
    'United States of America': 'USA',
    # Add other mappings if necessary
}

# Apply the mapping to the population dataset
df_population['Location'] = df_population['Location'].replace(country_name_mapping)

# Merge the lung cancer data with population data using country names and years
df_lung_merged = pd.merge(df_lung, df_population, left_on=['Country', 'Year'], right_on=['Location', 'Time'])
df_lung_merged['Lung Cancer Per Capita'] = df_lung_merged['Tracheal, bronchus, and lung cancer'] / df_lung_merged['PopTotal']

# Merge the emissions data with population data using country names and years
df_emissions_merged = pd.merge(df_emissions, df_population, left_on=['Country', 'Year'], right_on=['Location', 'Time'])
df_emissions_merged['Emissions Per Capita'] = df_emissions_merged['Total'] / df_emissions_merged['PopTotal']

# Aggregate data by country and code
df_lung_per_capita = df_lung_merged.groupby(['Country', 'Code'])['Lung Cancer Per Capita'].mean().reset_index()
df_emissions_per_capita = df_emissions_merged.groupby(['Country', 'ISO 3166-1 alpha-3'])['Emissions Per Capita'].mean().reset_index()

# Rename the columns for better readability
df_lung_per_capita.columns = ['Country', 'Code', 'Lung Cancer Per Capita']
df_emissions_per_capita.columns = ['Country', 'Code', 'Emissions Per Capita']

# Create the choropleth map for lung cancer rates
fig_lung = px.choropleth(df_lung_per_capita,
                         locations='Code',
                         color='Lung Cancer Per Capita',
                         hover_name='Country',
                         color_continuous_scale=px.colors.sequential.Plasma,
                         title='Lung Cancer Rates per Capita by Country')

# Update layout for a larger map
fig_lung.update_layout(
    title_text='TBL-Kanker Cijfers per Capita per Land',
    title_x=0.5,
    geo=dict(
        showframe=False,
        showcoastlines=False,
        projection_type='equirectangular'
    ),
    height=600,
    coloraxis_colorbar=dict(
        title="TBL-kanker<br>Per Capita",
    )
)

# Create the choropleth map for emissions
fig_emissions = px.choropleth(df_emissions_per_capita,
                              locations='Code',
                              color='Emissions Per Capita',
                              hover_name='Country',
                              color_continuous_scale=px.colors.sequential.Plasma,
                              title='Emissions per Capita by Country'
                              )

# Update layout for a larger map
fig_emissions.update_layout(
    title_text='Emissies per Capita per Land',
    title_x=0.5,
    geo=dict(
        showframe=False,
        showcoastlines=False,
        projection_type='equirectangular'
    ),
    height=600,
    coloraxis_colorbar=dict(
        title="Emissies<br>Per Capita",
    )
)

# Show the figures
fig_lung.show()
fig_emissions.show()

# India en Amerika

In de grafiek hieronder zijn twee duidelijke grafieken gemaakt, hierin is heel duidelijk te zien dat het aantal longkanker gevallen flink stijgt als de emissies per capita ook omhoog gaan.

In [206]:
# Standardize country names using a mapping dictionary
country_name_mapping = {
    'Russian Federation': 'Russia',
    'United States of America': 'USA',
    'United States': 'USA',
    # Add other mappings if necessary
}

# Apply the mapping to the emissions dataset
df_emissions['Country'] = df_emissions['Country'].replace(country_name_mapping)

# Apply the mapping to the lung cancer dataset
df_lung['Country'] = df_lung['Country'].replace(country_name_mapping)

# Filter data to include only years 1990-2012
df_lung_filtered = df_lung[(df_lung['Year'] >= 1990) & (df_lung['Year'] <= 2012)]
df_emissions_filtered = df_emissions[(df_emissions['Year'] >= 1990) & (df_emissions['Year'] <= 2012)]

# Merge the lung cancer data with population data
df_lung_merged = pd.merge(df_lung_filtered, df_population, left_on=['Country', 'Year'], right_on=['Location', 'Time'])
df_lung_merged['Lung Cancer Per Capita'] = df_lung_merged['Tracheal, bronchus, and lung cancer'] / df_lung_merged['PopTotal']

# Merge the emissions data with population data
df_emissions_merged = pd.merge(df_emissions_filtered, df_population, left_on=['Country', 'Year'], right_on=['Location', 'Time'])
df_emissions_merged['Emissions Per Capita'] = df_emissions_merged['Total'] / df_emissions_merged['PopTotal']

# Aggregate data by country, year, and code
df_lung_per_capita = df_lung_merged.groupby(['Country', 'Year', 'Code'])['Lung Cancer Per Capita'].mean().reset_index()
df_emissions_per_capita = df_emissions_merged.groupby(['Country', 'Year', 'ISO 3166-1 alpha-3'])['Emissions Per Capita'].mean().reset_index()

# Rename the columns for better readability
df_lung_per_capita.columns = ['Country', 'Year', 'Code', 'Lung Cancer Per Capita']
df_emissions_per_capita.columns = ['Country', 'Year', 'Code', 'Emissions Per Capita']

# Merge the two datasets on country code and year
df_combined = pd.merge(df_lung_per_capita, df_emissions_per_capita, on=['Country', 'Year', 'Code'])

In [207]:
def plot_country_data(df, countries):
    # Create figure with secondary y-axes and two columns
    fig = make_subplots(rows=1, cols=2, specs=[[{"secondary_y": True}, {"secondary_y": True}]],
                        subplot_titles=(countries[0], countries[1]))

    for i, country in enumerate(countries):
        df_country = df[df['Country'] == country]
        # Add traces for Lung Cancer Per Capita
        fig.add_trace(
            go.Scatter(x=df_country['Year'], y=df_country['Lung Cancer Per Capita'], name=f"TBL-Kanker Per Capita - {country}"),
            row=1, col=i+1, secondary_y=False,
        )

        # Add traces for Emissions Per Capita
        fig.add_trace(
            go.Scatter(x=df_country['Year'], y=df_country['Emissions Per Capita'], name=f"Emissies Per Capita - {country}"),
            row=1, col=i+1, secondary_y=True,
        )

        # Set y-axes titles for each subplot
        fig.update_yaxes(title_text="<b>TBL-Kanker Per Capita", row=1, col=i+1, secondary_y=False)
        fig.update_yaxes(title_text="<b>Emissies Per Capita", row=1, col=i+1, secondary_y=True)

    # Update x-axis title and figure title
    fig.update_xaxes(title_text="Jaar", row=1, col=1)
    fig.update_xaxes(title_text="Jaar", row=1, col=2)
    fig.update_layout(
        title_text="Dubbele aslijnen voor USA en India",
        height=600
    )

    fig.show()

# Plot for USA and India
plot_country_data(df_combined, ['USA', 'India'])

# Correlatie plots

Hieronder staat een aantal plots dat correlaties laat zien tussen verschillende variabelen uit onze datasets.

In [208]:
# Filter data to include only years 2000-2012
df_lung_filtered = df_lung[(df_lung['Year'] >= 2000) & (df_lung['Year'] <= 2012)]
df_emissions_filtered = df_emissions[(df_emissions['Year'] >= 2000) & (df_emissions['Year'] <= 2012)]
df_population_filtered = df_population[(df_population['Time'] >= 2000) & (df_population['Time'] <= 2012)]

# Extract relevant columns
df_lung_relevant = df_lung_filtered[['Country', 'Year', 'Tracheal, bronchus, and lung cancer']]
df_emissions_relevant = df_emissions_filtered[['Country', 'Year', 'Total']]
df_population_relevant = df_population_filtered[['Location', 'Time', 'PopTotal']]

# Merge datasets on Country and Year
df_merged = pd.merge(df_lung_relevant, df_emissions_relevant, left_on=['Country', 'Year'], right_on=['Country', 'Year'])
df_merged = pd.merge(df_merged, df_population_relevant, left_on=['Country', 'Year'], right_on=['Location', 'Time'])

# Calculate per capita values
df_merged['LungCancerPerCapita'] = df_merged['Tracheal, bronchus, and lung cancer'] / df_merged['PopTotal']
df_merged['EmissionsPerCapita'] = df_merged['Total'] / df_merged['PopTotal']

# Select only the numeric columns
df_numeric = df_merged[['Year', 'LungCancerPerCapita', 'EmissionsPerCapita']]

# Group by Year and average the per capita values
df_grouped = df_numeric.groupby('Year').mean().reset_index()

# Create the scatter plot with trendline using plotly
correlation_coefficient = df_grouped['EmissionsPerCapita'].corr(df_grouped['LungCancerPerCapita'])
fig = px.scatter(df_grouped, x="EmissionsPerCapita", y="LungCancerPerCapita", trendline="ols",
                 title='Correlatie tussen LBT-kanker en Emissies per Capita (MtCO2 per persoon)',
                 hover_data=["Year"],
                 labels={
                     "EmissionsPerCapita": "Totale Emissies per Capita (MtCO2 per persoon)",
                     "LungCancerPerCapita": "LBT-Kanker Doden per Capita",
                     "Year":"Jaar"
                 })

# Add correlation coefficient as annotation
fig.add_annotation(
    x=max(df_grouped['EmissionsPerCapita']),  # Position the annotation at the far right of the x-axis
    y=min(df_grouped['LungCancerPerCapita']),  # Position the annotation at the bottom of the y-axis
    text=f'Correlation Coefficient: {correlation_coefficient:.2f}',
    showarrow=False,
    font=dict(size=12, color="black"),
    xanchor='right',
    yanchor='bottom'
)

fig.update_layout(
    height = 600
)

# Roken

Roken is de grootste oorzaak van TBL-kanker.
Sigaretten bevatten vele kankerverwekkende stoffen. In de rook van sigaretten zitten meer dan 60 verschillende carcinogenen, waaronder in kleine aantallen een van de sterkste carcinogeen groepen, polycyclic aromatic hydrocarbons (PAH). PAH’s zijn producten van onvolledige verbranding en komen ook voor in uitlaatgassen van onder andere auto’s (Hecht, 2006). 

# Rokers per capita 

Het aantal rokers per capita is een goede aanduiding voor het aantal TBL-Kanker patiënten.

In [209]:
# Load the smoking data CSV file
smoking_file_path = 'csv/smoking.csv'
smoking_data = smoking_df

# Load the population data CSV file
population_file_path = 'csv/Population.csv'
population_data = pd.read_csv(population_file_path)

# Standardize country names using a mapping dictionary
country_name_mapping = {
    "United States of America": "United States",
    'Russian Federation': 'Russia',
    # Add more mappings as needed
}

# Apply country name mapping to smoking data
smoking_data['Country'] = smoking_data['Country'].replace(country_name_mapping)

# Apply country name mapping to population data
population_data['Location'] = population_data['Location'].replace(country_name_mapping)


# Remove commas from the 'Population' column and convert to numeric
population_data['PopTotal'] = population_data['PopTotal'].str.replace(',', '').astype(float)

# # Convert the 'Population' column to numeric, forcing errors to NaN
# population_data['PopTotal'] = pd.to_numeric(population_data['PopTotal'], errors='coerce')

# Filter smoking data for the years 2000 to 2012
filtered_smoking_data = smoking_data[(smoking_data['Year'] >= 2000) & (smoking_data['Year'] <= 2012)]

# Group by country and calculate the average number of smokers
avg_smokers = filtered_smoking_data.groupby('Country')['Data.Smokers.Total'].mean().reset_index()

# Calculate the average population for the same period
filtered_population_data = population_data[(population_data['Time'] >= 2000) & (population_data['Time'] <= 2012)]
avg_population = filtered_population_data.groupby('Location')['PopTotal'].mean().reset_index()

# Rename columns to facilitate merging
avg_population.rename(columns={'Location': 'Country'}, inplace=True)

# Merge the average smokers and average population dataframes on the country column
merged_data = pd.merge(avg_smokers, avg_population, on='Country')

# Calculate smokers per capita
merged_data['Smokers_Per_Capita'] = merged_data['Data.Smokers.Total'] / merged_data['PopTotal']

# Create a choropleth map
fig = px.choropleth(
    merged_data,
    locations="Country",
    locationmode="country names",
    color="Smokers_Per_Capita",
    hover_name="Country",
    color_continuous_scale=px.colors.sequential.Plasma,
    title="Gemiddelde aantal rokers per Capita (2000-2012)"
)

fig.update_layout(
    coloraxis_colorbar=dict(
        title="Rokers<br>Per Capita",
    ),
    geo=dict(
        showframe=False,
        showcoastlines=False,
        projection_type='equirectangular'
    ),
    height=600,    
)
# Show the figure
fig.show()

# Rokers correlatie met LBT-Kanker

In [210]:
smoking_data['Country'] = smoking_data['Country'].replace(country_name_mapping)
# Filter data to include only years 2000-2012
df_smoking_filtered = smoking_data[(smoking_data['Year'] >= 2000) & (smoking_data['Year'] <= 2012)]

# Extract relevant columns
df_smoking_relevant = df_smoking_filtered[['Country', 'Year', 'Data.Smokers.Total']]


# Merge datasets on Country and Year
df_merged = pd.merge(df_lung_relevant, df_smoking_relevant, left_on=['Country', 'Year'], right_on=['Country', 'Year'])
df_merged = pd.merge(df_merged, df_population_relevant, left_on=['Country', 'Year'], right_on=['Location', 'Time'])

# Calculate per capita values
df_merged['LungCancerPerCapita'] = df_merged['Tracheal, bronchus, and lung cancer'] / df_merged['PopTotal']
df_merged['SmokersPerCapita'] = df_merged['Data.Smokers.Total'] / df_merged['PopTotal']

# Select only the numeric columns
df_numeric = df_merged[['Year', 'LungCancerPerCapita', 'SmokersPerCapita']]

# Group by Year and average the per capita values
df_grouped = df_numeric.groupby('Year').mean().reset_index()

# Create the scatter plot with trendline using plotly
correlation_coefficient_smoke = df_grouped['SmokersPerCapita'].corr(df_grouped['LungCancerPerCapita'])
fig = px.scatter(df_grouped, x="SmokersPerCapita", y="LungCancerPerCapita", trendline="ols",
                 title='Correlatie tussen LBT-kanker en Rokers per Capita',
                 hover_data=["Year"],
                 labels={
                     "SmokersPerCapita": "Totale Rokers per Capita",
                     "LungCancerPerCapita": "LBT-Kanker Doden per Capita",
                     "Year":"Jaar"
                 })

# Add correlation coefficient as annotation
fig.add_annotation(
    x=max(df_grouped['SmokersPerCapita']),  # Position the annotation at the far right of the x-axis
    y=min(df_grouped['LungCancerPerCapita']),  # Position the annotation at the bottom of the y-axis
    text=f'Correlation Coefficient: {correlation_coefficient_smoke:.2f}',
    showarrow=False,
    font=dict(size=12, color="black"),
    xanchor='right',
    yanchor='bottom'
)

fig.update_layout(
    height = 600
)



# Referenties 

> Cohen, A. J., & Pope 3rd, C. A. (1995). Lung cancer and air pollution. _Environmental health perspectives, 103_(suppl 8), 219-224.
>
> Safiri, S., Sohrabi, M. R., Carson-Chahhoud, K., Bettampadi, D., Taghizadieh, A., Almasi-Hashiani, A., ... & Kolahi, A. A. (2021). Burden of tracheal, bronchus, and lung cancer and its attributable risk factors in 204 countries and territories, 1990 to 2019. _Journal of Thoracic Oncology, 16_(6), 945-959.
