# Climate Change Dataset - Dominic Simpson, La Fosse Data Hackathon


## Steps & Deliverables


1. **Choose a Dataset**
- Go to **Kaggle** and find a dataset you find interesting (small-to-medium size so you can work quickly – < 100MB recommended)
- Make sure it has at least one numeric column you can predict with regression or one categorical column you can classify
- Upload it to **Databricks**


2. **Ask Questions & Create Hypotheses**
- Write 3–5 analysis questions you want to answer
- Write 1–2 hypotheses you can test
- Decide which column will be your target variable for Machine Learning

##### Exposition:
For this hackaton project, "From Data to Insights to Predictions", I have chosen the following dataset from Kaggle: https://www.kaggle.com/datasets/bhadramohit/climate-change-dataset/data

- Title: Climate Change Dataset - "Dataset of Temperature, Emissions, and Environmental Trends (2000-2024)"
- File: climate_change_dataset.csv
- File Size: 53.21kB - 90kB (depending on encoding)
- Number of Rows: 1000
- Number of Columns: 10


Analysis Questions:
1. Does the data show that the combined average temperatures of the thirteen countries in the data has risen overall throughout the last 25 years (approx)?
2. If so, can rising global temperatures be correlated with rising CO₂ emissions per capita?
3. Has there been an increase in extreme weather over the 25 year period?
4. Is there a relationship between CO₂ emissions per capita and renewable energy usage?
5. Has there been an inexorable increase in sea level rise throughout the world?


Hypotheses:
1. Countries throughout the world have seen a general rise in temperatures overall.
2. A country's rising tempeature can be correlated with their CO₂ emissions per capita.

Decide which column will be your target variable for Machine Learning

- Avg Temperature (Â°C) [_column name will be modified_]

In [0]:
# Testing testing
print("Hello World!")

In [0]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

3. **Data Cleaning & Transformation**


- Load your dataset in a `Jupyter Notebook` inside Databricks
- Handle missing values, duplicates, and incorrect data types
- Create new columns if needed
- Filter, group, and sort data to prepare it for analysis

In [0]:
df = pd.read_csv("data/climate_change_dataset.csv")
df.head()

In [0]:
df.tail()

In [0]:
df.describe(include='all')


In [0]:
df.info()

In [0]:
df.shape

In [0]:
# no missing values
missing_values = df.isnull().sum()
print(missing_values[missing_values > 0])

In [0]:
# no duplicate values
duplicated_values = df.duplicated().sum()
print(duplicated_values > 0)

Formatting columns

In [0]:
# Ensure that float data in dataset is formatted to 
# two decimal places, to preserve precision from original calculations
# (in climate change studies, small differences can be meaningful when looking at long-term trends)
pd.options.display.float_format = '{:.2f}'.format

In [0]:
# `Year` has already been formatted correctly as int64
# `Country` has already been formatted correctly as object
# `Avg Temperature (°C)` has already been formatted correctly as float64, with two decimal places
# added from command above replacing original only one decimal place
df['Avg Temperature (°C)'].head(10)



In [0]:
# `Sea Level Rise (mm)` has already been formatted correctly as float64, with two decimal places
# added from command above replacing original only one decimal place
df['Sea Level Rise (mm)'].head(10)


In [0]:
# There is no decimal places in original data in column, so I have left this as int64
df['Rainfall (mm)'].head(10)


In [0]:
# population data contains errors and is not required for this project
df.drop('Population', axis=1, inplace=True, errors='ignore')


In [0]:
# There is no decimal places in original data in column, so I have left this as int64
df['Extreme Weather Events'].head(10)


In [0]:
# `Forest Area (%)` has already been formatted correctly as float64, with two decimal places
# added from command above replacing original only one decimal place
df['Forest Area (%)'].head(10)

In [0]:
# Column names
# Standardized them to lowercase and with underscores, as well as removing 
# units like (°C), etc., as well as measurements such as 'm' via regex
# These technical terms will still appear in data visualization and ML models
df.columns = (
    df.columns
        .str.strip() # remove leading/trailing spaces
        .str.lower() # convert to lowercase
        .str.replace(r'\s+', '_', regex=True) # adds underscores in spaces between column name words
        .str.replace(r'\(°c\)', '', regex=True) # gets rid of °c
        .str.replace(r'\(%\)', '', regex=True) # gets rid of %
        .str.replace(r'\(mm\)', '', regex=True) # gets rid of 'mm'
        .str.replace(r'\((tons/capita)\)', '', regex=True) # gets rid of 'tons/capita'
        .str.replace(r'_+$', '', regex=True) # delete training underscores at end of column name
)


In [0]:
df.info()

In [0]:
#Reorder data by year (earliest first) and country (alphabetical)
df_sorted = df.sort_values(['year', 'country'])
df_sorted.head(10)


In [0]:
df_sorted.to_csv('data/cleaned_climate_change_data.csv', index=False)


In [0]:
df1 = pd.read_csv('data/cleaned_climate_change_data.csv')
df1.head()

In [0]:
df1['year'].unique()

4. **Data Visualization**


Use Matplotlib (and optionally Seaborn) to create at least 5 meaningful plots that help answer your questions
Each plot should have a clear title, axis labels, and legends if needed

In [0]:
# Save combined countries' average temperatures for each year
yearly_avgtemp_df = (
    df1.groupby('year', as_index=False)['avg_temperature']
       .mean()
)

print(yearly_avgtemp_df)

In [0]:
# Lineplot of Average Temperature Rise of Selected Countries (2000 -20024)
plt.figure(figsize=(12, 6))
sns.lineplot(data=yearly_avgtemp_df,
            x='year',
            y='avg_temperature',
            marker='o')

plt.title('Average Temperature Rise of Selected Countries (2000-2024)')
plt.xlabel('Year')
plt.ylabel('Average Temperature (°C)')

plt.show()

In [0]:
country_tempco2_avg = (
    df1.groupby('country', as_index=False)[['avg_temperature', 'co2_emissions']].mean()
)

country_tempco2_avg.head(10)

In [0]:
# Can rising global temperatures be correlated with rising CO₂ emissions per capita?

plt.figure(figsize=(12, 6))

#average temperature rises
sns.lineplot(
            x='year',
            y='avg_temperature',
            data=df,
            label='Average Temperature (°C)'
)

sns.lineplot(
            x='year',
            y='co2_emissions',
            data=df,
            label='CO2 Emissions (Tons/Capita)'
)

plt.title('Climate Change Trends in Sample Countries (2000-2024)')
plt.xlabel('Year', fontsize=12)
plt.ylabel('Value', fontsize=12)
plt.legend()

plt.show()



In [0]:
# Scatterplot of Temperature vs CO₂ Emissions per Capita (2000–2024) by Country

plt.figure(figsize=(12, 6))
sns.scatterplot(data=country_tempco2_avg,
                x='co2_emissions',
                y='avg_temperature',
                )

for i, row in country_tempco2_avg.iterrows():
    plt.text(row['co2_emissions'] + 0.1, 
             row['avg_temperature'],
             row['country'],
             fontsize=10
            )

plt.title('Temperature vs CO₂ Emissions per Capita (2000 - 2024) by Country')
plt.xlabel('Average Temperature (°C)')
plt.ylabel('CO₂ Emissions per Capita (Tons/Capita) by Country')
plt.tight_layout()

plt.show()



In [0]:
# Regression plot showing slight rise in sea level
yearly_sea_rise = (
    df1.groupby('year', as_index=False)['sea_level_rise']
       .mean()
)

plt.figure(figsize=(12, 6))

sns.regplot(
    data=yearly_sea_rise,
    x='year',
    y='sea_level_rise',
    scatter_kws={'s': 60, 'alpha': 0.8, 'color': 'royalblue'},
    line_kws={'linewidth': 2, 'color': 'darkred'},
    ci=None
)

plt.title('Global Mean Sea Level Rise with Linear Trend')
plt.xlabel('Year')
plt.ylabel('Sea Level Rise (units)')
plt.grid(True, axis='y', linestyle='--', alpha=0.6)
plt.tight_layout()

plt.show()


In [0]:
# Have ex