# Climate Change Dataset - Dominic Simpson, La Fosse Data Hackathon


## Steps & Deliverables


1. **Choose a Dataset**
- Go to **Kaggle** and find a dataset you find interesting (small-to-medium size so you can work quickly – < 100MB recommended)
- Make sure it has at least one numeric column you can predict with regression or one categorical column you can classify
- Upload it to **Databricks**


2. **Ask Questions & Create Hypotheses**
- Write 3–5 analysis questions you want to answer
- Write 1–2 hypotheses you can test
- Decide which column will be your target variable for Machine Learning

##### Exposition:
For this hackaton project, "From Data to Insights to Predictions", I have chosen the following dataset from Kaggle: https://www.kaggle.com/datasets/bhadramohit/climate-change-dataset/data

- Title: Climate Change Dataset - "Dataset of Temperature, Emissions, and Environmental Trends (2000-2024)"
- File: climate_change_dataset.csv
- File Size: 53.21kB - 90kB (depending on encoding)
- Number of Rows: 1000
- Number of Columns: 10


Analysis Questions:
- Does the data show that the combined average temperatures of the thirteen countries in the data has risen overall throughout the last 25 years (approx)?
- Can rising global temperatures be correlated with rising CO₂ emissions per capita?
- Has there been an inexorable increase in sea level rise throughout the world?
- Has there been an increase in extreme weather over the 25 year period?
- Can relationships be established between a countries' renewal energy program and forest area (both %), on the other, and average temperature, sea level rise, and extreme weather events on the other?


Hypotheses:
1. Countries throughout the world have seen a general rise in temperatures overall.
2. Rising global temperatures can be correlated with the trend for increasing CO₂ emissions per capita - despite attempts to bring down CO₂ levels.

Decide which column will be your target variable for Machine Learning

- Avg Temperature (Â°C) [_column name will be modified_]

In [0]:
# Testing testing
print("Hello World!")

In [0]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

3. **Data Cleaning & Transformation**


- Load your dataset in a `Jupyter Notebook` inside Databricks
- Handle missing values, duplicates, and incorrect data types
- Create new columns if needed
- Filter, group, and sort data to prepare it for analysis

In [0]:
df = pd.read_csv("data/climate_change_dataset.csv")
df.head()

In [0]:
df.tail()

In [0]:
df.describe(include='all')


In [0]:
df.info()

In [0]:
df.shape

In [0]:
# no missing values
missing_values = df.isnull().sum()
print(missing_values[missing_values > 0])

In [0]:
# no duplicate values
duplicated_values = df.duplicated().sum()
print(duplicated_values > 0)

Formatting columns

In [0]:
# Ensure that float data in dataset is formatted to 
# two decimal places, to preserve precision from original calculations
# (in climate change studies, small differences can be meaningful when looking at long-term trends)
pd.options.display.float_format = '{:.2f}'.format

In [0]:
# `Year` has already been formatted correctly as int64
# `Country` has already been formatted correctly as object
# `Avg Temperature (°C)` has already been formatted correctly as float64, with two decimal places
# added from command above replacing original only one decimal place
df['Avg Temperature (°C)'].head(10)



In [0]:
# `Sea Level Rise (mm)` has already been formatted correctly as float64, with two decimal places
# added from command above replacing original only one decimal place
df['Sea Level Rise (mm)'].head(10)


In [0]:
# There is no decimal places in original data in column, so I have left this as int64
df['Rainfall (mm)'].head(10)


In [0]:
# population data contains errors and is not required for this project
df.drop('Population', axis=1, inplace=True, errors='ignore')


In [0]:
# There is no decimal places in original data in column, so I have left this as int64
df['Extreme Weather Events'].head(10)


In [0]:
# `Forest Area (%)` has already been formatted correctly as float64, with two decimal places
# added from command above replacing original only one decimal place
df['Forest Area (%)'].head(10)

In [0]:
# Column names
# Standardized them to lowercase and with underscores, as well as removing 
# units like (°C), etc., as well as measurements such as 'm' via regex
# These technical terms will still appear in data visualization and ML models
df.columns = (
    df.columns
        .str.strip() # remove leading/trailing spaces
        .str.lower() # convert to lowercase
        .str.replace(r'\s+', '_', regex=True) # adds underscores in spaces between column name words
        .str.replace(r'\(°c\)', '', regex=True) # gets rid of °c
        .str.replace(r'\(%\)', '', regex=True) # gets rid of %
        .str.replace(r'\(mm\)', '', regex=True) # gets rid of 'mm'
        .str.replace(r'\((tons/capita)\)', '', regex=True) # gets rid of 'tons/capita'
        .str.replace(r'_+$', '', regex=True) # delete training underscores at end of column name
)


In [0]:
df.info()

In [0]:
#Reorder data by year (earliest first) and country (alphabetical)
df_sorted = df.sort_values(['year', 'country'])
df_sorted.head(10)


In [0]:
df_sorted.to_csv('data/cleaned_climate_change_data.csv', index=False)


In [0]:
df1 = pd.read_csv('data/cleaned_climate_change_data.csv')
df1.head()

In [0]:
df1['year'].unique()

4. **Data Visualization**


- Use Matplotlib (and optionally Seaborn) to create at least 5 meaningful plots that help answer your questions
- Each plot should have a clear title, axis labels, and legends if needed

In [0]:
# Does the data show that the combined average temperatures of the thirteen countries in the data has risen overall throughout the last 25 years (approx)?

# Lineplot of Average Temperature Rise of Selected Countries (2000 -20024)

# Save combined countries' average temperatures for each year
yearly_avgtemp_df = (
    df1.groupby('year', as_index=False)['avg_temperature']
       .mean()
)

plt.figure(figsize=(12, 6))
sns.lineplot(data=yearly_avgtemp_df,
            x='year',
            y='avg_temperature',
            marker='o')

plt.title('Average Temperature Rise of Selected Countries (2000-2024)')
plt.xlabel('Year')
plt.ylabel('Average Temperature (°C)')

plt.show()

In [0]:
# Can rising global temperatures be correlated with rising CO₂ emissions per capita?

#lineplot with legend to show two variables

plt.figure(figsize=(12, 6))

#average temperature rises
sns.lineplot(
            x='year',
            y='avg_temperature',
            data=df1,
            label='Average Temperature (°C)'
)

#average c02 emission rises
sns.lineplot(
            x='year',
            y='co2_emissions',
            data=df1,
            label='CO2 Emissions (Tons/Capita)'
)

plt.title('Climate Change Trends in Sample Countries (2000-2024)')
plt.xlabel('Year')
plt.ylabel('Value')
plt.legend()

plt.show()



In [0]:
# Can rising global temperatures be correlated with rising CO₂ emissions per capita per country?
# Scatterplot of Temperature vs CO₂ Emissions per Capita (2000–2024) by Country

country_tempco2_avg = (
    df1.groupby('country', as_index=False)[['avg_temperature', 'co2_emissions']].mean()
)

plt.figure(figsize=(12, 6))
sns.scatterplot(data=country_tempco2_avg,
                x='co2_emissions',
                y='avg_temperature',
                )

for i, row in country_tempco2_avg.iterrows():
    plt.text(row['co2_emissions'] + 0.1, 
             row['avg_temperature'],
             row['country'],
             fontsize=10
            )

plt.title('Temperature vs CO₂ Emissions per Capita (2000 - 2024) by Country')
plt.xlabel('Average Temperature (°C)')
plt.ylabel('CO₂ Emissions per Capita (Tons/Capita) by Country')
plt.tight_layout()

plt.show()



In [0]:
# Regression plot showing slight rise in sea level
yearly_sea_rise = (
    df1.groupby('year', as_index=False)['sea_level_rise']
       .mean()
)

plt.figure(figsize=(12, 6))

sns.regplot(
    data=yearly_sea_rise,
    x='year',
    y='sea_level_rise',
    scatter_kws={'s': 60, 'alpha': 0.8, 'color': 'royalblue'},
    line_kws={'linewidth': 2, 'color': 'darkred'},
    ci=None
)

plt.title('Global Mean Sea Level Rise with Linear Trend')
plt.xlabel('Year')
plt.ylabel('Sea Level Rise (units)')
plt.grid(True, axis='y', linestyle='--', alpha=0.6)
plt.tight_layout()

plt.show()


In [0]:
# Has there been a mean increase in extreme weather over the 25 year period?
# It's hard to tell from this bar plot.

yearly_extreme_weather = (
    df1.groupby('year', as_index=False)['extreme_weather_events']
       .mean()
)

plt.figure(figsize=(12, 6))

sns.barplot(
    data=yearly_extreme_weather,
    x='year',
    y='extreme_weather_events',
    color='royalblue'
)

plt.title('Extreme Weather Events by Year')
plt.xlabel('Year')
plt.ylabel('Number of Extreme Weather Events')
plt.xticks(rotation=45)
plt.tight_layout()

plt.show()


In [0]:
# Has there been a mean increase in extreme weather over the 25 year period?
# This graph is much clearer when showing a cumulative total. Extreme weather events 
# have increased over the last 25 years.

yearly_extreme_weather_cumulative = df1.groupby('year', as_index=False)['extreme_weather_events'].mean()
yearly_extreme_weather_cumulative = yearly_extreme_weather_cumulative.sort_values('year')
yearly_extreme_weather_cumulative['cumulative'] = yearly_extreme_weather_cumulative['extreme_weather_events'].cumsum()['cumulative'] = yearly_extreme_weather_cumulative['extreme_weather_events'].cumsum()

plt.figure(figsize=(12, 6))
plt.fill_between(yearly_extreme_weather_cumulative['year'], yearly_extreme_weather_cumulative['cumulative'], alpha=0.3)
plt.bar(yearly_extreme_weather_cumulative['year'], yearly_extreme_weather_cumulative['cumulative'], width=0.8, alpha=0.7)
plt.title('Cumulative Extreme Weather Events of Selected Countries (2000 - 2024)')
plt.xlabel('Year')
plt.ylabel('Cumulative Events (frequency)')
plt.xticks(rotation=45)
plt.tight_layout()

plt.show()

In [0]:
# Can relationships be established between a countries' renewal energy program and forest area (both %), on the other, and average temperature, sea level rise, and extreme weather events on the other?
# The data correlations are complex, as can be seen in this map, which shows that China has both the biggest renewable energy programme and forest areas, yet still faces a relatively high sea level rise.

country_averages = df1.groupby('country', as_index=False).agg({
    'renewable_energy': 'mean',
    'forest_area': 'mean',
    'avg_temperature': 'mean',
    'sea_level_rise': 'mean',
    'extreme_weather_events': 'mean'
})

# sort by renewal energy percentage
country_averages = country_averages.sort_values('renewable_energy', ascending=False)
# set country as index
country_averages = country_averages.set_index('country')


white_cols = ['renewable_energy', 'forest_area'] # these two columns should not show any impact colour
impact_cols = ['avg_temperature', 'sea_level_rise', 'extreme_weather_events'] # impact colour columns
cols_order = white_cols + impact_cols

display_df = country_averages[cols_order].copy()  # shown as text in cells

#colour matrix
color_df = pd.DataFrame(0.0, index=display_df.index, columns=display_df.columns)

for c in impact_cols:
    col = display_df[c].astype(float)
    rng = col.max() - col.min()
    if rng == 0:
        color_df[c] = 0.0
    else:
        color_df[c] = (col - col.min()) / rng

fmt_df = pd.DataFrame(index=display_df.index, columns=display_df.columns, dtype=object)
fmt_df['renewable_energy']      = display_df['renewable_energy'].map(lambda v: f"{v:.2f}%")
fmt_df['forest_area']           = display_df['forest_area'].map(lambda v: f"{v:.2f}%")
fmt_df['avg_temperature']       = display_df['avg_temperature'].map(lambda v: f"{v:.2f}")
fmt_df['sea_level_rise']        = display_df['sea_level_rise'].map(lambda v: f"{v:.2f}")
fmt_df['extreme_weather_events']= display_df['extreme_weather_events'].map(lambda v: f"{int(round(v))}")


plt.figure(figsize=(12, 8))
ax = sns.heatmap(
    color_df,
    cmap='Reds',
    vmin=0,
    vmax=1,
    linewidths=.5,
    linecolor='white',
    cbar_kws={'label':'Higher = Darker Red'},
    annot=fmt_df.values,
    fmt='',
)

ax.set_title('Climate and Environmental Indicators (Averages 2000 - 2024), grouped by countries and renewable energy')
ax.set_xlabel('')
ax.set_ylabel('Country')

plt.xticks(rotation=0)
plt.tight_layout()

plt.show()


5. **Machine Learning Predictions**


- Use `scikit-learn` to create a basic model:
  - Regression (if predicting a number)
  - Classification (if predicting a category)
- Steps:
  - Split data into train/test sets
  - Choose a simple model
  - Train and evaluate the model
  - Show accuracy score, R², or other relevant metrics
- Interpret the results

In [0]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline # ensures that preprocessing happens during training and prediction
from sklearn.preprocessing import OneHotEncoder # converts year and country into 
#separate binary columns so that the model can capture country-specific and year-specific climate patterns

# y variable is average temperatures
y = df1['avg_temperature']

# X is other temperatures
X = df1[['year', 
         'country', 
         'co2_emissions', 
         'sea_level_rise',
         'rainfall',
         'renewable_energy',
         'extreme_weather_events',
         'forest_area']]

#categorical columns
categorical_cols = ['year', 'country']
#numeric columns
numeric_cols = ['co2_emissions', 
                'sea_level_rise', 
                'rainfall', 
                'renewable_energy', 
                'extreme_weather_events', 
                'forest_area'
                ]

# Preprocessor for categorical columns
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols),
        ('num', 'passthrough', numeric_cols)
    ]
)

# Create pipeline for preprocessing + model
pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('model', RandomForestRegressor(n_estimators=100, random_state=42))
])

model = pipe

# Fit the pipeline
model.fit(X, y)



In [0]:
# Predict
predictions = model.predict(X)
predictions


In [0]:
# Average predicted temperature per country
df1['predicted_temp'] = predictions

country_pred_temp = df1.groupby('country')['predicted_temp'].mean()
print(country_pred_temp)

In [0]:
overall_mean = df1['predicted_temp'].mean()
print("Overall mean predicted temp:", overall_mean)

In [0]:
pred = model.predict(X)

plt.scatter(pred, y, alpha=0.6)
plt.title('Predicted vs Actual Average Temperatures (°C)')
plt.xlabel('Predicted Average Temperatures')
plt.ylabel('Actual Average Temperatures')

In [0]:
from sklearn.model_selection import GridSearchCV
import time # speeds up DataBricks processing

param_grid = {
    'model__n_estimators': [100, 200, 300],
    'model__max_depth': [None, 10, 20],
    'model__min_samples_split': [2, 5, 10]
}

new_model = GridSearchCV(
    estimator=model, #pipeline
    param_grid = param_grid,
    cv=3, 
    scoring='neg_mean_squared_error',
    n_jobs=1,
    verbose=2,
    return_train_score=False
)

t0 = time.time()

new_model.fit(X, y)

print(f"Elapsed: {time.time()-t0:.1f}s")
print("Best parameters:", new_model.best_params_)
print("Best score:", new_model.best_score_)

In [0]:
results_df = pd.DataFrame(new_model.cv_results_)

#Orders best score by first
results_df = results_df.sort_values(by='mean_test_score', ascending=False)

results_df.head()


In [0]:
print("Best parameters:", new_model.best_params_)




In [0]:
best_model = new_model.best_estimator_ # RandomForests equivalent to model__n_neighbors in KNN
best_params = new_model.best_params_

#Predicts with the best model
new_y_pred = best_model.predict(X)

plt.scatter(new_y_pred, y, color='seagreen')
plt.xlabel('Predicted Average Temperatures')
plt.ylabel('Actual Average Temperatures')
plt.title(f'Predicted versus Actual Average Temperatures (n_estimators={best_params["model__n_estimators"]}) (°C)')
plt.grid(True)

plt.show()

In [0]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error

# split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# fit the grid search on train only
new_model.fit(X_train, y_train)

# inspect best params and grab best fitted pipeline
print('Best parameters:', new_model.best_params_)
best_model = new_model.best_estimator_

# predict on the test set (unseen during training / tuning)
y_test_pred = best_model.predict(X_test)

# metrics
r2 = r2_score(y_test, y_test_pred)
rmse = mean_squared_error(y_test, y_test_pred, squared=False)
print(f'R2: {r2:.2f}')
print(f'RMSE: {rmse:.2f}')

# plot
plt.scatter(
    y_test_pred,
    y_test,
    alpha=0.6,
    color='seagreen'
)

mn = min(y_test.min(), y_test_pred.min())
mx = max(y_test.max(), y_test_pred.max())
plt.plot([mn, mx], [mn, mx], linestyle='--')

plt.title(
    f'Test Set: Predicted vs Actual Average Temperatures'
    f'(n_estimators={new_model.best_params_["model__n_estimators"]})'
)
plt.xlabel('Predicted Average Temperatures (°C)')
plt.ylabel('Actual Average Temperatures (°C)')
plt.grid(True)

plt.show()

In [0]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

mae = mean_absolute_error(y_test, y_test_pred)
mse = mean_squared_error(y_test, y_test_pred)
rmse = mean_squared_error(y_test, y_test_pred, squared=False)
r2 = r2_score(y_test, y_test_pred)

print(f'MAE: {mae:.2f} °C')
print(f'MSE: {mse:.2f} °C^2')
print(f'RMSE: {rmse:.2f} °C')
print(f'R2: {r2:.2f}')
      