# Climate Change Dataset - Dominic Simpson, La Fosse Data Hackathon


## Steps & Deliverables


1. **Choose a Dataset**
- Go to **Kaggle** and find a dataset you find interesting (small-to-medium size so you can work quickly – < 100MB recommended)
- Make sure it has at least one numeric column you can predict with regression or one categorical column you can classify
- Upload it to **Databricks**


2. **Ask Questions & Create Hypotheses**
- Write 3–5 analysis questions you want to answer
- Write 1–2 hypotheses you can test
- Decide which column will be your target variable for Machine Learning

##### Exposition:
For this hackaton project, "From Data to Insights to Predictions", I have chosen the following dataset from Kaggle: https://www.kaggle.com/datasets/bhadramohit/climate-change-dataset/data

- Title: Climate Change Dataset - "Dataset of Temperature, Emissions, and Environmental Trends (2000-2024)"
- File: climate_change_dataset.csv
- File Size: 53.21kB - 90kB (depending on encoding)
- Number of Rows: 1000
- Number of Columns: 10


Analysis Questions:
1. Does the data show that every country in the world has seen rising temperatures overall in the last 25 years (approx)?
2. Can a country's rising temperatures be correlated with their CO₂ emissions per capita?
2. How do parts of the world compare with statistics on extreme weather?
3. Is there a relationship between CO₂ emissions per capita and renewable energy usage?
4. Has there been an inexorable increase in sea level rise throughout the world?


Hypotheses:
1. Countries throughout the world have seen a general rise in temperatures overall.
2. A country's rising tempeature can be correlated with their CO₂ emissions per capita.

Decide which column will be your target variable for Machine Learning

- Avg Temperature (Â°C) [_column name will be modified_]

In [0]:
# Testing testing
print("Hello World!")

In [0]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

3. **Data Cleaning & Transformation**


- Load your dataset in a `Jupyter Notebook` inside Databricks
- Handle missing values, duplicates, and incorrect data types
- Create new columns if needed
- Filter, group, and sort data to prepare it for analysis

In [0]:
df = pd.read_csv("data/climate_change_dataset.csv")
df.head()

In [0]:
df.tail()

In [0]:
df.describe(include='all')


In [0]:
df.info()

In [0]:
df.shape

In [0]:
# no missing values
missing_values = df.isnull().sum()
print(missing_values[missing_values > 0])

In [0]:
# no duplicate values
duplicated_values = df.duplicated().sum()
print(duplicated_values > 0)

Formatting columns

In [0]:
# Ensure that float data in dataset is formatted to 
# two decimal places, to preserve precision from original calculations
# (in climate change studies, small differences can be meaningful when looking at long-term trends)
pd.options.display.float_format = '{:.2f}'.format

In [0]:
# `Year` has already been formatted correctly as int64
# `Country` has already been formatted correctly as object
# `Avg Temperature (°C)` has already been formatted correctly as float64, with two decimal places
# added from command above replacing original only one decimal place
df['Avg Temperature (°C)'].head(10)



In [0]:
# `Sea Level Rise (mm)` has already been formatted correctly as float64, with two decimal places
# added from command above replacing original only one decimal place
df['Sea Level Rise (mm)'].head(10)


In [0]:
# There is no decimal places in original data in column, so I have left this as int64
df['Rainfall (mm)'].head(10)


In [0]:
# population data contains errors and is not required for this project
df.drop('Population', axis=1, inplace=True, errors='ignore')


In [0]:
# There is no decimal places in original data in column, so I have left this as int64
df['Extreme Weather Events'].head(10)


In [0]:
# `Forest Area (%)` has already been formatted correctly as float64, with two decimal places
# added from command above replacing original only one decimal place
df['Forest Area (%)'].head(10)

In [0]:
# Column names
# Standardized them to lowercase and with underscores, as well as removing 
# units like (°C), etc., as well as measurements such as 'm' via regex
# These technical terms will still appear in data visualization and ML models
df.columns = (
    df.columns
        .str.strip() # remove leading/trailing spaces
        .str.lower() # convert to lowercase
        .str.replace(r'\s+', '_', regex=True) # adds underscores in spaces between column name words
        .str.replace(r'\(°c\)', '', regex=True) # gets rid of °c
        .str.replace(r'\(%\)', '', regex=True) # gets rid of %
        .str.replace(r'\(mm\)', '', regex=True) # gets rid of 'mm'
        .str.replace(r'\((tons/capita)\)', '', regex=True) # gets rid of 'tons/capita'
        .str.replace(r'_+$', '', regex=True) # delete training underscores at end of column name
)


In [0]:
df.head()