# Climate Change World Bank Dataset - Dominic Simpson


##### Exposition:
For this personal project, I have chosen a WorldBank dataset on climate change, covering world data for the period 1901-2023.

##### Guide to the data:
- tas_annual: Average Mean Surface Air Temperature - annual average air temperature at 2m above
- tasmax_annual: Average Maximum Temperature - average daily maximums each year
- tasmin_annual: Average Minimum Temperature - average daily minimums each year
- pr_annual: Average Precipitation - average precipitation each year


Analysis Questions:
- Does the data show that the combined average temperatures of all countries in the world has risen overall throughout the last 25 years (approx)?
- Can rising global temperatures be correlated with rising CO₂ emissions per capita?
- Has there been an inexorable increase in sea level rise throughout the world?
- Has there been an increase in extreme weather over the 25 year period?
- Can relationships be established between a countries' renewal energy program and forest area (both %), on the other, and average temperature, sea level rise, and extreme weather events on the other?


Hypotheses:
1. Countries throughout the world have seen a general rise in temperatures overall.
2. Rising global temperatures can be correlated with the trend for increasing CO₂ emissions per capita - despite attempts to bring down CO₂ levels.

Decide which column will be your target variable for Machine Learning

- Avg Temperature (Â°C) [_column name will be modified_]

In [1]:
# Testing testing
print("Hello World!")

Hello World!


In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

3. **Data Cleaning & Transformation**


### Groups together different datasets together into one:

In [3]:
def load_wide_to_long(path, value_name):
    df = pd.read_csv(path)
    long = df.melt(id_vars=["code","name"], var_name="year", value_name=value_name)
    long["year"] = long["year"].str.extract(r"(\d{4})").astype(int)
    return long

tas = load_wide_to_long("data/tas_annual.csv", "tas")
tasmax = load_wide_to_long("data/tasmax_annual.csv", "tasmax")
tasmin = load_wide_to_long("data/tasmin_annual.csv", "tasmin")
pr = load_wide_to_long("data/pr_annual.csv", "pr")

merged = (
    tas.merge(tasmax, on=["code","name","year"], how="outer")
       .merge(tasmin, on=["code","name","year"], how="outer")
       .merge(pr, on=["code","name","year"], how="outer")
       .sort_values(["code","year"])
)

merged.to_csv("data/climate_annual_allvars.csv", index=False)

In [4]:
df = pd.read_csv("data/climate_annual_allvars.csv")


In [5]:
df1 = df

In [6]:
df1.head()

Unnamed: 0,code,name,year,tas,tasmax,tasmin,pr
0,ABW,Aruba (Neth.),1901,28.22,31.78,24.72,420.9
1,ABW,Aruba (Neth.),1902,27.79,31.35,24.29,420.9
2,ABW,Aruba (Neth.),1903,27.89,31.45,24.39,420.9
3,ABW,Aruba (Neth.),1904,27.62,31.17,24.12,420.9
4,ABW,Aruba (Neth.),1905,27.68,31.23,24.18,420.9


In [7]:
df1 = df1.rename(columns={'name' : 'country_name'})
df1.head()

Unnamed: 0,code,country_name,year,tas,tasmax,tasmin,pr
0,ABW,Aruba (Neth.),1901,28.22,31.78,24.72,420.9
1,ABW,Aruba (Neth.),1902,27.79,31.35,24.29,420.9
2,ABW,Aruba (Neth.),1903,27.89,31.45,24.39,420.9
3,ABW,Aruba (Neth.),1904,27.62,31.17,24.12,420.9
4,ABW,Aruba (Neth.),1905,27.68,31.23,24.18,420.9


In [8]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30258 entries, 0 to 30257
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   code          30258 non-null  object 
 1   country_name  30258 non-null  object 
 2   year          30258 non-null  int64  
 3   tas           30011 non-null  float64
 4   tasmax        30011 non-null  float64
 5   tasmin        30010 non-null  float64
 6   pr            30012 non-null  float64
dtypes: float64(4), int64(1), object(2)
memory usage: 1.6+ MB


In [9]:
df1.describe(include = 'all')

Unnamed: 0,code,country_name,year,tas,tasmax,tasmin,pr
count,30258,30258,30258.0,30011.0,30011.0,30010.0,30012.0
unique,246,246,,,,,
top,ABW,Aruba (Neth.),,,,,
freq,123,123,,,,,
mean,,,1962.0,19.18727,23.930472,14.48764,1262.459739
std,,,35.506455,8.674286,8.832801,8.762106,836.377345
min,,,1901.0,-21.07,-17.18,-25.01,10.89
25%,,,1931.0,11.58,16.29,6.75,622.5925
50%,,,1962.0,22.96,28.03,18.1,1101.57
75%,,,1993.0,25.94,30.19,21.42,1782.23


In [10]:
df1.shape

(30258, 7)

In [11]:
missing_values = df1.isnull().sum()
print(missing_values[missing_values > 0])

tas       247
tasmax    247
tasmin    248
pr        246
dtype: int64


In [12]:
df1.isnull().sum()
df1[df1.isnull().any(axis=1)].head()

Unnamed: 0,code,country_name,year,tas,tasmax,tasmin,pr
3198,BLM,Saint-Barthélemy (Fr.),1901,,,,
3199,BLM,Saint-Barthélemy (Fr.),1902,,,,
3200,BLM,Saint-Barthélemy (Fr.),1903,,,,
3201,BLM,Saint-Barthélemy (Fr.),1904,,,,
3202,BLM,Saint-Barthélemy (Fr.),1905,,,,


### The missing values are most likely small island territories or overseas departments (as in the example above) that often have no independent climate series, due to their data being absorbed into their parent country (in this case France), or not modelled separately.

In [13]:
# Places that have no climate data at all (missing countries)
missing_countries = (
    df1.groupby('code')[['tas','tasmax','tasmin','pr']]
       .apply(lambda x: x.isnull().all())
       .any(axis=1)
)
missing_countries[missing_countries]

code
BLM    True
BVT    True
dtype: bool

In [14]:
# Countries that have some missing data
countries_with_some_missing_data = (
    df1.groupby('code')[['tas','tasmax','tasmin','pr']]
       .apply(lambda x: x.isnull().any().any())
)
countries_with_some_missing_data[countries_with_some_missing_data]


code
BLM    True
BVT    True
CAN    True
KAZ    True
MNG    True
dtype: bool

In [15]:
# Only Saint Barthélemy and Bouvet Island have no data
df1[df1['code'].isin(['BLM', 'BVT'])].shape


(246, 7)

In [16]:
# Dropped those two places
df1 = df1.drop(df1[df1['code'].isin(['BLM', 'BVT'])].index)


In [17]:
df1.isnull().sum()


code            0
country_name    0
year            0
tas             1
tasmax          1
tasmin          2
pr              0
dtype: int64

In [18]:
df_missing = df1[df1.isnull().any(axis=1)]
print(df_missing)

      code country_name  year   tas  tasmax  tasmin      pr
4678   CAN       Canada  1905 -5.01     NaN  -10.03  532.67
13915  KAZ   Kazakhstan  1917  6.10   12.20     NaN  205.35
13945  KAZ   Kazakhstan  1947  5.76   11.54     NaN  282.96
18105  MNG     Mongolia  1925   NaN    6.86   -6.85  233.19


In [20]:
df1 = df1.sort_values(['code','year'])

for col in ['tas', 'tasmax', 'tasmin', 'pr']:
    df1[col] = (
        df1.groupby('code')[col]
            .transform(lambda g: g.interpolate(limit_direction='both'))
    )

In [21]:
df1.isnull().sum()
df1[df1[['tas', 'tasmax', 'tasmin', 'pr']].isnull().any(axis=1)]

Unnamed: 0,code,country_name,year,tas,tasmax,tasmin,pr
