In [2]:
import pandas as pd

In [3]:
df = pd.read_csv('data/Climate_Change_Indicators.csv')
df.head(5)

Unnamed: 0,Year,Global Average Temperature (°C),CO2 Concentration (ppm),Sea Level Rise (mm),Arctic Ice Area (million km²)
0,1948,13.17,397.04,116.25,5.97
1,1996,13.1,313.17,277.92,9.66
2,2015,14.67,311.95,290.32,8.4
3,1966,14.79,304.25,189.71,11.83
4,1992,13.15,354.52,14.84,11.23


In [4]:
# Check for missing values
missing_values = df.isnull().sum()
print("Missing values in each column:\n", missing_values)

Missing values in each column:
 Year                               0
Global Average Temperature (°C)    0
CO2 Concentration (ppm)            0
Sea Level Rise (mm)                0
Arctic Ice Area (million km²)      0
dtype: int64


###### The line of code `missing_values = df.isnull().sum()` is used to identify and count the number of missing (null) values in each column of a pandas DataFrame. 
###### `df.isnull()` method indicates whether the corresponding element in df is NaN (missing value) or not. True indicates a missing value, and False indicates a non-missing value. `.sum()` method sums up the True values along each column. Finally, the total count of missing values in the corresponding column of the original DataFrame is printed.
###### `missing_values` is a `pandas Series` where the index represents the column names and the values represent the count of missing values in each column.

###### Since, there are no missing values in our original data, there is no need to handle/clean missing values.

In [5]:
# Check for inconsistent values

# Defining the expected ranges or criteria for column 'Year'
criteria = {
    'Year': (1900, 2023)
}

# Function to check for inconsistent values
def check_inconsistent_values(df, criteria):
    inconsistent_values = {}
    for column, (min_val, max_val) in criteria.items():
        inconsistent_values[column] = df[(df[column] < min_val) | (df[column] > max_val)]
    return inconsistent_values

In [6]:
# Check for inconsistent values
inconsistent_values = check_inconsistent_values(df, criteria)

# Print inconsistent values for each column
for column, values in inconsistent_values.items():
    if not values.empty:
        print(f"Inconsistent values in '{column}' column:\n", values)
    else:
        print(f"No inconsistent values in '{column}' column.")

No inconsistent values in 'Year' column.


###### To check the inconsistent value for the column 'Year', first we define the expected criteria `criteria = { 'Year': (1900, 2023) }`.
###### Then, we define a function to check for inconsistent values `check_inconsistent_values` that takes two arguments: a DataFrame `df` and a dictionary `criteria`. It initializes an empty dictionary `inconsistent_values` to store the inconsistent values for each column. It iterates over each column and its corresponding range `(min_val, max_val)` in the criteria dictionary. For each column, it filters the DataFrame to find rows where the column values are either `less than min_val` or `greater than max_val`. These inconsistent rows are stored in the `inconsistent_values` dictionary under the corresponding column name.
###### Lastly, we print inconsistent value.
###### If we know the criterias for other columns, we can simply check for inconsistent values just by adding expected range/criteria in the dictionary `criteria`.

In [8]:
summary_statistics = df.describe()
print(summary_statistics)

               Year  Global Average Temperature (°C)  CO2 Concentration (ppm)  \
count  1.048576e+06                     1.048576e+06             1.048576e+06   
mean   1.961505e+03                     1.449954e+01             3.500280e+02   
std    3.579736e+01                     8.661005e-01             4.042409e+01   
min    1.900000e+03                     1.300000e+01             2.800000e+02   
25%    1.930000e+03                     1.375000e+01             3.149900e+02   
50%    1.962000e+03                     1.450000e+01             3.500700e+02   
75%    1.993000e+03                     1.525000e+01             3.850200e+02   
max    2.023000e+03                     1.600000e+01             4.200000e+02   

       Sea Level Rise (mm)  Arctic Ice Area (million km²)  
count         1.048576e+06                   1.048576e+06  
mean          1.499900e+02                   9.000896e+00  
std           8.657659e+01                   3.462551e+00  
min           0.000000e+00    

In [17]:
# Aggregate the data by year, computing the average for each climate variable

df_grouped_by_year = df.groupby("Year").mean()
df_grouped_by_year

Unnamed: 0_level_0,Global Average Temperature (°C),CO2 Concentration (ppm),Sea Level Rise (mm),Arctic Ice Area (million km²)
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1900,14.506663,350.373405,150.408288,8.978659
1901,14.485343,349.757140,150.548828,8.947272
1902,14.476262,349.299686,152.174821,9.035554
1903,14.492360,349.644375,150.138338,9.056501
1904,14.494241,349.537032,150.667318,8.990691
...,...,...,...,...
2019,14.500105,348.642249,151.020415,9.014690
2020,14.496937,350.021731,150.219741,9.054254
2021,14.501424,350.150302,150.187456,8.968700
2022,14.495233,350.493023,148.857646,8.942012
