# COMP 4462 Project
## Visualization of Hong Kong Compared to Global

This is the data-preprocessing process for the visualization of the bubble chart, where the emission of each country is plotted against the temperture rise

Because the task of this visualization is to compare where Hong Kong stands when compared to the global average, this task not possible to be accomplished by using data from Data.gov alone. I have decided to import multiple datasets from Kaggle.com to help me achieve this task. All other visualization in our group will use data from Data.gov.

Ackownledgement of the data source: 


<strong>Data for greenhouse gas emission:</strong>
https://www.kaggle.com/datasets/unitednations/international-greenhouse-gas-emissions


<strong>Data for temperature rise by Country</strong>
https://www.kaggle.com/datasets/shabou/ghg-emissions?select=owid-co2-data.csv



In [1]:
import pandas as pd
import numpy as np

# Data for temperature rise
df_temp = pd.read_csv('GlobalLandTemperaturesByCountry.csv', parse_dates = ["dt"])

# Data for greenhouse gas emission
df_emission = pd.read_csv('owid-co2-data.csv')

### Exploring the Data

Let's see what our data looks like

In [2]:
df_temp.shape

(577462, 4)

In [3]:
df_temp.head()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,Country
0,1743-11-01,4.384,2.294,Åland
1,1743-12-01,,,Åland
2,1744-01-01,,,Åland
3,1744-02-01,,,Åland
4,1744-03-01,,,Åland


In [4]:
df_emission.shape

(24016, 38)

In [5]:
df_emission.head()

Unnamed: 0,iso_code,country,year,co2,co2_growth_prct,co2_growth_abs,consumption_co2,trade_co2,trade_co2_share,co2_per_capita,...,ghg_per_capita,methane,methane_per_capita,nitrous_oxide,nitrous_oxide_per_capita,primary_energy_consumption,energy_per_capita,energy_per_gdp,population,gdp
0,AFG,Afghanistan,1949,0.015,,,,,,0.0,...,,,,,,,,,7663783.0,
1,AFG,Afghanistan,1950,0.084,475.0,0.07,,,,0.0,...,,,,,,,,,7752000.0,19494800000.0
2,AFG,Afghanistan,1951,0.092,8.696,0.007,,,,0.0,...,,,,,,,,,7840000.0,20063850000.0
3,AFG,Afghanistan,1952,0.092,0.0,0.0,,,,0.0,...,,,,,,,,,7936000.0,20742350000.0
4,AFG,Afghanistan,1953,0.106,16.0,0.015,,,,0.0,...,,,,,,,,,8040000.0,22015460000.0


### Filtering the data

For the temperature dataset, the granularity is by month. This is too fine for our visualization as our visualization will hav e the granularity of year. I will therefore process the data so that it averages the month data into year data.

For the greenhouse gas emission data, only the field country, year, co2 and poppulation are useful

In [6]:
df_emission = df_emission[["country", "year", "co2", "population"]]

In [7]:
df_emission.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24016 entries, 0 to 24015
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   country     24016 non-null  object 
 1   year        24016 non-null  int64  
 2   co2         23372 non-null  float64
 3   population  19394 non-null  float64
dtypes: float64(2), int64(1), object(1)
memory usage: 750.6+ KB


In [8]:
# Changing the name of the columns so that we can merge the two dataset with ease later

df_emission.rename(columns = {'year':'dt', 'country':'Country'}, inplace = True)

In [9]:
# Parse Date

df_emission["dt"] = pd.to_datetime(df_emission["dt"], format = "%Y")

In [10]:
df_emission.head()

Unnamed: 0,Country,dt,co2,population
0,Afghanistan,1949-01-01,0.015,7663783.0
1,Afghanistan,1950-01-01,0.084,7752000.0
2,Afghanistan,1951-01-01,0.092,7840000.0
3,Afghanistan,1952-01-01,0.092,7936000.0
4,Afghanistan,1953-01-01,0.106,8040000.0


In [11]:
# We will only use the data after 1990, more on that later

df_emission = df_emission.loc[df_emission["dt"] > '1990-1-1']

In [12]:
df_emission.shape

(6757, 4)

### Processing the Temperature Rise Data

Processing the temperature rise data is arguably more tricky. 

As mentioned, we need a way to merge the month data into year data by taking average. We will also have to do that while keeping the country consistant (Only data of a certain country in a specific year will be merged).

The dataset also only shows the average temperature data. For our visualization, we are more interested in <strong> temperature rise</strong>. This is data we don't have, and we will need a way to calculate that. 

More will be elaborated in the final report. 

In [13]:
df_temp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 577462 entries, 0 to 577461
Data columns (total 4 columns):
 #   Column                         Non-Null Count   Dtype         
---  ------                         --------------   -----         
 0   dt                             577462 non-null  datetime64[ns]
 1   AverageTemperature             544811 non-null  float64       
 2   AverageTemperatureUncertainty  545550 non-null  float64       
 3   Country                        577462 non-null  object        
dtypes: datetime64[ns](1), float64(2), object(1)
memory usage: 17.6+ MB


In [14]:
# We do not need the temperature uncertainly column to keep it simple

df_temp = df_temp.drop("AverageTemperatureUncertainty", axis = 1)

In [15]:
# We will clone the Average Temperature Column
# Because the original AverageTemperature Column will be used to store temperature difference instead

df_temp["NewAverageTemperature"] = df_temp["AverageTemperature"]

In [16]:
# Grouping the data by country

grouped = df_temp.groupby("Country")

In [17]:
grouped.head()

Unnamed: 0,dt,AverageTemperature,Country,NewAverageTemperature
0,1743-11-01,4.384,Åland,4.384
1,1743-12-01,,Åland,
2,1744-01-01,,Åland,
3,1744-02-01,,Åland,
4,1744-03-01,,Åland,
...,...,...,...,...
575497,1850-01-01,22.187,Zimbabwe,22.187
575498,1850-02-01,23.538,Zimbabwe,23.538
575499,1850-03-01,22.528,Zimbabwe,22.528
575500,1850-04-01,20.000,Zimbabwe,20.000


(This is probably not the best solution but it works)

We will loop through each group grouped by countries
Within each group, we will further group it based on year, and merge all the month data into one year datapoint

In [18]:
# Empty dataframe
big_df = pd.DataFrame()

for name, g in grouped:
    
    # Thanks to: https://stackoverflow.com/questions/49216357/how-to-keep-original-index-of-a-dataframe-after-groupby-2-columns
    # Essentially what we are doing here is finding the mean of the month data
    group_df = g.groupby(g['dt'].dt.year).agg("mean").reset_index()
    
    group_df["Country"] = name
    
    group_df["dt"] = pd.to_datetime(group_df["dt"], format = "%Y")
    
    # Using the date as index
    group_df.set_index("dt")
    
    before_1990 = (group_df['dt'] <= '1900-1-1')
    
    after_1990 = (group_df['dt'] > '1990-1-1')
    
    # The temperature rise is calculated by the the difference of the datapoint minus the mean of all temperature before 1990s
    mean_temp_before_1990 = group_df.loc[before_1990]["AverageTemperature"].mean()
    
    group_df = group_df.loc[after_1990]
    
    group_df["AverageTemperature"] -= mean_temp_before_1990
    
    big_df = big_df.append(group_df)

print(big_df)

            dt  AverageTemperature  NewAverageTemperature      Country
153 1991-01-01            0.629435              14.370750  Afghanistan
154 1992-01-01            0.314768              14.056083  Afghanistan
155 1993-01-01            0.697935              14.439250  Afghanistan
156 1994-01-01            1.013435              14.754750  Afghanistan
157 1995-01-01            1.117851              14.859167  Afghanistan
..         ...                 ...                    ...          ...
266 2009-01-01            1.528716               6.489083        Åland
267 2010-01-01           -0.098451               4.861917        Åland
268 2011-01-01            2.210383               7.170750        Åland
269 2012-01-01            1.103549               6.063917        Åland
270 2013-01-01            1.269383               6.229750        Åland

[5589 rows x 4 columns]


### Merging the data

We now have two dataframe

`big_df`: Storing the temperature rise and average temperature data, by year


`df_emission`: Storing the emission and population data, also by year

In [19]:
# Merging the data based on country

bigger = pd.merge(big_df, df_emission, on=['dt', 'Country'], how='left')

In [20]:
bigger.shape

(5589, 6)

### Dropping invalid data

Many data points in the dataframe now is empty, this is mostly because the data for some years is missing in the the original dataset. 

The datasets also contains some countries which the other dataset do not have. 

<strong>There probably could have been better way to process the data, but for simplicity, I just drop all the empty values. Should have more than enough data for the visualization</strong>

In [21]:
bigger = bigger.dropna()

### Taking log of the data

The visualization would be impossible if I did not take log of the data, as the carbon emission of some countries is far more that the emission of others. 

I will justify this in my final report and in the presentation.

In [22]:
bigger["co2_log"] = np.log(bigger["co2"])

In [23]:
# Exporting the data, done!

bigger.to_csv('new.csv')