# **Data Collection Notebook**

## Objectives

* Fetch data from Kaggle and save it as raw data.
* Inspect the data and save it under outputs/datasets/collection

## Inputs

*   Kaggle JSON file - the authentication token.

## Outputs

* Generate Dataset: outputs/datasets/collection/Co2Emissions.csv



# Install python packages in the notebooks

In [2]:
%pip install -r /workspace/pp5_co2_oracle/requirements.txt

You should consider upgrading via the '/home/gitpod/.pyenv/versions/3.8.16/bin/python -m pip install --upgrade pip' command.[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.


# Change working directory

We need to change the working directory from its current folder to its parent folder
-    We access the current directory with os.getcwd()

In [3]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/pp5_co2_oracle/jupyter_notebooks'

I want to make the parent of the current directory the new current directory.
- os.path.dirname() gets the parent directory
- os.chir() defines the new current directory

In [4]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [5]:
current_dir = os.getcwd()
current_dir

'/workspace/pp5_co2_oracle'

# Fetch data from Kaggle

Install Kaggle package to fetch data

In [6]:
%pip install kaggle

You should consider upgrading via the '/home/gitpod/.pyenv/versions/3.8.16/bin/python -m pip install --upgrade pip' command.[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.


In [7]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

In [8]:
KaggleDatasetPath = "thedevastator/global-fossil-co2-emissions-by-country-2002-2022"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading global-fossil-co2-emissions-by-country-2002-2022.zip to inputs/datasets/raw
 59%|██████████████████████▎               | 1.00M/1.71M [00:00<00:00, 2.61MB/s]
100%|██████████████████████████████████████| 1.71M/1.71M [00:00<00:00, 3.81MB/s]


In [9]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json
  

Archive:  inputs/datasets/raw/global-fossil-co2-emissions-by-country-2002-2022.zip
replace inputs/datasets/raw/GCB2022v27_MtCO2_flat.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: ^C


## Load and Inspect Kaggle data

- I removed the column named ISO 3166-1 alpha-3 from the DataFrame. The ISO 3166-1 alpha-3 column contains country codes which refer to another column called Country. To avoid redundancy, this column is removed. The inplace=True parameter ensures that the DataFrame df is modified directly, without needing to assign it back to df.

In [10]:
import pandas as pd

# Load the dataset
df = pd.read_csv("inputs/datasets/raw/GCB2022v27_MtCO2_flat.csv")

# Drop the 'Code' column
df.drop(columns=['ISO 3166-1 alpha-3'], inplace=True)

# Impute missing values with mean for all columns
df_imputed = df.fillna(df.mean())

# Display the DataFrame after imputation
print(df_imputed.head(100))
print(df_imputed.isnull().sum())  # Check for any remaining missing values








        Country  Year  Total       Coal        Oil        Gas    Cement  \
0   Afghanistan  1750    0.0  73.968916  55.760624  23.504285  4.330443   
1   Afghanistan  1751    0.0  73.968916  55.760624  23.504285  4.330443   
2   Afghanistan  1752    0.0  73.968916  55.760624  23.504285  4.330443   
3   Afghanistan  1753    0.0  73.968916  55.760624  23.504285  4.330443   
4   Afghanistan  1754    0.0  73.968916  55.760624  23.504285  4.330443   
..          ...   ...    ...        ...        ...        ...       ...   
95  Afghanistan  1845    0.0  73.968916  55.760624  23.504285  4.330443   
96  Afghanistan  1846    0.0  73.968916  55.760624  23.504285  4.330443   
97  Afghanistan  1847    0.0  73.968916  55.760624  23.504285  4.330443   
98  Afghanistan  1848    0.0  73.968916  55.760624  23.504285  4.330443   
99  Afghanistan  1849    0.0  73.968916  55.760624  23.504285  4.330443   

     Flaring      Other  Per Capita  
0   1.712695  10.951389    4.413363  
1   1.712695  10.951389

  df_imputed = df.fillna(df.mean())


DataFrame Summary

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63104 entries, 0 to 63103
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Country     63104 non-null  object 
 1   Year        63104 non-null  int64  
 2   Total       62904 non-null  float64
 3   Coal        21744 non-null  float64
 4   Oil         21717 non-null  float64
 5   Gas         21618 non-null  float64
 6   Cement      20814 non-null  float64
 7   Flaring     21550 non-null  float64
 8   Other       1620 non-null   float64
 9   Per Capita  18974 non-null  float64
dtypes: float64(8), int64(1), object(1)
memory usage: 4.8+ MB


## Push file to Repo

In [12]:
import os
try:
  os.makedirs(name='outputs/datasets/collection') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/Co2Emissions.csv",index=False)