# **Data Collection Notebook**

## Objectives

* Fetch data from Kaggle and save it as raw data.
* Inspect the data and save it under outputs/datasets/collection

## Inputs

*   Kaggle JSON file - the authentication token.

## Outputs

* Generate Dataset: outputs/datasets/collection/Co2Emissions.csv



# Install python packages in the notebooks

In [5]:
%pip install -r C:\Users\Grampers\Desktop\CO2Oracle\requirements.txt





[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: C:\Users\Grampers\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


# Change working directory

We need to change the working directory from its current folder to its parent folder
-    We access the current directory with os.getcwd()

In [6]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\Grampers\\Desktop\\CO2Oracle\\jupyter_notebooks'

I want to make the parent of the current directory the new current directory.
- os.path.dirname() gets the parent directory
- os.chir() defines the new current directory

In [7]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [8]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\Grampers\\Desktop\\CO2Oracle'

# Fetch data from Kaggle

Install Kaggle package to fetch data

In [9]:
%pip install kaggle

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: C:\Users\Grampers\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [13]:
import os

# Set the environment variable for the Kaggle API
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()

In [14]:
KaggleDatasetPath = "thedevastator/global-fossil-co2-emissions-by-country-2002-2022"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Dataset URL: https://www.kaggle.com/datasets/thedevastator/global-fossil-co2-emissions-by-country-2002-2022
License(s): CC0-1.0
global-fossil-co2-emissions-by-country-2002-2022.zip: Skipping, found more recently modified local copy (use --force to force download)


In [18]:
import os
import zipfile

# Define the destination folder
destination_folder = 'inputs/datasets/raw'

# Unzip all files in the destination folder
for item in os.listdir(destination_folder):
    if item.endswith('.zip'):  # Check for ZIP files
        file_path = os.path.join(destination_folder, item)
        with zipfile.ZipFile(file_path, 'r') as zip_ref:
            zip_ref.extractall(destination_folder)  # Extract all contents
        os.remove(file_path)  # Remove the ZIP file after extracting

# Remove the kaggle.json file securely
kaggle_json_path = os.path.join(os.getcwd(), 'kaggle.json')
os.remove(kaggle_json_path)  # Remove the file

## Load and Inspect Kaggle data

- I removed the column named ISO 3166-1 alpha-3 from the DataFrame. The ISO 3166-1 alpha-3 column contains country codes which refer to another column called Country. To avoid redundancy, this column is removed. The inplace=True parameter ensures that the DataFrame df is modified directly, without needing to assign it back to df.

In [19]:
import pandas as pd

# Load the dataset
df = pd.read_csv("inputs/datasets/raw/GCB2022v27_MtCO2_flat.csv")

# Drop the 'Code' column
df.drop(columns=['ISO 3166-1 alpha-3'], inplace=True)


# Display the DataFrame after imputation
print(df.head(100))
print(df.isnull().sum())  # Check for any remaining missing values








        Country  Year  Total  Coal  Oil  Gas  Cement  Flaring  Other  \
0   Afghanistan  1750    0.0   NaN  NaN  NaN     NaN      NaN    NaN   
1   Afghanistan  1751    0.0   NaN  NaN  NaN     NaN      NaN    NaN   
2   Afghanistan  1752    0.0   NaN  NaN  NaN     NaN      NaN    NaN   
3   Afghanistan  1753    0.0   NaN  NaN  NaN     NaN      NaN    NaN   
4   Afghanistan  1754    0.0   NaN  NaN  NaN     NaN      NaN    NaN   
..          ...   ...    ...   ...  ...  ...     ...      ...    ...   
95  Afghanistan  1845    0.0   NaN  NaN  NaN     NaN      NaN    NaN   
96  Afghanistan  1846    0.0   NaN  NaN  NaN     NaN      NaN    NaN   
97  Afghanistan  1847    0.0   NaN  NaN  NaN     NaN      NaN    NaN   
98  Afghanistan  1848    0.0   NaN  NaN  NaN     NaN      NaN    NaN   
99  Afghanistan  1849    0.0   NaN  NaN  NaN     NaN      NaN    NaN   

    Per Capita  
0          NaN  
1          NaN  
2          NaN  
3          NaN  
4          NaN  
..         ...  
95         NaN  

DataFrame Summary

In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63104 entries, 0 to 63103
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Country     63104 non-null  object 
 1   Year        63104 non-null  int64  
 2   Total       62904 non-null  float64
 3   Coal        21744 non-null  float64
 4   Oil         21717 non-null  float64
 5   Gas         21618 non-null  float64
 6   Cement      20814 non-null  float64
 7   Flaring     21550 non-null  float64
 8   Other       1620 non-null   float64
 9   Per Capita  18974 non-null  float64
dtypes: float64(8), int64(1), object(1)
memory usage: 4.8+ MB


## Push file to Repo

In [21]:
import os
try:
  os.makedirs(name='outputs/datasets/collection') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/Co2Emissions.csv",index=False)