# Data Cleaning

In [1]:
import numpy as np
import pandas as pd
df = pd.read_csv("Carbon_Projects.csv")
print(df.head)

<bound method NDFrame.head of         ID                                               Name  \
0     5302       SD Vista - Solar Water Pump Project in Kenya   
1     5264                 Yunfu LFG Power Generation Project   
2     5214  Pingjiang County Domestic Waste Harmless Landf...   
3     5203  Luoding BCCY New Power CO., Ltd. MSW biogas to...   
4     5186              Quanzhou Canhua PET Recycling Project   
...    ...                                                ...   
1951  2166  Chongqing Youyang County Youchou Hydropower St...   
1952  2162  2 x 3.5 MW Ullunkal Hydro Power Project in Ker...   
1953  2157                                    BAESA Project `   
1954  2136                     cancelled duplicate of VCSR218   
1955  2126  7.3 MW Bundled Wind Power Project by Oswal Cables   

                                              Proponent  \
0                              Sunculture Kenya Limited   
1                                   Multiple Proponents   
2           

In [2]:
# The data set dimension is 1956 x 13.
# Now, we want to see if there are any missing values.
print(df.isnull().any())

ID                                      False
Name                                     True
Proponent                               False
Project Type                            False
AFOLU Activities                         True
Methodology                             False
Status                                  False
Country/Area                            False
Estimated Annual Emission Reductions    False
Region                                   True
Project Registration Date                True
Crediting Period Start Date              True
Crediting Period End Date                True
dtype: bool


In [3]:
# We want to check the data type of each column.
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1956 entries, 0 to 1955
Data columns (total 13 columns):
 #   Column                                Non-Null Count  Dtype 
---  ------                                --------------  ----- 
 0   ID                                    1956 non-null   int64 
 1   Name                                  1955 non-null   object
 2   Proponent                             1956 non-null   object
 3   Project Type                          1956 non-null   object
 4   AFOLU Activities                      533 non-null    object
 5   Methodology                           1956 non-null   object
 6   Status                                1956 non-null   object
 7   Country/Area                          1956 non-null   object
 8   Estimated Annual Emission Reductions  1956 non-null   object
 9   Region                                1875 non-null   object
 10  Project Registration Date             1553 non-null   object
 11  Crediting Period Start Date   

In [4]:
# To change the data type of column Estimated Annual Emission Reduction to integer for further analysis
df['Estimated Annual Emission Reductions'] = df['Estimated Annual Emission Reductions'].str.replace(',', '', regex=True).astype('Int64')

In [5]:
# To confirm the data type of the column as integer64
print(df['Estimated Annual Emission Reductions'].dtype)

Int64


In [6]:
print(df['Estimated Annual Emission Reductions'])

0        40000
1        47759
2        60882
3        44555
4        58574
         ...  
1951    334000
1952     16125
1953    318793
1954    115912
1955     14832
Name: Estimated Annual Emission Reductions, Length: 1956, dtype: Int64


In [None]:
# We know that there is no missing value in Project Type column.
# Now, we want to look at the carbon projects by project type.
unique_count = df['Project Type'].unique()

In [9]:
unique_count

array(['Energy distribution',
       'Energy industries (renewable/non-renewable sources); Waste handling and disposal',
       'Waste handling and disposal', 'Transport',
       'Energy industries (renewable/non-renewable sources); Mining/mineral production',
       'Agriculture Forestry and Other Land Use', 'Energy demand',
       'Livestock, enteric fermentation, and manure management; Waste handling and disposal',
       'Energy industries (renewable/non-renewable sources); Fugitive emissions from fuels (solid, oil and gas); Mining/mineral production',
       'Energy industries (renewable/non-renewable sources); Livestock, enteric fermentation, and manure management; Waste handling and disposal',
       'Energy demand; Waste handling and disposal',
       'Energy industries (renewable/non-renewable sources); Transport',
       'Livestock, enteric fermentation, and manure management',
       'Energy industries (renewable/non-renewable sources)',
       'Transport; Waste handling and

In [13]:
df['Project Type'].agg(['count', 'nunique'])

count      1956
nunique      32
Name: Project Type, dtype: int64

In [14]:
df['Project Type'].nunique

<bound method IndexOpsMixin.nunique of 0                                     Energy distribution
1       Energy industries (renewable/non-renewable sou...
2       Energy industries (renewable/non-renewable sou...
3       Energy industries (renewable/non-renewable sou...
4                             Waste handling and disposal
                              ...                        
1951    Energy industries (renewable/non-renewable sou...
1952    Energy industries (renewable/non-renewable sou...
1953    Energy industries (renewable/non-renewable sou...
1954    Energy industries (renewable/non-renewable sou...
1955                             Manufacturing industries
Name: Project Type, Length: 1956, dtype: object>

In this dataset, there are 32 unique project types.

In the Project Type column, the project types registered in Verra are:

- Energy Distribution
- Energy industries (renewable/non-renewable): waste handlind and disposal
- Wate Handling and disposal
- Transport
- 