# Exploratory Data Analysis of CDP dataset 


Se the documentation of the dataset [here](https://github.com/OpenGeoScales/ogs-data-exploration/blob/main/data/ghg-emissions/cdp/README.md) for more details on the data source and methods of calculations

### Summary :
0. Stacking every yearly report
1. Missing values
2. Geospacial coverage
3. Temporal coverage
4. Emissions analysis
5. Gases included

**To Do :**
- [ ] apply preprocessing for measurement year
- [ ] stack all yearly report (maybe rename every columns to a reference list and create missing columns, then stack)
- [x] geospacial analysis
- [x] update geospacial analysis considering that many cities do not have any emissions (490 VS 723)
- [x] create new columns scope_1, Scope_2 and Scope_3 with the total value of emissions
- [ ] check what are BASIC emissions (and check if there are cases with BASIC but not Scope_X)
- [ ] analysis of emissions time series (min/max, distribution)

Also :
- [ ] see why we have duplicates in the same year report
- [ ] Missing cities: extract city name from "Organization" by matching it with a reference list (of cities names per country)
- [ ] plot cities in a map to assert that coordinates are actually true (do it for some countries)

### Steps of preprocessing needed for each dataset: (draft)
Year of measurement should be handled differently to the report:
- split accounting year start/end for 13; 17-20
- clean measurement year for 15; 16
- already clean for 12; 14

In [66]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [67]:
data_17 = pd.read_csv("../../../data/ghg-emissions/cdp/2017_-_Cities_Community_Wide_Emissions.csv")
data_16 = pd.read_csv("../../../data/ghg-emissions/cdp/2016_-_Citywide_GHG_Emissions.csv")
data_15 = pd.read_csv("../../../data/ghg-emissions/cdp/2015_-_Citywide_Emissions.csv")
data_14 = pd.read_csv("../../../data/ghg-emissions/cdp/2014_-_Citywide_GHG_Emissions.csv")
data_13 = pd.read_csv("../../../data/ghg-emissions/cdp/Citywide_GHG_Emissions_2013.csv")
data_12 = pd.read_csv("../../../data/ghg-emissions/cdp/2012_-_Citywide_GHG_Emissions.csv")

data_18 = pd.read_csv("../../../data/ghg-emissions/cdp/2018_-_2019_City-wide_Emissions.csv")
data_19 = pd.read_csv("../../../data/ghg-emissions/cdp/2019_City-wide_Emissions.csv")
data_20 = pd.read_csv("../../../data/ghg-emissions/cdp/2020_-_City-Wide_Emissions.csv")


In [68]:
cols_mapping = pd.read_excel("../../../data/ghg-emissions/cdp/columns_mapping.xls", sheet_name='Sheet1')

# Stacking all the reports

Every reports has : a different number of columns, different columns names, and set in a different order
Here we use the mapping file `columns_mapping.xlsx` to rename columns in the same referential and thus be able to combine all reports into one dataframe.

In [69]:
# This is the mapping file that shows which columns is present per report
cols_mapping

Unnamed: 0,Mapped Columns,2012,2013,2014,2015,2016,2017,2018,2019,2020,Commentaires
0,Year Reported to CDP,Reporting Year,Reporting Year,Reporting Year,Reporting Year,Reporting Year,Reporting year,Year Reported to CDP,Year Reported to CDP,Year Reported to CDP,
1,Account Number,Account No,Account No,Account No,Account No,Account Number,Account number,Account Number,Account Number,Account Number,
2,Organization,City Name,City Name,City Name,City Name,City Name,Organization,Organization,Organization,Organization,
3,City,City Short Name,City Short Name,City Short Name,City Short Name,City Short Name,City,City,City,City,
4,Country,Country,Country,Country,Country,Country,Country,Country,Country,Country,
5,CDP Region,,,,,,Region,CDP Region,CDP Region,CDP Region,
6,C40,C40,C40,C40,C40,C40,C40,,,,
7,Reporting Authority,,,,,,,Reporting Authority,Reporting Authority,,
8,Access,,,,,,Access,Access,Access,Access,
9,City-wide emissions inventory,,,,,,,City-wide Emissions Inventory,City-wide Emissions Inventory,City-wide emissions inventory,


In [70]:
datasets = [data_12, data_13, data_14, data_15, data_16, data_17, data_18, data_19, data_20]


In [71]:
# rename columns that causes problems
# these columns contains weird caracters
data_18.rename(columns = {data_18.columns[19]: 'Emissions occurring outside city boundary/ Scope 3 (metric tonnes CO2e) for Total generation of grid supplied energy',
                         data_18.columns[20]: 'Emissions occurring outside city boundary/ Scope 3 (metric tonnes CO2e) for Total emissions (excluding generation of grid supplied energy)'},
              inplace = True)
data_17.rename(columns = {'​Average altitude (m)': 'Average altitude (m)'}, inplace = True)

In [72]:
# check that every columns is well written in cols_by_report (to avoid issues)
# if a column is not recognized in the mapping file then its name will be printed below

year = 2012
for dataset in datasets:
    print(f'Report of the year : {year}')
    
    for col in dataset.columns:
        if col.strip() not in cols_mapping[year].values:
            print(col)
    
    year += 1

Report of the year : 2012
Report of the year : 2013
Report of the year : 2014
Report of the year : 2015
Report of the year : 2016
Report of the year : 2017
Report of the year : 2018
Report of the year : 2019
Report of the year : 2020


In [73]:
# We set the same columns names for all reports
year = 2012
for dataset in datasets:
    # for each column in the dataset
    for col in list(dataset.columns):
        # I think this if is useless if col.strip() in list(cols_mapping[year].values):
            # we rename the column according the the mapping
        dataset.rename(
            columns = {col: cols_mapping[cols_mapping.loc[:, year] == col.strip()]['Mapped Columns'].values[0]},
            inplace = True
        )
    
    for ref_col in cols_mapping['Mapped Columns'].values:
        # if a column does not exists in the yearly report, we create it and fill it with NaN
        if ref_col not in list(dataset.columns):
            dataset[ref_col] = np.NaN
    # The following line set columns in the right order but does not work wi the list datasets, so we do it separately
    #dataset = dataset[list(cols_mapping['Mapped Columns'].values)]
    year += 1

In [74]:
data_12 = data_12[list(cols_mapping['Mapped Columns'].values)]
data_13 = data_13[list(cols_mapping['Mapped Columns'].values)]
data_14 = data_14[list(cols_mapping['Mapped Columns'].values)]
data_15 = data_15[list(cols_mapping['Mapped Columns'].values)]
data_16 = data_16[list(cols_mapping['Mapped Columns'].values)]
data_17 = data_17[list(cols_mapping['Mapped Columns'].values)]
data_18 = data_18[list(cols_mapping['Mapped Columns'].values)]
data_19 = data_19[list(cols_mapping['Mapped Columns'].values)]
data_20 = data_20[list(cols_mapping['Mapped Columns'].values)]

In [77]:
cols_mapping['Mapped Columns'].to_list()

['Year Reported to CDP',
 'Account Number',
 'Organization',
 'City',
 'Country',
 'CDP Region',
 'C40',
 'Reporting Authority',
 'Access',
 'City-wide emissions inventory',
 'Accounting year',
 'Accounting year start',
 'Accounting year end',
 'Administrative city boundary',
 'Inventory boundary (compared to Administrative city boundary)',
 'Primary Protocol',
 'Primary Protocol Comment',
 'Common Reporting Framework inventory format (GPC)',
 'Gases included',
 'Scopes Included ',
 'Scope 1 generation of grid supplied energy',
 'Scope 1 excluding generation of grid supplied energy',
 'Scope 2 generation of grid supplied energy',
 'Scope 2 excluding generation of grid supplied energy',
 'Scope 3 generation of grid supplied energy',
 'Scope 3 excluding generation of grid supplied energy',
 'Scope 1',
 'Scope 2 ',
 'Scope 3',
 'TOTAL BASIC Emissions (GPC)',
 'TOTAL BASIC+ Emissions (GPC)',
 'Total City-wide Emissions',
 'Comment',
 'Change in emissions',
 'Primary reason for the change i

In [75]:
pd.DataFrame({"data_13": data_18.columns,
            "data_14": data_19.columns,
            "data_20": data_20.columns})

Unnamed: 0,data_13,data_14,data_20
0,Year Reported to CDP,Year Reported to CDP,Year Reported to CDP
1,Account Number,Account Number,Account Number
2,Organization,Organization,Organization
3,City,City,City
4,Country,Country,Country
5,CDP Region,CDP Region,CDP Region
6,C40,C40,C40
7,Reporting Authority,Reporting Authority,Reporting Authority
8,Access,Access,Access
9,City-wide emissions inventory,City-wide emissions inventory,City-wide emissions inventory


In [79]:
datasets[2].head()

Unnamed: 0,Organization,Account Number,Country,City,C40,Year Reported to CDP,Accounting year,Primary Protocol,Primary Protocol Comment,Total City-wide Emissions,...,Land area (in square km),Population,Population Year,Average altitude (m),Average annual temperature (in Celsius),City GDP,GDP Currency,Year of GDP,GDP Source,Last update
0,Municipalidad de La Paz,50364,Bolivia,La Paz,,2014,2012,Other: Global Protocol for Community-Scale Gre...,"Norma Boliviana NB-ISO 14064:1 ""Gases de efect...",1440.5,...,,,,,,,,,,
1,Bogotá Distrito Capital,31154,Colombia,Bogotá,C40,2014,2013,2006 IPCC Guidelines for National Greenhouse G...,A partir de las proyecciones del inventario GE...,16077576.18,...,,,,,,,,,,
2,Taipei City Government,31446,Taiwan,Taipei,,2014,2012,Other: International Emissions Analysis Protoc...,GHG emissions counting in Taipei City contain ...,14416100.0,...,,,,,,,,,,
3,"City of London, ON",50558,Canada,"London, ON",,2014,2012,International Emissions Analysis Protocol (ICLEI),"Please see our report, 2012 Community Energy &...",2920000.0,...,,,,,,,,,,
4,Pretoria - Tshwane,49360,South Africa,Pretoria,,2014,2012,International Emissions Analysis Protocol (ICLEI),The methodology as defined in the Internationa...,11984729.0,...,,,,,,,,,,


In [80]:
data_14['a'] = 444

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [81]:
data_14['a']

0     444
1     444
2     444
3     444
4     444
     ... 
83    444
84    444
85    444
86    444
87    444
Name: a, Length: 88, dtype: int64

In [87]:
datasets[2]['b'] = 555

In [88]:
data_14.columns

Index(['Year Reported to CDP', 'Account Number', 'Organization', 'City',
       'Country', 'CDP Region', 'C40', 'Reporting Authority', 'Access',
       'City-wide emissions inventory', 'Accounting year',
       'Accounting year start', 'Accounting year end',
       'Administrative city boundary',
       'Inventory boundary (compared to Administrative city boundary)',
       'Primary Protocol', 'Primary Protocol Comment',
       'Common Reporting Framework inventory format (GPC)', 'Gases included',
       'Scopes Included ', 'Scope 1 generation of grid supplied energy',
       'Scope 1 excluding generation of grid supplied energy',
       'Scope 2 generation of grid supplied energy',
       'Scope 2 excluding generation of grid supplied energy',
       'Scope 3 generation of grid supplied energy',
       'Scope 3 excluding generation of grid supplied energy', 'Scope 1',
       'Scope 2 ', 'Scope 3', 'TOTAL BASIC Emissions (GPC)',
       'TOTAL BASIC+ Emissions (GPC)', 'Total City-wide E

In [90]:
data_20.head()

Unnamed: 0,Year Reported to CDP,Account Number,Organization,City,Country,CDP Region,C40,Reporting Authority,Access,City-wide emissions inventory,...,Population Year,City Location,Country Location,Average altitude (m),Average annual temperature (in Celsius),City GDP,GDP Currency,Year of GDP,GDP Source,Last update
0,2020,834289,Municipality of Rauch,,Argentina,Latin America,,,public,Yes,...,2014,,,,,,,,,2021-04-10T02:18:55.717
1,2020,50671,Município de Fafe,Fafe,Portugal,Europe,,,public,Not intending to undertake,...,2011,POINT (-8.17286 41.4508),,,,,,,,2021-04-10T02:18:55.717
2,2020,55334,Município de Braga,Braga,Portugal,Europe,,,public,Yes,...,2020,POINT (-8.43821 41.5337),,,,,,,,2021-04-10T02:18:55.717
3,2020,10894,City of Los Angeles,Los Angeles,United States of America,North America,,,public,Yes,...,2018,POINT (-118.244 34.0522),,,,,,,,2021-04-10T02:18:55.717
4,2020,840269,"Town of Whitby, ON",,Canada,North America,,,public,Yes,...,2018,,,,,,,,,2021-04-10T02:18:55.717


In [60]:
data_14[list(cols_mapping['Mapped Columns'].values)].columns == data_20[list(cols_mapping['Mapped Columns'].values)].columns

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True])

In [61]:
data_20 = data_20[list(cols_mapping['Mapped Columns'].values)]

In [62]:
data_20.head()

Unnamed: 0,Year Reported to CDP,Account Number,Organization,City,Country,CDP Region,C40,Reporting Authority,Access,City-wide emissions inventory,...,Population Year,City Location,Country Location,Average altitude (m),Average annual temperature (in Celsius),City GDP,GDP Currency,Year of GDP,GDP Source,Last update
0,2020,834289,Municipality of Rauch,,Argentina,Latin America,,,public,Yes,...,2014,,,,,,,,,2021-04-10T02:18:55.717
1,2020,50671,Município de Fafe,Fafe,Portugal,Europe,,,public,Not intending to undertake,...,2011,POINT (-8.17286 41.4508),,,,,,,,2021-04-10T02:18:55.717
2,2020,55334,Município de Braga,Braga,Portugal,Europe,,,public,Yes,...,2020,POINT (-8.43821 41.5337),,,,,,,,2021-04-10T02:18:55.717
3,2020,10894,City of Los Angeles,Los Angeles,United States of America,North America,,,public,Yes,...,2018,POINT (-118.244 34.0522),,,,,,,,2021-04-10T02:18:55.717
4,2020,840269,"Town of Whitby, ON",,Canada,North America,,,public,Yes,...,2018,,,,,,,,,2021-04-10T02:18:55.717


In [None]:
# Adding missing columns for 2018 and 2020 : "Administrative city boundary"
# Adding missing column for 2020 : "Reporting Authority"

data_18.insert( 10, "Administrative city boundary", "not present in this year's report")
data_19.insert( 10, "Administrative city boundary", "not present in this year's report")

data_20.insert(6, "Reporting Authority", "not present in this year's report")

In [None]:
# check that columns are the same in all 3 dataframes
pd.DataFrame({"data_18": data_18.columns,
            "data_19": data_19.columns,
            "data_20": data_20.columns})

In [None]:
print(f"number of records in 2018 : {data_18.shape[0]}")
print(f"number of records in 2019 : {data_19.shape[0]}")
print(f"number of records in 2020 : {data_20.shape[0]}")

In [None]:
# we concatenate the dataframes and drop duplicates
data_18.columns = data_20.columns
data_19.columns = data_20.columns

data = pd.concat([data_18, data_19, data_20], ignore_index = False)
data = data.drop_duplicates()
data.reset_index(inplace = True)

In [None]:
print(f"The dataset has {data.shape[0]} rows and {data.shape[1]} columns")

# Missing values of emissions

In [None]:
# Mask to select rows without any emissions data
has_no_emissions = \
    data['Direct emissions (metric tonnes CO2e) for Total generation of grid-supplied energy'].isna() \
    & data['Direct emissions (metric tonnes CO2e) for Total emissions (excluding generation of grid-supplied energy)'].isna() \
    & data['Indirect emissions from use of grid supplied energy (metric tonnes CO2e) for Total generation of grid supplied energy'].isna() \
    & data['Indirect emissions from use of grid supplied energy (metric tonnes CO2e) for Total Emissions (excluding generation of grid-supplied energy)'].isna() \
    & data['Emissions occurring outside city boundary (metric tonnes CO2e) for Total Generation of grid supplied energy'].isna() \
    & data['Emissions occurring outside city boundary (metric tonnes CO2e) for Total Emissions (excluding generation of grid-supplied energy)'].isna() \
    & data['TOTAL Scope 1 Emissions (metric tonnes CO2e)'].isna() \
    & data['TOTAL Scope 2 emissions (metric tonnes CO2e)'].isna() \
    & data['TOTAL Scope 3 Emissions'].isna() \
    & data['TOTAL BASIC Emissions (GPC)'].isna() \
    & data['TOTAL BASIC+ Emissions (GPC)'].isna()

In [None]:
# Show the value counts of 'City-wide emissions inventory' for the full dataset/rows with emisssions/rows without any emissions
# We cannot use this column to filter rows without any emissions,
# instead we need to use the has_no_emissions filter created just before
pd.DataFrame({
    'full dataset' : data['City-wide emissions inventory'].value_counts(),
    'rows with emissions' : data[~ has_no_emissions]['City-wide emissions inventory'].value_counts(),
    'rows without any emissions' : data[has_no_emissions]['City-wide emissions inventory'].value_counts()
})

In [None]:
#data[has_no_emissions].to_excel('cities without emissions.xlsx')

In [None]:
# Ratio of cities without any emissions
print( f"Ratio of cities without any emissions : {data[has_no_emissions].City.shape[0] / data.City.shape[0]}")

# Spacial coverage

Summary of the analysis below:
- 98 countries from all continents (68 when removing rows without any emissions), the most represented are North/South America and Europe
- 723 cities (381 when removing rows without any emissions)
- 30% of cities are missing but in most cases we should be able to infer the city name from 'Organization'
- Can we merge easily this dataset with other sources ? We have clean names of country/city so I guess it is ok if we link them with city/country codes

In [None]:
# Keeping only geo-related data so it's easier to display
geo_data = data.loc[:, ['Account Number', 'Organization', 'City', 'Country', 'CDP Region','Reporting Authority', 'Access',
                        'City-wide emissions inventory', 'Administrative city boundary', 'Inventory boundary (compared to Administrative city boundary)',
                       'Land area (in square km)', 'City Location']]
geo_data.head()

In [None]:
# Number of records (=cities) per region
fig, ax = plt.subplots(figsize=(16, 6))
sns.countplot(x = 'CDP Region', data = geo_data)

In [None]:
# Cities does not have a unique account number
print("---- On the full dataset ----")
print(f"Number of countries : {geo_data.groupby(by = 'Country')['Country'].count().size}")
print(f"Number of cities : {geo_data.groupby(by = 'City')['City'].count().size}")
print(f"Number of account number : {geo_data.groupby(by = 'Account Number')['Account Number'].count().size}")
print("\n---- Only for rows with emissions ----")
print(f"Number of countries : {geo_data[~has_no_emissions].groupby(by = 'Country')['Country'].count().size}")
print(f"Number of cities : {geo_data[~has_no_emissions].groupby(by = 'City')['City'].count().size}")
print(f"Number of account number : {geo_data[~has_no_emissions].groupby(by = 'Account Number')['Account Number'].count().size}")

In [None]:
# how many times does a city appear in the report?
# 76 cities appear 4 times, since there are 3 reports it seems weird
geo_data['City'].value_counts().value_counts()

In [None]:
# How many cities recorded per country? (null value of cities are included, the top 20 countries are shown)
print(geo_data.groupby(by = 'Country')['Country'].count().sort_values(ascending = False)[:20])

In [None]:
# Let's see an example
geo_data[geo_data['Country']=='France'];

## Can we infer city name from 'Organization' ?

In (I guess) all cases the name of the city can be extracted from Organization

The question is: does the emission measurment concerns only the city, or a breader area? 

I checked on some examples below and in many cases the area covered in the emission measurment is the city itself (when `Administrative city boundary = City / Municipality	` and `Inventory boundary = Same – covers entire city and nothing else`)

In [None]:
# Ratio of missing  values (%)
geo_data.isna().sum() / geo_data.shape[0] * 100

In [None]:
# Can we infer the name of cities from 'Organization' when 'City' is missing ?

geo_data[geo_data['City'].isna()].sample(10)

In [None]:
# checking for some 'Account Number' from the previous table if there is one record that contains the city name,
# but it is not the case
geo_data[
    geo_data['Account Number'] == 841491
]

In [None]:
# This column is only present in 2020's report
geo_data['Administrative city boundary'].value_counts()

# Temporal coverage

- a bit of engineering is required to split start/end year in two separated columns
- most years are between 2014 and 2018 but ranges from 1990 to 2021
- in almost every cases the emissions are given over a one-year window

'Year Reported to CDP' and 'Last update' all have the same value in the same year's report (2020 for example)

In [None]:
data['Accounting year'].head()

In [None]:
# split 'Accounting year' in start/end date and cast to datetime format
data['Accounting year start'] = data['Accounting year'].str.split(' - ', n = 1, expand = True)[0]
data['Accounting year end'] = data['Accounting year'].str.split(' - ', n = 1, expand = True)[1]

data['Accounting year start'] = pd.to_datetime(data['Accounting year start'], errors = 'coerce')
data['Accounting year end'] = pd.to_datetime(data['Accounting year end'], errors = 'coerce')

In [None]:
# there are many missing values
# in most cases both start/end date are missing
data[['Accounting year start', 'Accounting year end']].isna().sum()

In [None]:
# they are null both at same time 
data[ data['Accounting year start'].isna() & data['Accounting year end'].isna() ].shape[0]

In [None]:
# distribution of 'Accounting year start'
fig, ax = plt.subplots(figsize = (12, 6))
sns.countplot(x = data['Accounting year start'].dt.year)

In [None]:
# For which period of time are emissions given ?
(data['Accounting year end'] - data['Accounting year start']).value_counts()

# Emissions analysis

Emissions data takes a very wide range of values (= many extreme values)
For each scope they are given as either:
- total emissions
- or split by including/excluding generation of grid-supplied energy
I decided to create a column 'Scope_X' to make analysis easier

Still need to check what are BASIC/BASIC+ emissions, but there are some explanation [in this document](https://ghgprotocol.org/sites/default/files/standards_supporting/GPC_Executive_Summary_1.pdf) from GPD.

In [None]:
data.iloc[:, 15:29];

In [None]:
# We create a column with the total emissions for each scope.
# It was verified that when we have 'TOTAL Scope 1 Emissions (metric tonnes CO2e)' we do not have the including/excluding grid supplied energy
# and vice-versa
# the rows with missing values are kept as missing values thanks to 'min_count=1'
data['Scope_1'] = data[
    ['Direct emissions (metric tonnes CO2e) for Total generation of grid-supplied energy',
    'Direct emissions (metric tonnes CO2e) for Total emissions (excluding generation of grid-supplied energy)',
     'TOTAL Scope 1 Emissions (metric tonnes CO2e)']
].sum(axis=1, min_count=1)

data['Scope_2'] = data[
    ['Indirect emissions from use of grid supplied energy (metric tonnes CO2e) for Total generation of grid supplied energy',
    'Indirect emissions from use of grid supplied energy (metric tonnes CO2e) for Total Emissions (excluding generation of grid-supplied energy)',
    'TOTAL Scope 2 emissions (metric tonnes CO2e)']
].sum(axis=1, min_count=1)

data['Scope_3'] = data[
    ['Emissions occurring outside city boundary (metric tonnes CO2e) for Total Generation of grid supplied energy',
    'Emissions occurring outside city boundary (metric tonnes CO2e) for Total Emissions (excluding generation of grid-supplied energy)',
    'TOTAL Scope 3 Emissions']
].sum(axis=1, min_count=1)

In [None]:
# The data is terribly skewed, there are some values so high that we cannot plot it on an histogram
data[['Scope_1', 'Scope_2', 'Scope_3']].describe(percentiles=[0.25, 0.5, 0.75])

In [None]:
data[['Scope_1', 'Scope_2', 'Scope_3']].skew(axis=0, skipna=True)

In [None]:
# Top values for scope 1
data[['Country', 'Organization', 'Administrative city boundary','Scope_1']].sort_values(
    by='Scope_1', ascending=False)[:15]

In [None]:
# Extreme values are filtered out so we can have a look at the distribution
fig, axes = plt.subplots(3, 1, figsize=(7, 7))
sns.histplot(x = data[data['Scope_1'] < 1e8].loc[:, 'Scope_1'], ax=axes[0])
sns.histplot(x = data[data['Scope_2'] < 1e8].loc[:, 'Scope_2'], ax=axes[1])
sns.histplot(x = data[data['Scope_3'] < 1e8].loc[:, 'Scope_3'], ax=axes[2])

In [None]:
data['Scope_2'].isna().sum()

In [None]:
data['Scope_1_std'] = (data['Scope_1'] - data['Scope_1'].mean()) / data['Scope_1'].std()

In [None]:
data['Scope_1'].mean()

In [None]:
sns.histplot(x=data[
    data['Scope_1_std'] <= data['Scope_1_std'].mean() + 3*data['Scope_1_std'].std() & data['Scope_1_std'] >= data['Scope_1_std'].mean() - 3*data['Scope_1_std'].std()
]['Scope_1_std'])

In [None]:
data['Scope_1_std']

# Gazes included

In [None]:
data['Gases Included'].value_counts()

In [None]:
data_16['Gases included'].value_counts()

In [None]:
data_17['Gases included'].value_counts()

In [None]:
data_18['Gases Included'].value_counts()