# Operationalising the Calculation

Now that you understand the calculation, and have worked through a toy case, we will operationalise the calculation.

In [2]:
import pandas as pd

### Finding New Production Data

Recall from the previous Start Here notebook that New Production of a device dominates the calculation.

If we can create a database which aggregates new production data, we will have data for most important part of the calculation.

Many Original Equipment Manufacturers (OEMs) use the MIT's Product Attribute to Impact Algorithm to calculate the emissions at all stages of their product's lifecycle, and share the data on their website.

These files provide high-quality data for the following three components of the calculation:
* PRODUCTION = Environmental impact from the avoided new production;
* TRANSPORTp = Environmental impact from the avoided transport from production;
* WASTE = Environmental impact from the avoided waste handling (of the product that was not produced);
* TRANSPORTw = Environmental impact from the avoided transport to waste handling; 

**However**, these files are often difficult to find and in PDF form.

This makes the data very difficult to perform data analysis with.

## Boavizta's Database

[Boavizta](https://boavizta.org/en) is agroup of French data-professionals who work to help organisations measure the carbon effects of their IT equipment.

One of Boavizta's projects aggregates the the MIT's files which aggregate the data from various OEMs' MIT files.

You can find a dashboard for the project [here](https://dataviz.boavizta.org/). 

You will want the latest data, which you can export as a CSV from the site.

At the accompanying [GitHub](https://github.com/Boavizta/environmental-footprint-data), you can see the process through which they have extracted the data.

As a summary of their process, they:
* Aggregate the files from the OEMs on the GitHub.
* Developed Python files which parse the PDFs with Natural Language Processing for each brand.
* Added the extracted data to a single database.



This database, with some transformation, will underpin the calculation.

Let's take a look at the database:

In [8]:
df = pd.read_csv('CSVs/boavizta-data/boavizta-data-us.csv')
df.sample(5)

Unnamed: 0,manufacturer,name,category,subcategory,gwp_total,gwp_use_ratio,yearly_tec,lifetime,use_location,report_date,...,height,added_date,add_method,gwp_transport_ratio,gwp_eol_ratio,gwp_ssd_ratio,gwp_mainboard_ratio,gwp_daughterboard_ratio,gwp_enclosure_ratio,comment
872,Lenovo,Legion C730 Cube,Workplace,Desktop,1026.0,0.73,,6.0,WW,April 2018,...,,01-11-2020,Initial Parsing,,,,,,,
819,Lenovo,ideapad 310-15ISK,Workplace,Laptop,339.0,0.26,,5.0,US,March 2016,...,,01-11-2020,Initial Parsing,,,,,,,
322,Dell,OptiPlex 7770 All-in-One Desktop,Workplace,Desktop,514.0,0.368,90.26,4.0,EU,July 2019,...,,2022-09-14,Dell Auto Parser,0.161,0.01,,,,,
464,Dell,Precision 7750,Workplace,Desktop,495.0,0.126,27.59,4.0,EU,May 2020,...,,2022-09-08,Dell Auto Parser,,,,,,,
462,Dell,Precision 7730,Workplace,Desktop,476.0,0.189,25.11,4.0,EU,July 2019,...,,2022-09-08,Dell Auto Parser,,,,,,,


### Column information

The first two columns and the fourth should be intuitive.
The third distinguishes between workplace devices and those used in datacenters.

The others that are relevant to our purposes include:
* gwp_total: GHG emissions (estimated as CO2 equivalent, the unit is kgCO2eq) through the total lifecycle of the product (Manufacturing, Transportation, Use phase and Recycling)
* gwp_use_ratio: part of the GHG emissions coming from the use phase (the hypothesis for this use phase are detailed in the other columns, especially the lifetime and the use_location)
* report_date: the date at which the Product Carbon Footprint report of the device was published
* gwp_manufacturing_ratio: the proportion of the GHG emissions coming from the manufacturing phase
* weight: product weight in kg
* assembly_location: The region of the world in which the device is assembled
    * US: United States of America
    * EU: Europe
    * CN: China
    * Asia: Asia
* screen_size: in inches
* server_type: the type of server
* hard_drive: the hard drive of the device if any
* memory: RAM in GB
* number_cpu: number of CPUs

## Transforming the Data

We have over a thousand entries, but the data is ungrouped.
If we have a device that doesn't excatly fit the model name, the data won't be much use.

We need to make aggregate the data across types.