# BW \#64 Coal Power

One way to reduce and slow climate change is by reducing the degree to which we generate energy using fossil fuels.

The G7 just announced a plan to dramatically reduce the use of coal for power plants by the year 2035. What countries use coal, and are newer plants cleaner than older ones?

We try to understand which countries are still running coal plants, what kind of coal and process they're using, and how much emissions they're creating.

## Data and seven questions
To download the data, you'll need to go to the "download data" link:

https://globalenergymonitor.org/projects/global-coal-plant-tracker/download-data/

## Challenges
The learning goals include reducing memory usage, grouping, pivot tables, and various types of plotting.

- Download the Excel spreadsheet, and load the "Units" sheet from that document into a data frame. We'll only want the following columns: "Country", "Capacity (MW)", "Status", "Start year", "Combustion technology", "Coal type", "Region", and "Annual CO2 (million tonnes / annum)".
- How much memory does the data frame take up? How much memory do you save by turning columns into categories? Which columns are most (and least) likely to save us memory in this way? Are there any columns that we could turn into categories, but shouldn't?


In [1]:
import pandas as pd

In [29]:
filename = "C:\\Users\\npigeon\\Git\\BW #64 Coal power\\global-coal-plant-tracker-january-2024.xlsx"
df = pd.read_excel(filename, sheet_name='Units',
                  usecols=["Country", "Capacity (MW)", "Status", 
                           "Start year", "Combustion technology", 
                           "Coal type", "Region", 
                           "Annual CO2 (million tonnes / annum)"])

In [9]:
df

Unnamed: 0,Country,Capacity (MW),Status,Start year,Combustion technology,Coal type,Region,Annual CO2 (million tonnes / annum)
0,Albania,800.0,cancelled,,ultra-supercritical,bituminous,Europe,3.1
1,Argentina,120.0,operating,2022.0,subcritical,bituminous,Americas,0.6
2,Argentina,120.0,construction,2023.0,subcritical,bituminous,Americas,0.6
3,Argentina,375.0,operating,1983.0,subcritical,bituminous,Americas,2.0
4,Australia,160.0,retired,1969.0,subcritical,lignite,Oceania,1.0
...,...,...,...,...,...,...,...,...
13901,Zimbabwe,135.0,construction,,unknown,unknown,Africa,0.7
13902,Zimbabwe,135.0,construction,,unknown,unknown,Africa,0.7
13903,Zimbabwe,300.0,announced,2025.0,unknown,unknown,Africa,1.3
13904,Zimbabwe,300.0,announced,2025.0,unknown,unknown,Africa,1.3


#### How much memory does the data frame take up?
The df takes 4 301 719 octets 4.3MB

In [13]:
df.memory_usage(deep=True).sum()

4301719

In [14]:
df.memory_usage(deep=True)

Index                                     132
Country                                778963
Capacity (MW)                          111248
Status                                 801876
Start year                             111248
Combustion technology                  839171
Coal type                              800189
Region                                 747644
Annual CO2 (million tonnes / annum)    111248
dtype: int64

#### How much memory do you save by turning columns into categories?

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13906 entries, 0 to 13905
Data columns (total 8 columns):
 #   Column                               Non-Null Count  Dtype   
---  ------                               --------------  -----   
 0   Country                              13906 non-null  category
 1   Capacity (MW)                        13906 non-null  float64 
 2   Status                               13906 non-null  object  
 3   Start year                           10281 non-null  float64 
 4   Combustion technology                13906 non-null  object  
 5   Coal type                            13906 non-null  object  
 6   Region                               13906 non-null  object  
 7   Annual CO2 (million tonnes / annum)  13906 non-null  float64 
dtypes: category(1), float64(3), object(4)
memory usage: 779.1+ KB


In [24]:
df = df.apply(lambda col: col.astype('category'))
df.memory_usage(deep=True).sum()

186120

In [25]:
4301719 - 186120

4115599

We save 4 115 599 octets 4.1MB

**Correction**

The After memory usage is After:	     416,504

So we saved 3885215

In [30]:
for one_column in df.select_dtypes(exclude='float64').columns:
    df[one_column] = df[one_column].astype('category')

after_memory_usage = df.memory_usage(deep=True).sum()
print(f'After:\t{after_memory_usage:>12,}')


After:	     416,504


In [31]:
4301719- 416504

3885215

By the way, I decided to try specifying the use of PyArrow for the dtype backend, rather than NumPy, to see what sort of memory savings we might enjoy:

In [32]:
dfa = pd.read_excel(filename, sheet_name='Units',
                  usecols=["Country", "Capacity (MW)", "Status", 
                           "Start year", "Combustion technology", 
                           "Coal type", "Region", 
                           "Annual CO2 (million tonnes / annum)"],
                   dtype_backend='pyarrow')


I then calculated the memory usage. Note that because PyArrow stores all of the values right in Apache Arrow, and not in Python, there’s no need to say “deep=True”. Then let’s see how much we save:



In [34]:
arrow_memory_usage = dfa.memory_usage().sum()
print(f'Arrow:\t{(arrow_memory_usage):>12,}')


Arrow:	   1,170,873


In [35]:
4301719 -   1170873

3130846

We saved 3130846 with arrow. So Arrow still saved us a lot of memory, and we didn’t have to use categories. However, it didn’t save quite as much memory as categories.