# ✏️ Outline / Pitch

This year, China faced its worst power shortage in a decade, which rippled across factories and households recently. Power rationing for factories and businesses became in effect, with some provinces ordering factories to halt production for a few days each week. Reactions from citizens are a mix of fear and frustration, as power cuts are often made with no warning. It not only affected China’s domestic social order but also raised international concerns.

Why has China been hit by power shortages? There are two major interconnected explanations. For one thing, the country struggled to balance electricity supplies with demand, which was boosted during the pandemic. For another, with Beijing attempting to make the country carbon neutral by 2060, China is trying to slow coal production, which accounts for the majority of China’s electricity resources. However, despite many discussions surrounding these two arguments, little data has been cited as evidence. So I want to make some efforts to fill the gap and do some experimental exploration.

In my notebook, I drew data from China’s official websites and Wikipedia, looking at these two explanations by 4 visualized pictures. I divided my notebook into three parts. 

I offered an overview of China’s electricity generation & consumption situation and power resources in 2021 by utilizing the recent 9 reports released by China Electricity Council (CEC)（Since the reports were in PDF formats, I had to manually adjust them). From the “Chart: Electricity Deficit in China (2021))” [line 11], we can see a status of electricity deficit in China in the past 9 months, which is especially severe in May and August. If we compare this chart to the “Chart: Electricity Generation by Source (2021)”[line 12], we would find that during August, the electricity generation is relatively high (except for Lunar New Year). However, even with more generation volume, the deficit is still high. It seems to foresee the electricity shortage in September this year.

In the second and third parts, I analyzed how the supply-demand relationship and generation resources changed over time. From “Electricity Surplus in China (generation-consumption) (2001-2019)”[line 24], we can see that the annual statistic shows the opposite result from the previous monthly statistics, displaying an increasing electricity surplus status. Why is that? I think it’s necessary to look at the calculating criteria and ask for some expert insights.

Finally, I drew a picture “Chart: Electricity Production By Source (2008-2019)”[line 36]. From this, we can see that fossil fuel, of which coal is a major part, keeps increasing. While at the same time, the amount of electricity generation generated by hydro is also rising. We could see the effort of Beijing developing renewable resources but more comparison analysis is needed if we need to evaluate to what extent such effort works.
    

---

In [1]:
%load_ext lab_black
import pandas as pd
import altair as alt
import numpy as np

# 1. Electricity Statistics of China in 2021 (Monthly Data)

### *Note:
- Task Description: By utilizing 9 reports released by China Electricity Council (CEC) in 2021
- Preparation before coding: I Downloaded monthly statistics PDF from CEC website, converted pdf into excel with Adobe Acrobat DC, and then double check the data to make sure it's accurate
- Reference: China Electricity Council (CEC):https://english.cec.org.cn/menu/index.html?251

### clean data

In [2]:
df_1 = pd.read_csv("./monthly.csv")

In [3]:
df_1.columns = df_1.columns.str.strip().str.lower()

In [4]:
df_1["electricity generation twh"] = (
    df_1["electricity generation twh"]
    .str.replace("-- Hydro", "Hydro", regex=False)
    .str.replace("-- Thermal", "Thermal", regex=False)
    .str.replace("-- Nuclear", "Nuclear", regex=False)
)

In [5]:
df_1

Unnamed: 0,electricity generation twh,unit,value,yoy(%),accumulation,yoy(%).1,month
0,Electricity Generation,TWh,1242.8,19.5,,,February
1,Hydro,TWh,129.2,8.5,,,February
2,Thermal,TWh,939.0,18.4,,,February
3,Nuclear,TWh,58.4,23.4,,,February
4,Total Electricity Consumption,TWh,1258.8,22.2,,,February
...,...,...,...,...,...,...,...
139,Hydro,GW,,,384.52,5.1,October
140,Thermal,GW,,,1282.89,3.6,October
141,Nuclear,GW,,,53.26,6.8,October
142,-- Wind,GW,,,299.63,30.4,October


In [6]:
df_1.keys()

Index(['electricity generation twh', 'unit', 'value', 'yoy(%)', 'accumulation',
       'yoy(%).1', 'month'],
      dtype='object')

In [7]:
# rename columns
df_1.columns = ["ge_m"] + [_ for _ in df_1.columns[1:]]
df_1.head(22)

Unnamed: 0,ge_m,unit,value,yoy(%),accumulation,yoy(%).1,month
0,Electricity Generation,TWh,1242.8,19.5,,,February
1,Hydro,TWh,129.2,8.5,,,February
2,Thermal,TWh,939.0,18.4,,,February
3,Nuclear,TWh,58.4,23.4,,,February
4,Total Electricity Consumption,TWh,1258.8,22.2,,,February
5,-- Primary Industry,TWh,14.2,26.5,,,February
6,-- Secondary Industry,TWh,801.2,25.8,,,February
7,Including: Industrial,TWh,784.3,25.7,,,February
8,-- Tertiary Industry,TWh,231.3,22.5,,,February
9,-- Residential,TWh,212.1,10.0,,,February


In [8]:
# change data types
df_1["value"] = df_1["value"].astype("float")

### 🌟 extract the target data (with "while" expression) / calculate discrepancy

In [9]:
import datetime

# 1. Build DataFrame for electricity discrepancy. Notably, here the discrepancy means "consumption - generation"
df_1_gc_dict = {"month": [], "value": []}
index = 0
while index < len(df_1):
    df_1_gc_dict["month"].append(datetime.datetime.strptime(df_1["month"][index], "%B"))
    # new value = original consumption - original generation
    df_1_gc_dict["value"].append(df_1["value"][index + 4] - df_1["value"][index])
    index += 16
df_1_gc = pd.DataFrame.from_dict(df_1_gc_dict)
df_1_gc

Unnamed: 0,month,value
0,1900-02-01,16.0
1,1900-03-01,5.2
2,1900-04-01,13.1
3,1900-05-01,24.6
4,1900-06-01,17.3
5,1900-07-01,17.2
6,1900-08-01,22.4
7,1900-09-01,19.6
8,1900-10-01,20.9


In [10]:
# 2. Build DataFrame for Different Types of Energy
df_1_type_dict = {"month": [], "type": [], "value": []}
index = 1
while index < len(df_1):
    df_1_type_dict["month"].append(
        datetime.datetime.strptime(df_1["month"][index], "%B")
    )
    df_1_type_dict["type"].append("Hydro")
    df_1_type_dict["value"].append(df_1["value"][index])
    df_1_type_dict["month"].append(
        datetime.datetime.strptime(df_1["month"][index], "%B")
    )
    df_1_type_dict["type"].append("Thermal")
    df_1_type_dict["value"].append(df_1["value"][index + 1])
    df_1_type_dict["month"].append(
        datetime.datetime.strptime(df_1["month"][index], "%B")
    )
    df_1_type_dict["type"].append("Nuclear")
    df_1_type_dict["value"].append(df_1["value"][index + 2])
    index += 16
df_1_type = pd.DataFrame.from_dict(df_1_type_dict)
df_1_type

Unnamed: 0,month,type,value
0,1900-02-01,Hydro,129.2
1,1900-02-01,Thermal,939.0
2,1900-02-01,Nuclear,58.4
3,1900-03-01,Hydro,67.0
4,1900-03-01,Thermal,495.3
5,1900-03-01,Nuclear,34.2
6,1900-04-01,Hydro,77.6
7,1900-04-01,Thermal,451.7
8,1900-04-01,Nuclear,32.4
9,1900-05-01,Hydro,95.6


### Chart: Elecdtricity Deficit in China (2021)

In [11]:
base_figure = alt.Chart(df_1_gc).encode(x="month:T")
alt.layer(base_figure.mark_line(opacity=0.3, color="blue").encode(y="value:Q"))

### Chart: Electricity Generation by Source (2021)

In [12]:
alt.Chart(df_1_type).mark_area().encode(x="month:T", y=alt.Y("value:Q"), color="type:N")

___

# 2. Electricity Supply-Demand Balance (Annual Data)

### *Note
- Task description: Utilizing China's national data, I would like to look into the trend of electricity supply-demand balance of China in past 20 years, laying a good foundation for further exploration of the situation of 2021 electricity shortage
- Preperation before coding: Simply download data from China's offcial statistics database (refer to the link below)
- Reference:
National Bureau of Statistics of China:https://data.stats.gov.cn/english/easyquery.htm?cn=C01
BP Statistics (double check):https://www.bp.com/en/global/corporate/energy-economics/statistical-review-of-world-energy.html


### read data

In [13]:
df = pd.read_csv("./annual.csv")
df.head()

Unnamed: 0,Database???Annual,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20
0,Year£ºLATEST20,,,,,,,,,,...,,,,,,,,,,
1,Indicators,2020.0,2019.0,2018.0,2017.0,2016.0,2015.0,2014.0,2013.0,2012.0,...,2010.0,2009.0,2008.0,2007.0,2006.0,2005.0,2004.0,2003.0,2002.0,2001.0
2,Total Electricity Available for Consumption(10...,,74866.3,71509.2,65914.0,61204.4,58021.3,57830.5,54204.1,49767.7,...,41936.5,37032.7,34540.8,32712.4,28588.4,24940.8,21972.3,19032.2,16466.0,14724.1
3,Outputof Electricity(100 million kw.h),,75034.3,71661.3,66044.5,61331.6,58145.7,57944.6,54316.4,49875.5,...,42071.6,37146.5,34668.8,32815.5,28657.3,25002.6,22033.1,19105.8,16540.0,14808.0
4,Production of Hydro Power Electricity(100 mill...,,13044.4,12317.9,11978.7,11840.5,11302.7,10728.8,9202.9,8721.1,...,7221.7,6156.4,5851.9,4852.6,4357.9,3970.2,3535.4,2836.8,2879.7,2774.3


### 🌟reorganize table

In [14]:
# transpose rows and columns
df_t = df.T
df_t.head(22)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,12,13,14,15,16,17,18,19,20,21
Database???Annual,Year£ºLATEST20,Indicators,Total Electricity Available for Consumption(10...,Outputof Electricity(100 million kw.h),Production of Hydro Power Electricity(100 mill...,Production of Thermal Power Electricity(100 mi...,Production of Nuclear Power Electricity(100 mi...,Production of Wind Power Electricity(100 milli...,Imports of Electricity(100 million kw.h),Exports of Electricity(100 million kw.h),...,Total Electricity Consumption Industry(100 mil...,Total Electricity Consumption Construction(100...,Total Electricity Consumption Transport Storag...,Total Electricity Consumption Wholesale and Re...,Total Electricity Consumption Others(100 milli...,Total Electricity Consumption Residential(100 ...,Total Electricity Consumption End-use(100 mill...,Total Electricity Consumption Industry End-use...,Losses in Transmission(100 million kw.h),Data Sources£ºNational Bureau of Statistics
Unnamed: 1,,2020.0,,,,,,,,,...,,,,,,,,,,
Unnamed: 2,,2019.0,74866.3,75034.3,13044.4,52201.5,3483.5,4060.3,48.6,216.5,...,50698.3,991.2,1752.3,3187.1,6263.8,10637.2,71536.0,47368.2,3330.1,
Unnamed: 3,,2018.0,71509.2,71661.3,12317.9,50963.2,2943.6,3659.7,56.9,209.1,...,49094.9,887.8,1608.5,2900.4,5716.5,10057.6,68156.5,45743.2,3351.7,
Unnamed: 4,,2017.0,65914.0,66044.5,11978.7,47546.0,2480.7,2972.3,64.2,194.7,...,46052.8,789.2,1418.0,2526.6,4880.6,9071.6,62718.1,42857.0,3195.8,
Unnamed: 5,,2016.0,61204.4,61331.6,11840.5,44370.7,2132.9,2370.7,61.9,189.1,...,42996.9,725.6,1251.5,2323.8,4394.8,8420.6,58142.2,39934.0,3062.9,
Unnamed: 6,,2015.0,58021.3,58145.7,11302.7,42841.9,1707.9,1857.7,62.1,186.5,...,41550.0,698.7,1125.6,2122.0,3918.6,7565.2,55032.1,38562.1,2987.9,
Unnamed: 7,,2014.0,57830.5,57944.6,10728.8,44001.1,1325.4,1599.8,67.5,181.6,...,42248.7,721.7,1059.2,1995.6,3615.0,7176.1,54729.8,39148.8,3099.9,
Unnamed: 8,,2013.0,54204.1,54316.4,9202.9,42470.1,1116.1,1412.0,74.4,186.7,...,39236.9,675.1,1000.9,1876.9,3397.6,6989.2,51062.7,36096.2,3140.7,
Unnamed: 9,,2012.0,49767.7,49875.5,8721.1,38928.1,973.9,959.8,68.7,176.5,...,36232.2,608.4,915.4,1691.5,3083.6,6219.0,46866.5,33336.1,2896.2,


In [15]:
# make the first row as index
df_new = pd.DataFrame(df_t.values[1:], columns=df_t.values[0])

In [16]:
# drop empty row
df_new = df_new.drop([0])

### extract the target data

In [17]:
# extract the target data
df_new = df_new.loc[
    :,
    [
        "Indicators",
        "Outputof Electricity(100 million kw.h)",
        "Total Electricity Consumption(100 million kw.h)",
    ],
]

In [18]:
# rename columns
df_new.columns = ["Year", "Production", "Consumption"]

In [19]:
# change data types
df_new["Year"] = pd.to_datetime(df_new["Year"].astype("int").astype("str")).astype(
    "datetime64[Y]"
)
df_new["Production"] = df_new["Production"].astype("float")
df_new["Consumption"] = df_new["Consumption"].astype("float")

In [20]:
# reorder years
df_new = pd.DataFrame(df_new.sort_values(by="Year", ascending=True))
df_new

Unnamed: 0,Year,Production,Consumption
19,2001-01-01,14808.0,14723.5
18,2002-01-01,16540.0,16465.5
17,2003-01-01,19105.8,19031.6
16,2004-01-01,22033.1,21971.4
15,2005-01-01,25002.6,24940.3
14,2006-01-01,28657.3,28588.0
13,2007-01-01,32815.5,32711.8
12,2008-01-01,34668.8,34541.4
11,2009-01-01,37146.5,37032.2
10,2010-01-01,42071.6,41934.5


In [21]:
# Finally good to go！！ 😭

### Chart: Electricity Supply-Demand Relationship (2001-2019)

In [22]:
# using Altair draw Layered Area Chart for "Outputof Electricity(100 million kw.h)" & "Total Electricity Consumption(100 million kw.h)"

#### First(not very ideal): Eletricity generation & consumption (2001-2019)

In [23]:
base_figure = alt.Chart(df_new).encode(x="Year")
alt.layer(
    base_figure.mark_area(opacity=0.3, color="blue").encode(y="Production:Q"),
    base_figure.mark_area(opacity=0.3, color="red").encode(y="Consumption:Q"),
)
## blue+red=purlple...the result is too similar to be observed

#### Second: Eletricity Surplus in China(generation-consumption) (2001-2019)

In [24]:
# Calculate the Discrepancy so that I can draw the difference
df_new["P-C"] = df_new["Production"] - df_new["Consumption"]
## Second Drawing
base_figure = alt.Chart(df_new).encode(x="Year")
alt.layer(base_figure.mark_area(opacity=0.3, color="green").encode(y="P-C:Q"))

# 3. Power Resources

### web script

In [25]:
url = "https://en.wikipedia.org/wiki/Electricity_sector_in_China#cite_note-:5-26"

In [26]:
page = pd.read_html(url, attrs={"class": "wikitable sortable"})

In [27]:
df_3 = pd.DataFrame(page[0])

In [28]:
df_3.columns

MultiIndex([(           'Year',            'Year'),
            (          'Total',           'Total'),
            (         'Fossil',            'Coal'),
            (         'Fossil',             'Oil'),
            (         'Fossil',             'Gas'),
            (        'Nuclear',         'Nuclear'),
            (      'Renewable',           'Hydro'),
            (      'Renewable',            'Wind'),
            (      'Renewable',        'Solar PV'),
            (      'Renewable',        'Biofuels'),
            (      'Renewable',           'Waste'),
            (      'Renewable',   'Solar thermal'),
            (      'Renewable',     'Geo-thermal'),
            (      'Renewable',            'Tide'),
            ('Total renewable', 'Total renewable'),
            (    '% renewable',     '% renewable')],
           )

### 🌟 deal with multiIndex issue

In [29]:
# to progress to the next step, I have to convert MultiIndex into a Single one
lst = df_3.values
df_3 = pd.DataFrame(lst, columns=[_[0] for _ in df_3.keys()])
df_3.head()

Unnamed: 0,Year,Total,Fossil,Fossil.1,Fossil.2,Nuclear,Renewable,Renewable.1,Renewable.2,Renewable.3,Renewable.4,Renewable.5,Renewable.6,Renewable.7,Total renewable,% renewable
0,2008,3481985,2743767,23791,31028,68394,585187,14800,152,14715.0,0.0,0.0,144.0,7.0,615005,17.66%
1,2009,3741961,2940751,16612,50813,70134,615640,26900,279,20700.0,0.0,0.0,125.0,7.0,663651,17.74%
2,2010,4207993,3250409,13236,69027,73880,722172,44622,699,24750.0,9064.0,2.0,125.0,7.0,801441,19.05%
3,2011,4715761,3723315,7786,84022,86350,698945,70331,2604,31500.0,10770.0,6.0,125.0,7.0,814288,17.27%
4,2012,4994038,3785022,6698,85686,97394,872107,95978,6344,33700.0,10968.0,9.0,125.0,7.0,1019238,20.41%


### reorganize the table

In [30]:
df_3 = df_3.drop(columns=["% renewable"])
df_3 = df_3.drop([12])

In [31]:
df_3.columns = (
    ["Year_source"]
    + ["Total"]
    + ["Fossil_1"]
    + ["Fossil_2"]
    + ["Fossil_3"]
    + ["Nuclear"]
    + ["Hydro"]
    + [_ for _ in df_3.columns[7:]]
)
df_3

Unnamed: 0,Year_source,Total,Fossil_1,Fossil_2,Fossil_3,Nuclear,Hydro,Renewable,Renewable.1,Renewable.2,Renewable.3,Renewable.4,Renewable.5,Renewable.6,Total renewable
0,2008,3481985,2743767,23791,31028,68394,585187,14800,152,14715.0,0.0,0.0,144.0,7.0,615005
1,2009,3741961,2940751,16612,50813,70134,615640,26900,279,20700.0,0.0,0.0,125.0,7.0,663651
2,2010,4207993,3250409,13236,69027,73880,722172,44622,699,24750.0,9064.0,2.0,125.0,7.0,801441
3,2011,4715761,3723315,7786,84022,86350,698945,70331,2604,31500.0,10770.0,6.0,125.0,7.0,814288
4,2012,4994038,3785022,6698,85686,97394,872107,95978,6344,33700.0,10968.0,9.0,125.0,7.0,1019238
5,2013,5447231,4110826,6504,90602,111613,920291,141197,15451,38300.0,12304.0,26.0,109.0,8.0,1127686
6,2014,5678945,4115215,9517,114505,132538,1064337,156078,29195,44437.0,12956.0,34.0,125.0,8.0,1307170
7,2015,5859958,4108994,9679,145346,170789,1130270,185766,45225,52700.0,11029.0,27.0,125.0,8.0,1425180
8,2016,6217907,4241786,10367,170488,213287,1193374,237071,75256,64700.0,11413.0,29.0,125.0,11.0,1581979
9,2017,6452900,4178200,2700,203200,248100,1194700,304600,117800,81300.0,81300.0,,,,1700000


In [32]:
# change data type
df_3 = df_3.apply(pd.to_numeric)
df_3["Year_source"] = pd.to_datetime(df_3["Year_source"].astype(str)).astype(
    "datetime64[Y]"
)

In [33]:
df_3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12 entries, 0 to 11
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   Year_source      12 non-null     datetime64[ns]
 1   Total            12 non-null     int64         
 2   Fossil_1         12 non-null     int64         
 3   Fossil_2         12 non-null     int64         
 4   Fossil_3         12 non-null     int64         
 5   Nuclear          12 non-null     int64         
 6   Hydro            12 non-null     int64         
 7   Renewable        12 non-null     int64         
 8   Renewable        12 non-null     int64         
 9   Renewable        12 non-null     float64       
 10  Renewable        12 non-null     float64       
 11  Renewable        9 non-null      float64       
 12  Renewable        9 non-null      float64       
 13  Renewable        9 non-null      float64       
 14  Total renewable  12 non-null     int64      

In [34]:
# calculate total fossil category
df_3["Total fossil"] = df_3["Fossil_1"] + df_3["Fossil_2"] + df_3["Fossil_3"]

In [35]:
# extract the target data
df_3new = df_3.loc[
    :, ["Year_source", "Total fossil", "Nuclear", "Hydro"],
]
df_3new

Unnamed: 0,Year_source,Total fossil,Nuclear,Hydro
0,2008-01-01,2798586,68394,585187
1,2009-01-01,3008176,70134,615640
2,2010-01-01,3332672,73880,722172
3,2011-01-01,3815123,86350,698945
4,2012-01-01,3877406,97394,872107
5,2013-01-01,4207932,111613,920291
6,2014-01-01,4239237,132538,1064337
7,2015-01-01,4264019,170789,1130270
8,2016-01-01,4422641,213287,1193374
9,2017-01-01,4384100,248100,1194700


### Chart: Electricity Production By Source (2008-2019)

In [36]:
base_3figure = alt.Chart(df_3new).encode(x="Year_source")
alt.layer(
    base_3figure.mark_area(opacity=0.3, color="orange").encode(y="Total fossil:Q"),
    base_3figure.mark_area(opacity=0.3, color="red").encode(y="Nuclear:Q"),
    base_3figure.mark_area(opacity=0.3, color="green").encode(y="Hydro:Q"),
)

___

## This class has been such a great journey:)