# How have global energy production trends changed over time?

In [1]:
import pandas as pd

## Goals (2 min)

By the end of this case, you should be very comfortable writing your own functions using `pandas` and applying them to entire datasets. You'll understand how functions work in Python, including anonymous functions (using the keyword `lambda`), and you'll feel comfortable analyzing and manipulating larger datasets. You'll also have gained experience with exploring a dataset that is only loosely organised and about which you have very little initial information.


## Introduction (5 min)

**Business Context.** Global electricity production, consumption, import, and export is complex and interesting for a variety of reasons. Each country has to keep track of a vast array of information to ensure that they produce enough electricity, yet balance these needs against medium-term financial implications and environmental concerns.

You are an analyst working at a non-governmental organization (NGO) that reports on global energy trends. Your department has acquired a large CSV file, but your colleagues are battling to extract relevant insights from it using Excel due to its size and format. Worse still, it has thousands of variables and they are not sure which ones are interesting. Thus, you have been made responsible for supporting your team's journalists by providing them with data and insights that they can turn into written reports.

**Business Problem.**  Your task is to **break the available data down into smaller files, understand the information that is available, and extract key insights for an upcoming report on global power patterns.** Specifically, your team wants you to answer the following questions:

* How much power is produced?
* How much power is consumed?
* How much power is imported and exported? 
* How much of this power is renewable?
* How are these trends in production, consumption, import, and export changing over time?

**Analytical Context.** The data is stored in a large CSV file containing information on power production and consumption by country and year. You will: 1) break down the data into summarized CSV files to share with your colleagues; 2) manipulate the data to create more categories from the existing columns; 3) find the biggest players in different categories, including total energy export and total production by type (e.g. nuclear); and finally 4) find trends in the data, such as which countries have the fastest growing energy production.

## Getting started with the International Energy Statistics data (25 min)

The data file you have been given is a single CSV located at `data/all_energy_statistics.csv`. Your colleagues have informed you that the data is from http://data.un.org/Explorer.aspx, but they don't know much else about it. 

They specifically note that the data is very ["narrow"] (https://en.wikipedia.org/wiki/Wide_and_narrow_data). Although the file contains data for a wide variety of things, such as "Total Energy Production" all the way through to "Additives and Oxygenates - Exports", it has very few columns. 

Generally, when dealing with "wide" data, we can be fairly sure that all data in the same column is comparable. In this case, you'll notice a `unit` column. Not all numerical data in the `quantity` column is directly comparable. For example, sometimes the number in this column is defined in terms of "Metric tons, thousand" and sometimes in "Kilowatt-hours, million" -- evidently very different concepts!

As always, our first step is to read the data from disk and take a look at the first few rows:

In [2]:
df = pd.read_csv("all_energy_statistics.csv")

In [3]:
df

Unnamed: 0,country_or_area,commodity_transaction,year,unit,quantity,quantity_footnotes,category
0,Austria,Additives and Oxygenates - Exports,1996,"Metric tons, thousand",5.0,,additives_and_oxygenates
1,Austria,Additives and Oxygenates - Exports,1995,"Metric tons, thousand",17.0,,additives_and_oxygenates
2,Belgium,Additives and Oxygenates - Exports,2014,"Metric tons, thousand",0.0,,additives_and_oxygenates
3,Belgium,Additives and Oxygenates - Exports,2013,"Metric tons, thousand",0.0,,additives_and_oxygenates
4,Belgium,Additives and Oxygenates - Exports,2012,"Metric tons, thousand",35.0,,additives_and_oxygenates
...,...,...,...,...,...,...,...
1189477,Viet Nam,Electricity - total wind production,2012,"Kilowatt-hours, million",92.0,1.0,wind_electricity
1189478,Viet Nam,Electricity - total wind production,2011,"Kilowatt-hours, million",87.0,,wind_electricity
1189479,Viet Nam,Electricity - total wind production,2010,"Kilowatt-hours, million",50.0,,wind_electricity
1189480,Viet Nam,Electricity - total wind production,2009,"Kilowatt-hours, million",10.0,,wind_electricity


You'll notice that there is more of a delay than before when running the `read_csv` function. This dataset has over 1 million rows, so it takes a while to load it all into memory. From the first rows, we can immediately gain some useful insights

* The `category` column looks like it is well organized. All the samples we see are lowercase and underscores are used instead of spaces
* The `commodity_transaction` column looks more like a human-readable description. We can see how it includes a description of the category (e.g. "additives_and_oxygenates" matches with "Additives and Oxygenates" and "wind_electricity" matches with "Electricity - ....wind....")
* We see `year` ranges from at least 1995 to 2014 
* As mentioned before, we'll need to be careful when comparing quantities, as the `unit` column might change the meaning of the `quantity` column.

A good first question to ask is how many unique values there are for the following columns:

* `country_or_area`
* `commodity_transaction`
* `year`
* `category`

Let's find out:

In [4]:
print(df.year.min())
print(df.year.max())
print("----------")
print("commodity_transaction")
print(df.commodity_transaction.unique())
print()
print("num unique values: ", len(df.commodity_transaction.unique()))
print()
print("----------")
print(df.category.unique())
print()
print("num unique values: ", len(df.category.unique()))
print()
print("---------------")
print(df.country_or_area.unique())
print()
print("num unique values: ", len(df.country_or_area.unique()))


1990
2014
----------
commodity_transaction
['Additives and Oxygenates - Exports' 'Additives and Oxygenates - Imports'
 'Additives and Oxygenates - Production' ...
 'White spirit and special boiling point industrial spirits - Transformation'
 'White spirit and special boiling point industrial spirits - Transformation in petrochemical plants'
 'Electricity - total wind production']

num unique values:  2452

----------
['additives_and_oxygenates' 'animal_waste' 'anthracite'
 'aviation_gasoline' 'bagasse' 'biodiesel' 'biogases' 'biogasoline'
 'bitumen' 'black_liquor' 'blast_furnace_gas' 'brown_coal_briquettes'
 'brown_coal' 'charcoal' 'coal_tar' 'coke_oven_coke' 'coking_coal'
 'conventional_crude_oil' 'direct_use_of_geothermal_heat'
 'direct_use_of_solar_thermal_heat'
 'electricity_net_installed_capacity_of_electric_power_plants' 'ethane'
 'falling_water' 'fuel_oil' 'fuelwood' 'gas_coke' 'gas_oil_diesel_oil'
 'gasoline_type_jet_fuel' 'gasworks_gas' 'geothermal' 'hard_coal' 'heat'
 'hydro'

We can see that `country_or_area` has 243 unique values, more than the officially recognised 195, because this list includes some former countries such as the USSR as well as areas like Antarctic Fisheries which are not formal countries.

As expected, the `categories` column is well standardized and breaks each row into one of 71 unique categories, while the `commodity_transaction` row is slightly more chaotic and consists of 2452 unique values.

In terms of time, our data ranges from 1990 - 2014 inclusive, so 25 years in total.

Note that the output of `unique()` is automatically truncated for large lists, with a `...` inserted to indicate this.

Since the `commodity_transaction` column is a bit chaotic, we'll need to touch it up a bit. Let's create a copy of our dataframe before we start changing it so we can refer back to the original values if necessary.

In [5]:
df_orig = df.copy()

The first thing we noticed about the `commodity_transaction` column is that it uses hyphens (`-`) as separators. We can also see that it uses lowercase and capital letters - often something that makes analysis harder if we are going to do any string matching (e.g. find the word "production", which might skip descriptions which use "Production" instead). 

Let's start by lowercasing all of the descriptions. In the previous case, you learned how to do this by creating a separate list, looping through the dataframe, and then adding all the items from the list as a new column. We could achieve what we wanted as follows:

In [6]:
%%time
clean_transaction_list = []

for item in df['commodity_transaction']:
    item = item.lower()
    clean_transaction_list.append(item)
    
df['clean_transaction'] = clean_transaction_list

CPU times: user 571 ms, sys: 57.5 ms, total: 629 ms
Wall time: 643 ms


In [7]:
df.head()

Unnamed: 0,country_or_area,commodity_transaction,year,unit,quantity,quantity_footnotes,category,clean_transaction
0,Austria,Additives and Oxygenates - Exports,1996,"Metric tons, thousand",5.0,,additives_and_oxygenates,additives and oxygenates - exports
1,Austria,Additives and Oxygenates - Exports,1995,"Metric tons, thousand",17.0,,additives_and_oxygenates,additives and oxygenates - exports
2,Belgium,Additives and Oxygenates - Exports,2014,"Metric tons, thousand",0.0,,additives_and_oxygenates,additives and oxygenates - exports
3,Belgium,Additives and Oxygenates - Exports,2013,"Metric tons, thousand",0.0,,additives_and_oxygenates,additives and oxygenates - exports
4,Belgium,Additives and Oxygenates - Exports,2012,"Metric tons, thousand",35.0,,additives_and_oxygenates,additives and oxygenates - exports


We added `%%time` at the top of our cell to make Jupyter output information about how long it took to run that cell. We can see that looping through our DataFrame and adding the column took nearly 1 second to complete. It also took 5 lines of code.

As its very common to need to apply the same operation on every row of a dataset, `pandas` provides a shortcut to do this. You can use the `.apply()` function on a DataFrame directly and pass in a function to apply to every row. This is more efficient in two ways:

* It takes fewer lines of code, so it's faster to write the code (and to read it)
* `apply()` is optimized to take advantage of modern CPU features such as vectorization, so it runs in less time

We can achieve exactly the same result as we did with our `for` loop using the `apply()` function as follows:

In [8]:
%%time
df['clean_transaction2'] = df['commodity_transaction'].apply(str.lower)

CPU times: user 261 ms, sys: 49.2 ms, total: 310 ms
Wall time: 324 ms


In [9]:
df.head()

Unnamed: 0,country_or_area,commodity_transaction,year,unit,quantity,quantity_footnotes,category,clean_transaction,clean_transaction2
0,Austria,Additives and Oxygenates - Exports,1996,"Metric tons, thousand",5.0,,additives_and_oxygenates,additives and oxygenates - exports,additives and oxygenates - exports
1,Austria,Additives and Oxygenates - Exports,1995,"Metric tons, thousand",17.0,,additives_and_oxygenates,additives and oxygenates - exports,additives and oxygenates - exports
2,Belgium,Additives and Oxygenates - Exports,2014,"Metric tons, thousand",0.0,,additives_and_oxygenates,additives and oxygenates - exports,additives and oxygenates - exports
3,Belgium,Additives and Oxygenates - Exports,2013,"Metric tons, thousand",0.0,,additives_and_oxygenates,additives and oxygenates - exports,additives and oxygenates - exports
4,Belgium,Additives and Oxygenates - Exports,2012,"Metric tons, thousand",35.0,,additives_and_oxygenates,additives and oxygenates - exports,additives and oxygenates - exports


Here we can see that `.apply()` ran around twice as quickly as the iterative version and produced the same results (the `clean_transaction` and `clean_transaction2` columns are the same). You can read more about the `apply()` function [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html), but in essence you call it from a column of a DataFrame and pass in a function. It applies that function to every row of that column in the DataFrame. In this case, we passed in the `str.lower` function, which converts a string to lowercase.

## Pre-processing and pivoting our data (45 min)

We noted before that the `commodity_transaction` column seemed to use hyphens to separate different concepts in a single column. Let's do some more analysis to see if this is true across the board.

### Exercise 1: (7 min)

Find out how many of the 2,000+ unique columns contain:

- 0 hyphens
- exactly 1 hyphen
- more than 1 hyphen

**Hint:** You can use Python's built-in [`count()`](https://www.w3schools.com/python/ref_string_count.asp) method to count the occurrences of a character in a string).

**Answer.** One possible solution is given below:

In [10]:
hyphens_0 = 0
hyphens_1 = 0
hyphens_2plus = 0

for value in df.commodity_transaction.unique():
    hyphen_count = value.count("-")
    if hyphen_count == 0:
        hyphens_0 += 1
    elif value.count("-") == 1:
        hyphens_1 += 1
    else:
        hyphens_2plus += 1
        
print("zero hyphens", hyphens_0)
print("one hyphen", hyphens_1)
print("two or more hyphens", hyphens_2plus)

zero hyphens 57
one hyphen 1845
two or more hyphens 550


We can see that most descriptions have exactly one hyphen, strengthening the idea that the first part of the description before the hyphen is linked to `category`, while the rest is more descriptive. We should take a closer look at the ones with zero hyphens as there are only 57 of these.

### Exercise 2: (5 min)

Write code to print out all descriptions with zero hyphens. What do you notice about these?

**Answer.** One possible solution is given below:

In [11]:
for value in df.commodity_transaction.unique():
    hyphen_count = value.count("-")
    if hyphen_count == 0:
        print(value)

From chemical sources – Autoproducer
From chemical sources – Autoproducer – CHP plants
From chemical sources – Autoproducer – Heat plants
From combustible fuels – Autoproducer
From combustible fuels – Autoproducer – CHP plants
From combustible fuels – Autoproducer – Heat plants
From combustible fuels – Main activity
From combustible fuels – Main activity – CHP plants
From combustible fuels – Main activity – Heat plants
From electric boilers – Main activity
From heat pumps – Main activity
From other sources – Autoproducer
From other sources – Autoproducer – CHP plants
From other sources – Autoproducer – Heat plants
From other sources – Main activity
From other sources – Main activity – CHP plants
From other sources – Main activity – Heat plants
Geothermal – Autoproducer
Geothermal – Autoproducer – CHP plants
Geothermal – Autoproducer – Heat plants
Geothermal – Main activity
Geothermal – Main activity – CHP plants
Geothermal – Main activity – Heat plants
Nuclear – Main activity
Nuclear –

Tricky! We see an inconsistency in the data, where some descriptions use m-dashes (`–`) instead of hyphens (`-`). This is barely noticeable to a human reader, but can cause issues for computers which see the two as completely distinct characters.

### Passing our own functions to `apply()` (12 min)

We previously passed the built-in `str.lower()` function to the `apply()` function to apply to it every row in our DataFrame. Now we want to clean up the m-dashes and lowercase the result at the same time. Let's write our own custom Python function to do both, and pass that to `apply()` instead. You can read more about writing your own custom functions in Python [here](https://www.w3schools.com/python/python_functions.asp):

In [12]:
def clean_transaction_description(transaction_description):
    """Lowercase the input and replace all m-dashes with hyphens"""
    clean = transaction_description.lower()
    clean = clean.replace("–", "-")
    return clean
    

# drop the columns we added before so we can recreate them with our new clean function
df = df.drop(columns=['clean_transaction', 'clean_transaction2'])
df.head()

Unnamed: 0,country_or_area,commodity_transaction,year,unit,quantity,quantity_footnotes,category
0,Austria,Additives and Oxygenates - Exports,1996,"Metric tons, thousand",5.0,,additives_and_oxygenates
1,Austria,Additives and Oxygenates - Exports,1995,"Metric tons, thousand",17.0,,additives_and_oxygenates
2,Belgium,Additives and Oxygenates - Exports,2014,"Metric tons, thousand",0.0,,additives_and_oxygenates
3,Belgium,Additives and Oxygenates - Exports,2013,"Metric tons, thousand",0.0,,additives_and_oxygenates
4,Belgium,Additives and Oxygenates - Exports,2012,"Metric tons, thousand",35.0,,additives_and_oxygenates


In [13]:
df['clean_transaction'] = df['commodity_transaction'].apply(clean_transaction_description)
df.head()

Unnamed: 0,country_or_area,commodity_transaction,year,unit,quantity,quantity_footnotes,category,clean_transaction
0,Austria,Additives and Oxygenates - Exports,1996,"Metric tons, thousand",5.0,,additives_and_oxygenates,additives and oxygenates - exports
1,Austria,Additives and Oxygenates - Exports,1995,"Metric tons, thousand",17.0,,additives_and_oxygenates,additives and oxygenates - exports
2,Belgium,Additives and Oxygenates - Exports,2014,"Metric tons, thousand",0.0,,additives_and_oxygenates,additives and oxygenates - exports
3,Belgium,Additives and Oxygenates - Exports,2013,"Metric tons, thousand",0.0,,additives_and_oxygenates,additives and oxygenates - exports
4,Belgium,Additives and Oxygenates - Exports,2012,"Metric tons, thousand",35.0,,additives_and_oxygenates,additives and oxygenates - exports


Here we used `apply()` again, but this time passed in our own function which did both the lowercasing and the replacing of m-dashes with hyphens.

We've now seen how to use the `apply()` function with both built-in functions and our own custom functions. There's one more way we can use `apply()` though: with custom **anonymous functions** using the Python `lambda` keyword. Let's see how to achieve the same result using `lambda`:

In [14]:
df = df.drop(columns=['clean_transaction'])
df.head()

Unnamed: 0,country_or_area,commodity_transaction,year,unit,quantity,quantity_footnotes,category
0,Austria,Additives and Oxygenates - Exports,1996,"Metric tons, thousand",5.0,,additives_and_oxygenates
1,Austria,Additives and Oxygenates - Exports,1995,"Metric tons, thousand",17.0,,additives_and_oxygenates
2,Belgium,Additives and Oxygenates - Exports,2014,"Metric tons, thousand",0.0,,additives_and_oxygenates
3,Belgium,Additives and Oxygenates - Exports,2013,"Metric tons, thousand",0.0,,additives_and_oxygenates
4,Belgium,Additives and Oxygenates - Exports,2012,"Metric tons, thousand",35.0,,additives_and_oxygenates


In [15]:
# lowercase the description and replace m-dashes with hyphens in one line
df['clean_transaction'] = df['commodity_transaction'].apply(lambda x: x.lower().replace("–", "-"))
df.head()

Unnamed: 0,country_or_area,commodity_transaction,year,unit,quantity,quantity_footnotes,category,clean_transaction
0,Austria,Additives and Oxygenates - Exports,1996,"Metric tons, thousand",5.0,,additives_and_oxygenates,additives and oxygenates - exports
1,Austria,Additives and Oxygenates - Exports,1995,"Metric tons, thousand",17.0,,additives_and_oxygenates,additives and oxygenates - exports
2,Belgium,Additives and Oxygenates - Exports,2014,"Metric tons, thousand",0.0,,additives_and_oxygenates,additives and oxygenates - exports
3,Belgium,Additives and Oxygenates - Exports,2013,"Metric tons, thousand",0.0,,additives_and_oxygenates,additives and oxygenates - exports
4,Belgium,Additives and Oxygenates - Exports,2012,"Metric tons, thousand",35.0,,additives_and_oxygenates,additives and oxygenates - exports


This code is functionally equivalent to what we ran before, but it's more concise. Instead of giving our function a name (`clean_transaction_description`), we can declare an anonymous function by using the [`lambda`](https://www.w3schools.com/python/python_lambda.asp) keyword. This says that we are going to pass in a series of `x` values (the descriptions), and describes what to do to each of them. The advantage of doing this is that it's more concise. The disadvantage is that it can be harder to read and it prevents us from using our function again later without redefining it all over again.

### Extracting the most interesting rows (6 min)

It's hard to manually inspect over 2,000 unique description values, but we know that we're mainly interested in: 

* Import
* Export
* Total production
* Total demand or consumption
* Renewables

We can search for some keywords in the descriptions using code similar to the following:

In [16]:
[x for x in df['clean_transaction'].unique() if "import" in x]

['additives and oxygenates - imports',
 'anthracite - imports',
 'aviation gasoline - imports',
 'biodiesel - imports',
 'biogases - imports',
 'biogasoline - imports',
 'bitumen - imports',
 'brown coal briquettes - imports',
 'brown coal - imports',
 'charcoal - imports',
 'coal tar - imports',
 'coking coal - imports',
 'conventional crude oil - imports',
 'ethane - imports',
 'fuel oil - imports',
 'fuelwood - imports',
 'gas coke - imports',
 'gas oil/ diesel oil - imports',
 'gasoline-type jet fuel - imports',
 'gasworks gas - imports',
 'hard coal - imports',
 'heat - imports',
 'industrial waste - imports',
 'kerosene-type jet fuel - imports',
 'lignite - imports',
 'liquefied petroleum gas (lpg) - imports',
 'lubricants - imports',
 'motor gasoline - imports',
 'municipal wastes - imports',
 'naphtha - imports',
 'natural gas (including lng) - imports',
 'natural gas liquids - imports',
 'of which: biodiesel - imports',
 'of which: biogasoline - imports',
 'oil shale - imports

This gives us a much more manageable list to look through, and we can see that "electricity - imports" is likely an interesting value. We can cross-check this in the main dataset (and see all columns to boot) as follows:

In [17]:
## Note the below is functionally equivalent to 
# df[df["clean_transaction"] == "electricity - imports"].head()
# but slightly easier to type

df[df.clean_transaction == "electricity - imports"].head()

Unnamed: 0,country_or_area,commodity_transaction,year,unit,quantity,quantity_footnotes,category,clean_transaction
1108326,Afghanistan,Electricity - imports,2014,"Kilowatt-hours, million",3710.8,,total_electricity,electricity - imports
1108327,Afghanistan,Electricity - imports,2013,"Kilowatt-hours, million",3615.2,,total_electricity,electricity - imports
1108328,Afghanistan,Electricity - imports,2012,"Kilowatt-hours, million",3071.0,,total_electricity,electricity - imports
1108329,Afghanistan,Electricity - imports,2011,"Kilowatt-hours, million",2732.0,,total_electricity,electricity - imports
1108330,Afghanistan,Electricity - imports,2010,"Kilowatt-hours, million",1867.0,,total_electricity,electricity - imports


### Exercise 3: (5 min)

Use the above method or any other method that you prefer to explore the transaction descriptions and define a Python list containing the 9 most interesting ones. These should cover the total values for import, export, total production, total demand, and renewable energy production.

**Answer.** One possible solution is given below:

In [98]:
# The first four values handle demand, production, import and exports
# The others are all values that match the `total ..... production` pattern except for `thermal` which 
# loosely describes all non-renewable sources of production
keep_values =  ["Electricity - Gross demand",
        "Electricity - Gross production",
        "Electricity - imports",
        "Electricity - exports",
        "Electricity - total hydro production",
        "Electricity - total wind production",
        "Electricity - total solar production",
        "Electricity - total geothermal production",
        "Electricity - total tide, wave production",
]

### Pivoting the interesting values into their own columns (10 min)

Of course, now that we've identified the most interesting transaction descriptions, we probably ought to pull them out of that single column that they're stuck in. Let's "pivot" our data to a more useable format, keeping each of these interesting values as new columns. This translates our data from a fairly narrow format into a wider one.

You might know of "pivot tables" from Excel. If not, don't worry - you'll come across them later and in more detail. But if you do know of them, you'll recognize that this pretty much the exact same thing. We'll use the pivot function in pandas, which you can read more about [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot.html). For now, just try to understand how the following code works, but you won't be expected to do this yourself until you've gained more experience with `pandas`:

In [70]:
# we'll keep our "interesting" values after we turn them into columsn
# but we'll also keep the "country" and "year" columns
final_keep_values = ["country_or_area", "year"] + keep_values

# Turn values in the 'commodity transaction' column
# into our new column names
# and keep only the 'quantity' column as the new values
df_countries = pd.pivot_table(
    df,
    values="quantity",
    index=["country_or_area", "year"],
    columns="commodity_transaction",
).reset_index()[final_keep_values]

# rename the columns to be more concise
df_countries.columns = [
    "country",
    "year",
    "demand",
    "production",
    "imports",
    "exports",
    "hydro",
    "wind",
    "solar",
    "geothermal",
    "tide",
]

# output with the energy production leaders first
df_countries.sort_values(by="production", ascending=False)

Unnamed: 0,country,year,demand,production,imports,exports,hydro,wind,solar,geothermal,tide
1062,China,2014,5219096.0,5649583.4,6750.0,18158.0,1064337.0,156078.0,15189.0,,
1061,China,2013,5016127.0,5431637.4,7438.0,18669.0,920291.0,141197.0,5564.0,,
1060,China,2012,4609729.0,4987553.0,6874.0,17653.0,872107.0,95978.0,,,
1059,China,2011,4319132.0,4713019.0,6562.0,19307.0,698945.0,70331.0,,,
5322,United States,2010,4153664.0,4378422.0,45083.0,19107.0,286333.0,95148.0,3934.0,17577.0,
...,...,...,...,...,...,...,...,...,...,...,...
2873,Lesotho,1996,335.0,,335.0,,,,,,
2874,Lesotho,1997,395.0,,395.0,,,,,,
2875,Lesotho,1998,385.0,,385.0,,,,,,
3454,Namibia,1990,,,,,,,,,


We can see thaht our data is in a much more user-friendly format now. We have kept only the quantity column and each row now represents one country in a particular year. If we had data for each year for each of the 243 countries or areas, we would expect to have 6075 rows, but we have only 5568. This makes sense as some countries stopped existing and data collection in general has become much easier and more consistent over time. Let's take a look at how many countries we have data on for each year:

In [95]:
df_countries['year'].value_counts()

2013    229
2014    229
2012    229
2007    227
2011    226
2008    226
2005    226
2009    226
2010    226
2006    226
2002    225
2004    225
2003    225
1995    223
1996    223
2001    223
1997    223
2000    222
1992    222
1993    222
1994    222
1998    222
1999    222
1990    200
1991    199
Name: year, dtype: int64

As expected, in earlier years, we have data for fewer countries.

The final check we should do is whether any of the values we kept used a different "unit". A quick scan of the data shows that all of the values we are interested in are measured in "Kilowatt-hours, million", but it's possible that some small values could be measured as "Kilowatt-hours, thousand", for example. Let's look for unique values used in our `keep_values` list:

In [97]:
x = keep_values[0]
all_units = []

for value in keep_values:
    units_used = list(df[df.commodity_transaction == value]['unit'].unique())
    all_units += units_used
print(set(all_units))

{'Kilowatt-hours, million'}


All good! Only one unit is used. So we are done with data preparation and we can start exploring our dataset for information.

## Exploring growth of power production and renewables (60 min)

As mentioned, the team is interested in analyzing countries based on their renewable energy production. We currently know how much power they produce in total and how much of this is due to each of a number of renewable options. We'll start by adding some supplementary data and then analyzing our dataset for interesting countries and patterns.

### Exercise 4: (7 min)

Add a new summary column called `renewable_percent` which gives the percentage of total power production which is made up of renewable energy.

**Hint:** You might notice that some values are `na`, meaning `not available`. We can probably assume that these are 0 (though this might not always be meaningful; e.g. if we don't have data on the USSR in 2014, it's not because its power plants are all turned off!). You can use the `pandas` [`fillna`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html) method to replace `na` values with 0.

**Answer.** One possible solution is given below:

In [101]:
# replace the `na` values with 0
df_countries = df_countries.fillna(0)

# sum all renewable energy production sources and divide by the total energy production
df_countries["renewable_percent"] = (
    df_countries["hydro"]
    + df_countries["wind"]
    + df_countries["solar"]
    + df_countries["geothermal"]
    + df_countries["tide"]
) / df_countries['production']
df_countries

Unnamed: 0,country,year,demand,production,imports,exports,hydro,wind,solar,geothermal,tide,renewable_percent
0,Afghanistan,1990,1055.0,1128.0,0.0,0.0,764.0,0.0,0.0,0.0,0.0,0.677305
1,Afghanistan,1991,945.0,1015.0,0.0,0.0,690.0,0.0,0.0,0.0,0.0,0.679803
2,Afghanistan,1992,789.0,703.0,131.0,0.0,478.0,0.0,0.0,0.0,0.0,0.679943
3,Afghanistan,1993,780.0,695.0,130.0,0.0,475.0,0.0,0.0,0.0,0.0,0.683453
4,Afghanistan,1994,770.0,687.0,128.0,0.0,472.0,0.0,0.0,0.0,0.0,0.687045
...,...,...,...,...,...,...,...,...,...,...,...,...
5563,Zimbabwe,2010,9317.3,8602.9,1681.7,694.4,5762.8,0.0,0.0,0.0,0.0,0.669867
5564,Zimbabwe,2011,9645.5,9177.2,1578.7,988.2,5201.8,0.0,0.0,0.0,0.0,0.566818
5565,Zimbabwe,2012,9425.2,9148.6,1076.1,700.9,5387.3,0.0,0.0,0.0,0.0,0.588866
5566,Zimbabwe,2013,9919.7,9498.8,1722.0,1189.3,4981.8,0.0,0.0,0.0,0.0,0.524466


### Exercise 5: (10 min)

Considering only the most recent year that we have data for (2014), which 5 countries produced the largest proportion of their power through renewables, and which 5 countries produced the smallest proportion of their power through renewables?

**Hint:** You can use the [`sort_values`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html) method in `pandas` to sort a DataFrame by a specific column, either descending or ascending.

**Answer.** One possible solution is given below:

In [103]:
# filter the dataframe by year to get only 2014 and then sort by renewable percent and take the top 5
df_countries[(df_countries["year"] == 2014)].sort_values(
    by="renewable_percent", ascending=False
).head(5)

Unnamed: 0,country,year,demand,production,imports,exports,hydro,wind,solar,geothermal,tide,renewable_percent
2891,Lesotho,2014,783.48,515.2,271.2,2.92,515.2,0.0,0.0,0.0,0.0,1.0
49,Albania,2014,7791.43,4724.43,3250.45,183.45,4724.43,0.0,0.0,0.0,0.0,1.0
611,Bhutan,2014,2085.46,7003.86,187.37,4991.9,7003.36,0.0,0.0,0.0,0.0,0.999929
3948,Paraguay,2014,13432.0,55282.3,0.0,41400.1,55276.4,0.0,0.0,0.0,0.0,0.999893
2328,Iceland,2014,17475.0,18122.0,0.0,0.0,12873.0,8.0,0.0,5238.0,0.0,0.999834


We can see that Lesotho is on top, generating 100% of its power using hydro. Lesotho is a pretty tiny country though, and also imports about half as much power as it produces. The other countries on the list are also relatively small players in terms of total energy production.

In [104]:
# filter the dataframe by year to get only 2014 and then sort by renewable percent and take the top 5
df_countries[(df_countries["year"] == 2014)].sort_values(
    by="renewable_percent"
).head(5)

Unnamed: 0,country,year,demand,production,imports,exports,hydro,wind,solar,geothermal,tide,renewable_percent
5053,Trinidad and Tobago,2014,9531.0,9891.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3873,Palau,2014,73.7,79.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
962,Cayman Islands,2014,620.74,620.74,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1012,Chad,2014,206.0,225.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3798,Oman,2014,28343.0,29128.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We can see even more relatively small players in our list of countries which produce no renewable power.

### Question: (5 min)

Why do you think we are seeing a lot of very small countries on both lists?

Very small countries are not particularly representative of the global renewable power situation, so your team asks you to restrict your analysis only to countries that produce a lot of power.

### Exercise 6: (5 min)

Repeat the above analysis but only look at the countries in the top 10% of total power production.

**Hint:** You can filter a DataFrame with multiple conditions by using the `&` symbol; e.g.:

`df_countries[df_countries.year == 2014 & df_countries.wind > 0]` 

would give you a DataFrame of all countries in 2014 which had produced at least some wind power.

**Answer.** One possible solution is given below:

In [114]:
threshold = df_countries["production"].quantile(0.9)
df_countries[
    (df_countries.production > threshold) & (df_countries.year == 2014)
].sort_values(by="renewable_percent", ascending=False).head(5)

Unnamed: 0,country,year,demand,production,imports,exports,hydro,wind,solar,geothermal,tide,renewable_percent
3773,Norway,2014,124139.0,142327.0,6347.0,21932.0,136636.0,2216.0,0.0,0.0,0.0,0.975584
712,Brazil,2014,615629.0,590541.0,33778.0,3.0,373439.0,12211.0,16.0,0.0,0.0,0.653072
937,Canada,2014,591137.0,656225.0,12808.0,58421.0,382574.0,22538.0,1756.0,0.0,16.0,0.620037
4844,Sweden,2014,132375.0,153662.0,13852.0,29475.0,63872.0,11234.0,47.0,0.0,0.0,0.48908
5474,Viet Nam,2014,141136.0,145730.0,2053.0,880.0,61480.0,300.0,0.0,0.0,0.0,0.423935


This list now has more countries on it that most people are likely to associate with renewable power! We can see that hydro and wind are popular ways of generating renewable power (by contrast our previous "top" list contained 0 wind generation).

In [128]:
df_countries[
    (df_countries.production > threshold) & (df_countries.year == 2014)
].sort_values(by="renewable_percent").head(5)

Unnamed: 0,country,year,demand,production,imports,exports,hydro,wind,solar,geothermal,tide,renewable_percent
4294,Saudi Arabia,2014,304240.0,311806.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,3e-06
2745,"Korea, Republic of",2014,523363.0,550933.0,0.0,0.0,7820.0,1146.0,2557.0,0.0,492.0,0.021808
4541,South Africa,2014,231445.0,252578.0,11177.0,13836.0,4082.0,1070.0,1120.0,0.0,0.0,0.024832
3823,Other Asia,2014,244755.0,260025.0,0.0,0.0,7439.0,1500.0,552.0,0.0,0.0,0.0365
4965,Thailand,2014,179330.0,180862.0,12260.0,2066.0,5540.0,305.0,1385.0,1.0,0.0,0.039981


And here we can see countries which produce a lot of power but barely any of it renewable, all in Asia or Africa.

Of course, your team is also interested in looking at change in renewable energy over time. Let's look at the top and bottom 5 countries where the percentage of renewable energy they produced in 2014 is **very different** from the percentage in 1990.

### Exercise 7: (15 min)

Add a new column to your DataFrame which displays the difference in percentage renewable energy production between 2014 and 1990. Which are the top and bottom 5 countries? What do you notice about these countries? Perform this analysis both with all countries and again with only those in the 10% of total power production.

**Hint:** you can use the `pivot()` method again to create a DataFrame which has 1990 and 2014 as columns and `renewable_percent` as values to help with this by using the following code

```
renewable_change = pd.pivot_table(
    df_countries, values="renewable_percent", index=["country"], columns="year",
).reset_index()[["country", 1990, 2014]]
```

**Answer.** One possible solution is given below:

In [144]:
# get a DataFrame with only the 1990 and 2014 values kept, and as columns
renewable_change = pd.pivot_table(
    df_countries, values="renewable_percent", index=["country"], columns="year",
).reset_index()[["country", 1990, 2014]]

# add the diff column to see the chnage
renewable_change["diff"] = renewable_change[2014] - renewable_change[1990]
renewable_change.sort_values(by="diff", ascending=False).head(5)

year,country,1990,2014,diff
86,Greenland,0.0,0.683475,0.683475
185,Sierra Leone,0.0,0.653569,0.653569
75,French Guiana,0.0,0.605495,0.605495
20,Belize,0.0,0.507055,0.507055
58,Denmark,0.024555,0.42538,0.400824


As before, we can see some pretty small countries. All of them went from producing zero or nearly zero renewable energy in 1990 to 40% or more by 2014. These are some great countries for the team to dig more into.

In [145]:
renewable_change.sort_values(by="diff").head(5)

year,country,1990,2014,diff
195,Sri Lanka,0.998413,0.37765,-0.620763
176,Rwanda,0.976608,0.390884,-0.585724
96,Honduras,0.912549,0.373227,-0.539322
204,Suriname,0.8586,0.364792,-0.493808
226,United Rep. of Tanzania,0.896869,0.41936,-0.477509


And there are some drops too. Sri Lanka was almost 100% renewable energy in 1990, but only 38% in 2014. Let's take a look at the larger ones:

In [150]:
# get only the top producers and redo the analysis
threshold = df_countries.production.quantile(0.9)
df_countries_large = df_countries[df_countries.production > threshold]

renewable_change = pd.pivot_table(
    df_countries_large, values="renewable_percent", index=["country"], columns="year",
).reset_index()[["country", 1990, 2014]]

renewable_change["diff"] = renewable_change[2014] - renewable_change[1990]
renewable_change.sort_values(by="diff", ascending=False).head(5)

year,country,1990,2014,diff
23,Spain,0.172486,0.389797,0.217312
12,Italy,0.176856,0.370429,0.193573
29,United Kingdom,0.022512,0.132286,0.109774
19,Poland,0.024305,0.065491,0.041187
1,Australia,0.095988,0.135047,0.039059


Spain, Italy and the UK have made some good progress with 10 - 20% growth in renewable power share. But considering only larger countries makes the drop-off quite dramatic after these three, with Poland and Austria rounding out the top 5 with only 4% increases each.

In [151]:
renewable_change.sort_values(by="diff").head(5)

year,country,1990,2014,diff
2,Brazil,0.927691,0.653072,-0.274618
9,India,0.247679,0.123335,-0.124344
24,Sweden,0.498512,0.48908,-0.009432
13,Japan,0.115881,0.114409,-0.001472
3,Canada,0.615727,0.620037,0.00431


On the bottom of the list, we see developing countries like Brazil and India finding it hard to keep growing their renewable energy sources at the same rate as their economies.

### Exercise 8: (15 min)

Your team is also interested in countries which are producing a lot more power now than they were 25 years ago. What are the top and bottom 10 countries in terms of growth of:

* Total power
* Renewable power

Note that because many countries were producing zero or very little renewable energy in 1990, doing a basic growth calculation will show that many countries have "infinite" (represented as `inf` in `pandas`) growth. To avoid this, restrict your results to countries which produced at least 1,000 units of renewable power in 1990 for the renewable growth analysis and at least 1,000 units of total power for the total growth analysis.

**Hint:** Assuming you add a column called `renewable_total`, you can use the following pivots to generate tables similar to before for both renewable growth and total growth:

```
renewable_growth = pd.pivot_table(
    df_countries, values="renewable_total", index=["country"], columns="year",
).reset_index()[["country", 1990, 2014]]
```

```
total_growth = pd.pivot_table(
    df_countries, values="production", index=["country"], columns="year",
).reset_index()[["country", 1990, 2014]]
```

**Answer.** One possible solution is given below:

In [184]:
# calculate renewable total
df_countries["renewable_total"] = (
    df_countries["hydro"]
    + df_countries["wind"]
    + df_countries["solar"]
    + df_countries["geothermal"]
    + df_countries["tide"]
)

In [185]:
# pivot to create year columns
renewable_growth = pd.pivot_table(
    df_countries, values="renewable_total", index=["country"], columns="year",
).reset_index()[["country", 1990, 2014]]

# calculate growth rate in percentage terms
renewable_growth['growth'] = (renewable_growth[2014] - renewable_growth[1990])/renewable_growth[1990] * 100

# get the top values that had at least 1000 units renewable in 1990
renewable_growth[renewable_growth[1990] > 1000].sort_values(by='growth', ascending=False).head(10)

year,country,1990,2014,growth
235,Viet Nam,5371.0,61780.0,1050.25135
43,China,126720.0,1235604.0,875.066288
147,Myanmar,1193.0,8828.84,640.053646
227,United Kingdom,7198.0,44835.0,522.881356
87,Greece,1999.0,12088.0,504.702351
24,Bhutan,1557.0,7003.36,349.79833
196,Spain,26204.0,108656.0,314.654251
32,Bulgaria,1878.0,7746.0,312.460064
100,Iceland,4504.0,18119.0,302.286856
131,Malaysia,3982.0,13615.0,241.913611


That's some impressive growth - all of the countries on the top ten have over 200% growth in renewables.

In [187]:
# and the bottom
renewable_growth[renewable_growth[1990] > 1000].sort_values(by='growth').head(10)

year,country,1990,2014,growth
206,Suriname,1111.2,795.1,-28.446724
116,"Korea, Dem.Ppl's.Rep.",15600.0,13000.0,-16.666667
208,Sweden,73039.0,75153.0,2.894344
160,Norway,121382.0,138852.0,14.392579
162,Other Asia,8196.0,9491.0,15.80039
157,Nigeria,4387.0,5346.0,21.860041
110,Japan,97577.0,119063.0,22.019533
58,Côte d'Ivoire,1464.0,1913.0,30.669399
209,Switzerland,30983.0,40644.0,31.181616
154,New Zealand,25314.0,33824.0,33.617761


And only two countries have negative renewable growth. Even New Zealand with a seemingly impressive growth of 34% is included in the trailing 10. 

Let's do the same for total growth:

In [189]:

total_growth = pd.pivot_table(
    df_countries, values="production", index=["country"], columns="year",
).reset_index()[["country", 1990, 2014]]

total_growth['growth'] = (total_growth[2014] - total_growth[1990])/total_growth[1990]*100
total_growth[total_growth[1990] > 1000].sort_values(by='growth', ascending=False).head(10)

year,country,1990,2014,growth
235,Viet Nam,8722.0,145730.0,1570.832378
43,China,621200.0,5649583.4,809.462878
174,Qatar,4818.0,38692.0,703.071814
16,Bahrain,3792.0,27246.0,618.512658
17,Bangladesh,8057.0,55845.0,593.123992
226,United Arab Emirates,17081.0,116528.0,582.208302
161,Oman,4504.0,29128.0,546.714032
122,Lebanon,2825.0,17952.0,535.469027
131,Malaysia,25263.0,147461.0,483.70344
147,Myanmar,2478.0,14156.3,471.279257


In terms of total growth, we can see a strong concentration in Asia and the Middle East, but there's a definite overlap with countries with high growth in renewables.

In [191]:
total_growth[total_growth[1990] > 1000].sort_values(by='growth').head(10)

year,country,1990,2014,growth
116,"Korea, Dem.Ppl's.Rep.",27700.0,17909.0,-35.34657
0,Afghanistan,1128.0,1049.3,-6.97695
176,Romania,64310.0,65676.0,2.124086
20,Belgium,70923.0,72688.0,2.488614
99,Hungary,28436.0,29371.0,3.288086
242,Zimbabwe,9559.0,10023.0,4.854064
208,Sweden,146514.0,153662.0,4.878715
227,United Kingdom,319737.0,338925.0,6.001182
32,Bulgaria,42141.0,47485.0,12.681237
171,Poland,136311.0,159059.0,16.688308


By contrast, countries on the bottom-10 list do not overlap as much with the bottom-10 list in renewables growth. Sweden appears on both lists, but otherwise this list contains more examples of countries such as Zimbabwe and Afghanistan which have experienced major disruptive events in the last 25 years.

### Exercise 9: (8 min)

Finally, your team wants an easy-to-read label for each country based on total growth. They have given you the following specification for how the countries should be labeled:

* zero or negative growth = "No growth"
* 1% -100% growth = "Growing"
* over 100% growth = "Growing fast"
* NaN (if the data from 1990 or 2014 is NaN) = "Not Applicable"

Calculate the label for each country, using the `apply()` method for efficiency. 

**Hint:** You can check if the value of variable `x` is Nan as follows:

```
import numpy as np
np.isnan(x)
```

**Answer.** One possible solution is given below:

In [218]:
import numpy as np
def calculate_label(growth):
    if np.isnan(growth):
        return "Not Applicable"
    elif growth > 100:
        return "Growing fast"
    elif growth > 0:
        return "Growing"
    else:
        return "No growth"
    
total_growth['label'] = total_growth['growth'].apply(calculate_label)
total_growth

year,country,1990,2014,growth,label
0,Afghanistan,1128.0,1049.300,-6.976950,No growth
1,Albania,3197.0,4724.430,47.776978,Growing
2,Algeria,16104.0,64242.000,298.919523,Growing fast
3,American Samoa,100.0,156.945,56.945000,Growing
4,Andorra,120.0,126.800,5.666667,Growing
...,...,...,...,...,...
238,Yemen Arab Rep. (former),830.0,,,Not Applicable
239,"Yemen, Dem. (former)",910.0,,,Not Applicable
240,"Yugoslavia, SFR (former)",82905.0,,,Not Applicable
241,Zambia,7771.0,14452.000,85.973491,Growing


## Largest importers and exporters of energy (15 min)

The final thing that your team wants to look into is imports and exports of energy by country.

### Exercise 10: (15 min)

Your team wants to know:

* Which countries have imported and exported the most power in total
* Which countries have imported the largest percentage of their *demand* and exported the largest percentage of their *production*

Do the analysis for all countries *and* for only countries with total production in the top 10%.

**Answer.** One possible solution is given below:

In [243]:
df_countries_agg = df_countries.groupby("country").sum()[
    ["demand", "production", "imports", "exports", "renewable_total"]
]

In [244]:
df_countries_agg.sort_values(by="exports", ascending=False).head(5)

Unnamed: 0_level_0,demand,production,imports,exports,renewable_total
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
France,11102400.0,13289334.0,187359.0,1633762.0,1801935.0
Germany,12887914.0,14146393.0,1013711.0,1183777.0,1266381.0
Canada,13520339.0,14700613.0,363518.0,1084799.0,8824536.0
Paraguay,171097.96,1197305.62,9.13,1019362.97,1196802.35
Switzerland,1444738.0,1632441.0,666095.0,754602.0,914723.0


France, Germany, and Canada export a lot of energy!

In [245]:
df_countries_agg.sort_values(by="imports", ascending=False).head(5)

Unnamed: 0_level_0,demand,production,imports,exports,renewable_total
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
United States,93789000.0,98746617.0,1129039.0,377445.0,9014525.0
Italy,7431370.0,6837341.0,1096071.0,33423.0,1470013.0
Germany,12887914.0,14146393.0,1013711.0,1183777.0,1266381.0
Brazil,10291411.0,9481650.0,914672.0,8615.0,7860553.0
Switzerland,1444738.0,1632441.0,666095.0,754602.0,914723.0


The United States imports a lot of energy. We can guess that a lot of this comes from Canada, although the dataset doesn't actually contain this data.

Interestingly, Germany and Switzerland are on both lists, importing and exporting power. This is perhaps because Switzerland's large amount of hydro power is difficult to store.

Let's look at percentages:

In [269]:
df_countries_agg['percent_export'] = df_countries_agg['exports'] / df_countries_agg['production']
df_countries_agg['percent_import'] = df_countries_agg['imports'] / df_countries_agg['demand']

In [270]:
df_countries_agg.sort_values(by='percent_export', ascending=False).head(5)

Unnamed: 0_level_0,demand,production,imports,exports,renewable_total,percent_export,percent_import
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Paraguay,171097.96,1197306.0,9.13,1019363.0,1196802.0,0.851381,5.3e-05
Bhutan,20733.42,91613.21,595.631,70028.72,91586.65,0.764395,0.028728
Mozambique,170075.0,232838.0,116607.0,176300.0,229279.0,0.757179,0.685621
Luxembourg,144000.0,64395.0,153692.0,44394.0,24068.0,0.689401,1.067306
Lao People's Dem. Rep.,41285.7149,74112.16,9439.6614,41276.11,68534.16,0.556941,0.228642


There are some smaller countries that export over 50% of the power that they produce. Let's look again with the 10% threshold applied:

In [272]:
threshold = df_countries_agg["production"].quantile(0.9)
df_countries_agg[df_countries_agg.production > threshold].sort_values(
    by="percent_export", ascending=False
).head(5)

Unnamed: 0_level_0,demand,production,imports,exports,renewable_total,percent_export,percent_import
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
France,11102400.0,13289334.0,187359.0,1633762.0,1801935.0,0.122938,0.016876
Sweden,3499543.0,3731851.0,320874.0,377042.0,1743724.0,0.101034,0.09169
Norway,2956855.0,3122858.0,170211.0,281560.0,3081819.0,0.090161,0.057565
Germany,12887914.0,14146393.0,1013711.0,1183777.0,1266381.0,0.08368,0.078656
Canada,13520339.0,14700613.0,363518.0,1084799.0,8824536.0,0.073793,0.026887


Larger countries export no more than 12% of their produced power, and we can see many of the same countries on the previous leader list.

In [274]:
df_countries_agg.sort_values(by='percent_import', ascending=False).head(5)

Unnamed: 0_level_0,demand,production,imports,exports,renewable_total,percent_export,percent_import
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Luxembourg,144000.0,64395.0,153692.0,44394.0,24068.0,0.689401,1.067306
State of Palestine,68445.9,6195.2,62250.7,0.0,0.0,0.0,0.909488
Benin,15761.0,2466.0,13303.0,0.0,14.0,0.0,0.844045
Liechtenstein,3140.9,577.6,2563.4,0.0,548.2,0.0,0.816136
Andorra,11293.2,2446.0,8850.1,3.0,2323.3,0.001226,0.783666


For import percentage, we again see small countries feature most. Luxembourg imports more power than it produces! Let's add the 10% threshold back:

In [276]:
df_countries_agg[df_countries_agg.production > threshold].sort_values(
    by="percent_import", ascending=False
).head(5)

Unnamed: 0_level_0,demand,production,imports,exports,renewable_total,percent_export,percent_import
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Italy,7431370.0,6837341.0,1096071.0,33423.0,1470013.0,0.004888,0.147492
Sweden,3499543.0,3731851.0,320874.0,377042.0,1743724.0,0.101034,0.09169
Brazil,10291411.0,9481650.0,914672.0,8615.0,7860553.0,0.000909,0.088877
Germany,12887914.0,14146393.0,1013711.0,1183777.0,1266381.0,0.08368,0.078656
Norway,2956855.0,3122858.0,170211.0,281560.0,3081819.0,0.090161,0.057565


We can see that Italy is hugely dependent on its neighbors, being on the leader list for both total quantity of imported energy and as a percentage of demand.

## Writing new country-specific summary data to disk (5 min)

Your team is delighted that you've managed to make sense of the data and extract some insights. They want to explore the data themselves too, but all of their existing tools are designed to analyze data from only one country at a time. They have asked that you create separate CSV files for each country, using the country as the file name, with a maximum of 25 rows per file (one per year) and columns for imports, exports, etc.

To do this, we use the [`to_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html) function on a given DataFrame to write it to a file. We create a new directory called "output_csvs" in our working directory so that we don't clutter up our workspace with 243 CSV files. Then we run the following code to write our data to disk:

In [340]:
import os

OUTPUT_DIRECTORY = "output_csvs"

if not os.path.exists(OUTPUT_DIRECTORY):
    os.makedirs(OUTPUT_DIRECTORY)

for country in df_countries['country'].unique():
    country_df = df_countries[df_countries.country == country].drop(columns='country')
    country_df.to_csv(f"{OUTPUT_DIRECTORY}/{country}.csv")

## Conclusions (3 min)

We saw a number of interesting trends in the global energy industry. Specifically, we saw that many countries are relying more and more on renewables, but that some of the countries with fast-growing demand are forced to turn to non-renewable sources to keep up.

We also noticed that contrary to our expectations of some countries being "net importers" and others being "net exporters" of power, many countries actually both import *and* export large amounts of power.

## Takeaways (5 min)

In this case, we covered some more features of `pandas` and got more practice with the features we covered previously. Specifically we saw how to:

* Use the `apply()` method in `pandas` with build-in functions, custom functions, and anonymous functions
* Work with large datasets and explore these using basic string matching to find interesting columns, and reformat the results into more convenient formats
* Pivot between wide and narrow formats
* Plot basic line plots
* Break up a large dataset into smaller ones and write these back to disk

While you'll learn more advanced functionality than this in later cases, these basics will be used again and again, so keep coming back to this case as reference material as often as you need.