In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab03.ipynb")

In [None]:
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf



# Lab 03: Environmental Kuznets Curve (EKC)

Reference:[The evolution of the environmental Kuznets curve hypothesis assessment: A literature review under a critical analysis perspective](https://www.cell.com/heliyon/fulltext/S2405-8440(22)02809-2?_returnURL=https%3A%2F%2Flinkinghub.elsevier.com%2Fretrieve%2Fpii%2FS2405844022028092%3Fshowall%3Dtrue)                                                                                                                                           
## Background of the EKC
### Origin of the EKC
"The results of the research of Kuznets disclosed an **inverted
U-shaped** relationship between ``income per capita`` and ``income inequality``.
According to Kuznets, the inverted U-shaped relationship revealed an
unequal income distribution in the early stages of income growth that
moves towards equal income distribution with increasing economic
productivity in the later stages of economic growth. Therefore, Kuznets specified that the transition from a pre-industrial to an industrial
development firstly led to income inequality. This is followed by a rising
income per capita together with superior income equality.

The EKC attracted a lot of attention from policymakers, theorists and empirical
researchers and started to be widely used in environmental studies through the seminal research of Grossman and Krueger, carried
out in 1991. They revealed that the relationship between income per
capita and environmental degradation, like the income per capita and
income inequality of Kuznets, also follows an inverted U-shaped curve.

In the early 1990s, the main idea in economics was “too poor to be
green” [15]. According to Beckerman's [15] point of view regarding the
effect of economic growth on environmental degradation, the author
argues that there is: "clear evidence that, although economic growth usually
leads to environmental deterioration in the early stages of the process, in the
end, the best and probably the only way to attain a decent environment in most
countries is to become rich". This view reflects the basic philosophy of the
EKC theory. The World Development Report in 1992 argues that some
environmental problems are aggravated by the growth of economic activity, and it suggests that accelerated equitable income growth will make
it possible to achieve higher world output and improved environmental
conditions [16, 17]. This proposal lays the foundation of the EKC literature."

### Conceptual framework of the EKC
The EKC is commonly interpreted in two ways:
#### Two Phases, namely the early and later stages of economic development:
1. The early stages are defined by a
decreasing capacity of ecosystem regeneration as a consequence of
**intensive use of resources** that lead to a rising ecological footprint and
pollution. The early stages are linked with **lax environmental regulations** associated with a low capacity to pay for
environmental conservation.
2. The later stages are characterized by
mitigation of environmental degradation resulting from the dissemination of **clean technology and innovation**, **society environmental awareness**, and **effectiveness and institutional quality** associated with an
increase in the level of income.

In addition, these stages are also
characterized by two effects, i.e., **policy effect** and **income effect**:
1. The policy effect consists of greater public concern about the environment,
which leads to rigorous regulatory requirements.
2. The income effect consists of the increase in income that leads to an increase in the willingness to pay for environmentally-friendly features.
#### Three phases of economic development:
1. the pre-industrial economy, mainly characterised by primary
sector and low levels of income;
2. the industrial economy, constituted
by the secondary sector and associated with middle-income levels; and
3. the post-industrial economy, formed by the tertiary sector and services, and associated with higher levels of income.

In the pre-industrial economy, economic activity is limited and results in a natural resource abundance and reduced formation of waste. In this phase, the use of pollutant technology, the lack of environmental awareness, and the prioritisation of economic growth result in rising environmental degradation.

The industrial economy is characterised by natural resources that are starting to run out and increasing waste accumulation because of industrialisation. In this phase, a **positive** relationship between economic growth and environmental deterioration is verified, and it occurs before the turning point is achieved.

The third phase of economic development is characterised by a structural change in the
economy, changing to information- and technology-intensive industries
and a services-directed economy. This change is linked with the reinforcement of environmental regulations, the use of cleaner and efficient technology, and a strengthening of environmental awareness, resulting in a mitigation of environmental degradation. In this phase, a **negative** relationship between economic growth and environmental deterioration is verified, and it occurs after the turning point has been
reached.                 

### Shape of the EKC
 The EKC defines the pollution trajectory over time and income resulting from economic development. The EKC is a long-run concept.

Consider a linear regression model of pollution level regressed on income: $β_1$ as the coefficient of
income and $β_2$ as the coefficient of income squared, both in the long run,
the EKC is verified according to the condition $β_1 > 0$ $\mbox{and}$
$β_2 < 0$.

Besides these ones, two more conditions might be obtained in the EKC
assessment. These two imply the inclusion of the third polynomial, income cubed ($β_3$).
1. Figure (vi) $β_1 < 0$; $β_2 > 0$ \& $β_3 < 0$. Opposed to the N-shaped curve.
2. Figure (vii) $β_1 > 0$; $β_2 < 0$ \& $β_3 > 0$. Cubic polynomial or N-shaped curve.

![image.png](attachment:49c7b085-de9a-40ff-9185-7b97da58d339.png)

#### Grossman-Kruger (1995)
![image.png](attachment:2f2be023-b2b7-4d02-886f-e8b0e6d0d948.png)

#### Data sources: 

The main data We will use is extracted from the ["World Development Indicators DataBank"](https://databank.worldbank.org/source/world-development-indicators) from the World Bank, for 2019 data only. 

So we will explore the EKC hypotheis only from a cross-section perspective.

### Learning Objectives: 
- Importing and exporting dataframes
- Recognizing and handling missing values and NaNs
- Pivoting data
- Regression model behind EKC

---
## Part 1: Importing dataset

**Question 1.1:** Import the dataset `https://raw.githubusercontent.com/Mxywp/EnvEcon105-2025/refs/heads/data/wdi_gdp_pollution_2019.csv`

In [None]:
gdp_ekc = ...
gdp_ekc

---
## Part 2: Exploring the dataset

One of the first things that we will do with our dataset is to learn about its structure: how many rows and columns are there in the dataset? What values does each column store? What is the data type for each column (int, string, etc.)? For categorical variables, what are unique values? For numerical variables, what is the mean, median, min, and max? 

**Question 2.1:** How many rows and columns are there in this dataframe `gdp_ekc`? Assign the number of rows to `N_rows` and the number of columns to `N_cols`. 

In [None]:
N_rows = ...
N_cols = ...
N_rows
N_cols

In [None]:
grader.check("q_2_1")

**Question 2.2:** How many unique countries are there in this dataframe `gdp_ekc`? Assign the number of unique counties to `N_unique_countries`. 

In [None]:
N_unique_countries = ...
N_unique_countries

In [None]:
grader.check("q_2_2")

---
## Part 3: Pivot

You should know a bit about pivot tables from our lecture on `tidy data`. Look at the documentation [here](https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html). For this lab analysis, we would like to use `.pivot()`, # to convert a long form dataframe to a wide one.

**Question 3.1:** Convert the dataframe using `pandas.pivot()` and assign the pivot table to `ekc_wide` so that it contains new columns that correspond to the unique values of the column `Series Name'. 

In [None]:
ekc_wide = ...
ekc_wide

In [None]:
grader.check("q_3_1")

**Question 3.2:** Drop the column that we won't use in this lab: 'CO2_emissionstons per capita)'

Don't create a new dataframe after renaming. Check `DataFrame.drop()` and its `inplace' argument to make changes directly to the existing dataframe.

In [None]:
...
ekc_wide

In [None]:
grader.check("q_3_2")

In [None]:
You probably find that the new column names are a bit too long/complex!

**Question 3.3:** Rename the column names (exactly as the following) to simpler format to ease the future use: 
1. 'CO2 emissions (metric tons per capita)' to 'CO2_tonpc'
2. 'GDP per capita (constant 2015 US$)' to 'GDP_pc'
3. 'PM2.5 air pollution, mean annual exposure (micrograms per cubic meter)' to 'PM25_mcgpcm'
4. 'Population density (people per sq. km of land area)' to 'pop_den'

Don't create a new dataframe after renaming. Check `DataFrame.rename` and its `inplace' argument to make changes directly to the existing dataframe.

In [None]:
...
                        ...
                        ...
                        ...
ekc_wide.columns

In [None]:
grader.check("q_3_3")

---
## Part 4: Missing Values and NaNs

As said in class, real-world data is rarely clean. Particularly, many datasets have significant amount of missing data. In `Pandas`, missing data is primarily represented by two special values: 
`None`: This is a `Python` object used to represent missing values, particularly in object-type (e.g., string) arrays.
`NaN` (Not a Number): This is a special floating-point value from NumPy that is widely recognized as a missing value indicator, especially in numerical arrays.

However, different data sources may record and/or report missing data in different ways. 

In our dataset, there are two types of 'missing values': "NaN" and "..". Let's see how they look like.

In [None]:
ekc_wide[ekc_wide["CO2_tonpc"].isna()]

In [None]:
ekc_wide[ekc_wide["CO2_tonpc"] == ".."][:5]

**Question 4.1:** For simplicity, simply drop all rows that contain missing values (either NaN or ..) for this lab. 
*hint:* 
1. check data type before using `.dropna()`, which does not work with `string` or `object`. You need to convert the columns into `float` type.
2. however, the missing value '..' can't be converted from `string` to `float`. So, you need to replace it with something that can be converted to `float`. 
There are a few ways to complete this. Here is my suggestion: 1) `.replace()` '..' to 'NaN'. 2) .astype(float) changes data types of the columns with numerical values (but stored in `object` type) into `float` type. 3) .dropna(). 
3. Finally, assign the number of rows to `n_rows`.

*Note:* As said in class, this is not a good way to deal with missing values. So, do not do this in the real world.

In [None]:
ekc_no_missing = ekc_wide.copy()
...
ekc_no_missing[['GDP_pc', 'CO2_tonpc', 'PM25_mcgpcm', 'pop_den']] = ...
...
ekc_no_missing.head()
n_rows= ...
n_rows


In [None]:
grader.check("q_4_1")

---
## Part 5: EKC - Regression Model

Here I'd like to explore the regression model behind the EKC shown in the graphs at the beginning of this notebool. You could use `statsmodels.formula.api` (imported at the beginning) to get regression results or whichever statistical package you like. 

<!-- BEGIN QUESTION -->

**Question 5_1:** Please fit the following model to our data with a formula like this: 
`PM25_mcgpcm ~ GDP_pc + GDP_pc2` where GDP_pc2 is the square of GDP_pc. And print out your regression results.

In [None]:
ekc_reg = ...
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 5.2:** Please explain the meaning of the coefficient estimates (excluding the intercept), in terms of changes in `y` in response to changes in `x' as we did in the lecture. Do you observe a U-shaped or inverted U-shaped relationship for EKC?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

**Congratulations!** You're done with EnvEcon 105 Lab 03!

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(run_tests=True)