In [245]:
import pandas as pd

data = pd.read_csv('data.csv')

In [247]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29 entries, 0 to 28
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Country Name   26 non-null     object 
 1   Country Code   24 non-null     object 
 2   Series Name    24 non-null     object 
 3   Series Code    24 non-null     object 
 4   2016 [YR2016]  24 non-null     float64
 5   2017 [YR2017]  24 non-null     float64
 6   2018 [YR2018]  24 non-null     float64
 7   2019 [YR2019]  24 non-null     float64
 8   2020 [YR2020]  24 non-null     object 
dtypes: float64(4), object(5)
memory usage: 2.2+ KB


### Structure Overview
- **Columns:** 9 total — `Country Name`, `Country Code`, `Series Name`, `Series Code`, and data from 2016 to 2020.
- **Rows:** 29 total, but only 24 expected (8 countries × 3 indicators).
- **Missing Values:** 
  - `Country Name`: 26 non-null → 2 additional unexpected values
  - All other key columns have 24 non-null values → matches expectation
### Observations
- **Country Name, Country Code, etc**: Correctly typed as `object`.
- **2016 – 2019**: Correctly typed as `float64`.
- **2020**: Typed as `object`, likely due to presence of empty strings or non-numeric text.

In [250]:
data.head(29)

Unnamed: 0,Country Name,Country Code,Series Name,Series Code,2016 [YR2016],2017 [YR2017],2018 [YR2018],2019 [YR2019],2020 [YR2020]
0,Argentina,ARG,Access to electricity (% of population),EG.ELC.ACCS.ZS,99.84958,100.0,99.98958,100.0,100
1,Argentina,ARG,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,4.201846,4.071308,3.975772,3.74065,..
2,Argentina,ARG,GDP (constant 2015 US$),NY.GDP.MKTP.KD,582376600000.0,598790900000.0,583118100000.0,571304500000.0,514772410744.886
3,Brazil,BRA,Access to electricity (% of population),EG.ELC.ACCS.ZS,99.7,99.8,99.7,99.8,100
4,Brazil,BRA,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,2.168575,2.196418,2.071855,2.057811,..
5,Brazil,BRA,GDP (constant 2015 US$),NY.GDP.MKTP.KD,1743173000000.0,1766233000000.0,1797737000000.0,1819683000000.0,1749103394213.21
6,Chile,CHL,Access to electricity (% of population),EG.ELC.ACCS.ZS,100.0,99.7,100.0,100.0,100
7,Chile,CHL,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,4.74983,4.71402,4.624338,4.821118,..
8,Chile,CHL,GDP (constant 2015 US$),NY.GDP.MKTP.KD,246747700000.0,250097800000.0,260076800000.0,262080800000.0,246412987238.941
9,Paraguay,PRY,Access to electricity (% of population),EG.ELC.ACCS.ZS,98.4,99.3,99.6,99.7,100


### Observations
1. `Country Name` and `Country Code` are redundant — keep one.
2. `Series Name` and `Series Code` are also redundant — keep one.
3. As shown in **2020[YR2020]**, e.g., All CO₂ emissions are blank, but there are also float and NaN → **2020** is typed as `object`
4. Indicators have different scales:
   - Electricity access: max ~100
   - CO₂ emissions: small values (single digit)
   - GDP: large absolute values → **may require standardization**

In [253]:
data.describe()

Unnamed: 0,2016 [YR2016],2017 [YR2017],2018 [YR2018],2019 [YR2019]
count,24.0,24.0,24.0,24.0
mean,124750700000.0,127011500000.0,128602300000.0,129340600000.0
std,368267800000.0,373607000000.0,378974200000.0,382645100000.0
min,1.059329,1.17372,1.217642,1.165425
25%,3.754891,3.627642,3.569208,3.370855
50%,99.2,99.5,99.65,99.75
75%,41861300000.0,43443850000.0,44460650000.0,44386020000.0
max,1743173000000.0,1766233000000.0,1797737000000.0,1819683000000.0


### Observations
1. `.describe()` not useful due to inconsistent units and missing values.

---

Next step: filter valid rows, fix `2020` column type, and restructure data.


In [256]:
# First we filter valid rows with Country Name
data['Country Name'].unique()

array(['Argentina', 'Brazil', 'Chile', 'Paraguay', 'Uruguay', 'Bolivia',
       'Peru', 'Ecuador', nan,
       'Data from database: Sustainable Development Goals (SDGs)',
       'Last Updated: 07/22/2022'], dtype=object)

In [258]:
target_countries = ['Argentina', 'Brazil', 'Chile', 'Paraguay', 'Uruguay', 'Bolivia', 'Peru', 'Ecuador']
data = data[data['Country Name'].isin(target_countries)]

In [260]:
# Checking if the cleaning is OK
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 24 entries, 0 to 23
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Country Name   24 non-null     object 
 1   Country Code   24 non-null     object 
 2   Series Name    24 non-null     object 
 3   Series Code    24 non-null     object 
 4   2016 [YR2016]  24 non-null     float64
 5   2017 [YR2017]  24 non-null     float64
 6   2018 [YR2018]  24 non-null     float64
 7   2019 [YR2019]  24 non-null     float64
 8   2020 [YR2020]  24 non-null     object 
dtypes: float64(4), object(5)
memory usage: 1.9+ KB


Looking good, we shall proceed.

In [263]:
data['Series Name'].unique()

array(['Access to electricity (% of population)',
       'CO2 emissions (metric tons per capita)',
       'GDP (constant 2015 US$)'], dtype=object)

No need to clean, nice. Now we drop some redundant columns.

In [266]:
data = data.drop(columns = ['Country Code', 'Series Code'])

In [268]:
# Checking again
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 24 entries, 0 to 23
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Country Name   24 non-null     object 
 1   Series Name    24 non-null     object 
 2   2016 [YR2016]  24 non-null     float64
 3   2017 [YR2017]  24 non-null     float64
 4   2018 [YR2018]  24 non-null     float64
 5   2019 [YR2019]  24 non-null     float64
 6   2020 [YR2020]  24 non-null     object 
dtypes: float64(4), object(3)
memory usage: 1.5+ KB


Looking good, yipee. Now we clean `2020`

In [271]:
data['2020 [YR2020]'] = pd.to_numeric(data['2020 [YR2020]'], errors = 'coerce')

In [273]:
# Checking again
data['2020 [YR2020]'].dtype

dtype('float64')

In [275]:
data['2020 [YR2020]'].isna().sum()

8

Now all blanks in `2020` are NaN. I choose to just keep them for now.

Next I make the data easier to read.

In [298]:
# Step 1: Making it tidy
data_melted = data.melt(id_vars = ['Country Name', 'Series Name'], 
                        value_vars = ['2016 [YR2016]', '2017 [YR2017]', '2018 [YR2018]', '2019 [YR2019]', '2020 [YR2020]'],
                        var_name = 'year', value_name = 'value')

# Step 2: Rearranging
data_final = data_melted.pivot_table(index = ['Country Name', 'year'],
                                     columns = 'Series Name',
                                     values = 'value').reset_index()

In [300]:
# Check
data_final.head()

Series Name,Country Name,year,Access to electricity (% of population),CO2 emissions (metric tons per capita),GDP (constant 2015 US$)
0,Argentina,2016 [YR2016],99.849579,4.201846,582376600000.0
1,Argentina,2017 [YR2017],100.0,4.071308,598790900000.0
2,Argentina,2018 [YR2018],99.989578,3.975772,583118100000.0
3,Argentina,2019 [YR2019],100.0,3.74065,571304500000.0
4,Argentina,2020 [YR2020],100.0,,514772400000.0


In [302]:
# Exporting the data
data_final.to_csv('cleaned_data.csv', index = False)