## Data Analysis on World CO2 Data
Drawing upon our knowledge in Python and its data science libraries—Pandas, NumPy, and Matplotlib—__we can leverage our expertise in addressing pressing global challenges__. This case study specifically delves into __examining the historical extent of climate change__. 

We will be looking at the dataset provide by the World Bank: [CO2 emissions (metric tons per capita)](https://api.worldbank.org/v2/en/indicator/EN.ATM.CO2E.PC?downloadformat=csv).

__But, first we must understand the dataset!__

## Understanding Climate Change, thus the Dataset


The specific .csv file in the zip file downloaded is titled "API_EN.ATM.CO2E.PC_DS2_en_csv_v2_5995557". The file name will be manually changed into "World CO2 Data" for better naming convention.

__Understanding Climate Change__

People often ask, _Why is CO2 emissions a prime indicator of climate change? Why not other metrics?_ 

The [United States Environmental Protection Agency](https://www.epa.gov/report-environment/greenhouse-gases) addresses this by highlighting that CO2 emissions presently constitute the largest share of the warming linked to human activities. This allows us to draw parallels between CO2 emissions and its climate effects. 

__Understanding the Dataset__

Now that we know CO2 emissions are directly tied to climate change, __let's check out the actual data__. 

__The dataset outlines the CO2 emissions in metric tons per capita__. This means each entry in the dataset corresponds to the amount of CO2 emitted per person in a given region or country. Additionally, __there's a total entry of 266 countries' data from 1960 to 2022__. It is important to note, however, __that the dataset contains some missing values__; we must address this later using Pandas.

Analyzing this data allows us to discern patterns and trends, revealing which areas contribute more significantly to per capita emissions.

## The Big Question

With the bigger picture in mind, we can now frame some questions that would provide us further insights to the climate change problem. The questions we should address are:
1. What are the top 5 countries with the highest CO2 emissions per capita in the most recent year available?
2. What are the bottom 5 countries with the lowest CO2 emissions per capita in the most recent year available?
3. Which countries saw the biggest CO2 emissions change during COVID years?
4. Which countries saw minimal CO2 emissions change during COVID years?
5. Are there significant differences in CO2 emissions per capita between continents?
6. Which countries have consistently maintained high or low CO2 emissions per capita, against the average CO2 emissions per capita, throughout the dataset period?
7. Investigate any recent changes or trends in CO2 emissions per capita. Are there signs of improvement or deterioration in recent years?

__To address the questions above, we can use the aforementioned libraries to gain further insight.__


## Procedures to perform data analysis on given dataset

### imports

In [1]:
import pandas as pd

### Creating and Cleaning Dataframe by reading the .csv dataset

Cleaning and preprocessing a dataset is crucial for optimal data analysis. In our case, __the CSV file initially contains four rows of comments unrelated to the dataset, disrupting the expected structure__. To remedy this, we can __skip these comment rows when creating a Pandas dataframe__. 

We can usse the `pd.read_csv()` function to read the contents of the CSV file into a Pandas DataFrame and to skip the unwanted rows we can use `df = pd.read_csv('your_file.csv', skiprows=4)`

In [5]:
df = pd.read_csv('World CO2 Data.csv', skiprows=4)

This ensures that the dataset is free from irrelevant information and ready for further analysis, minimizing potential parsing issues and enabling a smooth data exploration process.

__Now, let's take a peek at the dataframe object—df we've created.__ To get a good general overview, we should implement multiple viewing methods.

We can use the `df.columns` to get all columns from the dataframe data.

In [6]:
df.columns

Index(['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code',
       '1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968',
       '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977',
       '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986',
       '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995',
       '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004',
       '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013',
       '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021', '2022',
       'Unnamed: 67'],
      dtype='object')

The column headers above are mostly as expected. But, there exist a column header: 'Unnamed:67'. __This column appears to be unnamed or might be an extra column with no clear label__. It's common to encounter such columns when reading CSV files, and __it can be dropped if not needed.__ 

We should use `df.head` and `df.tail` to get the first and last few rows respectively. Just to get a better picture.

In [7]:
df.head()

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2014,2015,2016,2017,2018,2019,2020,2021,2022,Unnamed: 67
0,Aruba,ABW,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,,,,,,,...,,,,,,,,,,
1,Africa Eastern and Southern,AFE,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,,,,,,,...,1.013758,0.96043,0.941337,0.933874,0.921453,0.915294,0.79542,,,
2,Afghanistan,AFG,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,,,,,,,...,0.283692,0.297972,0.268359,0.281196,0.299083,0.297564,0.223479,,,
3,Africa Western and Central,AFW,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,,,,,,,...,0.493505,0.475577,0.479775,0.465166,0.475817,0.490837,0.46315,,,
4,Angola,AGO,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,,,,,,,...,1.091497,1.125185,1.012552,0.829723,0.755828,0.753638,0.592743,,,


In [8]:
df.tail()

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2014,2015,2016,2017,2018,2019,2020,2021,2022,Unnamed: 67
261,Kosovo,XKX,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,,,,,,,...,,,,,,,,,,
262,"Yemen, Rep.",YEM,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,,,,,,,...,0.988347,0.47524,0.342802,0.32237,0.368614,0.354864,0.308515,,,
263,South Africa,ZAF,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,,,,,,,...,8.191153,7.607189,7.54459,7.683708,7.667377,7.688908,6.687563,,,
264,Zambia,ZMB,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,,,,,,,...,0.297755,0.305055,0.316995,0.393726,0.440527,0.414336,0.401903,,,
265,Zimbabwe,ZWE,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,,,,,,,...,0.866838,0.846962,0.723062,0.663069,0.735435,0.663338,0.530484,,,


This clearly illustrates that __some rows have empty values__, indicating that a country has no data on CO2 emissions per capita. Additionally, __we can suspect that some columns have empty values__, indicating that there is no data on that specific year.

To address this, we can write a query that _checks if a column header in a dataframe has any rows of value at all if not then delete the column header and checks if a row in a dataframe has any value at all if not then delete the corresponding row._ We can do so like such below:

In [9]:
# Check if any row has all NaN values
empty_rows = df.index[df.isnull().all(axis=1)]

In [10]:
# Check if any column has all NaN values
empty_columns = df.columns[df.isnull().all()]

# Drop columns with all NaN values
df = df.drop(empty_columns, axis=1)

__Let's see the changes due to the codes above.__

In [12]:
df.columns

Index(['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code',
       '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998',
       '1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007',
       '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016',
       '2017', '2018', '2019', '2020'],
      dtype='object')