# Exploratoratory Data Analysis

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Reading the data from a file. Parameter *skiprows* allows us to skip first *k* rows.
> Check what data is in the first 4 rows.

In [None]:
df = pd.read_csv("co2emissions.csv", skiprows = 4)

Checking a type of an **df** object.

In [None]:
type(df)

Reviewing the data and its description.

In [None]:
df.head(10)

In [None]:
df.shape #(rows, columns)

In [None]:
df.describe()

For non-numeric attributes we can check the frequency distribution of the values.

In [None]:
df['Indicator Code'].value_counts()

In [None]:
df.columns[[0,1]] # Retrieving the names of the first two columns.

Retrieving rows depending on a condition.

In [None]:
df['Country Name'] == 'Canada' #Rows for which the country name is Canada. Returns True/False

In [None]:
df[df['Country Name'] == 'Canada'] #A subset of data based on a condition

Retrieving data based on several conditions.

In [None]:
df[(df['Country Name'] == 'Canada') | (df['Country Code'] == 'JPN')]

In [None]:
df[(df['Country Name'] == 'Canada') & (df['Country Code'] == 'CAN')]

> What does the following line of code do?

In [None]:
df[['Country Name', 'Country Code', '2010']][16:21]

### Task 1. (1 point)
> Find the CO2 emissions per capita for France and Germany for 2010 and 2011. Display only the columns with the country name and CO2 emissions for the corresponding years.

In [None]:
# Your answer here


## Data cleaning

> What problems exist in the analyzed data set?

In [None]:
df.loc[[93,151,174,242]]

1. Some rows represent an aggregation of countries rather than countries (e.g. "World").
1. Some columns are redundant and can be removed (e.g. "Indicator Name").
1. For some years we do not have data for any country (e.g. 2012-2015).
1. Some countries do not have data for any year (e.g. "Taiwan", "China").

In [None]:
# loading detailed data about countries
countries = pd.read_csv("countries_metadata.csv", encoding = "utf-8") 

In [None]:
countries.head(10)

In [None]:
# Combining data with country information
merge = pd.merge(df, countries, on = "Country Code")

In [None]:
merge.head(10)

We remove rows that do not represent countries.

**Note:** The value in the *Region* column is *NaN* only when the row does not represent a country.

In [None]:
merge = merge[pd.notnull(merge['Region'])]

Removing irrelevant columns.

In [None]:
merge.columns

In [None]:
merge = merge.drop(merge.columns[[60,61,64,65]], axis=1) # Note: zero indexed
merge = merge.drop('Indicator Name', axis=1)
merge = merge.drop('Indicator Code', axis=1)

In [None]:
merge.columns

Removing columns without data.

In [None]:
merge.count()

In [None]:
merge['2015']

In [None]:
merge = merge.drop(['2012', '2013', '2014', '2015'], axis=1)

In [None]:
merge.count()

Removing countries for which there is no data.

In [None]:
merge.mean(axis=1, numeric_only=True) 

In [None]:
merge = merge[pd.notnull(merge.mean(axis=1, numeric_only=True))]

In [None]:
merge.head()

### Task 2. (1 point)
> Find the countries with the highest CO2 emissions per capita in 2011. What can you say about them? Did anything surprise you? What do you think, why exactly these countries had the highest CO2 emissions per capita in 2011?

> **Tip:** Use sorting and the *head* function.

In [None]:
# Your answer here


### Task 3. (2 points)
> Draw a plot showing how average CO2 emissions per capita (aggregated for all countries) changed in the years 1960-2011. What is the general trend? What can you say about the flow of the chart?

In [None]:
# Your answer here


### Task 4. (1 point)
> Which Income groups had the lowest CO2 emissions per capita in 2010? What do you think, why?

In [None]:
# Your answer here
