# SLU1 - Pandas 101: Exercise notebook

In [1]:
import pandas as pd

In this notebook the following is tested:

- Pandas Series
- Pandas DataFrames
- Adding columns to a dataframe
- Printing the columns
- Load a dataset
- Preview a dataframe
- Make use of info, describe and shape

## Useful information

|    Country    | Population |   Capital  | Area (km^2) | Population Density (/km^2) | Main Religion | GDP (B) | Has Sovereign State? |
|:-------------:|:----------:|:----------:|:-----------:|:--------------------------:|:-------------:|:-------:|:--------------------:|
|    Denmark    |   5724456  | Copenhagen |    43094    |            129.5           |    Lutheran   |  306.7  |          No          |
|    Finland    |   5498211  |  Helsinki  |    338145   |            16.2            |    Lutheran   |  236.8  |          No          |
|    Iceland    |   335878   |  Reykjavík |    103000   |             3.2            |    Lutheran   |   20.0  |          No          |
|     Norway    |   5265158  |    Oslo    |    323802   |            16.1            |    Lutheran   |  370.4  |          No          |
|   Greenland   |    56483   |    Nuuk    |   2166086   |            0.028           |    Lutheran   |   2.22  |          Yes         |
| Faroe Islands |    49188   |  Tórshavn  |     1393    |            35.5            |    Lutheran   |   2.77  |          Yes         |

---

## Exercise 1: Series

In this first exercise the goal is to get used to creating series.

#### 1.1) Create a series for the countries

In [2]:
# Create a series with the countries, using the order provided
# in "Useful Information". Don't forget to delete the raise error

# Create a series for the countries
# countries = ...

# YOUR CODE HERE
countries = pd.Series(["Denmark", "Finland", "Iceland", "Norway", "Greenland", "Faroe Islands"])

In [3]:
assert isinstance(countries, pd.Series), "Should be of type pd.Series"
assert (countries == ["Denmark", "Finland", "Iceland", "Norway", "Greenland", "Faroe Islands"]).all()

#### 1.2) Create a series for the capitals (using countries as index)

In [4]:
# Create a series with the capitals, using the order provided
# in "Useful Information", with countries as index

# Create a series for the capitals
# capitals = ...

# YOUR CODE HERE
capitals = pd.Series(["Copenhagen", "Helsinki", "Reykjavík", "Oslo", "Nuuk", "Tórshavn"], index=countries)

In [5]:
assert isinstance(capitals, pd.Series), "Should be of type pd.Series"
assert (capitals == ["Copenhagen", "Helsinki", "Reykjavík", "Oslo", "Nuuk", "Tórshavn"]).all()
assert (capitals.index == ["Denmark", "Finland", "Iceland", "Norway", "Greenland", "Faroe Islands"]).all()

---

## Exercise 2: DataFrames

In this exercise the goal is to create a simple DataFrame comprising the information provided in the "Useful Information" section.

#### 2.1) Create a DataFrame with countries/capital/population columns

In [6]:
# Create a dataframe with columns countries/capital/population
# columns. 
# nordic_countries is the final dataframe.

# First create lists with the values
countries_list = ["Denmark", "Finland", "Iceland", "Norway", "Greenland", "Faroe Islands"]
capitals_list = ["Copenhagen", "Helsinki", "Reykjavík", "Oslo", "Nuuk", "Tórshavn"]
population_list = [5724456, 5498211, 335878, 5265158, 56483, 49188]

# Then use them to create Series
countries = pd.Series(countries_list)
capitals = pd.Series(capitals_list)
population = pd.Series(population_list)

# Finally create the DataFrame
nordic_countries = pd.DataFrame(dict(countries=countries, capital=capitals, population=population))


In [7]:
assert isinstance(nordic_countries, pd.DataFrame), "Should be of type pd.DataFrame"
assert (nordic_countries["capital"] == ["Copenhagen", "Helsinki", "Reykjavík", "Oslo", "Nuuk", "Tórshavn"]).all()
assert (nordic_countries["countries"] == ["Denmark", "Finland", "Iceland", "Norway", "Greenland", "Faroe Islands"]).all()
assert (nordic_countries["population"] == [5724456, 5498211, 335878, 5265158, 56483, 49188]).all()

The expected output is a dataframe with columns named "capital", "countries" and "population", indexed from 0 to 5.

#### 2.2) Create a DataFrame with the capital/population columns, but with countries as row indexes

In [8]:
# Create a dataframe with the columns capital/population
# columns and with countries as Indexes.
# You may use previously created lists
# nordic_countries_ind is the final dataframe

# Create the indexed Series
capitals_indexed = pd.Series(capitals_list, index=countries)
population_indexed = pd.Series(population_list, index=countries)

# Create the DataFrame
nordic_countries_ind = pd.DataFrame(dict(capital=capitals_indexed, population=population_indexed))


In [9]:
assert isinstance(nordic_countries_ind, pd.DataFrame), "Should be of type pd.DataFrame"
assert (nordic_countries_ind["capital"] == ["Copenhagen", "Helsinki", "Reykjavík", "Oslo", "Nuuk", "Tórshavn"]).all()
assert (nordic_countries_ind["population"] == [5724456, 5498211, 335878, 5265158, 56483, 49188]).all()
assert (nordic_countries_ind.index == ["Denmark", "Finland", "Iceland", "Norway", "Greenland", "Faroe Islands"]).all()

The expected output is a dataframe with columns named "capital", "population", indexed with the correct countries.

#### 2.3) Going back to the 2.1 DataFrame, add columns with information concerning the gdp and the main religion (Do not create a dataframe from scratch)

In [10]:
# Using the dataframe created in 2.1) add columns with the gdp
# and main religion information.
# nordic_countries remains a dataframe

# Write lists with gdp and religion information
gdp_list = [306.7,236.8,20.0,370.4,2.22,2.77]
religion_list = ["Lutheran"] * 6

# Write them as Series
gdp = pd.Series(gdp_list)
religion = pd.Series(religion_list)

# Add them to the dataframe
nordic_countries["gdp"] = gdp
nordic_countries["mainReligion"] = religion


In [11]:
assert isinstance(nordic_countries, pd.DataFrame), "Should be of type pd.DataFrame"
assert (nordic_countries["gdp"] == [306.7, 236.8, 20.0, 370.4, 2.22, 2.77]).all()
assert (nordic_countries["mainReligion"] == ["Lutheran", "Lutheran", "Lutheran", "Lutheran", "Lutheran", "Lutheran"]).all()
assert set(nordic_countries.columns) == {"capital", "countries", "population", "gdp", "mainReligion"}

The expected output is a dataframe with columns "capital", "countries", "population", "gdp" and "mainReligion", indexed from 0 to 5.

## Exercise 3: Load the Iris csv data set

In this exercise you will load the iris data set ([source](https://gist.github.com/curran/a08a1080b88344b0c8a7)), one of the standard data sets for students of this field. You will then preview it and retrieve some information.

#### 3.1) Load the Iris data set

In [12]:
# Load the Iris data set
from urllib.request import urlretrieve
iris = 'http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
urlretrieve(iris)

iris_df = df = pd.read_csv(iris, sep=',', header=None)

attributes = ["sepal_length", "sepal_width", "petal_length", "petal_width", "species"]
iris_df.columns = attributes

# YOUR CODE HERE


In [13]:
assert isinstance(iris_df, pd.DataFrame), "Should be of type pd.DataFrame"

#### 3.2) Get general information about the dataframe

In [14]:
# Print iris_df info

# ...

# YOUR CODE HERE
print(iris_df.info)

<bound method DataFrame.info of      sepal_length  sepal_width  petal_length  petal_width         species
0             5.1          3.5           1.4          0.2     Iris-setosa
1             4.9          3.0           1.4          0.2     Iris-setosa
2             4.7          3.2           1.3          0.2     Iris-setosa
3             4.6          3.1           1.5          0.2     Iris-setosa
4             5.0          3.6           1.4          0.2     Iris-setosa
5             5.4          3.9           1.7          0.4     Iris-setosa
6             4.6          3.4           1.4          0.3     Iris-setosa
7             5.0          3.4           1.5          0.2     Iris-setosa
8             4.4          2.9           1.4          0.2     Iris-setosa
9             4.9          3.1           1.5          0.1     Iris-setosa
10            5.4          3.7           1.5          0.2     Iris-setosa
11            4.8          3.4           1.6          0.2     Iris-setosa
12    

#### 3.3) Preview the top 10 entries

In [15]:
# Print the top 10 entries of iris_df

# ...

# YOUR CODE HERE
iris_df.head(n=10)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa
8,4.4,2.9,1.4,0.2,Iris-setosa
9,4.9,3.1,1.5,0.1,Iris-setosa


#### 3.4) Preview the bottom 10 entries

In [16]:
# Print the bottom 10 entries of iris_df

# YOUR CODE HERE
iris_df.tail(n=10)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
140,6.7,3.1,5.6,2.4,Iris-virginica
141,6.9,3.1,5.1,2.3,Iris-virginica
142,5.8,2.7,5.1,1.9,Iris-virginica
143,6.8,3.2,5.9,2.3,Iris-virginica
144,6.7,3.3,5.7,2.5,Iris-virginica
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica
149,5.9,3.0,5.1,1.8,Iris-virginica


#### 3.5) Print the number of rows and the number of columns

In [17]:
# Get the nr_rows and nr_cols from iris_df

nr_rows = iris_df.shape[0]
nr_cols = iris_df.shape[1]
iris_df.shape

(150, 5)

In [18]:
assert nr_rows == 150
assert nr_cols == 5

#### 3.6) Create a list with the variable (column) names

In [19]:
# Create a list with the column names

list_of_variables = list(iris_df.columns)
list_of_variables

['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

In [20]:
assert "sepal_width" in list_of_variables
assert "species" in list_of_variables
assert "sepal_length" in list_of_variables

The expected output is a list with "sepal_length", "sepal_width", "petal_length", "petal_width" and "species".

#### 3.7) Making use of describe() what is the mean value of the petal length?

In [21]:
# Print iris_df describe

iris_df.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [22]:
# Manually input the petal length mean value
# This is just to force the use of the previously obtained information

petal_length_mean = 3.758667


In [23]:
assert petal_length_mean < 3.8 and petal_length_mean > 3.7