# 11. NumPy, Pandas, Matplotlib

In this notebook, you will be introduced to the three essential Python libraries for data science: **NumPy**, **pandas**, and **Matplotlib**. NumPy provides fast and efficient operations on numerical data, especially in arrays and matrices. pandas builds on NumPy to offer powerful tools for working with structured data, such as tables, with convenient ways to clean, transform, and analyze it. Matplotlib is used to create visualizations, allowing us to explore and present data graphically.

A library is a collection of pre-written code that provides additional functionality to Python. You have already seen how to use the Python standard library through import, but in practice, many libraries are installed separately using tools such as `pip` or `conda`. If you created your conda environment with the yml file you should already have NumPy, pandas, Matplotlib installed.

Previously, we performed a small data analysis using only pure Python. While this is possible, it often requires writing a lot of code for simple, repetitive tasks. With NumPy, pandas, and Matplotlib, these tasks become shorter, faster, and easier to read. In a typical workflow, pandas is used to store and manipulate data, NumPy handles the underlying numerical operations, and Matplotlib creates visualizations from the results.

## 11.1 Introduction

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In W3 you learned the `import` statement and how to rename imported modules such as `import math as m`. The imports above work in a similar fashion, we give the library a short name (an alias) so we can refer to it more easily in our code.

- `np` is the common alias for NumPy
- `pd` is the common alias for pandas
- `plt` is the common alias for Matplotlib’s plotting module

These short names are not required, but they are standard in almost all Python code for data science, so you should use them to make your code easier for others to read and understand.

Now let's see a single line of code from each:

In [1]:
# here is a numpy array
temperatures = np.array([22, 19, 23, 25, 20])
temperatures

NameError: name 'np' is not defined

In [None]:
# here is a pandas dataframe, based on the numpy array
df = pd.DataFrame(temperatures,
                  columns=['Temperature (°C)'],
                  index=['Mon', 'Tue', 'Wed', 'Thu', 'Fri'])
df

In [2]:
# here is a matplotlib plot, based on the pandas dataframe
plt.plot(df.index, df['Temperature (°C)'])

NameError: name 'plt' is not defined

## 11.2 Motivation

Now let’s replicate some of the tasks we previously did with pure Python, this time using these libraries to highlight their advantages. The goal here is not to understand every detail yet, but to see how tasks that took tens of lines in pure Python can often be done in a single line. Tutorials will follow in the next section.

Remember the following problem from `7-dictionaries.ipynb`:

You have given the following dataset, which contains monthly average temperatures for 3 Dutch cities.

|City|Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec|
|--|--|--|--|--|--|--|--|--|--|--|--|--|
|Amsterdam|4.8|4.71|7.06|10.45|13.49|16.71|18.85|19.23|16.54|13.04|8.94|6.46|
|Breda|3.8|4.5|8.22|12.92|16.14|19.73|21.72|21.33|17.51|13.06|8.25|5.21|
|Eindhoven|3.6|4.27|7.8|12.26|15.31|18.72|20.61|20.24|16.62|12.39|7.83|4.94|

let's use a pandas dataframe, `df`, to store this data

In [3]:
data = {
    'Amsterdam': [4.8,4.71,7.06,10.45,13.49,16.71,18.85,19.23,16.54,13.04,8.94,6.46],
    'Breda': [3.8,4.5,8.22,12.92,16.14,19.73,21.72,21.33,17.51,13.06,8.25,5.21],
    'Eindhoven': [3.6,4.27,7.8,12.26,15.31,18.72,20.61,20.24,16.62,12.39,7.83,4.94]
    }

df = pd.DataFrame(data).T
df.columns = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]

df

NameError: name 'pd' is not defined

obtaining the average yearly temperature of every city is as simple as:

In [4]:
city_averages = df.mean(axis=1)
city_averages

NameError: name 'df' is not defined

obtaining the average temperature of every month is as simple as:

In [5]:
monthly_averages = df.mean(axis=0)
monthly_averages

NameError: name 'df' is not defined

In [6]:
# coldest city?
city_averages.idxmin()

NameError: name 'city_averages' is not defined

In [7]:
# coldest month?
monthly_averages.idxmin()

NameError: name 'monthly_averages' is not defined

and now, let's plot the data

In [8]:
plt.plot(df.loc['Amsterdam'])
plt.plot(df.loc['Breda'])
plt.plot(df.loc['Eindhoven'])

NameError: name 'plt' is not defined

let's make it a bit more lovely:

In [9]:
plt.figure(dpi=200) # increase resolution
plt.plot(df.loc['Amsterdam'], label='Amsterdam') # add labels for the legend
plt.plot(df.loc['Breda'], label='Breda')
plt.plot(df.loc['Eindhoven'], label='Eindhoven')
plt.legend() # plot the legend
plt.ylabel("Temperature (C)") # add axis labels
plt.xlabel("Month")

NameError: name 'plt' is not defined

Now, let's replicate `10-data-analysis.ipynb`

In [10]:
df = pd.read_csv('data.csv', sep=";")
df

NameError: name 'pd' is not defined

In [11]:
df['value'] = df['value'].str.replace(">", "")
df['value'] = df['value'].str.replace("<", "")
df['value'] = df['value'].astype('int')
df

NameError: name 'df' is not defined

In [12]:
df['value'].mean()

NameError: name 'df' is not defined

And let's also do the final exercise:

**Exercise 10.9**

Find the geographical areas with proportion of population with primary reliance on clean fuels and technologies for cooking (%) less than or equal to 5 %.

Expected result

`['Rwanda',
 'Malawi',
 'Nigeria',
 'Guinea',
 'Niger',
 'Myanmar',
 'Sierra Leone',
 'South Sudan',
 'Madagascar',
 'Chad']`

In [None]:
df[df['value'] <= 5]

compare these to the functions you wrote in notebooks 7 and 10.

Hopefully this gives you a glimpse of how powerful these libraries can be compared to pure Python, and we haven’t even scratched the surface yet! Now let's take a step back and follow a short tutorial on each.

## 11.3 NumPy tutorial

The array (`numpy.ndarray`) is the central data structure of the NumPy library.

In [13]:
arr = np.array([1, 2, 3])
arr

NameError: name 'np' is not defined

In [14]:
type(arr)

NameError: name 'arr' is not defined

NumPy gives you an enormous range of fast and efficient ways of creating arrays and manipulating numerical data inside them. While a Python list can contain different data types within a single list, all of the elements in a NumPy array should be homogeneous. The mathematical operations that are meant to be performed on arrays would be extremely inefficient if the arrays weren’t homogeneous.

Why use NumPy?

NumPy arrays are faster and more compact than Python lists. An array consumes less memory and is convenient to use. NumPy uses much less memory to store data and it provides a mechanism of specifying the data types. This allows the code to be optimized even further.

### 11.3.1 Array basics

**Exercise 11.1**

Create numpy arrays with the following types `int`, `bool`, `str`, `float`

In [17]:
import numpy as np
int_array = np.array([1, 2, 3, 4, 5])

**Exercise 11.2**

All of the elements in a NumPy array should be homogeneous, try creating a heterogeneous array and see what happens. Did you get an error?

In [18]:
import numpy as np

arr = np.array([1, 2.5, 'hello', True])
print(arr)
print(arr.dtype)

['1' '2.5' 'hello' 'True']
<U32


Let's go back to our initial array:

In [19]:
arr = np.array([1, 2, 3])
arr

array([1, 2, 3])

Many operations you use on Python lists can also be applied to NumPy arrays

In [20]:
arr[0] # index

np.int64(1)

In [21]:
arr[0:2] # slice

array([1, 2])

In [None]:
len(arr) # get length

an array can be 1D, 2D, 3D and so on

In [22]:
# a 2D array can be created with a nested list
arr = np.array([[1, 2, 3], [4, 5, 6]])
arr

array([[1, 2, 3],
       [4, 5, 6]])

The rank of the array is the number of dimensions. The shape of the array is a tuple of integers giving the size of the array along each dimension.

In [23]:
# 2 rows, 3 columns
arr.shape

(2, 3)

In [24]:
# 2 dimensions
arr.ndim

2

In [25]:
# 6 elements in the array
arr.size

6

In [26]:
# contains integers
arr.dtype

dtype('int64')

**Exercise 11.3**

Create a 2D array with floats and of shape (3, 5), check its attributes `dtype`, `shape`, `size` and `ndim` to verify you did it correctly.


In [27]:
import numpy as np

arr = np.array([[1.1, 2.2, 3.3, 4.4, 5.5],
                [6.6, 7.7, 8.8, 9.9, 10.1],
                [11.1, 12.2, 13.3, 14.4, 15.5]])

print(arr)
print("dtype:", arr.dtype)
print("shape:", arr.shape)
print("size:", arr.size)
print("ndim:", arr.ndim)


[[ 1.1  2.2  3.3  4.4  5.5]
 [ 6.6  7.7  8.8  9.9 10.1]
 [11.1 12.2 13.3 14.4 15.5]]
dtype: float64
shape: (3, 5)
size: 15
ndim: 2


### 11.3.2 Creating arrays

`np.array()` is not the only way to create arrays

In [28]:
np.zeros(5)

array([0., 0., 0., 0., 0.])

In [29]:
np.ones(10)

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

In [30]:
np.arange(10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [31]:
np.arange(0, 10, 2)

array([0, 2, 4, 6, 8])

In [32]:
np.linspace(0, 100, 3)

array([  0.,  50., 100.])

In [33]:
np.random.randint(0, 100, size=10)

array([34, 42, 26,  8, 98, 71, 76, 67, 69, 50], dtype=int32)

you can set `dtype` while creating arrays

In [None]:
arr = np.array([1, 2, 3])
arr.dtype

In [34]:
arr = np.array([1, 2, 3], dtype='float')
arr.dtype

dtype('float64')

In [35]:
# or change dtype after creating the array
arr = np.array([1, 2, 3])
print(arr.dtype)
arr = arr.astype('float')
print(arr.dtype)

int64
float64


**Exercise 11.4**

Create a 2D array of size (200, 300)

In [36]:
import numpy as np

arr = np.zeros((200, 300))
print(arr.shape)


(200, 300)


**Exercise 11.5**

Create a 3D array

In [38]:
import numpy as np

arr = np.random.rand(3, 4, 2)
print(arr)
print("shape:", arr.shape)


[[[0.02088364 0.13042408]
  [0.13718377 0.67558861]
  [0.56585859 0.97117067]
  [0.24399685 0.32193907]]

 [[0.26883562 0.70136383]
  [0.87732113 0.90489437]
  [0.71586411 0.99850348]
  [0.60211534 0.40142268]]

 [[0.07355542 0.2157464 ]
  [0.98851637 0.47662198]
  [0.03654219 0.16470604]
  [0.89172952 0.57098454]]]
shape: (3, 4, 2)


### 11.3.3 Reshaping arrays

In [None]:
arr = np.arange(6)
arr

In [None]:
arr.reshape(2, 3)

In [None]:
arr.reshape(3, 2)

**Exercise 11.6**

Create an array of size (3, 4) and reshape it to (6, 2)

In [40]:
import numpy as np

arr = np.arange(12).reshape(3, 4)
print(arr)

reshaped = arr.reshape(6, 2)
print(reshaped)


[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
[[ 0  1]
 [ 2  3]
 [ 4  5]
 [ 6  7]
 [ 8  9]
 [10 11]]


**Exercise 11.7**

Create an array of size (3, 4) and reshape it to (5, 5). You will get an error. Why?

In [41]:
import numpy as np

arr = np.arange(12).reshape(3, 4)
arr = arr.reshape(5, 5)


ValueError: cannot reshape array of size 12 into shape (5,5)

this means you need to make sure the total size remains the same

### 11.3.4 Understanding the shape of a 1D array

In [None]:
arr = np.arange(6)
arr

In [None]:
arr.shape

does this mean we have `one column and six rows` or a `one row and six columns`?

neither `one column and six rows` nor `one row and six columns`, It’s a 1D array with only one axis, so rows and columns don’t apply.

we can create (1, 6) and (6, 1) shaped arrays as follows:

In [None]:
arr_2D_row = np.array([[0, 1, 2, 3, 4, 5]])
arr_2D_row.shape

In [None]:
arr_2D_col = np.array([[0], [1], [2], [3], [4], [5]])
arr_2D_col.shape

In [None]:
arr_2D_row

In [None]:
arr_2D_col

`arr`, `arr_2D_row`, `arr_2D_col` are all related but not the same.

You can create one from the other by reshaping

In [None]:
arr = np.arange(6)
print(arr.shape)
arr = arr.reshape(1, 6)
print(arr.shape)
arr = arr.reshape(6, 1)
print(arr.shape)

### 11.3.5 Indexing, slicing, filtering arrays

Let's start with a 2D array

In [None]:
arr = np.array([[1 , 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
arr

You can get any single item by row and column indices. For example the item in the first row and third column is `3`. You can obtain that item by:

In [None]:
arr[0, 2]

Let's say you would like to get the first two items in the second row, this is how you can do it:

In [None]:
arr[1, 0:2]

In [None]:
# first row
arr[0, :]

In [None]:
# first column
arr[:, 0]

**Exercise 11.8**

For the array below, return the following

- Last row (i.e. [9, 10, 11, 12])
- Second column (i.e. [2, 6, 10])
- Last 3 elements of the last row (i.e. [10, 11, 12])

In [None]:
arr = np.array([[1 , 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
arr

In [None]:
# YOUR CODE HERE

Now let's look at filtering using our `arr`.

In [None]:
arr = np.array([[1 , 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
arr

This is how you can get all the numbers in your array greater than 5:

In [None]:
# step one: create a boolean mask
arr_mask = arr > 5
arr_mask

In [None]:
# step two: use the mask for indexing
arr[arr_mask]

you can combine the steps as follows:

In [None]:
arr[arr>5]

It is essential you understand this operation because <mark>you will use this concept a lot!</mark> So let me repeat:

`arr[arr > 5]` uses **Boolean indexing** in NumPy: first, `arr > 5` compares each element of `arr` with 5 and produces a Boolean array of the same shape, with `True` where the element is greater than 5 and `False` otherwise; then, using this Boolean array inside the square brackets tells NumPy to return only the elements where the corresponding Boolean value is `True`, effectively filtering the array based on the condition.

You can filter one array based on another array as well.

**Exercise 11.9**

Given two arrays below, find the adults. Do it in a single line.

Expected output:

`array(['Sanne', 'Daan', 'Joris', 'Lotte'])`

In [None]:
# Ages of people in a survey
ages = np.array([15, 22, 35, 17, 40, 29])

# Names corresponding to those ages
names = np.array(["Bram", "Sanne", "Daan", "Femke", "Joris", "Lotte"])

In [None]:
# YOUR CODE HERE

### 11.3.6 Array operations

In [None]:
arr_1 = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
arr_1

In [None]:
arr_2 = np.array([[9, 10, 11, 12], [13, 14, 15, 16]])
arr_2

In [None]:
# add a number to all elements
arr_1 + 1

In [None]:
# multiply an array by a number
arr_1 * -1

In [None]:
# add two arrays together
arr_1 + arr_2

In [None]:
# mean of an array
np.mean(arr_1)

In [None]:
# mean of all columns
np.mean(arr_1, axis=0)

In [None]:
# mean of all rows
np.mean(arr_1, axis=1)

In [None]:
# find unique values
arr = np.array([1, 1, 5, 5, 5])
np.unique(arr)

In [None]:
# together with the counts
unique_values, counts = np.unique(arr, return_counts=True)
print(unique_values, counts)

**Exercise 11.10**

Find the sum of each column in `arr_2`

Expected output:

`array([22, 24, 26, 28])`

In [None]:
arr_2 = np.array([[9, 10, 11, 12], [13, 14, 15, 16]])
arr_2

In [42]:
import numpy as np

arr_2 = np.array([[1, 2, 3, 4],
                  [5, 6, 7, 8],
                  [9, 10, 11, 12],
                  [7, 6, 5, 4]])

column_sums = np.sum(arr_2, axis=0)
print(column_sums)


[22 24 26 28]


### 11.3.7 Math and NumPy


In this block, you are taking the course **[Foundational Mathematics (Y1A2)](https://ai.buas.nl/curriculum/foundational-mathematics.html)**. In the next block, you will take **[Mathematical Foundations for Machine Learning (Y1B2)](https://ai.buas.nl/curriculum/mathematical-foundations-for-machine-learning.html)**.

Many of the concepts you have learned, and will learn, can be done directly with **NumPy**. This library allows you to apply mathematical ideas to arrays of numbers efficiently, without writing long loops. Here are some examples:

In trigonometry, you might calculate `sin(θ)` or `cos(θ)` for a single angle. With NumPy, you can do this for an entire array of angles at once:

In [None]:
angles = np.array([0, np.pi/6, np.pi/2])
np.sin(angles)

Let's say you would like to find $2x^2 + 3x + 1$ for $ x = 5$

In [None]:
coeffs = [2, 3, 1]
x = 5
np.polyval(coeffs, x)

you can do this for multiple x values at once

In [None]:
coeffs = [2, 3, 1]
x = np.array([5, 0, -2])
np.polyval(coeffs, x)

You will learn about the dot product, the fundamental operation of linear algebra as well as neural networks:

```math
\begin{bmatrix}1 & 2 \\ 3 & 4\end{bmatrix}
\cdot
\begin{bmatrix}5 \\ 6\end{bmatrix}
=
\begin{bmatrix}1×5 + 2×6 \\ 3×5 + 4×6\end{bmatrix}
=
\begin{bmatrix}17 \\ 39\end{bmatrix}

this can be done with NumPy as follows:

In [None]:
A = np.array([[1, 2], [3, 4]])
b = np.array([5, 6])

np.dot(A, b)

The focus of this tutorial is NumPy, not teaching mathematics. However, from now on, when you learn new math concepts, also practice them in NumPy, this will both strengthen your understanding of the math and help you learn how to perform these operations in NumPy, which you will use frequently.

Take a look at the functions available:

https://numpy.org/doc/stable/reference/routines.math.html

https://numpy.org/doc/stable/reference/routines.linalg.html


**Exercise 11.11**

Find the base 10 logarithm of the numbers given below.

Expected answer:

`array([1.     , 2.     , 2.69897, 3.     ])`

In [43]:
import numpy as np

arr = np.array([10, 100, 500, 1000])
log_values = np.log10(arr)
print(log_values)


[1.      2.      2.69897 3.     ]


### 11.3.8 Vectorization

Vectorization in NumPy means performing operations on entire arrays (vectors, matrices) at once, instead of using loops to process one element at a time. NumPy does this by running optimized code written in C under the hood, which is much faster than pure Python loops. <mark>Learning and applying this concept can be the difference between your code running in 1 hour versus 1 minute.</mark>

Let's calculate the Body Mass Index (BMI) of 10 million people to demonstrate the concept. BMI is calculated as

$$
\text{BMI} = \frac{\text{weight (kg)}}{\left[\text{height (m)}\right]^2}
$$

In [None]:
# Generate data
import random
n = 10000000
heights_m = [random.uniform(1.5, 2.0) for _ in range(n)]
weights_kg = [random.uniform(50, 100) for _ in range(n)]

In [None]:
bmis = []
for i in range(n):
  bmi = weights_kg[i] / (heights_m[i] ** 2)
  bmis.append(bmi)

let's do the same with numpy

In [None]:
heights_np = np.array(heights_m)
weights_np = np.array(weights_kg)

In [None]:
bmis_from_numpy = weights_np / (heights_np ** 2)

it is very fast so hard to grasp the difference. We can measure time using the `time` module from the Python Standard Library and compare.

In [None]:
import time

In [None]:
start = time.time()
bmis = []
for i in range(n):
  bmi = weights_kg[i] / (heights_m[i] ** 2)
  bmis.append(bmi)
end = time.time()

In [None]:
without_vectorization = end - start
without_vectorization

In [None]:
start = time.time()
bmis_from_numpy = weights_np / (heights_np ** 2)
end = time.time()

In [None]:
with_vectorization = end - start

In [None]:
print(f"Without vectorization the calculation took {without_vectorization:.4f} seconds.")
print(f"With vectorization the calculation took {with_vectorization:.4f} seconds.")
print(f"{without_vectorization/with_vectorization:.2f} times faster!")

**Exercise 11.12**

The code below calculates the discounted prices of 10 products by applying a corresponding percentage discount to each original price; vectorize it with NumPy to run it in a single line.

In [None]:
prices = np.array([10.50, 1.35, 3.99, 0.50, 4.15, 20, 13.45, 21.10, 100, 70]) # euros
discounts = np.array([10, 5, 10, 20, 5, 10, 10, 10, 5, 20]) # % discount

In [None]:
discounted_prices = []
for i, price in enumerate(prices):
    discounted_price = prices[i] - prices[i] * discounts[i] / 100
    discounted_prices.append(discounted_price)
discounted_prices = np.array(discounted_prices)
discounted_prices

In [None]:
# YOUR CODE HERE

That is it for now.

Knowing NumPy is not about memorizing every function or method. It’s about understanding what tasks can be done with it and how to do them efficiently. We won’t cover every method or module here, so it’s important to get comfortable reading the documentation and learning features as you need them. As a data scientist, NumPy will be an essential part of your work, so invest the time to learn it well.

## 11.4 pandas tutorial

Pandas builds on top of NumPy, using its fast and efficient array operations as the foundation. Pandas adds powerful tools for working with structured, labeled data such as tables from spreadsheets, databases, or CSV files. Understanding NumPy will help you grasp how Pandas stores and processes data, but Pandas will take you further by making it easier to clean, explore, and analyze real-world datasets.

Pandas provides two types of classes for handling data:

1. `Series`: a one-dimensional labeled array holding data of any type such as integers, strings, Python objects etc.

2. `DataFrame`: a two-dimensional data structure that holds data like a two-dimension array or a table with rows and columns.

### 11.4.1 Series and DataFrame basics

In [None]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

In [None]:
type(s)

In [None]:
# let's generate data again as we did for the BMI calculation above
n = 100
heights_m = [random.uniform(1.5, 2.0) for _ in range(n)]

# create df
df = pd.DataFrame(heights_m, columns=['Height (m)'])
df

In [None]:
type(df)

we can do this with multiple columns by passing a dictionary

In [None]:
n = 100
heights_m = [random.uniform(1.5, 2.0) for _ in range(n)]
weights_kg = [random.uniform(50, 100) for _ in range(n)]

# create df
df = pd.DataFrame({'Height (m)':heights_m, 'Weight (kg)':weights_kg})
df

**Exercise 11.13**

Create a `Series` and a `DataFrame`

In [None]:
# YOUR CODE HERE

the following methods can help you understand what any `df` contains quickly

In [None]:
df.head() # first 5 items

In [None]:
df.tail() # last 5 items

In [None]:
df.describe() # stats per column

In [None]:
df.info() # overall info

**Exercise 11.14**

Apply `.head()`, `.tail()`, `.describe()`, `.info()` to the `df` you created in the previous exercise.

In [46]:
print(df.head())
print(df.tail())
print(df.describe())
print(df.info())


    Name  Age  Score
0   Carl   22     88
1  Jacob   30     94
2  Peter   35     95
3  David   41     80
4   Koen   45     85
    Name  Age  Score
0   Carl   22     88
1  Jacob   30     94
2  Peter   35     95
3  David   41     80
4   Koen   45     85
             Age      Score
count   5.000000   5.000000
mean   34.600000  88.400000
std     9.071935   6.268971
min    22.000000  80.000000
25%    30.000000  85.000000
50%    35.000000  88.000000
75%    41.000000  94.000000
max    45.000000  95.000000
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    5 non-null      object
 1   Age     5 non-null      int64 
 2   Score   5 non-null      int64 
dtypes: int64(2), object(1)
memory usage: 252.0+ bytes
None


### 11.4.2 Indexing, slicing, filtering DataFrames

In numpy we use indices to access element, in pandas we have two options:

1. `iloc`, integer based indexing like numpy
2. `loc`, label based indexing

the best way to understand is to see these in action

In [None]:
df = pd.DataFrame({
    'A': [10, 20, 30],
    'B': [40, 50, 60],
    'C': [70, 80, 90]
}, index=['x', 'y', 'z'])
df

I can access 80 as follows

In [None]:
df.iloc[1, 2]

In [None]:
df.loc['y', 'C']

**Exercise 11.15**

For the array given below, get `60` using both `iloc` and `loc`.

In [47]:
df = pd.DataFrame({
    'A': [10, 20, 30],
    'B': [40, 50, 60],
    'C': [70, 80, 90]
}, index=['x', 'y', 'z'])
df

Unnamed: 0,A,B,C
x,10,40,70
y,20,50,80
z,30,60,90


In [48]:
print(df.iloc[2, 1])


60


slicing with `iloc` works very much like numpy

In [None]:
df.iloc[0:2, 0:2]

but slicing with `loc` is a counterintuitive because the end is included, so be very careful!

In [None]:
df.loc['x':'z', 'A':'C']

this is especially a problem when row labels are numbers, which is very common

In [None]:
df = pd.DataFrame({
    'A': [10, 20, 30],
    'B': [40, 50, 60],
    'C': [70, 80, 90]
})
df

let's slice with `0:2` using `iloc` and `loc`

In [None]:
df.iloc[0:2]

In [None]:
df.loc[0:2]

Now let’s make things a bit more confusing 🥲 Let's add more rows to our df and drop the second row. Dropping rows is common in data analysis when certain rows are corrupted, empty, or simply not relevant to the analysis.

In [None]:
df = pd.DataFrame({
    'A': [10, 20, 30, 31, 32],
    'B': [40, 50, 60, 61, 62],
    'C': [70, 80, 90, 91, 92]
})
df

In [None]:
df = df.drop(1)
df

let's use `iloc[2]` and `loc[2]` and see what happens

In [None]:
df.iloc[2]

In [None]:
df.loc[2]

we got different results 🥲 but why why why? Because:

- the `2` in `iloc` is the integer index which means the 3rd row, independent of the row label
- while the `2` in `loc` is the row label, independent of the row position

**Exercise 11.16**

What would happen if you apply `df.iloc[1]` and `df.loc[1]` to the `df` given below. First try to guess and then run the code to see if you guessed correctly.

In [49]:
df = pd.DataFrame({
    'A': [10, 20, 30, 31, 32],
    'B': [40, 50, 60, 61, 62],
    'C': [70, 80, 90, 91, 92]
})
df = df.drop(1)
df

Unnamed: 0,A,B,C
0,10,40,70
2,30,60,90
3,31,61,91
4,32,62,92


In [51]:
# guess before running the code
df.iloc[1]

A    30
B    60
C    90
Name: 2, dtype: int64

In [52]:
# guess before running the code
df.loc[1]

KeyError: 1

This might be confusing at first, but understanding it will give you a proper foundation in pandas, making it easier to build new knowledge on top. The rest of the concepts are intuitive.

In [None]:
df['A'] # get a column

In [None]:
# Select all rows where the value in column 'A' is greater than 30
df[df['A'] > 30]

### 11.4.3 Operations

It works like numpy. Let's replicate the BMI calculation and save it as a new column

In [None]:
n = 100
heights_m = [random.uniform(1.5, 2.0) for _ in range(n)]
weights_kg = [random.uniform(50, 100) for _ in range(n)]

# create df
df = pd.DataFrame({'Height (m)':heights_m, 'Weight (kg)':weights_kg})
df

In [None]:
# that's it
df['BMI'] = df['Weight (kg)'] / (df['Height (m)']**2)
df

the vectorization idea works exactly the same in pandas. Instead of iterating over values, try to make your operations column or row wise to make it faster.

In [None]:
df.mean(axis=0) # mean of all columns

In [None]:
df.mean(axis=1) # mean of all rows

**Exercise 11.17**

convert the `Height (m)` column to cm and save it as a new column `Height (cm)`

In [None]:
# YOUR CODE HERE

### 11.4.4 Working with missing data

One way or another, you will encounter DataFrames with missing data. The most common representation is `NaN` (Not a Number), but missing values can also appear as `None`, `pd.NA`, `np.nan`, `NaT`, and others. The `isna()` method in pandas checks whether a value is considered missing, so it’s important to understand exactly what pandas treats as missing. For example, a string `'nan'` or `'NaN'` is just text and is **not** considered missing, `NaN` is a specific data type, not merely a word.


In [None]:
print(pd.isna(np.nan)) # not a number
print(pd.isna(pd.NA)) # Not Available
print(pd.isna(None))
print(pd.isna(pd.NaT)) # Not a Time

In [None]:
print(pd.isna('nan')) # 'nan' (string) → just text, not a real NaN value
print(pd.isna('NaN')) # 'NaN' (string) → also just text, not a real NaN
print(pd.isna(False))  # Boolean False → a valid value, not missing
print(pd.isna(0)) # Number 0 → valid numeric value, not missing
print(pd.isna('')) # Empty string → valid string, not considered missing by default


now that you know what is considered missing and what is not, let's see the operations you can do on a `df` with missing values

In [None]:
arr = np.array([[np.nan, 1, 5, 10],
               [34, np.nan, 0, 0],
               [4, np.nan, np.nan, 0],
               [10, 10, 10, 0]])
df = pd.DataFrame(arr, columns=['A', 'B', 'C', 'D'])
df

In [None]:
df.isna() # Boolean mask of missing values

In [None]:
df.isna().sum() # Count NaNs per column

In [None]:
df.isna().sum(axis=1) # Count NaNs per row

In [None]:
df.isna().any() # Columns that contain at least one NaN (True/False)

In [None]:
df.columns[df.isna().any()]# Column names that contain NaN

In [None]:
df[df.isna().any(axis=1)]  # Rows that contain any NaN

In [None]:
np.argwhere(np.isnan(df.to_numpy()))  # array of (row, col) indices

In [None]:
df.dropna() # Drop rows with ANY NaN

In [None]:
df.dropna(axis=1) # Drop columns with ANY NaN

In [None]:
df.dropna(how='all') # Drop rows where ALL values are NaN

In [None]:
df.dropna(thresh=3) # Keep rows with at least 3 non-NaNs

In [None]:
df.fillna(-99) # Replace all NaNs with a number

In [None]:
df.fillna(df.mean(numeric_only=True)) # Column-wise mean imputation

if you would like to learn more, please check the documentation: https://pandas.pydata.org/docs/user_guide/missing_data.html

### 11.4.5 String methods

A pandas dataframe can contain strings and we can apply the string methods to the columns

In [None]:
df = pd.DataFrame([[1, 2, 'hey'], [2, 3, 'hola']], columns=['A', 'B', 'C'])
df

In [None]:
# you can see that the dtype is an object
# which is common for columns with text
df.info()

In [None]:
# here is how you can apply string methods to the whole column
df['C'].str.capitalize()

### 11.4.6 Merging dataframes

There are many situations where you need to combine data from different DataFrames. Common cases include:

- Adding rows: stacking DataFrames vertically.

- Adding columns: joining DataFrames side by side.

- Merging by key: bringing in one or more columns from another DataFrame by matching values in a shared column.

In [None]:
# vertical stack
df1 = pd.DataFrame({'id':[1,2], 'score':[80,85]})
df2 = pd.DataFrame({'id':[3,4], 'score':[90,88]})

In [None]:
df1

In [None]:
df2

In [None]:
pd.concat([df1, df2], axis=0, ignore_index=True)

In [None]:
# horizontal stack
left  = pd.DataFrame({'id':[1,2,3]}, index=[10,11,12])
right = pd.DataFrame({'age':[21,19,25]}, index=[10,11,12])

In [None]:
left

In [None]:
right

In [None]:
pd.concat([left, right], axis=1)

In [None]:
# merge
students = pd.DataFrame({'id':[1,2,3], 'name':['Ana','Bo','Cam']})
scores   = pd.DataFrame({'id':[2,3,4], 'score':[85,90,70]})

In [None]:
students

In [None]:
scores

In [None]:
pd.merge(students, scores, on='id', how='left')

### 11.4.7 Loading and saving DataFrames


In [None]:
pd.read_csv('data.csv', delimiter=';')

you can read many other file formats using `pd.read_fileformat()`, for example for excel you can use `pd.read_excel()`.

and to save you will use `df.to_fileformat()`

In [None]:
df = pd.DataFrame([[1, 2, 'hey'], [2, 3, 'hola']], columns=['A', 'B', 'C'])
df

In [None]:
df.to_csv('test.csv')

In [None]:
# you will get an error if you don't have the openpyxl module in your conda environment
# if so, follow the instructions below
try:
    df.to_excel('text.xlsx')
except Exception as e:
    print("An error occurred:", e)

Installing `openpyxl` in your conda environment

We will use the `openpyxl` library to work with Excel (`.xlsx`) files in pandas.

1. **Open your terminal** (Anaconda Prompt on Windows, Terminal app on macOS/Linux).
2. **Activate your course environment**
    ```bash
    conda activate y1a-2025-26
    ```
3. **Install `openpyxl`**
    ```bash
    conda install openpyxl
    ```

after the installation run `df.to_excel('text.xlsx')` again and check the excel file.


### 11.4.8 .apply()

In the **string methods** section, you saw how we can use built-in functions directly on a column. In the **operations** section, you learned how to perform column-wise or row-wise calculations with operators like `+`, `-`, or `**2`. But what if we need an operation that is *not* built in, or we want to define our own?

That’s where **`.apply()`** comes in. It allows us to pass any function, either one we write ourselves or an existing one, and apply it to each value in a column or row.

Let's repeat the processing of the `value` column in `data.csv` using `.apply()`

In [None]:
df = pd.read_csv('data.csv', sep=";")
df

In [None]:
def process_value_col(x):
    x = x.replace("<", "").replace(">", "")
    return int(x)

In [None]:
df['value'] = df['value'].apply(process_value_col)
df

If you would like to apply an operation to every element, use [.map()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.map.html)

## 11.5 Matplotlib tutorial

We’ll cover Matplotlib briefly; it’s best learned by exploring specific plot types and features as you need them. At its core, a Matplotlib figure is made up of multiple elements, such as a title, axis labels, and the plotted data, all of which can be customized through parameters like color, line style, or font. There’s no need to memorize these options; consult the documentation or ask ChatGPT for guidance when creating a specific type of plot.

<div style="text-align: center;">
  <img src="https://matplotlib.org/stable/_images/anatomy.png" alt="Anatomy of a Matplotlib figure" style="width: 40%;">
</div>

In [None]:
x = np.random.randint(100, size=100)
y = np.random.randint(100, size=100)

In [None]:
plt.scatter(x, y)
plt.xlabel('Name of the x-axis')
plt.ylabel('Name of the y-axis')
plt.title('The title')

you can have multiple plots

In [None]:
fig, ax = plt.subplots(2, 2)
ax[0, 0].scatter(x, y, color='black')
ax[0, 1].scatter(x, y, color='blue')
ax[1, 0].scatter(x, y, color='red')
ax[1, 1].scatter(x, y, alpha=0.1)

Matplotlib supports many types of plots, such as scatter plots, histograms, and more. You can browse the full list here:
https://matplotlib.org/stable/plot_types/index.html

For an overview of what’s possible, check out the Quick Start guide:
https://matplotlib.org/stable/users/explain/quick_start.html