# Chapter 2: Data Ingestion & Variables

In [None]:
%reset
low_memory=False
import numpy as np
import pandas as pd

## 2.1 Introduction & Motivation

In Data Processing, before we can perform any analysis, it's crucial to understand how to manipulate and explore data efficiently. Pandas is a powerful Python library that allows us to work with data in the form of DataFrames, which are essentially tables of data. In this notebook, weâ€™ll learn how to load datasets into Pandas, access and index specific rows and columns, and perform basic operations on the data. These skills are fundamental for cleaning and preparing data.

## 2.2 Problem Setting

To make things a bit more concrete, we are going to explore this by looking at some of the world's most beloved creatures: Pokemon! We will be loading a dataset full of different kind of pokemon, and seeing what we can do with it.

## 2.3 Data Ingestion

As said before, pandas is the way to go when processing datasets. But how exactly? Well, first of all we have to make sure we can read our data and load it into memory. When telling pandas about a file, it will go looking in the directory our notebook is running in. This means that if we want to open a file in a seperate directory, we have to specify it!

In [None]:
pokemon = pd.read_csv("Pokemon.csv", sep= ";", index_col = 0)

We have are data loaded in memory, but we have no way of viewing it yet. We can do so by having a look at the "head" of the data.

In [None]:
pokemon.head()

Beautiful! We can even check exactly what datatype pandas has stored our data in:

In [None]:
type(pokemon)

Remember how pandas had two different main datatypes? One was called a **dataframe** and contains the entire dataset. The other was called a **Series** and contains just a single column of data.

In [None]:
type(pokemon["Name"])

Once everything is loaded in, we can gain some initial insights in the data. For example, we can see just how many records a dataset contains by checking the length.

In [None]:
len(pokemon)

You can also get a list of all the column names your dataset contains.

In [None]:
pokemon.columns

### Question 1: How can we see exactly how many records and columns/dimensions a dataset has in one single line of code?

## 2.4 Basic data selection, subsetting and slicing

Now that we have been able to read the data and perform some initial analysis, we can start querying the data a bit! Luckily pandas makes this super easy, as it uses methods similar to default arrays or dictionaries. To get a single column from a dataframe, simply put the name of the column you want in square brackets.

In [None]:
pokemon["Attack"]

If you wish to select multiple columns, simply pass them along within the square brackets. Notice that you need to add double brackets!

In [None]:
pokemon[["Attack", "HP"]]

You are also able to be more specific and select a ceratin value from a certain column using the 'loc' function.

In [None]:
pokemon.loc[4,"Name"]

If you don't know the name of the column, or just want to use indexes, you can make use of the 'iloc' function.

In [None]:
pokemon.iloc[4,0]

### Question 2: Based on the slicing options you have seen last lesson, how can you show me 'Type 1' of only the first 100 pokemon?

You can even apply the slicing to the colum names!

In [None]:
pokemon.loc[:,'Name':'Type 2']

One last important thing about indexing you need to know is that not only can you specify certain columns/records, but you can also use mathematical equations and logic operators to further narrow down the data you're selecting.

In [None]:
pokemon["Attack"] < 50

Huh? We were expecting a list of all pokemon were the attack was below 50, but it seems we got a list of booleans instead. Actually, this is to be exepected as that is exactly what we were asking! The code above asks the same question for each 'Attack' record: is this value under 50? It then returns the result, a boolean, for each record as a list. Luckily, we can use this list as a **mask** to retrieve all pokemon with an attack below 50.

In [None]:
pokemon[pokemon["Attack"] < 50]

Alternitavely, we can specify the column we want as a property.

In [None]:
pokemon[pokemon.Attack < 50]

If we wish to combine multiple logic operators we can easily do so by making use of the numpy 'logical_and' and 'logical_or' methods.

In [None]:
pokemon.loc[np.logical_and(pokemon["HP"] > 70, pokemon["Attack"] < 200)]

By now you can see how this can quickly get rather complicated when combining everything!

### Question 3: How many pokemons are Legendary?

### Question 4: What is the name of the first flying (Type 2) Pokemon which has a Speed of 60? Return only the Name of this pokemon.

## 2.5 Merging observations and merging variables

Appending, merging and concatenation are ways to combine dataframes in pandas (very related to unions and joins in SQL). Going over the possibilities of the underlying pandas functions will require way too much time as there are so many ways to combine datframes. Instead we will only explore two often encountered situations: adding new rows and adding new columns

Adding a new column is super easy! Simply define the name of the new column and add your values.

In [None]:
liked = True
pokemon["Liked"] = liked
pokemon.head()

Dropping a column is even easier.

In [None]:
del pokemon["Liked"]
pokemon.head()

You can also set the individual values of each record this way. keep in mind you need to specify either 1 value for all records(as seen above), or each record individually!

In [None]:
pokemonshort = pokemon.iloc[:5]

caught = [True, False, False, False, True]
pokemonshort["Caught"] = caught
pokemonshort.head()

To demonstrate merging, let's make two subsets of our pokemon dataframe.

In [None]:
poke1 = pokemon.loc[:7, :'Attack']
poke1

In [None]:
poke2 = pokemon.loc[:7, 'Defense':'Speed']
poke2

In [None]:
pd.concat([poke1, poke2])

This is not exactly what we were looking for... Our dataframes are joined, but they do not appear to share any common data. This is because we still need to specify our axis on which to merge!

In [None]:
poke3 = pd.concat([poke1, poke2], axis=1)
poke3

#### Question 5: create a new selection of the pokemon dataset containing the same columns as our dataset above but for the pokemon with id 10-15. Merge them in poke3.

## 2.6 Variable types

Maintaining a good overview of your datatypes is important when working with a dataset. Luckily, pandas saves the day here once again as they have a simple way of providing this overview!

In [None]:
pokemon.dtypes

You can see how each column has its own datatype. Since variables in python are dynamically typed, we can easily start modifying this by simply modifying the data itself.

In [None]:
pokemon["Attack2"] = pokemon["Attack"]+0.00
pokemon.dtypes

## 2.7 Date and time

Any data type to do with dates and times always causes issues. There are just so many ways to represent this, and there are so many different timezones! All of this leads to one giant mess where it can be extremely hard to compare two times to each other.

In order to create some peace in the chaos, pandas provides us with a few tools. In general, it tries to convert any date/time object into a timestamp.

In [None]:
pd.to_datetime('2025-02-17 09:10:12', format='%Y-%m-%d %H:%M:%S', utc = True)

If we want to involve timezone, we can specify this with the 'tz_localize' function.

In [None]:
pd.to_datetime('2025-02-17 09:10:12', format='%Y-%m-%d %H:%M:%S').tz_localize(tz = "Europe/Brussels")

As said before, there are a lot of timezones. You can get a handy list of the ones available here: https://gist.github.com/heyalexej/8bf688fd67d7199be4a1682b3eec7568

If you have loaded your time in a certain timezone and wish to convert it to another timezone, the 'astimezone' function provides the solution.

In [None]:
pd.to_datetime('2025-02-17 09:10:12', format='%Y-%m-%d %H:%M:%S').tz_localize(tz = "Europe/Brussels").astimezone("UTC")

Now you may be wondering what all those percentage symbols are. They indicate the format your datetime object originates in. Once again, there are a whole lot of different options. A good cheat sheet can be found here: https://strftime.org/

In [None]:
pd.to_datetime('9:10 17/2/2025', format='%H:%M %d/%m/%Y', utc = True)

You can even use this notation to get the date as an actual string!

In [None]:
pd.to_datetime('9:10 17/2/2025', format='%H:%M %d/%m/%Y', utc = True).strftime("%B %d, pancake %Y")

When you combine it all you start seeing exactly why dates can be such a mess, especially knowing that each datasource you use will most likely have a different format.

In [None]:
pd.to_datetime('22:10 17/2/2025', format='%H:%M %d/%m/%Y').tz_localize( tz = "US/Central").astimezone("UTC").strftime("%B %d, %Y")