# Tutorial 4 - Introduction to Pandas

Now that we've learned the basics on how to program in Python, we can start learning to deal with data. As you can see, we are now working in a different environment than we did before. This is called a (Jupyter) notebook. You can do the same things as we did before, but you have to click on run (on the top of your screen) or press ctrl+enter to run a cell. Keep in mind that you run the cell your cursor is currently in. If the code in the cell produces output, this output will appear below the cell. Let's try below!

In [None]:
a = 5
b = 7

In [None]:
c = a + b
c

## 4.0 Introducing pandas

To do data processing, we will make use of many convenient functions that are in the Python library called `pandas`. A library is typically a set of functions that somebody already wrote, and posted on the internet for other people to use. This is very useful, as we do not want to reinvent the wheel every time we start a new project!

Libraries are not all built into Python, and we might have to download them before we can use them. Pandas is a very common package, and should already be installed when you installed Jupyter. We just need to import it into our notebook, so that we can actually use it.

Below is a piece of code ```import pandas as pd```. The ```import pandas``` part is the essential part. This tells the notebook that we want to use the library called pandas. We wrote ```as pd``` because every time you want to use a function from pandas, we have to specify that the function we are looking for comes from pandas (e.g., we have to write `pd.function_from_pandas()` instead of just `function_from_pandas()`). Writing out pandas would get tedious and messy, therefore we use this common shortcut.

In [None]:
import pandas as pd

## 4.1 Data in a DataFrame

Over the last three weeks, we learned that lists can be a useful tool for saving data. Let's say we have a nested list that stores some data. We can turn this data into a pandas `DataFrame`, to more easily access the data. A pandas dataframe is a data structure that looks a bit like a table. Conveniently, the pandas library has many available functions to easily manipulate our data, or extract information from it.

Below, we take a list of lists, which already looks like a sort of table in the code cell downstairs. Wouldn't it be nice if we could easily select rows and columns like a table, too? That's what dataframes do for us! Each list in the nested list `my_friends` will turn into one row in the pandas dataframe, and contains five values, which are our values and will become our columns in the dataframe.

In [None]:
my_friends = [['Abby', 25, True, 20, 0],
              ['Sarah', 21, False, 7, 2],
              ['Ben', 23, False, 3, 1],
              ['Henry', 29, False, 1, 1]]

# Turn the nested list into a dataframe!
# Note that we use the pd. to indicate that the DataFrame type comes from the pandas package!
my_friends_df = pd.DataFrame(my_friends, columns = ['name', 'age', 'climber', 'distance_to_house_km', 'num_pets'])

# We can show the dataframe without using print():
my_friends_df

As you can see, we now have a nice looking table containing our data. This is the dataframe. Note that your dataframe has an an index on the left side, similar to row numbers in Excel (although we can actually change the index values of a dataframe!).

Generally, we do not have our data in Python lists, but rather in Excel or csv (Comma Separated Value) files. Today we are working with a csv file that contains information about XTC. The file is called xtc_data.csv, you can go have a look at the data the way you would normally do.

*Note: reading Excel files (.xls or .xlsx) directly sometimes results in unexpected behaviour with column types and differences between localized versions. It is usually safest to first save your data as csv (you can do this in Excel) and then load it into Python.*

Pandas has a built-in function to load a csv file and turn it into a dataframe so we can analyse the data. It can be used like this:

In [None]:
# This line assumes that the csv file is in the same directory as this Python notebook.
# If it is not, move the file or use Google to find out how to point the function to the
# correct location!
df = pd.read_csv('xtc_data.csv')

# Calling .head() on a dataframe shows the first five rows
df.head()

`df.head()` is a useful way to inspect what your data looks like without printing it all. As you can see if you open the file in Excel, we have 498 rows in the data, so printing them all would result in a massive output! If you want to inspect the bottom rows of your dataframe, you can use `df.tail()`. If you want to see the number of rows and columns, you can call `df.shape` (no brackets, because `shape` is not a function!) which will give you the number of rows and columns.

The XTC dataset comes from the CHAMP project (Collaborative Harmonization of Methods for Profiling of Amphetamine Type Stimulants), which is a project that aimed to create methods for drugs profiling. Using these methods, different seizures of drugs might be linked together.

The data was somewhat altered to make it applicable for this course. It consists of 160 street samples of XTC tablets. Of each street sample, at least two tablets were measured by one of several laboratories. The laboratories cooperating in this research are NBI (National Bureau of Investigation, Finland), LPS (Laboratoire de Police Scientifique de Lyon, France), UNIL (Universit√© de Lausanne, Switzerland) and NFI (Netherlands Forensic Institute). The properties of the tablets measured are the diameter (mm), thickness (mm), weight (mg), and purity (percentage of MDMA).

We can use pandas to calculate simple statistics, such as the mean purity of all pills. We can also check the maximum diameter of all pills, or the date on which most samples were collected.

Let's obtain the mean for all columns:

*Note: we specify `numeric_only=True` because not all the columns in our dataframe are numeric. For example, we cannot take the mean of the Street_sample_name, because these are strings. (You may wonder how pandas knows which columns are numeric and which are not, because we never told it. We can specify column types when loading a file, and if we don't, pandas makes a 'best guess' that is in this case accurate. Always check!)*

In [None]:
df.mean(numeric_only=True)

### Exercise 4.1.1

Using similar syntax to the code cell above, find the maximum, minimum, sum and count of all columns. When do you need to exclude non-numeric columns?

In [None]:
# Your code here

## 4.2 Selecting columns

We might want to inspect only a certain row or column of the dataframe. We point to our desired rows and columns in various ways, some of which you might recognise from what you learned about lists.

To select a single column, use square brackets and write the column of the name within the brackets. Don't forget the quotes! (Why do you need the quotes? What would happen if you asked Python for `df[Weight]`?)

In [None]:
df['Weight']

### Exercise 4.2.1

Before, we calculated the mean for all columns in the dataframe. However, most of the time we might just want the mean over a specific column. Can you combine what you've learned up until now to calculate the mean of the weight column? Hint: we use the same function to calculate the mean, but now apply it to a single column instead of the whole dataframe!

In [None]:
# Your code here

### Exercise 4.2.2

If we're interested in the means of more than one column, we can put the columns of interest in a list that we use as an argument in this function. In the code cell below, write one line of code that will give the means of columns Weight and Purity.

In [None]:
# Your code here

## 4.3 Selecting rows

Next, we'll learn how to select only a subset of rows, in this case a subset of XTC samples that were collected. When we select columns, we write the square brackets directly after our dataframe, so for selecting rows we need a different syntax. To select a single row by its index, we use `df.loc[i]` where `i` should be replaced by the index. If you also want to select a subset of columns, you add a comma and then the name(s) of the column(s): `df.loc[5, 'Weight']` will return the value in the Weight column of the row with row index 5. `df.loc[[3, 10], ['Purity', 'Diameter']]` will return a dataframe with 2 rows and 2 columns (we passed a list for both the rows and columns to select!).

## Exercise 4.3.1

Select all rows with a row index of 100 or larger, and columns Diameter and Thickness. Investigate the shape of this dataframe; it should have 398 rows and 2 columns.

In [None]:
# Your code here

We can also select rows that pass certain criteria, rather than selecting by row index. For instance, to select all rows where Street_sample_name is equal to "LPS_01", we write: `df.loc[df['Street_sample_name'] == 'LPS_01', :]`.

What is happening in this syntax? `df['Street_sample_name'] == 'LPS_01'` you may recognize as a statement that checks whether the values in the column Street_sample_name are equal to LPS_01. This will result in a sequence of True and False. By passing this to our row selector using `.loc`, we select only the rows where this statement evaluates to True. ", :" indicates that we want all the columns, as we do not select any in particular.

### Exercise 4.3.2

Select all rows where the Diameter is smaller than 8 and save this as a new dataframe called `df_small`. Then compare the means in this dataframe to the means for the full dataframe that you calculated earlier. Which means changed the most?

In [None]:
# Your code here

Now let's visualize what we just did. First, run the code cell below to get a histogram of all weights in our full dataset. We will go into more detail on how the code works later. Note that we are importing another library: matplotlib! This library contains many functions for making plots.

In [None]:
import matplotlib.pyplot as plt
plt.hist(df['Weight'])
plt.show()

We can make another such histogram, but now for the `df_small` dataframe! This dataframe contains only pills with a diameter smaller than 8, so we can expect that the weight distribution is also a bit different.

### Exercise 4.3.3

Create the same histogram for `df_small`. Note that we do not need to import matplotlib again, Jupyter remembers that we imported it earlier.

In [None]:
# Your code here

Nice! You'll see that the pills in `df_small` are generally lighter, as the x-axis only goes up to 300 grams instead of 500. In the next tutorial, you will also learn how to change the axes, label them, and make more complex plots. Visualizing your data is very helpful throughout any analysis!

## 4.4 Missing data

Something that occurs regularly when working with data, is finding empty cells in your tables. When we process our data with pandas, these empty cells are represented with NaN, which stands for Not a Number. As you can imagine, this might result into errors. How does one multiply 5 with NaN? Therefore, you might want to check if your data contains NaN values first and decide how to handle them.

In [None]:
# We will artificially insert a row that contains a NaN at the end
import numpy as np
df.loc[498, :] = ['NBI_81', 161, 1, 7.2, 4.3, 288.0, np.nan, '23-05-2023']
df.tail()

In [None]:
# Check for NaN using the .isnull() function
# we add the .any() function to avoid printing the whole df: we just want to know if each column
# has any NaN values in it. You can remove it to see the difference!
print(df.isnull().any())

As you can see, there is a NaN value in the Purity column. Depending on our goals, there are two options to deal with NaN values. The first is to remove any row that misses an element. We use the `df.dropna()` function to do so. Alternatively, we can replace missing values using `df.fillna()`. `fillna()` takes an argument called value, in which you tell it what number you want to use to replace NaN. For instance, `df_filled = df.fillna(0)` results in a new dataframe that is a copy of the old one, but all NaN values are replaced by zeroes.

In either case, the dataframe doesn't contain any NaNs anymore after dropping or filling them!

*Note: NaN, null, None and n/a are more or less the same things.*

In [None]:
df_dropped = df.dropna()
print(df_dropped.isnull().any())

In [None]:
df_filled = df.fillna(0)
print(df_filled.isnull().any())
df_filled.tail()

### Exercise 4.4.1

Run the code cell below to add some new rows to the dataframe with NaN values. Now apply the following rules:

* Drop all rows where Date_string is NaN (hint: you can tell `dropna()` to only consider specific columns using the subset parameter; check the documentation!)
* Replace all NaN values in columns batchnumber and measurement by 0
* Replace NaN values in columns Diameter, Thickness, Weight and Purity by the mean of that column.

You'll have to use `fillna()` on specific columns rather than the whole dataframe, and you need to make sure to save columns back to the dataframe. For instance, just running `df['Weight'].fillna(0)` will fill NaN values in column Weight with value 0, but it doesn't store it anywhere! You want to run `df['Weight'] = df['Weight'].fillna(0)` instead, assigning the filled column back to the column in the dataframe.

In [None]:
# Run this first
df_xtc = pd.read_csv('../tutorials/xtc_data.csv')
df_nans = pd.DataFrame([['NBI_81', np.nan, 1, 7.2, 4.3, 288.0, np.nan, np.nan],
           ['NBI_81', np.nan, 1, 7.2, 4.3, 288.0, np.nan, '23-05-2023'],
           ['NBI_81', 161, 1, np.nan, 4.3, 288.0, np.nan, np.nan],
           ['NBI_81', 161, np.nan, 7.2, 4.3, 288.0, np.nan, '23-05-2023'],
           ['NBI_81', np.nan, 1, 7.2, np.nan, 288.0, np.nan, '23-05-2023'],
           ['NBI_81', 161, np.nan, 7.2, 4.3, np.nan, np.nan, '23-05-2023']],
                       columns=df_xtc.columns)

df = pd.concat([df_xtc, df_nans]).reset_index(drop=True)
df.tail(10)

In [None]:
# Your code here


In other cases you might want to rank your dataframe, for example in descending value of a certain value. We can do this by using the function `sort_values()`:

In [None]:
df.sort_values('Purity')

## 4.5 Visualizing your data

As we saw before, we can create plots using matplotlib. As you can read [here](https://matplotlib.org/stable/plot_types/index.html), there are many different plots you can create with matplotlib. We have already used ```hist()```. Now let's make a simple scatterplot.

### Exercise 4.5.1

We will use ```plt.scatter()``` to plot the diameter and the weight of XTC pills in our dataset. As you can read in the documentation, scatter takes x and y as obligatory input variables. First we will have to define what should be on the x and y axes. In the code cell below, replace the 0 in `y = 0` by the correct data.

In [None]:
x = df['Diameter']
y = 0

Next, we have to tell matplotlib that we want to make a scatterplot, and give it the right input. When we are done, we write ```plt.show()``` to generate the figure in this notebook. If you ever want to automatically save the image you've generated, you can use ```plt.savefig(name)``` where name is the filename you would like to use.

In [None]:
plt.scatter(x, y)
plt.show()

Look! there seems to be a sort of correlation between the diameter and the weight of the pills.

Can you make another scatterplot that compares thickness and purity? Try to figure out how to give the axes a name by looking at the documentation and websites such as Stackoverflow.

In [None]:
# Your code here