# Dictionaries & dataframes

We have seen lists and arrays so far, but of course there are many different types of data. As such, there are also many different data structures to store them. In this lecture, we will introduce dictionaries and dataframes, which relate to each other in a similar way as lists and numpy arrays do.

First of all, we need to set up our virtual environment again. Let's start out by calling `python -m venv .venv` again from the terminal. Like last time, make sure you are in the current virtual environment in your terminal (indicated by the line starting with `(.venv)`) -- you may need to restart vscode to get this. Unlike last time, we are not going to be installing our packages manually, but instead we will tell python to install all the packages as indicated in a specific file -- this file is called `requirements.txt`. Go ahead and take a look at it! What do you think it means?

When you are ready, you can use the following command to install all the packages required for this lecture:
`pip install -r requirements.txt`

This might take quite a while -- especially psychopy is a very large package. If you run into any problems, let us know as soon as possible! After you are done, run the cell below to confirm that everything is working as expected:

In [1]:
from helpers import check_venv, check_pandas
if check_venv() and check_pandas():
    print('Virtual environment and pandas installation detected!')

No virtual environment found.
Make sure to follow the instructions, including the `python -m venv .venv` command!
You may need to restart Visual Studio Code after creating the virtual environment.
On Mac or Linux, you may need to use `python3` instead of `python`, i.e. `python3 -m venv .venv`.


## Dictionaries...

Alright, let's start by looking at a different type of data which we might want to use, but still using lists:

In [3]:
first_names = ['Marcus', 'Kristin', 'Adam', 'Kimberly', 'Judith']
last_names = ['Mcguire', 'Cantu', 'Mendez', 'Wolf', 'Johnson']
grades = [3.3, 2.7, 3.7, 2.7, 3.0]

Using these lists, how would we go about getting Adam's grade?

In [4]:
# Using a for loop:
target_name = 'Adam'
for name, grade in zip(first_names, grades):
    if name == target_name:
        print(f'{name}\'s grade is {grade}')

# Or using the .index method:
target_index = first_names.index(target_name)
print(grades[target_index])

Adam's grade is 3.7
3.7


This is not ideal -- is there perhaps a better way?

Yes, using a dictionary instead of multiple lists! We can represent a single student like this, using a dictionary:

In [5]:
student = {
    'first_name': 'Marcus',
    'last_name': 'Mcguire',
    'grade': 3.3
}

We can then access information by key, rather than by index (as in lists):

In [6]:
print(student['last_name'])

Mcguire


As you can imagine, dictionaries are very nice to store a single set of datapoints in a more structured form. But we can combine lists and dictionaries as well, to get collections of such datapoints:

In [7]:
students = [
    {'first_name': 'Marcus', 'last_name': 'Mcguire', 'grade': 3.3},
    {'first_name': 'Kristin', 'last_name': 'Cantu', 'grade': 2.7},
    {'first_name': 'Adam', 'last_name': 'Mendez', 'grade': 3.7},
    {'first_name': 'Kimberly', 'last_name': 'Wolf', 'grade': 2.7},
    {'first_name': 'Judith', 'last_name': 'Johnson', 'grade': 3.0}
]

As with nested lists, we can chain the square brackets to get information:

In [8]:
print(students[2]['last_name'])

Mendez


So that solved one of our problems, the general organisation and accessibility of our data. However, it doesn't make looking up rows of data thát much easier:

In [9]:
target_name = 'Judith'
for student in students:
    if student['first_name'] == target_name:
        print(student['grade'])

3.0


However, I've heard that `pandas` are pretty cool animals...

## ... and Dataframes!

The `pandas` package in Python provides us with a new data structure: the `DataFrame`! It's essentially tabular data, very similar to the list of dictionaries we used before. In fact, we can use the list of dictionaries to create the `DataFrame`!

In [11]:
import pandas as pd

students_dataframe = pd.DataFrame(students)
students_dataframe  # This does something fancy if you're using a notebook

Unnamed: 0,first_name,last_name,grade
0,Marcus,Mcguire,3.3
1,Kristin,Cantu,2.7
2,Adam,Mendez,3.7
3,Kimberly,Wolf,2.7
4,Judith,Johnson,3.0


We can access information from a dataframe in several ways:

In [12]:
print(students_dataframe['first_name'])  # By column name
print(students_dataframe.iloc[3])  # By row index
print(students_dataframe.loc[3, 'first_name'])  # By both

0      Marcus
1     Kristin
2        Adam
3    Kimberly
4      Judith
Name: first_name, dtype: object
first_name    Kimberly
last_name         Wolf
grade              2.7
Name: 3, dtype: object
Kimberly


Unlike with standard lists or dictionaries, we can easily perform operations on a whole column -- like looking for a certain person:

In [13]:
target_name = 'Judith'
mask = students_dataframe['first_name'] == target_name
mask

0    False
1    False
2    False
3    False
4     True
Name: first_name, dtype: bool

This results in a 'mask': a series of Boolean (True/False) values, one for each row, indicating whether the condition was met.
In this case, it shows us in which rows the `first_name` column was equal to our `target_name` variable ('Judith'). We can then use this mask instead of an index to get information belonging to those rows where the value is equal to True:

In [14]:
print(students_dataframe.loc[mask, 'grade'])  # Like the grade
print(students_dataframe.loc[mask])  # Or the whole row
print(students_dataframe.loc[mask, ['last_name', 'grade']])  # Or multiple columns!

4    3.0
Name: grade, dtype: float64
  first_name last_name  grade
4     Judith   Johnson    3.0
  last_name  grade
4   Johnson    3.0


We were breaking this up into steps for you, but actually, you can do all of this in a single line as well:

In [15]:
students_dataframe.loc[students_dataframe['first_name'] == target_name, 'grade']

4    3.0
Name: grade, dtype: float64

Pandas, like numpy, has lots of built-in methods to make your life easy. For instance, to get the student with the highest grade:

In [16]:
index = students_dataframe['grade'].argmax()
students_dataframe.iloc[index]

first_name      Adam
last_name     Mendez
grade            3.7
Name: 2, dtype: object

## Writing and reading dataframes

Now we have a nicely structured type of data, we might want to also have a way to easily read and write it from or to our disk. After all, it would be a bit of a hassle to have to construct our dataframes manually every time, using lists of dictionaries. Luckily, the `pandas` package has got us covered:

In [20]:
students_dataframe.to_csv('students.csv', index=False)

Go check it out! You can even go add a row to the `csv` (comma-separated value) file yourself, like you would maybe do during data collection. (Although, more likely, a script would do that for you.)

Now, just for completeness, let's look at how we load this data back in:

In [21]:
loaded_students = pd.read_csv('students.csv')
loaded_students

Unnamed: 0,first_name,last_name,grade
0,Marcus,Mcguire,3.3
1,Kristin,Cantu,2.7
2,Adam,Mendez,3.7
3,Kimberly,Wolf,2.7
4,Judith,Johnson,3.0
