# Dataframes as databases

Pandas dataframes offer different ways to query data. In some cases, these queries can become as elaborate as in traditional databases. In this notebook, we will see how to make simple queries to a dataset loaded from the internet.

## Loading data from the internet

There are several ways to work with data loaded from the internet. Here, we will download the dataset using a Linux tool and load it into Pandas.

### Downloading a dataset with `wget`

Running a notebook on Colab is done on a computer on Google's infrastructure. Google computers use the Linux operating system and we can take advantage of that when a Linux program can help us. An example is the program ``wget``, which downloads the URL we provide. To run a Linux program on Colab, we have to do this via code cells, using Linux terminal commands beginning with the symbol ``!`` 

In this case, I used the ``wget`` to download a dataset from the UFRN open data portal that contains the incoming students in 2019:

In [0]:
!wget http://dados.ufrn.br/dataset/554c2d41-cfce-4278-93c6-eb9aa49c5d16/resource/a55aef81-e094-4267-8643-f283524e3dd7/download/discentes-2019.csv

The file ``discentes-2019.csv`` should appear in the file list on the left side of the screen.

### Loading the dataset

Let's upload the file as a Pandas dataframe:

In [0]:
import pandas as pd
data = pd.read_csv('discentes-2019.csv', sep=';')
data.head()

In case you have any questions, let's review the code above:

 - ```python
 import pandas as pd
 ```
 We imported Pandas and called it ``pd``:
 ```python
data = pd.read_csv('discentes-2019.csv', sep=';')
 ```
 - As we use a name for Pandas, all of its commands will be referenced using that name (ex.: ``pd.read_csv()``).
 - We inform the character that is used in the dataset as a feature delimiter using the option ``sep=';'`` (normally Pandas can detect this automatically, but in Brazilian datasets it often fails).
 ```python
 data.head()
 ```
 We visualize the first samples of the dataset with the method ``head()``.


## Querying a dataframe

Well, we already have our dataframe ready for querying. The simplest forms of querying are **indexing** and **slicing**.

### Indexing a dataset

Queries on a Pandas dataframe are based from **indices**. The main index on a dataframe are the columns, which represent the features:

In [0]:
data.columns

This means that we can access any of these dataframe columns using the notations `data['column_name']` and `data.column_name`. Let's first translate the feature names, since this dataframe is provided in Brazilian Portuguese. Before we do that, we need to understand the meaning of each feature. We do that using a [data dictionary](http://dados.ufrn.br/dataset/554c2d41-cfce-4278-93c6-eb9aa49c5d16/resource/b5144c99-81f3-4cfc-8938-18adb81ae3c0/download/discentesdicionario.pdf), where we see what each feature means.

In [0]:
data.columns = [
                "id", 
                "student_name",
                "gender",
                "start_year",
                "start_semester",
                "start_type",
                "student_type",
                "status",
                "level_acronym",
                "level",
                "course_id",
                "course_name",
                "course_type",
                "unit_id",
                "unit_name",
                "main_unit_id",
                "main_unit_id"
]

 As each column is considered a series (object of type `Series`), we can use the methods of that type:

In [0]:
data["student_name"].head()

In [0]:
data.unit_name.tail()

The data in a series is also indexed. We can access them individually using the notation `series[row_number]`:

In [0]:
student_names = data["student_name"]
student_names[0]

In [0]:
data["student_name"][0]

In [0]:
data.unit_name[0]

To make our analysis more meaningful yet simple, we will discard a few features and translate another.

*   We can drop features using the method `drop()`, informing the feature list, and specifying that we refer to columns (`axis=1`):

In [0]:
data = data.drop(["student_type", "level_acronym", "level", "course_type"], axis=1)
data.head()

In [0]:
data.head()

*   We can translate feature values if they are categorical. Changing a feature type do categorical saves memory space, so you should that even when translating is not an issue:

In [0]:
data["status"] = data["status"].astype("category")

To translate a feature, we first list its existing values. We do that using the method `unique()`:

In [0]:
data["status"].unique()

To rename categories, we use the `rename_categories` method provided for categorical values, where we inform a list of novel category names in the same order returned by `unique`:

In [0]:
data["status"] = data["status"].cat.rename_categories(["CANCELLED", "ACTIVE", "SUSPENDED", "FINISHED", "REGISTERED", "ACTIVE - GRADUATING", "DEFENDED"])
data.head()

Note that we could translate other features using the same approach, but that would exceed the scope of this notebook.

#### The `loc` and `iloc` methods

It is also possible to directly access the data using the methods `loc` and `iloc`:
- Referring to columns by their names, using the notation `data.loc[row, column_name]`
:

In [0]:
data.loc[0, "student_name"]

- Referring to columns by their position in the column index, using the notation `data.iloc[row, column_index]`

In [0]:
data.iloc[0, 1]

Note that the indexes are counted from the number 0. Since `"student_name"` is the second column, we use index 1 to access it.

The methods `loc` and `iloc` they also accept that you provide a list of indexes.

In [0]:
data.loc[0, ["student_name","course_name"]]

In [0]:
data.iloc[[1,3,7], 1]

### Slicing the dataset

In most of the cases we're only interested in a particular subset of the data containing some continuous columns/rows. Selecting that subset can be done by using these slicing operations:
- Slicing by rows, using `data.loc[row_start:row_end, column_name]`:

---



In [0]:
data.loc[0:500,'student_name']

* By rows and columns simultaneously  based on their positions in the dataframe, using `data.iloc[row_start:row_end, column_start:column_end]`:

In [0]:
data.iloc[0:5, 5:8]

It's important to notice that slicing operations in Python usually includes the element located by the first index, but it doesn't include the element located by the second index.  This means  that when selecting multiple columns or multiple rows in this manner, you need to remember that in your selection the rows/columns selected will run from the first number to one minus the second number. Because of this, the example `data.iloc[0:5, 5:8]` returns 5 rows and 3 columns.

The `loc` method is an exception: it also includes the row referenced by the second index. This is why the example `data.loc[0:500,'nome_discente']` returns 501 rows. You can also slice by columns in `loc` method, but this has to be done by the  columns labels. This is probably the reason why the second element in the index is also included.

In [0]:
data.loc[0:500, 'student_name':'start_year']

## Queries just like those done in database development



The indexing and slicing operations are inherent to the Python language and  that's why they are implemented in Pandas.

Partially they help transforming **selection** and **projection** into operations, both common in databases:
- **Selection**: choosing a subset of samples
- **Projection**: choosing a subset of features

Pandas `DataFrame` provides more methods to these type of queries.

#### Searching by the name of the features

The method **filter()** chooses a subset of features based on its name:

In [0]:
data.filter(like='start')

The result of the method **filter** is a new `DataFrame` that can be associated to a new name:

In [0]:
start_date = data.filter(like='start')
start_date.head()

### Searching for conditions

Another way to filter dataframes is through **conditions**. To that end, we use the method `query('condition')`, where `condition` is a logical expression from Python. For example, we will choose only the samples for which **start_type** equals **REINGRESSO SEGUNDO CICLO**:

In [0]:
data.query("start_type == 'REINGRESSO SEGUNDO CICLO'")

Let’s discuss the example above:
* `start_type` is a `Series` (column) from the `DataFrame` that we call `data` 
* We compare each value in this series with the value `'REINGRESSO SEGUNDO CICLO'` using the equality operator `==`
```python
data.query("start_type == 'REINGRESSO SEGUNDO CICLO'")
```
We choose only the samples that satisfy that condition. 

Note that we can also use names to reference the returned `DataFrame`:

In [0]:
data_second_cycle = data.query("start_type == 'REINGRESSO SEGUNDO CICLO'")
data_second_cycle.head()

#### Conditions and comparison operators

In the example above, we use the equality operator. Note that it’s different to use `==` (comparison of equality) and `=` (association of a name to an object). Python offers more comparison operators:

| Symbol | Meaning |
|:----:|---|
| == | Equal to |
| !=  | Not equal |
| < | Less than |
| > | Greater than |
| <=  | Less than or equal to |
| >=  | Greater than or equal to |

It is also important to observe that the operators less than/greater than (or equal to) usually are applied to numeric data. For non-numeric data, we can use the operator `in`. Let’s choose only the observations whose status is "CANCELLED" or "SUSPENDED":


In [0]:
selected_status = ["CANCELLED", "SUSPENDED"]
data_status = data.query(f"status in {selected_status}")
data_status.tail()

Let’s discuss the example above:
* `selected_status` is a list with the statuses that we wish to filter. 
* We filter the dataframe `data` indicating that we want only the samples whose status value is specified on the list `selected_status`.
```python
data.query(f"status in {status_desejados}")
```
Note that we use a Python resource called `f-strings`, that allows to convert text objects specified between curly braces (an `f-string` always starts with an `f` before quotation marks).

#### Conditions and Logical Operators
We can also use more complex conditions, using **logical operators**. We are going to restrict the query above a little more. In addition to **start_type** having value **REINGRESSO SEGUNDO CICLO**, **course_name** has value **ENGENHARIA DE SOFTWARE**:

In [0]:
condition_second_cycle = "start_type == 'REINGRESSO SEGUNDO CICLO'"
condition_software_engineering = "course_name == 'ENGENHARIA DE SOFTWARE'"
data_second_SE = data.query(f"{condition_second_cycle} and {condition_software_engineering}")
data_second_SE.head()

Reviewing the code above:
* ```python
condition_second_cycle = "start_type == 'REINGRESSO SEGUNDO CICLO'"
````
Condition to choose only new entrants through second cycle re-entry.
```python
condition_software_engineering = "course_name == 'ENGENHARIA DE SOFTWARE'"
```
Condition to choose only those entering the software engineering course.
```python
data_second_SE = data.query(f"{condition_second_cycle} and {condition_software_engineering}")
```
Combining the two conditions through the `and` operator.

#### Other logical operators

In addition to the `and` operator, Pandas also provides the` or` operator. While the `and` operator chooses the sample only if both conditions are true, the `or` will choose it if one of the conditions is satisfied. Following this definition, what does the example below do?

In [0]:
condition_second_cycle = "start_type == 'REINGRESSO SEGUNDO CICLO'"
condition_software_engineering = "course_name == 'ENGENHARIA DE SOFTWARE'"
condition_computer_science = "course_name == 'CIÊNCIA DA COMPUTAÇÃO'"
condition_unit = f"{condition_computer_science} or {condition_software_engineering}"
condition_second_unit = data.query(f"{condition_second_cycle} and {condition_unit}")
condition_second_unit.head()

Reviewing the code above:
* ```python
condition_second_cycle = "start_type == 'REINGRESSO SEGUNDO CICLO'"
````
Condition to choose only new entrants through second cycle re-entry.
```python
condition_software_engineering = "course_name == 'ENGENHARIA DE SOFTWARE'"
```
Condition to choose only those entering the software engineering course.
```python
condition_computer_science = "course_name == 'CIÊNCIA DA COMPUTAÇÃO'"
```
Condition to choose only those entering the computer science course.
```python
condition_unit = f"{condition_computer_science} or {condition_software_engineering}"
```
Combining the two conditions using the `or` operator.
```python
condition_second_unit = data.query(f"{condition_second_cycle} and {condition_unit}")
```
Combining the two conditions using the `and` operator.

Note that we used the `or` operator when we could have used the` in` operator, which is more readable. In general, we adopt the `or` operator when conditions involve different features, instead of different values for the same feature.

Finally, the operator `not` is used to invert a condition:

In [0]:
data_direct_start = data.query(f"not {condition_second_cycle}")
data_direct_start.head()

* **Note**: complex logical expressions deserve specific research on the subject. Covering this topic in depth is beyond the scope of this notebook 🙃