# APS106 - Fundamentals of Computer Programming
## Week 12 | Lecture 1 (12.1) - Installing third-party packages, managing environments, and Pandas

### This Week
| Lecture | Topics |
| --- | --- |
| **12.1** | **Pandas** |
| 12.2 | More Pandas, Data Visualization | 
| 12.3 | Design Problem: Stock Market, Part 1 |

### Lecture Structure
1. [csv Module](#section1)
2. [DataFrames](#section2)
3. [Breakout Session 1](#section3)
4. [Series](#section4)
5. [Data Extraction](#section5)
6. [Breakout Session 2](#section6)
7. [Looping](#section7)
8. [Saving DataFrames](#section8)

<a id='section1'></a>
## 1. csv Module
Let's first import the `csv` module, that comes pre-installed with Python.

In [None]:
import csv

We will now import a `csv` file which contains the results from past presidential elections in the United States. This file is in the same folder as this notebook and is named `elections.csv`.

In [None]:
# Open the file
file = open('elections.csv', 'r')

# Create a csv reader object
csv_reader = csv.reader(file)

# Loop through the rows and print our the contents of each row
for row in csv_reader:
    print(row)
    
# Close file
file.close()

So, let's say that we're asked to print the name of the candidate with the lowest popular vote percentage (%) in the year 1996 election. The following code is how we could do this using the pure Python and the `csv` module.

In [None]:
file = open('elections.csv', 'r')
csv_reader = csv.reader(file)

row_count = 0
name = None
popular_vote = 100

for row in csv_reader:
    
    # Skip first row (column names)
    if row_count > 0:
        
        # row[0] is the column "Year"
        if row[0] == '1996':    
            
            # row[5] is the column "%"
            if float(row[5]) < popular_vote:
                popular_vote = float(row[5])
                
                # row[1] is the column "Candidate"
                name = row[1]
   
    row_count += 1
    
print('Candidate', name, 'had the lowest popular vote in 1996, which was', popular_vote, '%.')
    
file.close()

The `csv` module allowed us to get the desired output but the code has some limitations. For starts, it's a lot of code for a fairly simple query. Additionally, we are accessing the columns by an index, which makes the code challenging to interpret and potentially prone to error. Lastly, all the data is imported as strings, which means we must convert to a numeric data type for certain columns.

In this lecture, we will introduce the `Pandas` library, which is the standard in academia and industry for working with tabular data in Python. 

If you're using Anaconda, this package will likely be installed in your `(base)` environment but if not, you can run this code in a notebook cell to install.

```python

!pip install pandas
```

First, let's import `Pandas` and give it the name `pd`.

In [None]:
import pandas as pd

Now, let's try to write some code to accomplish the above task.

Let's first load the `csv` file and print the first 5 rows.

In [None]:
elections = pd.read_csv('elections.csv')
elections.head()

For starters, this is a much nicer view for getting a quick snap shot of our dataset.

But what data type is `elections`?

In [None]:
print(type(elections))

Hmmmm.... what is a `DataFrame`? More on this later.

Now, let's try to find the 1996 candidate with the lowest popular vote %.

In [None]:
worse_candidate = elections.loc[elections['Year'] == 1996, :].sort_values('%', ascending=True).iloc[0]
print(worse_candidate)

And what data type is `worse_candidate`?

In [None]:
print(type(worse_candidate))

Hmmmm.... what is a `Series`? More on this later.

Now, let's print out the results as we did with the prvious example.

In [None]:
print('Candidate', worse_candidate['Candidate'], 'had the lowest popular vote in 1996, which was', worse_candidate['%'], '%.')

So, how did we do? Well, we have reduced our code from 14 lines to 3. We are accessing columns by their name, not their position. Lastly, the data has been imported and automatically converted to the most likely data type. As you can see below, `'Candidate'` is a `string` and `'%'` is a `float`.

In [None]:
elections.dtypes

Note: `object` is a `string` and `dtypes` is a `DataFrame` attribute, which outputs a `Series` showing the name of which column and the corresponding data type.

Now, let's learn how this `Pandas` library works.

<a id='section2'></a>
## 2. DataFrames
A `DataFrame` is a 2-dimensional labelled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table. `DataFrames` were first introduced in the [**R Programming Language**](https://www.r-project.org/) and are generally the most commonly used pandas object. **Pandas** is the most popular Python package for working with `DataFrames`.
<br>
<img src="images/dataframe_overview.png" alt="drawing" width="850"/>
<br>

In this lecture, we will see how a `DataFrame` can be created from scratch or loaded from a file.

### How to Create a `DataFrame`
We can also create a `DataFrame` in a variety of ways. Here, we cover the following:

1. From a CSV file
1. Using a list and column names
1. From a dictionary

#### Creating a `DataFrame` from a CSV file
For loading data into a `DataFrame`, `pandas` has a number of very useful file reading tools. We'll be using `read_csv` today to load data from a CSV file into a `DataFrame` object.

In [None]:
elections = pd.read_csv("elections.csv")
elections

Because we did not specify a column which should be used as the `index`, `pandas` creates a default index for us. The default index starts and zero and increases by 1 (`0, 1, 2, 3, 4, ...`).

#### Creating a `DataFrame` using a list and column names
You can create a `DataFrame` using two lists. One list will contain the column names as seen below.

In [None]:
columns = ['Salary', 'Job Title']

The other list will contain the data. Each sub-list represents a row and the number of items in the sub-list must be equal to the number of items in the `columns` list.

In [None]:
data = [
    [110000, 'Professor'],
    [90000, 'Developer'],
    [130000, 'Electrician']
]

We can then use the pd.DataFrame constructor by passing the columns list to the `columns` parameter and the data list to the `data` parameter.

In [None]:
salaries = pd.DataFrame(data=data, columns=columns)
salaries

Because we did pass an argument to the `index` parameter, `pandas` creates a default index for us. I could use a custom `index` as follows.

In [None]:
salaries = pd.DataFrame(index=[10, 12, 20], data=data, columns=columns)
salaries

#### Creating a `DataFrame` from a dictionary
You can create a `DataFrame` using a dictionary where the keys are the column names and the values are lists. The length of each list must be the same.

In [None]:
df_dict_1 = pd.DataFrame({"Fruit":["Strawberry", "Orange", "Banana"], 
                          "Price":[5.49, 3.99, 9.99]})
df_dict_1

Notice how the `"Fruit"` list and the `"Price"` list have the same number of items in them (3).

You can also create a `DataFrame` using a list of dictionaries where each dictionary holds the data for a given row.

In [None]:
df_dict_2 = pd.DataFrame([{"Fruit": "Strawberry", "Price": 5.49}, 
                          {"Fruit": "Orange", "Price": 3.99},
                          {"Fruit": "Banana", "Price": 9.99}])
df_dict_2

### `DataFrame` attributes: `index`, `columns`,`shape`
The figure below displays the different components of a DataFrame, which include: `indices`, `columns`, `axes` (more on these later), and `Series`.
<br>
<img src="images/DataFrame.png" alt="drawing" width="450"/>
<br>
Let's check out our `elections` `DataFrame` again.

In [None]:
elections

### `index`
The `index` can be thought of as row labels. They can be numeric but they do not have to be. They could also be strings or datetimes, for example. If the `index` of a DataFrame is numeric, it does not have to be for the form `0, 1, 2, 3, 4, ...`.

#### `index` as a monotonic sequence of integers
The example below for `elections` is likely the expected behaviour when thinking about row indices of a data table.

In [None]:
elections.index

`index` returns a `RangeIndex()` object, which shows the start, end and step size of the row indices. `RangeIndex` is a memory-saving object used for representing monotonic ranges. This is similar to how we represent a sequence of integers using the `range` function in Python. For example, `range(0, 182, 1)`.

#### Converting a column into the index
Sometimes, it's advantageous to set a column in your DataFrame as the index. Consider our `elections` `DataFrame` again.

In [None]:
elections

We can set the `"Year"` column as the index.

In [None]:
elections = elections.set_index('Year')
elections

And we can reset the `index` using a monotonic range.

In [None]:
elections = elections.reset_index()
elections

#### `index` as an array of numeric values
Let's go back to `"Year"` being the `index`.

In [None]:
elections = elections.set_index('Year')
elections

One important property of `index` is that these row labels do not have to be unique, which probably seems counterintuitive. As we can see from the example above, there is more than one row with the index `2020`.

Because the index is not longer a monotonic range, the `RangeIndex()` object can no longer be used to represent it.

In [None]:
elections.index

Instead, we get a list of index values.

#### `index` as an array of non-numeric values
In the example below, you can see that we can also create an `index` using non-numeric data.

In [None]:
elections = elections.reset_index()
elections = elections.set_index('Candidate')
elections

In [None]:
elections = elections.reset_index()

### `columns`
`columns` are column labels in the same way the `indices` are row labels. We can get a list of `column` names using the following code.

In [None]:
elections.columns

### `shape`
`shape` is a `tuple` where the first item is the number of rows and the second item is the number of columns.
```python
>>> dataframe.shape
(num_rows, num_columns)
```

In [None]:
elections

In [None]:
elections.shape

In [None]:
df_dict_2 = pd.DataFrame([{"Fruit": "Strawberry", "Price": 5.49}, 
                          {"Fruit": "Orange", "Price": 3.99},
                          {"Fruit": "Banana", "Price": 9.99}])
df_dict_2

In [None]:
df_dict_2.shape

<a id='section3'></a>
## 3. Breakout Session 1
Given the data below, create a `DataFrame` with the columns `phone_number`, `job`, and `years`. The index should be `"name"` with the data coming from `data1`. There should be 4 rows in the `DataFrame`. Below is an example of the expected output of your code.

<img src="images/breakout_session_1.png" width="400" style="margin:auto"/>

In [None]:
data1 = {'name': ['John', 'Susan', 'Omid', 'Ava']}
data2 = {'phone_number': ['234-5678', '123-4567', '111-4444', '456-0987']}
data3 = {'job': ['Pizza Delivery', 'Teacher', 'Chemist', 'Coder']}
data4 = {'years_of_experience': [10, 2, 5, 8]}

# Write your code here
data1.update(data2)
data1.update(data3)
data1.update(data4)
df1 = pd.DataFrame(data1)
df1 = df1.set_index('name')

# Display DataFrame
df1

<a id='section4'></a>
## 4. Series
A `Series` is a 1-D labelled array of data. We can think of it as a column of data like you may have seen in an excel spreadsheet.

### Creating a new `Series` object
Below, we create a `Series` object.

In [None]:
my_series = pd.Series(["welcome", "to", "APS106"])
print(my_series)

Series have three main components, `index`, `name`, `data` (also referred to as `values`) as you can see in the diagram below. As you can see, `DataFrames` are collections of `Series`. Therefore, a helpful way to think of a `Series` is as a column from a table of data, like an Excel sheet.

<img src="images/Pandas_Series.png" width="600" style="margin:auto"/>


So, for our new `Series` `my_series`, let's check out these attributes (`index`, `name`, `data`).

**name**

In [None]:
print(my_series.name)

We never gave our `Series` a name, so by default, its `None`.

**data (values)**

In [None]:
print(my_series.values)

We passed this data to our `Series` constructor (`my_series = pd.Series(["welcome", "to", "APS106"])`).

**index**

In [None]:
print(my_series.index)

In the example above, `Pandas` automatically generated an `Index` of integer labels. The first item in the `Series` has `index = 0`, the second item has `index = 1` and so on. We can also create a `Series` object by providing a custom Index. In the case of our example, rather than the index being a long list of monotonically increasing integers, `Pandas` has created a `RangeIndex` object to represent the same information in a simpler and more compact format. `RangeIndex(start=0, stop=3, step=1)` is the same as `[0, 1, 2]` but in the case where you have millions of items in your `Series`, it's much more efficient. This should look familiar to you given `RangeIndex(start=0, stop=3, step=1)` would be very similar to `range(0, 3, 1)` that we learned in **Week 6**. 

Here is an example where we are specifying an `index`, `data`, and `name` when we create a new `Series`.

In [None]:
my_series = pd.Series(data=[20000, 21000, 100000, 88000, 101000], 
                      index=["Seb", "Ben", "Katia", "Joseph", "Tamara"], 
                      name="salary")
print(my_series)

We can see all the `Series` information in the print-out and we can also see it by using the attributes below.

In [None]:
print(my_series.name)
print(my_series.values)
print(my_series.index)

Now, because our index is no longer monotonically increasing integers (e.g. [0, 2, 4, 6, 8]), you can see that the `RangeIndex` object is no longer able to be used to represent the `index`. For this example, the `index` is simply a list of the index values `["Seb", "Ben", "Katia", "Joseph", "Tamara"]`.

**Note:** The idea of having of having non-numeric indices is new and a bit strange. We'll get into this more shortly.

After a `Series` has been created, we can reassign the `Index` of a `Series` to a new `Index`.

In [None]:
my_series.index = ['Goodfellow', 'Kinsella', 'Ossetchkina', 'Sebastian', 'Kecman']
print(my_series)

We can also do this for `name`.

In [None]:
my_series.name = 'income'
print(my_series)

But not for `data` (`values`).

In [None]:
my_series.values = [0, 0, 0, 0, 0]
print(my_series)

More on how to update column data next lecture.

<a id='section5'></a>
## 5. Data Extraction
We can use `.head()` to return the first 5 rows of a DataFrame or `.tail()` to return the last 5 rows of a `DataFrame`.

In [None]:
elections = pd.read_csv("elections.csv")
elections

In [None]:
elections.head() 

In [None]:
elections.tail() 

If you'd like to see the first `10` rows, simply pass `10` as an argument.

In [None]:
elections.head(10) 

### Label-Based Extraction using `.loc`
The `.loc` method in Pandas is primarily used for label-based indexing. It allows you to select data from a DataFrame based on labels or boolean arrays. The `.loc` method takes two arguments. The first is a row selection and the second is a column selection in the following form:
```python
dataframe.loc(row-selection, column-selection)
```

These arguments to `.loc`, `row-selection` and `column-selection` can be:
- A list.
- A slice (syntax is inclusive of the right-hand side of the slice).
- A single value.

Consider our `elections` DataFrame.

In [None]:
elections.head()

#### Selection using a list.
The first argument is a list of row labels. These would be values in `.index`. The second argument is a list of column labels. These would be values in `.columns`.
```python
dataframe.loc([List-of-row-labels], [List-of-column-labels])
```
These are the options for the row labels:

In [None]:
elections.index

Basically, numbers between 0 and 181.

In [None]:
print(list(elections.index)[0:10])

In [None]:
print(list(elections.index)[-10:])

And these are the options for columns labels.

In [None]:
elections.columns

So, lets try this out.

In [None]:
elections.loc[[87, 25, 179], ["Year", "Candidate", "Result"]]

#### Selection using a slice
In Pandas, the `:` slice in the `.loc` method is used to select rows and/or columns by label. It represents a range of labels or all labels along a particular axis (rows or columns). Row and column slices can be passed to the `.loc` method in the following way.
```python
df.loc[row_label_slice, column_label_slice]
```
- `row_label_slice`: A slice or a sequence of labels representing the range of rows to select.
- `column_label_slice`: A slice or a sequence of labels representing the range of columns to select.

**Slice of Labels:** If you pass a slice of labels (e.g., `start_label:end_label`), `.loc` selects all rows or columns with labels within the specified range, including both the start and end labels. For example:

In [None]:
elections.loc[80:85, "Candidate":"%"]

In this case, our row slice was `80:85`, which produced the following list of row labels `[80, 81, 82, 83, 84, 85]` and our column slice `"Popular vote":"%"` will produce the following list of column labels `["Candidate", "Party", "Popular vote", "Result", "%"]`.

You can also add a step size with the following syntax: `start_label:end_label:step_size`. For example, using a step size of `2` for the previous example would produce the following output.

In [None]:
elections.loc[80:85:2, "Candidate":"%":2]

**Important:** Compared to what we have previously learned with regards to slicing `strings`, `lists`, `tuples`, etc., the start and stop labels for `.loc` are inclusive. You'll recall that for `string` slicing, `string[0:5]` will include indices `0, 1, 2, 3, 4` and NOT `5`. In the example above, you'll notice row label `85` and column label `"%"` are included.

`.loc` also supports a negative step size.

In [None]:
elections.loc[80:85:2, "%":"Candidate":-1]

The one area where you can run into an error is when slicing row labels where there are non-unique values. Let's create a dummy DataFrame to show this behavior.

In [None]:
elections_test = elections.loc[80:90, "Year":"%"]
elections_test.index = [0, 1, 2, 3, 2, 5, 10, 12, 13, 8, 30]
elections_test

Now we have an `index` that is NOT MONOTONIC where monotonically increasing integers are a sequence of numbers where each number is greater than or equal to the preceding number.

In [None]:
elections_test.loc[2:8, "Year":"Party"]

As you can see, we get an error. This means that when our row labels or column labels are not monotonically increasing or decreasing, we cannot use slicing.

**Selection of all Labels:**  If you use `:` alone, `.loc` selects all rows or columns. For example, if I want to select all columns but only a few rows, we could do the following:

In [None]:
elections.loc[[3, 10, 50], :]

**Selection of a single value:**  You can pass a single row or column label into the row or column selection arguments of the `loc` method.

Below is an example of selecting a single value in a table.

In [None]:
elections.loc[0, 'Candidate']

Below is an example of selecting a single row in a table.

In [None]:
elections.loc[0, :]

You can see the `type` is a `Series`, which Pandas uses to represent 1-D data, which a row is what a row is. 

In [None]:
type(elections.loc[0, :])

However, if you want `.loc` to return a `DataFrame` with a single row, then you need to pass in a list of row labels with only one row label in it.

In [None]:
elections.loc[[0], :]

In [None]:
type(elections.loc[[0], :])

Below is an example of selecting a single column in a table is a `Series`.

In [None]:
elections.loc[:, 'Candidate']

And the same example but where a `DataFrame` with one column is returned.

In [None]:
elections.loc[:, ['Candidate']]

### Integer-Based Extraction Using iloc `.iloc`
In Pandas, the `.iloc` method is used for integer-based indexing, allowing you to select rows and/or columns by their integer position within the DataFrame. It is similar to Python's native indexing we use for `lists`, `string`, `tuples`, and `range`.

- When using `.iloc`, integer indexes are used to select rows and columns.
- Slices in `.iloc` are exclusive of the end index, meaning the row or column at the end index is not included in the selection.
- Negative integer indexes can be used to select rows or columns from the end of the `DataFrame`.
- `.iloc` provides a way to access `DataFrame` elements based on their integer position, which can be useful when you want to select rows or columns by their position rather than by their labels.

The figure below shows the differences between `loc` and `iloc`.

<img src="images/iloc_loc.png" width="900" style="margin:auto"/>

Let's check out our `elections` dataset again.

In [None]:
elections.head()

Below are some examples of using `iloc`.

In [None]:
elections.iloc[2:, 3:5]

In [None]:
elections.iloc[::2, 3:5]

In [None]:
elections.iloc[5:-1, 3:5]

#### Caution
We will use both `.loc` and `.iloc` in the course. `.loc` is generally preferred for a number of reasons, for example: 

1. It is harder to make mistakes since you have to literally write out what you want to get.
2. Code is easier to read because the reader doesn't have to know e.g. what column **#31** represents.
3. It is robust against permutations of the data, e.g. the Social Security administration switches the order of two columns.

However, `iloc` is sometimes more convenient. We'll provide examples of when iloc is the superior choice.

### Data Extraction From A `Series` using `.loc` and `iloc`.
`loc` and `iloc` can be used in the same way for a `Series`. The only difference is that they do not require row and column selections because a `Series` is 1D. Let's go back to our `Series` example.

In [None]:
my_series = pd.Series(data=[20000, 21000, 100000, 88000, 101000], 
                      index=["Seb", "Ben", "Katia", "Joseph", "Tamara"], 
                      name="salary")
print(my_series)

In [None]:
my_series.loc['Ben']

In [None]:
my_series.loc['Ben':'Joseph']

In [None]:
my_series.iloc[2]

In [None]:
my_series.iloc[1:4]

<a id='section6'></a>
## 6. Breakout Session 2
You have been provided with a `DataFrame` containing information about students' scores in different subjects. 

In [None]:
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
    'Math': [80, 85, 90, 95, 75],
    'Science': [75, 90, 85, 80, 95],
    'English': [85, 80, 75, 70, 90],
    'History': [90, 85, 80, 75, 85]
}

df = pd.DataFrame(data)
df

#### Question 1
Select the scores of the last 2 students in `"Math"` and `"Science"` using `.loc`.

In [None]:
df.loc[df.index[-2:], ['Math', 'Science']]

#### Question 2
Select the scores of the students with row positions 1, 2, and 3 in `"English"` and `"History"` using `.iloc`.

In [None]:
df.iloc[1:4, 2:]

#### Question 3
Select the scores of students with row position 0, 2, and 4 in all subjects using `.iloc`.

In [None]:
df.iloc[[0, 2, 4]]

#### Question 4
Select the scores of all students in `"Math"` and `"Science"` using `.loc`.

In [None]:
df.loc[:, ['Math', 'Science']]

#### Question 5
Create a new `DataFrame` including all the rows but with the columns reorder in reverse order using `.loc` and slicing.

In [None]:
df.loc[:, 'History':'Name':-1]

<a id='section7'></a>
## 7. Looping
There are MANY ways to loop over a `DataFrame`. Some are very fast and others very slow but for `APS106` we will learn two methods, one that relies on `.loc` and the other that relies on `.iloc`. 

Let's reload our trusty elections dataset.

In [None]:
elections = pd.read_csv("elections.csv")
elections.head()

### Looping over row labels (`.loc`)
In the code below, we are looping over the row labels and then using them to access the candidate using `.loc`. The candidate is saved as a string and then the first name is printed in all lower-case.

In [None]:
for row_label in elections.index:
    candidate = elections.loc[row_label, 'Candidate']
    print(candidate.split(' ')[0].lower())

### Looping over row label positions (`.iloc`)
In the code below, we are looping over the row label positions and then using them to access the candidate using `.iloc`. The candidate is saved as a string and then the first name is printed in all lower-case.

In [None]:
for row_id in range(elections.shape[0]):
    candidate = elections.iloc[row_id, 1]
    print(candidate.split(' ')[0].lower())

### Looping over a `Series`
We can loop over the values in a `Series` as follows.

In [None]:
for item in my_series:
    print(item)

Or, we could loop through the indices and then use them to access the values.

In [None]:
for index in my_series.index:
    print(index, my_series.loc[index])

<a id='section8'></a>
## 8. Saving DataFrames
In Pandas, saving `DataFrames` as `CSV` (Comma Separated Values) files is a straightforward process and can be done using the `to_csv()` function. This function allows you to save the contents of a `DataFrame` to a `CSV` file with various options for customization.

Let's create a custom `DataFrame`.

In [None]:
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 35],
        'Department': ['HR', 'Engineering', 'Marketing']}
df = pd.DataFrame(data)
df.head()

To save a `DataFrame` all we need is to specify the name of the file we're saving.

In [None]:
df.to_csv('example.csv')

If we don't want to include the `index`, simply pass `False` to the `index` parameter.

In [None]:
df.to_csv('example.csv', index=False)