# APS106 - Fundamentals of Computer Programming
## Week 12 | Lecture 1 (12.1) - Installing third-party packages, managing environments, and Pandas

### This Week
| Lecture | Topics |
| --- | --- |
| **12.1** | **Pandas** |
| 12.2 | More Pandas, Data Visualization | 
| 12.3 | Design Problem: Stock Market, Part 1 |

### Lecture Structure
1. [csv Module](#section1)
2. [Series](#section2)
3. [DataFrames](#section3)
4. [Indices](#section4)

<a id='section1'></a>
## 1.csv Module
Let's first import the `csv` module, that comes pre-installed with Python.

In [1]:
import csv

We will now import a `csv` file which contains the results from past presidential elections in the United States. This file is in the same folder as this notebook and is named `elections.csv`.

In [9]:
# Open the file
file = open('elections.csv', 'r')

# Create a csv reader object
csv_reader = csv.reader(file)

# Loop through the rows and print our the contents of each row
for row in csv_reader:
    print(row)
    
# Close file
file.close()

['Year', 'Candidate', 'Party', 'Popular vote', 'Result', '%']
['1824', 'Andrew Jackson', 'Democratic-Republican', '151271', 'loss', '57.21012204']
['1824', 'John Quincy Adams', 'Democratic-Republican', '113142', 'win', '42.78987796']
['1828', 'Andrew Jackson', 'Democratic', '642806', 'win', '56.20392707']
['1828', 'John Quincy Adams', 'National Republican', '500897', 'loss', '43.79607293']
['1832', 'Andrew Jackson', 'Democratic', '702735', 'win', '54.57478905']
['1832', 'Henry Clay', 'National Republican', '484205', 'loss', '37.6036283']
['1832', 'William Wirt', 'Anti-Masonic', '100715', 'loss', '7.821582644']
['1836', 'Hugh Lawson White', 'Whig', '146109', 'loss', '10.00598542']
['1836', 'Martin Van Buren', 'Democratic', '763291', 'win', '52.27247202']
['1836', 'William Henry Harrison', 'Whig', '550816', 'loss', '37.72154257']
['1840', 'Martin Van Buren', 'Democratic', '1128854', 'loss', '46.94878676']
['1840', 'William Henry Harrison', 'Whig', '1275583', 'win', '53.05121324']
['1844'

For, let's say that we're asked to print the name of the candidate with the lowest popular vote percentage (%) in the year 1996 election. The following code is how we could do this using the pure Python and the `csv` module.

In [25]:
file = open('elections.csv', 'r')
csv_reader = csv.reader(file)

row_count = 0
name = None
popular_vote = 100

for row in csv_reader:
    
    if row_count > 0:
        
        # row[0] is the column "Year"
        if row[0] == '1996':    
            
            # row[5] is the column "%"
            if float(row[5]) < popular_vote:
                popular_vote = float(row[5])
                
                # row[1] is the column "Candidate"
                name = row[1]
   
    row_count += 1
    
print('Candidate', name, 'had the lowest popular vote in 1996, which was', popular_vote, '%.')
    
file.close()

Candidate John Hagelin had the lowest popular vote in 1996, which was 0.118218738 %.


The `csv` module allowed us to get the desired output but the code has some limitations. For starts, its a lot of code a fairly simple query. Additionally, we are accessing the columns by an indect, which makes the code challenging to interpreted and potentially prone to error. Lastly, all the data is imported as strings, which means we must convert to a numeric data type for certain columns.

In this lecture we will introduce the `Pandas` library, which is the standard in academia and industry for working with tabular data in Python. 

If you're using Anaconda, this package will likley be installed in your `(base)` environment but if not, you can run this code in a notebook cell to install.

```python

!pip install pandas
```

First, let's import `Pandas` and give it the name `pd`.

In [16]:
import pandas as pd

Now, lets try to code some code to accomplish the above task.

Let's first load the `csv` file and print the first 5 rows.

In [18]:
elections = pd.read_csv('elections.csv')
elections.head()

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
0,1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
2,1828,Andrew Jackson,Democratic,642806,win,56.203927
3,1828,John Quincy Adams,National Republican,500897,loss,43.796073
4,1832,Andrew Jackson,Democratic,702735,win,54.574789


For starters, this is a much nicer view for getting a quick snap shot of our dataset.

But what data type is `elections`?

In [38]:
print(type(elections))

<class 'pandas.core.frame.DataFrame'>


Hmmmm.... what is a `DataFrame`? More on this later.

Now, let's try to find the 1996 candidate with the lowest popular vote %.

In [34]:
worse_candidate = elections[elections['Year'] == 1996].sort_values('%', ascending=True).iloc[0]
print(worse_candidate)

Year                    1996
Candidate       John Hagelin
Party            Natural Law
Popular vote          113670
Result                  loss
%                   0.118219
Name: 148, dtype: object


And what data type is `worse_candidate`?

In [39]:
print(type(worse_candidate))

<class 'pandas.core.series.Series'>


Hmmmm.... what is a `Series`? More on this later.

Now, let's print out the results as we did with the prvious example.

In [35]:
print('Candidate', worse_candidate['Candidate'], 'had the lowest popular vote in 1996, which was', worse_candidate['%'], '%.')

Candidate John Hagelin had the lowest popular vote in 1996, which was 0.118218738 %.


So, how did we do. Well, we have reduced out code from 14 lines to 3. We are accessing columns by their name, not thier position. Lastly, the data has been imported and automatically converted to the most likely data type. As you can see below, `'Candidate'` is a `string` and `'%'` is a `float`.

In [36]:
elections.dtypes

Year              int64
Candidate        object
Party            object
Popular vote      int64
Result           object
%               float64
dtype: object

Note: `object` is a `string`.

Now, let's learn how this `Pandas` library works.

<a id='section2'></a>
## 2. Series
A `Series` is a 1-D labelled array of data. We can think of it as a column of data like you may have seen in an excel spreadsheet.

### Creating a new `Series` object

In [3]:
my_series = pd.Series(["welcome", "to", "APS106"])
print(my_series)

0    welcome
1         to
2     APS106
dtype: object


Because we imported `Pandas` using this code: `import pandas as pd`, anytime we want to use a module or function or class from the `Pandas` library, we must preface it with the alias we used during import, which was `pd`.

```python
pd.module_or_function_or_class_name()
pd.Series()
pd.DataFrame()
```

If we just did this:
```python
Series()
DataFrame()
```

We would get an error.

In [4]:
my_series = Series(["welcome", "to", "APS106"])
print(my_series)

NameError: name 'Series' is not defined

Ok, back to the correct code.

In [5]:
my_series = pd.Series(["welcome", "to", "APS106"])
print(my_series)

0    welcome
1         to
2     APS106
dtype: object


Series have three main components, `index`, `name`, `data` (also refered to as `values`) as you can see in the diagram below. Don't worry about `DataFrames` for now, we'll get to them soon enough, but as you can see, `DataFrames` are collections of `Series`. Therefore, a helpful way to think of a `Series` is as a column from a table of data.

<img src="images/Pandas_Series.png" width="600" style="margin:auto"/>


So, for our new `Series` `my_series`, let's check out these attributes (`index`, `name`, `data`).

**name**

In [6]:
print(my_series.name)

None


We never gave our `Series` a name, so by default, its `None`.

**data (values)**

In [7]:
print(my_series.values)

['welcome' 'to' 'APS106']


We passed this data to our `Series` constructor (`my_series = pd.Series(["welcome", "to", "APS106"])`).

**index**

In [15]:
print(my_series.index)

RangeIndex(start=0, stop=3, step=1)


In the example above, `Pandas` automatically generated an `Index` of integer labels. The first item in the `Series` has `index = 0`, the second item has `index = 1` and so on. We can also create a `Series` object by providing a custom Index. In the case of our example, rather than the index being a long list of monotonically increasing integers, `Pandas` has created a `RangeIndex` object to represent the same information in a simpler and more compact format. `RangeIndex(start=0, stop=3, step=1)` is the same as `[0, 1, 2]` but in the case where you have millions of items in your `Series`, its much more efficient. 

Here is an example where we are specifying an `index`, `data`, and `name` when we create a new `Series`.

In [1]:
my_series = pd.Series(data=[20000, 21000, 100000, 88000, 101000], 
                      index=["Seb", "Ben", "Katia", "Joseph", "Tamara"], 
                      name="salary")
print(my_series)

NameError: name 'pd' is not defined

We can see all the `Series` information in the printout and we can also see it by using the attributes below.

In [10]:
print(my_series.name)
print(my_series.values)
print(my_series.index)

salary
[ 20000  21000 100000  88000 101000]
Index(['Seb', 'Ben', 'Katia', 'Joseph', 'Tamara'], dtype='object')


Now, because our index is no longer monotonically increasing integers (e.g. [0, 2, 4, 6, 8]), you can see that the `RangeIndex` object is no longer able to be used to represent the `index`. For this example, the `index` is simply a list of the index values `["Seb", "Ben", "Katia", "Joseph", "Tamara"]`.

**Note:** The idea of having of having non-numeric integers is new and a bit strange. We'll get into this more shortly.

After a `Series` has been created, we can reassign the `Index` of a `Series` to a new `Index`.

In [11]:
my_series.index = ['Goodfellow', 'Kinsella', 'Ossetchkina', 'Sebastian', 'Kecman']
print(my_series)

Goodfellow      20000
Kinsella        21000
Ossetchkina    100000
Sebastian       88000
Kecman         101000
Name: salary, dtype: int64


We can also do this for `name`.

In [12]:
my_series.name = 'income'
print(my_series)

Goodfellow      20000
Kinsella        21000
Ossetchkina    100000
Sebastian       88000
Kecman         101000
Name: income, dtype: int64


But not for `data (values)`.

In [13]:
my_series.values = [0, 0, 0, 0, 0]
print(my_series)

AttributeError: can't set attribute 'values'

More on how to update column data when we get to `DataFrames`.

#### Selecting items in a `Series`
We can select a single value or a set of values in a `Series` using:
- A single label
- A list of labels
- A filtering conditionon

In [14]:
my_series = pd.Series(data=[20000, 21000, 100000, 88000, 101000], 
                      index=["Seb", "Ben", "Katia", "Joseph", "Tamara"], 
                      name="salary")
print(my_series)

Seb        20000
Ben        21000
Katia     100000
Joseph     88000
Tamara    101000
Name: salary, dtype: int64


**Selection using one label**

In [15]:
my_series['Seb']

20000

Notice how the return value is a single array element.

**Selection using one label**

In [18]:
my_series[['Seb', 'Tamara']]

Seb        20000
Tamara    101000
Name: salary, dtype: int64

Notice how the return value is another `Series`.

In [19]:
type(my_series[['Seb', 'Tamara']])

pandas.core.series.Series

**Selection using a filter condition**

Filter condition: select all elements greater than 50,000.

In [21]:
print(my_series > 50000)

Seb       False
Ben       False
Katia      True
Joseph     True
Tamara     True
Name: salary, dtype: bool


What we get back is a `Series` of booleans. `True` if the data is > 50,000 and `False` if the data is <= 50,000.

We can use this boolean `Series` to filter our original `Series` to only have data with values > 50,000.

In [22]:
print(my_series[my_series > 50000])

Katia     100000
Joseph     88000
Tamara    101000
Name: salary, dtype: int64


You would also make up your own list of booleans. Also long as this list has an many booleans in it as there are items in the `Series`, you can use it in the following way.

In [23]:
print(my_series[[True, True, False, False, False]])

Seb    20000
Ben    21000
Name: salary, dtype: int64


#### Looping over a `Series`

In [28]:
for item in my_series:
    print(item)

20000
21000
100000
88000
101000
