# APS106 - Fundamentals of Computer Programming
## Week 12 | Lecture 1 (12.1) - Installing third-party packages, managing environments, and Pandas

### This Week
| Lecture | Topics |
| --- | --- |
| **12.1** | **Pandas** |
| 12.2 | More Pandas, Data Visualization | 
| 12.3 | Design Problem: Stock Market, Part 1 |

### Lecture Structure
1. [csv Module](#section1)
2. [DataFrames](#section2)
3. [Series](#section3)
4. [Data Extraction](#section4)

<a id='section1'></a>
## 1.csv Module
Let's first import the `csv` module, that comes pre-installed with Python.

In [4]:
import csv

We will now import a `csv` file which contains the results from past presidential elections in the United States. This file is in the same folder as this notebook and is named `elections.csv`.

In [5]:
# Open the file
file = open('elections.csv', 'r')

# Create a csv reader object
csv_reader = csv.reader(file)

# Loop through the rows and print our the contents of each row
for row in csv_reader:
    print(row)
    
# Close file
file.close()

['Year', 'Candidate', 'Party', 'Popular vote', 'Result', '%']
['1824', 'Andrew Jackson', 'Democratic-Republican', '151271', 'loss', '57.21012204']
['1824', 'John Quincy Adams', 'Democratic-Republican', '113142', 'win', '42.78987796']
['1828', 'Andrew Jackson', 'Democratic', '642806', 'win', '56.20392707']
['1828', 'John Quincy Adams', 'National Republican', '500897', 'loss', '43.79607293']
['1832', 'Andrew Jackson', 'Democratic', '702735', 'win', '54.57478905']
['1832', 'Henry Clay', 'National Republican', '484205', 'loss', '37.6036283']
['1832', 'William Wirt', 'Anti-Masonic', '100715', 'loss', '7.821582644']
['1836', 'Hugh Lawson White', 'Whig', '146109', 'loss', '10.00598542']
['1836', 'Martin Van Buren', 'Democratic', '763291', 'win', '52.27247202']
['1836', 'William Henry Harrison', 'Whig', '550816', 'loss', '37.72154257']
['1840', 'Martin Van Buren', 'Democratic', '1128854', 'loss', '46.94878676']
['1840', 'William Henry Harrison', 'Whig', '1275583', 'win', '53.05121324']
['1844'

For, let's say that we're asked to print the name of the candidate with the lowest popular vote percentage (%) in the year 1996 election. The following code is how we could do this using the pure Python and the `csv` module.

In [6]:
file = open('elections.csv', 'r')
csv_reader = csv.reader(file)

row_count = 0
name = None
popular_vote = 100

for row in csv_reader:
    
    if row_count > 0:
        
        # row[0] is the column "Year"
        if row[0] == '1996':    
            
            # row[5] is the column "%"
            if float(row[5]) < popular_vote:
                popular_vote = float(row[5])
                
                # row[1] is the column "Candidate"
                name = row[1]
   
    row_count += 1
    
print('Candidate', name, 'had the lowest popular vote in 1996, which was', popular_vote, '%.')
    
file.close()

Candidate John Hagelin had the lowest popular vote in 1996, which was 0.118218738 %.


The `csv` module allowed us to get the desired output but the code has some limitations. For starts, its a lot of code a fairly simple query. Additionally, we are accessing the columns by an indect, which makes the code challenging to interpreted and potentially prone to error. Lastly, all the data is imported as strings, which means we must convert to a numeric data type for certain columns.

In this lecture we will introduce the `Pandas` library, which is the standard in academia and industry for working with tabular data in Python. 

If you're using Anaconda, this package will likley be installed in your `(base)` environment but if not, you can run this code in a notebook cell to install.

```python

!pip install pandas
```

First, let's import `Pandas` and give it the name `pd`.

In [7]:
import pandas as pd

Now, lets try to code some code to accomplish the above task.

Let's first load the `csv` file and print the first 5 rows.

In [8]:
elections = pd.read_csv('elections.csv')
elections.head()

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
0,1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
2,1828,Andrew Jackson,Democratic,642806,win,56.203927
3,1828,John Quincy Adams,National Republican,500897,loss,43.796073
4,1832,Andrew Jackson,Democratic,702735,win,54.574789


For starters, this is a much nicer view for getting a quick snap shot of our dataset.

But what data type is `elections`?

In [9]:
print(type(elections))

<class 'pandas.core.frame.DataFrame'>


Hmmmm.... what is a `DataFrame`? More on this later.

Now, let's try to find the 1996 candidate with the lowest popular vote %.

In [10]:
worse_candidate = elections[elections['Year'] == 1996].sort_values('%', ascending=True).iloc[0]
print(worse_candidate)

Year                    1996
Candidate       John Hagelin
Party            Natural Law
Popular vote          113670
Result                  loss
%                   0.118219
Name: 148, dtype: object


And what data type is `worse_candidate`?

In [11]:
print(type(worse_candidate))

<class 'pandas.core.series.Series'>


Hmmmm.... what is a `Series`? More on this later.

Now, let's print out the results as we did with the prvious example.

In [12]:
print('Candidate', worse_candidate['Candidate'], 'had the lowest popular vote in 1996, which was', worse_candidate['%'], '%.')

Candidate John Hagelin had the lowest popular vote in 1996, which was 0.118218738 %.


So, how did we do? Well, we have reduced our code from 14 lines to 3. We are accessing columns by their name, not their position. Lastly, the data has been imported and automatically converted to the most likely data type. As you can see below, `'Candidate'` is a `string` and `'%'` is a `float`.

In [13]:
elections.dtypes

Year              int64
Candidate        object
Party            object
Popular vote      int64
Result           object
%               float64
dtype: object

Note: `object` is a `string` and `dtypes` is a `DataFrame` attribute, which outputs a `Series` showing the name of which column and the corresponding data type.

Now, let's learn how this `Pandas` library works.

<a id='section2'></a>
## 2. DataFrames
A `DataFrame` is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table. `DataFrames` were first introduced in the [**R Programming Language**](https://www.r-project.org/) and are generally the most commonly used pandas object. **Pandas** is the most popular Python package for working with `DataFrames`.
<br>
<img src="images/dataframe_overview.png" alt="drawing" width="850"/>
<br>

In this lecture, we will see how a `DataFrame` can be created from scratch or loaded from a file.

### How to Create a `DataFrame`
We can also create a `DataFrame` in a variety of ways. Here, we cover the following:

1. From a CSV file
1. Using a list and column names
1. From a dictionary

#### Creating a `DataFrame` from a CSV file
For loading data into a `DataFrame`, `pandas` has a number of very useful file reading tools. We'll be using `read_csv` today to load data from a CSV file into a `DataFrame` object.

In [49]:
elections = pd.read_csv("elections.csv")
elections

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
0,1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
2,1828,Andrew Jackson,Democratic,642806,win,56.203927
3,1828,John Quincy Adams,National Republican,500897,loss,43.796073
4,1832,Andrew Jackson,Democratic,702735,win,54.574789
...,...,...,...,...,...,...
177,2016,Jill Stein,Green,1457226,loss,1.073699
178,2020,Joseph Biden,Democratic,81268924,win,51.311515
179,2020,Donald Trump,Republican,74216154,loss,46.858542
180,2020,Jo Jorgensen,Libertarian,1865724,loss,1.177979


Because we did not specify a column which should be used as the `index`, `pandas` creates a default index for us. The default index starts and zero and increses by 1 (`0, 1, 2, 3, 4, ...`).

#### Creating a `DataFrame` using a list and column names
You can create a `DataFrame` using a two lists. One list will contain the column names as seen below.

In [50]:
columns = ['Salary', 'Job Title']

The other list will contain the data. Each sub list represents a row and the number of items in the sub list must be the equal to the number of items in the `columns` list.

In [51]:
data = [
    [110000, 'Professor'],
    [90000, 'Developer'],
    [130000, 'Electrician']
]

We can then use the pd.DataFrame constructor by passing the columns list to the `columns` parameter and the data list to the `data` parameter.

In [52]:
salaries = pd.DataFrame(data=data, columns=columns)
salaries

Unnamed: 0,Salary,Job Title
0,110000,Professor
1,90000,Developer
2,130000,Electrician


Because we did pass an argument to the `index` parameter, `pandas` creates a default index for us. I could use a custom `index` as follows.

In [53]:
salaries = pd.DataFrame(index=[10, 12, 20], data=data, columns=columns)
salaries

Unnamed: 0,Salary,Job Title
10,110000,Professor
12,90000,Developer
20,130000,Electrician


#### Creating a `DataFrame` from a dictionary
You can create a `DataFrame` using a dictrionary where the keys are the column names and the values are lists. The length of each list must be the same.

In [54]:
df_dict_1 = pd.DataFrame({"Fruit":["Strawberry", "Orange", "Banana"], 
                          "Price":[5.49, 3.99, 9.99]})
df_dict_1

Unnamed: 0,Fruit,Price
0,Strawberry,5.49
1,Orange,3.99
2,Banana,9.99


Notice how the `"Fruit"` list and the `"Price"` list have the same number of items in them (3).

You can also create a `DataFrame` using a list of dictionaries where each dictionary holes the data for a given row.

In [55]:
df_dict_2 = pd.DataFrame([{"Fruit": "Strawberry", "Price": 5.49}, 
                          {"Fruit": "Orange", "Price": 3.99},
                          {"Fruit": "Banana", "Price": 9.99}])
df_dict_2

Unnamed: 0,Fruit,Price
0,Strawberry,5.49
1,Orange,3.99
2,Banana,9.99


### `DataFrame` attributes: `index`, `columns`,`shape`
The figure below displays the different components of a DataFrame, which include: `indices`, `columns`, `axes` (more on these later), and `Series`.
<br>
<img src="images/DataFrame.png" alt="drawing" width="450"/>
<br>
Let's check out our `elections` `DataFrame` again.

In [56]:
elections

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
0,1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
2,1828,Andrew Jackson,Democratic,642806,win,56.203927
3,1828,John Quincy Adams,National Republican,500897,loss,43.796073
4,1832,Andrew Jackson,Democratic,702735,win,54.574789
...,...,...,...,...,...,...
177,2016,Jill Stein,Green,1457226,loss,1.073699
178,2020,Joseph Biden,Democratic,81268924,win,51.311515
179,2020,Donald Trump,Republican,74216154,loss,46.858542
180,2020,Jo Jorgensen,Libertarian,1865724,loss,1.177979


### `index`
The `index` can be thought of as row labels. They can be numeric but they do not have to be. They could also be strings or datetimes, for example. If the `index` of a DataFrame is numeric, id does not have to be for the form `0, 1, 2, 3, 4, ...`.

#### `index` as a monotonic sequence of integers
The example below for `electinos` is likely the expected behaviour when thinking about row indices of a data table.

In [57]:
elections.index

RangeIndex(start=0, stop=182, step=1)

`index` returns a `RangeIndex()` object, which shows the start, end and step size of the row indices. `RangeIndex` is a memory-saving object used for representing monotonic ranges. This is similar to how we represent a sequence of integer using the `range` function in Python. For example, `range(0, 182, 1)`.

#### Converting a column into the index
Sometimes, its advantageous to set a column in your DataFrame as the index. Consider our `elections` `DataFrame` again.

In [58]:
elections

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
0,1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
2,1828,Andrew Jackson,Democratic,642806,win,56.203927
3,1828,John Quincy Adams,National Republican,500897,loss,43.796073
4,1832,Andrew Jackson,Democratic,702735,win,54.574789
...,...,...,...,...,...,...
177,2016,Jill Stein,Green,1457226,loss,1.073699
178,2020,Joseph Biden,Democratic,81268924,win,51.311515
179,2020,Donald Trump,Republican,74216154,loss,46.858542
180,2020,Jo Jorgensen,Libertarian,1865724,loss,1.177979


We can set the `"Year"` column as the index.

In [59]:
elections = elections.set_index('Year')
elections

Unnamed: 0_level_0,Candidate,Party,Popular vote,Result,%
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
1828,Andrew Jackson,Democratic,642806,win,56.203927
1828,John Quincy Adams,National Republican,500897,loss,43.796073
1832,Andrew Jackson,Democratic,702735,win,54.574789
...,...,...,...,...,...
2016,Jill Stein,Green,1457226,loss,1.073699
2020,Joseph Biden,Democratic,81268924,win,51.311515
2020,Donald Trump,Republican,74216154,loss,46.858542
2020,Jo Jorgensen,Libertarian,1865724,loss,1.177979


And we can reset the `index` using a monotonic range.

In [60]:
elections = elections.reset_index()
elections

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
0,1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
2,1828,Andrew Jackson,Democratic,642806,win,56.203927
3,1828,John Quincy Adams,National Republican,500897,loss,43.796073
4,1832,Andrew Jackson,Democratic,702735,win,54.574789
...,...,...,...,...,...,...
177,2016,Jill Stein,Green,1457226,loss,1.073699
178,2020,Joseph Biden,Democratic,81268924,win,51.311515
179,2020,Donald Trump,Republican,74216154,loss,46.858542
180,2020,Jo Jorgensen,Libertarian,1865724,loss,1.177979


#### `index` as an array of numeric values
Let's go back to `"Year"` being the `index`.

In [61]:
elections = elections.set_index('Year')
elections

Unnamed: 0_level_0,Candidate,Party,Popular vote,Result,%
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
1828,Andrew Jackson,Democratic,642806,win,56.203927
1828,John Quincy Adams,National Republican,500897,loss,43.796073
1832,Andrew Jackson,Democratic,702735,win,54.574789
...,...,...,...,...,...
2016,Jill Stein,Green,1457226,loss,1.073699
2020,Joseph Biden,Democratic,81268924,win,51.311515
2020,Donald Trump,Republican,74216154,loss,46.858542
2020,Jo Jorgensen,Libertarian,1865724,loss,1.177979


One important property of `index` is that these row labels do not have to be unique, which probably seems counterintuitive. As we can see from the example above, there is more than one row with the index `2020`.

Because the index is not longer a monotonic range, the `RangeIndex()` object can no longer be used to represent it.

In [62]:
elections.index

Index([1824, 1824, 1828, 1828, 1832, 1832, 1832, 1836, 1836, 1836,
       ...
       2016, 2016, 2016, 2016, 2016, 2016, 2020, 2020, 2020, 2020],
      dtype='int64', name='Year', length=182)

Instead, we get a list of index values.

#### `index` as an array of non-numeric values
In the example below, you can see that we can also create an `index` using non-numeric data.

In [64]:
elections = elections.reset_index()
elections = elections.set_index('Candidate')
elections

Unnamed: 0_level_0,Year,Party,Popular vote,Result,%
Candidate,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Andrew Jackson,1824,Democratic-Republican,151271,loss,57.210122
John Quincy Adams,1824,Democratic-Republican,113142,win,42.789878
Andrew Jackson,1828,Democratic,642806,win,56.203927
John Quincy Adams,1828,National Republican,500897,loss,43.796073
Andrew Jackson,1832,Democratic,702735,win,54.574789
...,...,...,...,...,...
Jill Stein,2016,Green,1457226,loss,1.073699
Joseph Biden,2020,Democratic,81268924,win,51.311515
Donald Trump,2020,Republican,74216154,loss,46.858542
Jo Jorgensen,2020,Libertarian,1865724,loss,1.177979


In [67]:
elections = elections.reset_index()

### `columns`
`columns` are column labels in the same way the `indices` are row labels. We can get a list of `column` names using the following code.

In [68]:
elections.columns

Index(['Candidate', 'Year', 'Party', 'Popular vote', 'Result', '%'], dtype='object')

One special characteristic of `columns` is that they must be unique. See what happens when we try to make a `DataFrame` with all columns of the same name.

In [71]:
df_dict_3 = pd.DataFrame({"Fruit":["Strawberry", "Orange", "Banana"], 
                          "Fruit":[0, 1, 2],
                          "Fruit":['a', 'b', 'c']})
df_dict_3

Unnamed: 0,Fruit
0,a
1,b
2,c


You can see that only one `column` is added to the `DataFrame` and the `data` is from the last item in the dictionary.

### `shape`
`shape` is a `tuple` where the first item is the number of rows and the second item is the number of columns.

In [73]:
elections

Unnamed: 0,Candidate,Year,Party,Popular vote,Result,%
0,Andrew Jackson,1824,Democratic-Republican,151271,loss,57.210122
1,John Quincy Adams,1824,Democratic-Republican,113142,win,42.789878
2,Andrew Jackson,1828,Democratic,642806,win,56.203927
3,John Quincy Adams,1828,National Republican,500897,loss,43.796073
4,Andrew Jackson,1832,Democratic,702735,win,54.574789
...,...,...,...,...,...,...
177,Jill Stein,2016,Green,1457226,loss,1.073699
178,Joseph Biden,2020,Democratic,81268924,win,51.311515
179,Donald Trump,2020,Republican,74216154,loss,46.858542
180,Jo Jorgensen,2020,Libertarian,1865724,loss,1.177979


In [75]:
elections.shape

(182, 6)

In [76]:
df_dict_2 = pd.DataFrame([{"Fruit": "Strawberry", "Price": 5.49}, 
                          {"Fruit": "Orange", "Price": 3.99},
                          {"Fruit": "Banana", "Price": 9.99}])
df_dict_2

Unnamed: 0,Fruit,Price
0,Strawberry,5.49
1,Orange,3.99
2,Banana,9.99


In [77]:
df_dict_2.shape

(3, 2)

<a id='section4'></a>
## 4. Breakout Session 1
Given the data below, create a `DataFrame` with the columns `phone_number`, `job`, and `years`. The index should be `"name"` with the data coming from `data1`. There should be 4 rows in the `DataFrame`.

In [87]:
data1 = {'name': ['John', 'Susan', 'Omid', 'Ava']}
data2 = {'phone_number': ['234-5678', '123-4567', '111-4444', '456-0987']}
data3 = {'job': ['Pizza Delivery', 'Teacher', 'Chemist', 'Coder']}
data4 = {'years_of_experience': [10, 2, 5, 8]}

# Write your code here
data1.update(data2)
data1.update(data3)
data1.update(data4)
df1 = pd.DataFrame(data1)
df1 = df1.set_index('name')

# Display DataFrame
df1

Unnamed: 0_level_0,phone_number,job,years_of_experience
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
John,234-5678,Pizza Delivery,10
Susan,123-4567,Teacher,2
Omid,111-4444,Chemist,5
Ava,456-0987,Coder,8


<a id='section3'></a>
## 3. Series
A `Series` is a 1-D labelled array of data. We can think of it as a column of data like you may have seen in an excel spreadsheet.

### Creating a new `Series` object
Below, we create a `Series` object.

In [13]:
my_series = pd.Series(["welcome", "to", "APS106"])
print(my_series)

0    welcome
1         to
2     APS106
dtype: object


Because we imported `Pandas` using this code: `import pandas as pd`, anytime we want to use a module or function or class from the `Pandas` library, we must preface it with the alias we used during import, which was `pd`.

```python
pd.module_or_function_or_class_name()
pd.Series()
pd.DataFrame()
```

If we just did this:
```python
Series()
DataFrame()
```

We would get an error. Don't believe me? Let's try it out.

In [14]:
my_series = Series(["welcome", "to", "APS106"])
print(my_series)

NameError: name 'Series' is not defined

Ok, back to the correct code.

In [15]:
my_series = pd.Series(["welcome", "to", "APS106"])
print(my_series)

0    welcome
1         to
2     APS106
dtype: object


Series have three main components, `index`, `name`, `data` (also refered to as `values`) as you can see in the diagram below. As you can see, `DataFrames` are collections of `Series`. Therefore, a helpful way to think of a `Series` is as a column from a table of data, like and Excel sheet.

<img src="images/Pandas_Series.png" width="600" style="margin:auto"/>


So, for our new `Series` `my_series`, let's check out these attributes (`index`, `name`, `data`).

**name**

In [6]:
print(my_series.name)

None


We never gave our `Series` a name, so by default, its `None`.

**data (values)**

In [7]:
print(my_series.values)

['welcome' 'to' 'APS106']


We passed this data to our `Series` constructor (`my_series = pd.Series(["welcome", "to", "APS106"])`).

**index**

In [15]:
print(my_series.index)

RangeIndex(start=0, stop=3, step=1)


In the example above, `Pandas` automatically generated an `Index` of integer labels. The first item in the `Series` has `index = 0`, the second item has `index = 1` and so on. We can also create a `Series` object by providing a custom Index. In the case of our example, rather than the index being a long list of monotonically increasing integers, `Pandas` has created a `RangeIndex` object to represent the same information in a simpler and more compact format. `RangeIndex(start=0, stop=3, step=1)` is the same as `[0, 1, 2]` but in the case where you have millions of items in your `Series`, its much more efficient. This should look familiar to you given `RangeIndex(start=0, stop=3, step=1)` would be very similar to `range(0, 3, 1)` that we learned in **Week 6**. 

Here is an example where we are specifying an `index`, `data`, and `name` when we create a new `Series`.

In [16]:
my_series = pd.Series(data=[20000, 21000, 100000, 88000, 101000], 
                      index=["Seb", "Ben", "Katia", "Joseph", "Tamara"], 
                      name="salary")
print(my_series)

Seb        20000
Ben        21000
Katia     100000
Joseph     88000
Tamara    101000
Name: salary, dtype: int64


We can see all the `Series` information in the print-out and we can also see it by using the attributes below.

In [10]:
print(my_series.name)
print(my_series.values)
print(my_series.index)

salary
[ 20000  21000 100000  88000 101000]
Index(['Seb', 'Ben', 'Katia', 'Joseph', 'Tamara'], dtype='object')


Now, because our index is no longer monotonically increasing integers (e.g. [0, 2, 4, 6, 8]), you can see that the `RangeIndex` object is no longer able to be used to represent the `index`. For this example, the `index` is simply a list of the index values `["Seb", "Ben", "Katia", "Joseph", "Tamara"]`.

**Note:** The idea of having of having non-numeric indices is new and a bit strange. We'll get into this more shortly.

After a `Series` has been created, we can reassign the `Index` of a `Series` to a new `Index`.

In [11]:
my_series.index = ['Goodfellow', 'Kinsella', 'Ossetchkina', 'Sebastian', 'Kecman']
print(my_series)

Goodfellow      20000
Kinsella        21000
Ossetchkina    100000
Sebastian       88000
Kecman         101000
Name: salary, dtype: int64


We can also do this for `name`.

In [12]:
my_series.name = 'income'
print(my_series)

Goodfellow      20000
Kinsella        21000
Ossetchkina    100000
Sebastian       88000
Kecman         101000
Name: income, dtype: int64


But not for `data (values)`.

In [13]:
my_series.values = [0, 0, 0, 0, 0]
print(my_series)

AttributeError: can't set attribute 'values'

More on how to update column data when we get to `DataFrames`.

#### Selecting items in a `Series`
We can select a single value or a set of values in a `Series` using:
- A single label
- A list of labels
- A filtering conditionon

In [14]:
my_series = pd.Series(data=[20000, 21000, 100000, 88000, 101000], 
                      index=["Seb", "Ben", "Katia", "Joseph", "Tamara"], 
                      name="salary")
print(my_series)

Seb        20000
Ben        21000
Katia     100000
Joseph     88000
Tamara    101000
Name: salary, dtype: int64


**Selection using one label**

In [15]:
my_series['Seb']

20000

Notice how the return value is a single array element.

**Selection using multiple labels**

In [18]:
my_series[['Seb', 'Tamara']]

Seb        20000
Tamara    101000
Name: salary, dtype: int64

Notice how the return value is another `Series` with two items in it.

In [19]:
type(my_series[['Seb', 'Tamara']])

pandas.core.series.Series

**Selection using a filter condition**

Filter condition: select all elements greater than 50,000.

In [21]:
print(my_series > 50000)

Seb       False
Ben       False
Katia      True
Joseph     True
Tamara     True
Name: salary, dtype: bool


What we get back is a `Series` of booleans. `True` if the data is > 50,000 and `False` if the data is <= 50,000.

We can use this boolean `Series` to filter our original `Series` to only have data with values > 50,000.

In [22]:
print(my_series[my_series > 50000])

Katia     100000
Joseph     88000
Tamara    101000
Name: salary, dtype: int64


You could also make up your own list of booleans. Also long as this list has an many booleans in it as there are items in the `Series`, you can use it in the following way.

In [17]:
print(my_series[[True, True, False, False, False]])

Seb    20000
Ben    21000
Name: salary, dtype: int64


#### Looping over a `Series`
We can loop over the values in a `Series` as follows.

In [28]:
for item in my_series:
    print(item)

20000
21000
100000
88000
101000


Or, we could loop through the indices and then use them to access the values.

In [18]:
for index in my_series.index:
    print(index, my_series[index])

Seb 20000
Ben 21000
Katia 100000
Joseph 88000
Tamara 101000


<a id='section4'></a>
## 4. Data Extraction
A `Series` is a 1-D labelled array of data. We can think of it as a column of data like you may have seen in an excel spreadsheet.