# APS106 - Fundamentals of Computer Programming
## Week 12 | Lecture 1 (12.1) - Installing third-party packages, managing environments, and Pandas

### This Week
| Lecture | Topics | Reading |
| --- | --- | --- | 
| **12.1** | **Installing third-party packages, managing environments, and Pandas** | **Chapter 12**  |
| 12.2 | More Pandas, NumPy, Matplotlib/Seaborn | Chapter 12 | 
| 12.3 | Design Problem: Stock Market, Part 1 |  |

### Lecture Structure
1. [Installing Packages From A Jupyter Notebook](#section1)
2. [Importing Packages](#section2)
3. [Series](#section2)
4. [DataFrames](#section3)
5. [Indices](#section4)

<a id='section1'></a>
## 1. Installing Packages From A Jupyter Notebook
Below is how you can install a package from a Jupyter Notebook using `pip`. 

In [None]:
!pip install numpy

Below is how you can install a package from a Jupyter Notebook using `conda`. 

In [None]:
import sys
!conda install --yes --prefix {sys.prefix} numpy

Note that we use `--yes` to automatically answer `y` if and when conda asks for user confirmation.

Again, this is a quick & dirty way to install packages but is not recommended. Because I have done this method once or twice in the previous week's lectures, I wanted to explain quickly.

I recommend using the terminal or Anaconda Navigator to install and manage your packages.

<a id='section2'></a>
## 2. Importing Packages
`pd` is the conventional alias for Pandas, as `np` is for NumPy..

In [2]:
import pandas as pd

If a package, such as `Pandas` is not installed in the environment that your Notebook is running in, then you will get an error.

For example, I'll try to import a package I know is not installed in my environment. For those that are interested, this is a package used to analyze physiological signals.

In [3]:
from biosppy.signals import ecg

ModuleNotFoundError: No module named 'biosppy'

If you ever see this error, you can install the package using Anaconda terminal or Anaconda Navigator and then try to import again.

<a id='section3'></a>
## 3. Series
A `Series` is a 1-D labelled array of data. We can think of it as a column of data like you may have seen in an excel spreadsheet.

### Creating a new `Series` object

In [3]:
my_series = pd.Series(["welcome", "to", "APS106"])
print(my_series)

0    welcome
1         to
2     APS106
dtype: object


Because we imported `Pandas` using this code: `import pandas as pd`, anytime we want to use a module or function or class from the `Pandas` library, we must preface it with the alias we used during import, which was `pd`.

```python
pd.module_or_function_or_class_name()
pd.Series()
pd.DataFrame()
```

If we just did this:
```python
Series()
DataFrame()
```

We would get an error.

In [4]:
my_series = Series(["welcome", "to", "APS106"])
print(my_series)

NameError: name 'Series' is not defined

Ok, back to the correct code.

In [5]:
my_series = pd.Series(["welcome", "to", "APS106"])
print(my_series)

0    welcome
1         to
2     APS106
dtype: object


Series have three main components, `index`, `name`, `data` (also refered to as `values`) as you can see in the diagram below. Don't worry about `DataFrames` for now, we'll get to them soon enough, but as you can see, `DataFrames` are collections of `Series`. Therefore, a helpful way to think of a `Series` is as a column from a table of data.

<img src="images/Pandas_Series.png" width="600" style="margin:auto"/>


So, for our new `Series` `my_series`, let's check out these attributes (`index`, `name`, `data`).

**name**

In [6]:
print(my_series.name)

None


We never gave our `Series` a name, so by default, its `None`.

**data (values)**

In [7]:
print(my_series.values)

['welcome' 'to' 'APS106']


We passed this data to our `Series` constructor (`my_series = pd.Series(["welcome", "to", "APS106"])`).

**index**

In [15]:
print(my_series.index)

RangeIndex(start=0, stop=3, step=1)


In the example above, `Pandas` automatically generated an `Index` of integer labels. The first item in the `Series` has `index = 0`, the second item has `index = 1` and so on. We can also create a `Series` object by providing a custom Index. In the case of our example, rather than the index being a long list of monotonically increasing integers, `Pandas` has created a `RangeIndex` object to represent the same information in a simpler and more compact format. `RangeIndex(start=0, stop=3, step=1)` is the same as `[0, 1, 2]` but in the case where you have millions of items in your `Series`, its much more efficient. 

Here is an example where we are specifying an `index`, `data`, and `name` when we create a new `Series`.

In [1]:
my_series = pd.Series(data=[20000, 21000, 100000, 88000, 101000], 
                      index=["Seb", "Ben", "Katia", "Joseph", "Tamara"], 
                      name="salary")
print(my_series)

NameError: name 'pd' is not defined

We can see all the `Series` information in the printout and we can also see it by using the attributes below.

In [10]:
print(my_series.name)
print(my_series.values)
print(my_series.index)

salary
[ 20000  21000 100000  88000 101000]
Index(['Seb', 'Ben', 'Katia', 'Joseph', 'Tamara'], dtype='object')


Now, because our index is no longer monotonically increasing integers (e.g. [0, 2, 4, 6, 8]), you can see that the `RangeIndex` object is no longer able to be used to represent the `index`. For this example, the `index` is simply a list of the index values `["Seb", "Ben", "Katia", "Joseph", "Tamara"]`.

**Note:** The idea of having of having non-numeric integers is new and a bit strange. We'll get into this more shortly.

After a `Series` has been created, we can reassign the `Index` of a `Series` to a new `Index`.

In [11]:
my_series.index = ['Goodfellow', 'Kinsella', 'Ossetchkina', 'Sebastian', 'Kecman']
print(my_series)

Goodfellow      20000
Kinsella        21000
Ossetchkina    100000
Sebastian       88000
Kecman         101000
Name: salary, dtype: int64


We can also do this for `name`.

In [12]:
my_series.name = 'income'
print(my_series)

Goodfellow      20000
Kinsella        21000
Ossetchkina    100000
Sebastian       88000
Kecman         101000
Name: income, dtype: int64


But not for `data (values)`.

In [13]:
my_series.values = [0, 0, 0, 0, 0]
print(my_series)

AttributeError: can't set attribute 'values'

More on how to update column data when we get to `DataFrames`.

#### Selecting items in a `Series`
We can select a single value or a set of values in a `Series` using:
- A single label
- A list of labels
- A filtering conditionon

In [14]:
my_series = pd.Series(data=[20000, 21000, 100000, 88000, 101000], 
                      index=["Seb", "Ben", "Katia", "Joseph", "Tamara"], 
                      name="salary")
print(my_series)

Seb        20000
Ben        21000
Katia     100000
Joseph     88000
Tamara    101000
Name: salary, dtype: int64


**Selection using one label**

In [15]:
my_series['Seb']

20000

Notice how the return value is a single array element.

**Selection using one label**

In [18]:
my_series[['Seb', 'Tamara']]

Seb        20000
Tamara    101000
Name: salary, dtype: int64

Notice how the return value is another `Series`.

In [19]:
type(my_series[['Seb', 'Tamara']])

pandas.core.series.Series

**Selection using a filter condition**

Filter condition: select all elements greater than 50,000.

In [21]:
print(my_series > 50000)

Seb       False
Ben       False
Katia      True
Joseph     True
Tamara     True
Name: salary, dtype: bool


What we get back is a `Series` of booleans. `True` if the data is > 50,000 and `False` if the data is <= 50,000.

We can use this boolean `Series` to filter our original `Series` to only have data with values > 50,000.

In [22]:
print(my_series[my_series > 50000])

Katia     100000
Joseph     88000
Tamara    101000
Name: salary, dtype: int64


You would also make up your own list of booleans. Also long as this list has an many booleans in it as there are items in the `Series`, you can use it in the following way.

In [23]:
print(my_series[[True, True, False, False, False]])

Seb    20000
Ben    21000
Name: salary, dtype: int64


#### Looping over a `Series`

In [28]:
for item in my_series:
    print(item)

20000
21000
100000
88000
101000
