# Introduction to Python - Lecture 09 (31Oct 2018)

### Agenda for today:
+ Introduction to Pandas
+ Introduction to Seaborn

#### Recap

In [None]:
lst = np.array(
      [[ 1,  1],
       [ 4,  3],
       [ 0,  1],
       [-1,  1],
       [ 0,  1],
       [ 4, -2],
       [-5,  1],
       [-1,  0],
       [-3,  3],
       [ 3,  3]])

#### How do you return the first column? [1, 4, 0, -1, 0, 4, -5, -1, -3, 3]

#### How do you return the second column? [1, 3, 1, 1, 1, -2, 1, 0, 3, 3]

#### How do get the sum of each row?

#### How can we calculate the z score of each row?

---

### Setting a lower bound using numpy

You can find all values that match a criteria us comparison operations

**eg**
```python
lst = np.arange(0, 20)
lst < 5
```

+ This will return an array with 20 elements, the first 5 will be True, the rest False.
+ We can then use this as the index to the numpy array and set those values to something else
```python
lst = np.arange(0, 20)
lst[lst < 5] = 0
lst
```

### On removing noise from images

The method you use to remove noise will vary depending on the type of noise you are trying to remove.

Applying filters to the image is one method of trying to remove noise
+ median filter
+ gaussian filter
+ dilation/erosion

---

## Pandas

Pandas is an external library like numpy and seaborn and needs to be installed using a package manager.

Anaconda:
+ conda install pandas
Pip:
+ pip install pandas

**Note** Seaborn requires pandas as a dependancy, so you should already have it installed.

When we would like to use pandas we need to import it

In [None]:
import numpy as np
import pandas as pd

#### What is a dataframe?

A dataframe is a collection of data where each row consists of a collection of observations.

#### Creating a dataframe

There are many ways to create dataframes:
+ Coverting a dictionary to a dataframe
    ```python
df = pd.Dataframe.from_dict( << dict >> )
    ```
+ Loading the data from a csv file
    ```python
df = pd.read_csv( << csv_path >> )
    ```
+ Load the data from a url
    ```python
url = "https://vincentarelbundock.github.io/Rdatasets/csv/datasets/trees.csv"
df = pd.read_csv(url)
    ```

#### Loading data from a dictionary

Some of these examples are taken from the pandas documentation (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.from_dict.html#pandas.DataFrame.from_dict)

###### By default, each item in the dictionary will represent a column

In [None]:
data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}
pd.DataFrame.from_dict(data)

###### This can be changed by changing the orient parameter to 'index' (the default is 'column')

In [None]:
data = {'row_1': [3, 2, 1, 0], 'row_2': ['a', 'b', 'c', 'd']}
pd.DataFrame.from_dict(data, orient='index')

###### The names of the columns can be set using the columns parameter

In [None]:
pd.DataFrame.from_dict(data, orient='index',
                        columns=['A', 'B', 'C', 'D'])

###### Alternatively you can specify the column names in the dictionary

In [None]:
data = {
    'Tree1': {'girth': 8.3, 'height': 70, 'volume': 10.3},
    'Tree2': {'girth': 8.6, 'height': 65, 'volume': 10.3},
    'Tree3': {'girth': 8.8, 'height': 63, 'volume': 10.2}
}
pd.DataFrame.from_dict(data, orient='index')

**Note** Each row needs to have a unique identifier, in the above example this is represented by '**Tree#**'. Generally this is represented by an integer ranging from 0->n. 
+ In the above example we can reset the index to be the integers using the reset_index() function.

In [None]:
pd.DataFrame.from_dict(data, orient='index').reset_index()

###### Loading Data from a URL

In [None]:
url = "https://vincentarelbundock.github.io/Rdatasets/csv/datasets/trees.csv"
df = pd.read_csv(url)
df

### Accessing values in the dataframe

In [None]:
data = {
    'Tree1': {'girth': 8.3, 'height': 70, 'volume': 10.3},
    'Tree2': {'girth': 8.6, 'height': 65, 'volume': 10.3},
    'Tree3': {'girth': 8.8, 'height': 63, 'volume': 10.2}
}
df = pd.DataFrame.from_dict(data, orient='index')

#### Columns

+ To get a list of column names you can covert the dataframe into a list
```python
list(df)
```

In [None]:
list(df)

+ Access columns using the column name in square brackets to return a series containing the data.
```python
df["column_name"]
```

+ A series is a 1D array of id, value pairs

In [None]:
df["girth"]

If instead of passing a single column name you pass a list of column names, a sub dataframe will be returned containing only those columns.

In [None]:
df[["girth", "height"]]

#### Rows

Rows are accessed using either **loc** or **iloc**

###### iloc
+ This will access rows depending on their integer index
+ The first row will have index 0
+ Then next will have index 1, ...
+ To extract the first row you would use the following command
    + This will return a series containing the information from that row
```python
df.iloc[0]
```
+ To extract multiple rows you can pass a list of indices
    + This will return a dataframe containing the specified rows
```python
df.iloc[[0, 1, 2]]
```

In [None]:
df.iloc[[1, 2]]

###### loc

Various arguments will work with loc to extract rows from a dataframe

+ A single index label
    + Returns a series for that specific row
    ```python
df.loc["Tree2"]
    ```
+ A list of index labels
    + Returns a dataframe containing those rows
    ```python
df.loc[["Tree1", "Tree3"]]
    ```
+ A boolean list
    + Returns a dataframe for rows that are labeled true
    ```python
df.loc[[False, True]]
    ```

In [None]:
df.loc[[False, True, False]]

#### Extracting data by value

Comparison operators can be applied to series objects (which are numpy lists)
For each value it will return either True or False depending on the comparison

**eg**
```console
(1,1,1,5,5,5) > 3
> [False, False, False, True, True, True]
```

This is convinient as **.loc** can use an array of booleans to extract rows.


This allows for specific rows to be extracted from the dataframe depending on their value


In [None]:
df

+ Trees that are shorter than 70
    + df["height"] will return a series
    + df["height"] < 70 will return a list of booleans
        + [False, True, True]
    + We can then use this to extract those rows from the dataframe
    ```python
df.loc[df["height"] < 70]
    ```
    
##### How would you get the rows where the volume is equal to 10.3?

##### Combining conditions

Numpy has various bitwise operations which work on boolean arrays (bitwise operations work on binary sequences, a boolean list is a binary sequence)
+ **&**
    + This is and
    + The resulting list will only be true where both conditions are true
    ```python
l1 = np.array([True, False])
l2 = np.array([True, True])
l1 & l2
    ```
+ |
    + This is or
    + The resulting list will be true where any of the conditions is true
    ```python
l1 = np.array([True, False])
l2 = np.array([True, True])
l1 | l2
    ```
+ ~
    + This is negation
    + The resulting True/False values will be flipped
        ```python
l1 = np.array([True, False])
~l1
    ```
    
**When comparing different conditions with pandas they should be put in parenthesis**
```python
df.loc[(df["something"] > 5) & (df["nothing"] != 4)]
```

###### Using this how can we extract all rows with height < 70 and volume equal to 10.3?

###### Using this how can we extract all rows except with height < 70 and volume equal to 10.3?


#### Loading test datasets

Seaborn has a few test datasets included with it

+ Flights
+ Iris
+ many more (https://github.com/mwaskom/seaborn-data)

These can be accessed using seaborns load_dataset function.
This will return a pandas dataframe containing the data.

```python
df = sns.load_dataset("dataset_name")
```

We will now load the **flights** dataset and perform some analysis on it

In [None]:
import seaborn as sns
df = sns.load_dataset("flights")
df.head() # This will return the first 5 rows of the dataframe

#### Extracting rows using strings

Previously when we extracted rows it was using values, this will not work with strings.

+ To extract string values we can use the **isin** function and a list of options
+ Like other comparison operations this will return a list of boolean values
+ This list can be used in conjunction with loc to access rows
```python
df["month"].isin(["January"])
df.loc[df["month"].isin(["January"])]
```
+ As this is a list of booleans it can use the comparrisons we discussed earlier
```python
df.loc[(df["month"].isin(["January"])) & (df["passengers"] > 300)]
```

### Groupby

The groupby function of pandas allows you to gather statistics on certain groups within the data.


On its own the groupby function will not actually do anything except create the groups. It needs to be combined with additional functions such as:
+ mean()
+ count()
+ nunique()
+ etc

It is possible to iterate over the groups using for loops, but this is generally not required

As an example we can group all of the flights by month and then sum all of the passengers for each month

In [None]:
df.groupby("month") # This will not do anything except create the group objects

In [None]:
df.groupby("month").count()

In [None]:
df[["month", "passengers"]].groupby("month").count()

In [None]:
df[["month", "passengers"]].groupby("month").sum()

In [None]:
df[["month", "passengers"]].groupby("month").mean()

##### How would we calculate which year had the most passengers?

Melting Data / Gathering data

Data can be represented in two different forms:
+ Long
  + The flights data used previously would be an example of long data
  + Each row only contains a single value (passengers)
  + Long format is great for plotting data
+ Wide
  + Wide data contains multiple values per row
  + For example using the flights data:
    + each row could represent a year, each column a month
  + This format is sometimes easier for performing calculations
  + Often it is easier to store data in this format
  
Long data can be converted into wide data using the **pivot** function.

Pivot can take three arguments:
+ index
  + which column should act as the index (remember index values should be unique)
+ columns
  + which column should be split into multiple columns
+ values
  + which column should act as the values 

To convert the flights dataset both year and month could act as the index and columns interchangeably.
+ if index="year"
    + each year will be a row
+ if index="month"
    + each month will be a row
+ vice versa for the column

The value will always be set to passengers

```python
df_wide = df.pivot(index='year', columns='month', values='passengers')
df_wide
```

In [None]:
df_wide = df.pivot(index='year', columns='month', values='passengers')
df_wide

Wide data can be converted into long data using the **melt** function.

It has 5 arguments:
+ id_vars
  + columns which act as identifiers
+ value_vars
  + columns which contain the values
  + if not specified it will use all columns except the id_var columns
+ var_name
  + name to use for the variable column
+ value_name
  + name to use for the value column

In [None]:
# Pandas remembers what changes were made to convert the long to wide
# This resets the dataframe so that pandas does not know that the data was in long format
df_wide = pd.DataFrame(df_wide.to_records())

In [None]:
df_wide.melt(
    id_vars=["year"], 
#     value_vars=["January", "February"], 
    var_name="months", 
    value_name="passengers"
)

---

## Seaborn



In [None]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

## Lets make this interactive

Go [here](https://seaborn.pydata.org/api.html) and choose a type of plot, we will then discuss it