# Pandas
<img src="https://media.discordapp.net/attachments/739234516857782373/759029377157038091/IMG_20200925_133132.jpg?width=669&height=892" width=50% align="left"></img>

In this Session we'll look at a common Data Analysis Library for Python, known as Pandas.
For this module, we'll look at the basics of importing data, looking at two main types of Object which Pandas provides - the DataFrame and Data Series objects - as well as ways of manipulating these and interacting with them.

The primary motivation behind using Pandas is that it's a high-level library, with few lines of code we can do very powerful and robust 'things' to our data. We will be using these in this module to interrogate some data. This will become more important as the Programme continues (with Kevin's module: Fundamentals)

## Installing
It is unlikely that you will have the `pandas` pip package already installed. Therefore, we'll want to run `pip install pandas` from a command-line to install this, just as we have done before for Jupyter and Numpy.

In [1]:
!pip install pandas

Defaulting to user installation because normal site-packages is not writeable


We can also use some Jupyter functionality here to execute system commands.

## Import
As always, we need to import our Library. Typically at the top of our File. Just like what we did with our Numpy library, I'm going to alias this to `pd` so that we don't need to type so many pandas.

```python
import pandas as pd
```

If we have multiple libraries, we might have something that looks like the following:
    
```python
import numpy as np
import csv
import pandas as pd
import json
```

Each of these is executed in order, just as any Python expressions are. `import` and `as` are keywords which Python recognises and knows what to do with them.

In [3]:
import pandas as pd
print( type(pd) )

<class 'module'>


Now we have our pandas library imported, we can begin to use some functionality from it. It acts as a big kitchen sink full of useful Data Analysis utilities.

The first thing we need to do is to learn how to load some data in.
Pandas can do this via the `read_csv` function. (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)

Looking at the documentation may seem rather overwhelming. The main parameter we're interested in, is the non-optional one. The first one. `filepath_or_buffer`. We can provide this function a location, as a string, to a file we want to open. This file **MUST** be in the CSV file format to be interpreted correctly, without error.

When we invoke/call this function, it returns a DataFrame Object. Some convention likes to name the variable returned by `read_csv` as `df` for dataframe. However, you can name this something more descriptive!

```python
my_dataframe = pd.read_csv('./iris.csv')
# I'm going to load a popular dataset which I have a file for.
```

In [4]:
my_dataframe = pd.read_csv('./iris.csv')

We can check the type of `my_dataframe` to verify what it is.

In [5]:
print(type(my_dataframe))

<class 'pandas.core.frame.DataFrame'>


The last line of a Notebook cell is what's output from the cell itself. Jupyter has some nice ways of interpreting data frames for display.
If we use print, it looks horrible. But if we leave it out (for a Jupyter Notebook), we get a much prettier print.

In [6]:
print(my_dataframe) # Ugly

     sepal.length  sepal.width  petal.length  petal.width    variety
0             5.1          3.5           1.4          0.2     Setosa
1             4.9          3.0           1.4          0.2     Setosa
2             4.7          3.2           1.3          0.2     Setosa
3             4.6          3.1           1.5          0.2     Setosa
4             5.0          3.6           1.4          0.2     Setosa
..            ...          ...           ...          ...        ...
145           6.7          3.0           5.2          2.3  Virginica
146           6.3          2.5           5.0          1.9  Virginica
147           6.5          3.0           5.2          2.0  Virginica
148           6.2          3.4           5.4          2.3  Virginica
149           5.9          3.0           5.1          1.8  Virginica

[150 rows x 5 columns]


In [7]:
my_dataframe # Pretty
# Notice the cell output says Out[] with a number.

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Virginica
146,6.3,2.5,5.0,1.9,Virginica
147,6.5,3.0,5.2,2.0,Virginica
148,6.2,3.4,5.4,2.3,Virginica


## DataFrames and Series
Pandas has two data objects, `DataFrames` and `Series`

* <u>Series</u> - A `Series` object is a one dimensional data structure. It stores sequential values, each value has an index. This is conceptually similar to a Python List. However, with all Pandas objects we have additionally functionality defined which we can make use of.


* <u>DataFrame</u> - This represents a 2D (or more) data structure. It is essentially a standard table which consists of rows of data records, where each column is an attribute. Each column has its data attribute name, and each row has an index (starting from 0)

Documentation outlining all attributes and methods of these Objects can be found in their respective Documentation pages. https://pandas.pydata.org/pandas-docs/stable/reference/series.html and https://pandas.pydata.org/pandas-docs/stable/reference/frame.html

## Indexing DataFrames

We can obtain an entire attribute (column) by indexing it from the DataFrame.
Previously we were used to indexing elements directly, where these represented the unique people through our dataset. However, here, we're looking at getting ALL of a specific attribute.

Our Dataset contains 5 Data attributes:

* sepal length
* sepal width
* petal length
* petal width
* flower name

<img src="https://miro.medium.com/max/1275/1*7bnLKsChXq94QjtAiRn40w.png" width=100%></img>

We can index a given DataFrame by use of square brackets (as we would expect). However, this behaves similarly to a Dictionary, in that our attribute names are actually strings.

If we wanted a list of the column headings we can just use `my_dataframe.columns` as an attribute.

```python
my_dataframe.columns
```

From this we can see the column heading for Sepal Length is called "sepal.length". This can be any name. In our CSV files it could be something like "Sepal Length (mm)" which one might expect from more user-friendly data.

If you opened the CSV file provided in a notepad, you can see the headers are what our Data Frame is using, it automatically grabbed them for us when creating the original Data Frame!

```python
sepal_lengths = my_dataframe['sepal.length']
```

This should give us a `Series` object. As if we take a single column, it represents a 1D structure. We can prove this by calling `type()` on it!

```python
print(type(sepal_length))
```

In [8]:
print(my_dataframe.columns)
#Or
for c in my_dataframe.columns:
    print(c)

Index(['sepal.length', 'sepal.width', 'petal.length', 'petal.width',
       'variety'],
      dtype='object')
sepal.length
sepal.width
petal.length
petal.width
variety


In [9]:
sepal_lengths = my_dataframe['sepal.length']
print(type(sepal_lengths))

<class 'pandas.core.series.Series'>


### Statistical Measures of Data Series

We can calculate some very simple statistical measures of our data for this column.

Let's look at <u>min</u>, <u>max</u>, <u>mean</u>, <u>median</u>, and <u>mode</u>.

The first four, are all measures which work on numbers. The latter, mode, is for categorical data such as the variety of flower (There are only three types of flower).

These are all methods defined upon a Data Series object. So we can just invoke them! This is just like we had for numpy matrices too.

```python
# Take our Data Series object
print( sepal_lengths.min() )
print( sepal_lengths.max() )
print( sepal_lengths.mean() )
print( sepal_lengths.median() )
```

In [10]:
# Take our Data Series object
print( sepal_lengths.min() )
print( sepal_lengths.max() )
print( sepal_lengths.mean() )
print( sepal_lengths.median() )

4.3
7.9
5.843333333333334
5.8


### F-strings
If we want some fancier printing, we can use a Python 3.6 feature called `f-strings`. This allows us to directly insert variables into strings for the purpose of display, without having to worry about concatenattion.

E.g
```python
f"" # This is an f-string. It has an f before the "" or ''
some_var = 42
print(f"We can write normal strings with it.")
print(f"also input variables, like {some_var} in it.")
```

Just put your variable between curly braces {} to include it automatically. Notice how it's also done some casting for us.

In [11]:
f"" # This is an f-string. It has an f before the "" or ''
some_var = 42
print(f"We can write normal strings with it.")
print(f"also input variables, like {some_var} in it.")

We can write normal strings with it.
also input variables, like 42 in it.


Let's use these with our fancy statistical metrics to print things a bit nicer.


In [12]:
print( f"Sepal Length Min: {sepal_lengths.min()}" )
print( f"Sepal Length Max: {sepal_lengths.max()}" )
print( f"Sepal Length Mean: {sepal_lengths.mean()}" )
print( f"Sepal Length Median: {sepal_lengths.median()}" )

Sepal Length Min: 4.3
Sepal Length Max: 7.9
Sepal Length Mean: 5.843333333333334
Sepal Length Median: 5.8


Now let's look at that last one mode. If we use this on Sepal Length, it's not very useful. Length is a continuous bit of data. So we'll use it on the variety.

This method behaves differently, as it returns them in order of the most common to least. So the result is actually a Data Series itself. Very confusing.

```python
modes = my_dataframe['variety'].mode()
for m in modes:
    print(m)
```

Notice how I don't need to make a variable first to put it in here. Anything that evaluates to a number or string or something you want to put in can be used. In this case, I indexed our column, variety, then took the mode of the resulting series, all in a single expression.

In [13]:
modes = my_dataframe['variety'].mode()
for m in modes:
    print(m)

Setosa
Versicolor
Virginica


### Unique, Describe, Count.
We also have methods for obtaining the amount of unique values in a given attribute (column). In our case this is useful to verify we only have three types of flowers.

```python
print( my_dataframe['variety'].unique() )
```

In [14]:
print( my_dataframe['variety'].unique() )

['Setosa' 'Versicolor' 'Virginica']


`.count()` provides us with a way of finding all non-null elements within a column.

```python
print( my_dataframe['variety'].count() )
```

This may be different than the number of rows you expect.

In [15]:
print( my_dataframe['variety'].count() )

150


Finally, `.describe()` allows us to look at various qualities of the DataFrame of Series.

```python 
print(my_dataframe['variety'].describe())
```

In [16]:
print(my_dataframe['variety'].describe())

count            150
unique             3
top       Versicolor
freq              50
Name: variety, dtype: object


Note: Many of these functions actually work on the whole DataFrame itself!
    
```python
print(my_dataframe.describe())
```

In [17]:
my_dataframe.describe() # Remember if it's output of the cell, it's prettier. Only works in Jupyter.

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5
