<img src=images/gdd-logo.png width=300px align=right>

# Pandas introduction

The next notebooks will cover how to use the `pandas` library to explore datasets.

In this section we will cover:

* [Pandas overview](#overview)
* [Lambdas starter](#lambdas)
* [Benefits of using pandas](#benefits)
* [Data wrangling](#wrangling)
* [<mark>Exercise: Exploratory Data Analysis</mark>](#exploring)
    * [<mark>Exercise: Learn some key methods!</mark>](#exploring)
    * [<mark>Exercise: Explore a dataset</mark>](#ex-explore-data)
* [Analysis](#analysis)

<a id = 'lambdas'></a>
## <mark>Exercise: Lambda Starter</mark>

**Lambda functions** really start to come into their own when we use them with pandas. Therefore we need to be really comfortable with the syntax. 

Here is an example of a lambda function that adds 4 to a number:

In [1]:
add_4 = lambda x: x + 4

In [9]:
(lambda x,y:x+4*y)(10,4)

26

In [2]:
add_4(10)

14

This lambda function essentially does the same as if we were formally defining a function.

In [3]:
def add_4(x):
    return x + 4

In [4]:
add_4(10)

14

We can also define multiple parameters in a lambda function by a separating comma:

In [5]:
add_4y = lambda x, y: x + 4*y

In [6]:
add_4y(10, y = 2)

18

Now complete the following questions:

1. Create a lambda function that multiplies two numbers together (and check it)

In [12]:
product=lambda x,y: x*y
product(10,4)


40

2. Create a lambda function to check if a number is bigger than 10 (and check it)

In [15]:
grt_thn10=lambda x:x>10
grt_thn10(11)


True

<a id = 'overview'></a>
## Pandas overview

Pandas is a specialised package that allows you to work with tabular data using python.

First you need to import the package:

In [16]:
import pandas as pd

Then, to read in a csv file you can use:
```python
pd.read_csv('filepath/file.csv')
```

This notebook uses the `chickweight.csv` dataset, which is in the `data/` folder:

In [17]:
pd.read_csv('data/chickweight.csv')

Unnamed: 0,rownum,weight,Time,Chick,Diet
0,1,42,0,1,1
1,2,51,2,1,1
2,3,59,4,1,1
3,4,64,6,1,1
4,5,76,8,1,1
...,...,...,...,...,...
573,574,175,14,50,4
574,575,205,16,50,4
575,576,234,18,50,4
576,577,264,20,50,4


Aside from making tables look prettier in your Jupyter notebook, there are many advantages for using the Pandas packages when working with data.

<a id = 'benefits'></a>

## The benefits of using Pandas (and Python)

**Question:** What kind of benefits do you think you have using `pandas` to work with data?

<details>
    <summary><font color=blue>Show answer</font></summary>

- **Automation**: You can automate otherwise tedious tasks such as merging multiple datasets.
- **Cleaning**: Pandas allows you to automate the cleaning of your datasets.
- **Speed**: When working with large datasets, it is much faster than tools like Excel.
- **Filtering**: Easy to filter to find specific values
- **Groupby**: Chunk your data set into pieces, apply a function, and place it back together
- **Creating new columns**: Easily create new columns from calculations with other columns

    
*And much more!*

</details>


<a id = 'wrangling'></a>

## Data wrangling with Pandas

**Data Wrangling** is the process of transforming and mapping data, with the intent of making it more appropriate and valuable for a variety of downstream purposes such as for dashboards or analytics.

To demonstrate pandas' capabilities, let's load the `chickweight.csv` dataset.

In [19]:
chickweight = pd.read_csv('data/chickweight.csv').rename(str.lower, axis='columns')

chickweight.head()

Unnamed: 0,rownum,weight,time,chick,diet
0,1,42,0,1,1
1,2,51,2,1,1
2,3,59,4,1,1
3,4,64,6,1,1
4,5,76,8,1,1


<a id = 'exploring'></a>
## <mark> Exercise: Exploratory Data Analysis</mark>

### <mark>Part 1: Learn some key attributes/methods!</mark>

Fill in the comments to explain what each cell does **in your own words**. 

You can use `help(pd.DataFrame.X)` to access the documentation for the attribute/method. For example,

```python
help(chickweight.info)
```

The first one is done for you.

In [None]:
# the shape attribute... gives the number of rows and number of columns
chickweight.shape
#amount of rows and columns in a dataset

(578, 5)

In [None]:
# the info method...
chickweight.info()
#Summary of the form of data 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 578 entries, 0 to 577
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   rownum  578 non-null    int64
 1   weight  578 non-null    int64
 2   time    578 non-null    int64
 3   chick   578 non-null    int64
 4   diet    578 non-null    int64
dtypes: int64(5)
memory usage: 22.7 KB


In [None]:
# the descibe method...
chickweight.describe()
#summary of the values per column

Unnamed: 0,rownum,weight,time,chick,diet
count,578.0,578.0,578.0,578.0,578.0
mean,289.5,121.818339,10.717993,25.750865,2.235294
std,166.998503,71.07196,6.7584,14.568795,1.162678
min,1.0,35.0,0.0,1.0,1.0
25%,145.25,63.0,4.0,13.0,1.0
50%,289.5,103.0,10.0,26.0,2.0
75%,433.75,163.75,16.0,38.0,3.0
max,578.0,373.0,21.0,50.0,4.0


In [None]:
# the index columns...
chickweight.columns
# index of columns

Index(['rownum', 'weight', 'time', 'chick', 'diet'], dtype='object')

In [None]:
# the head method...
chickweight.head()
# first 5 rows

Unnamed: 0,rownum,weight,time,chick,diet
0,1,42,0,1,1
1,2,51,2,1,1
2,3,59,4,1,1
3,4,64,6,1,1
4,5,76,8,1,1


In [None]:
# the tail method...
chickweight.tail()
# last 5 rows

Unnamed: 0,rownum,weight,time,chick,diet
573,574,175,14,50,4
574,575,205,16,50,4
575,576,234,18,50,4
576,577,264,20,50,4
577,578,264,21,50,4


In [None]:
# the sample method...
chickweight.sample(5)
# Random sampling of 5 items

Unnamed: 0,rownum,weight,time,chick,diet
210,211,54,4,20,1
402,403,61,4,36,3
179,180,57,8,16,1
101,102,90,12,9,1
94,95,125,20,8,1


In [None]:
# you can use square brackets to...
chickweight['diet']
# points to a column and extracts the first and last 5 columns

0      1
1      1
2      1
3      1
4      1
      ..
573    4
574    4
575    4
576    4
577    4
Name: diet, Length: 578, dtype: int64

In [None]:
# the unique method...
chickweight['diet'].unique()
# Summarizes the different unique values that exist in a column

array([1, 2, 3, 4])

In [None]:
# the value_counts method...
chickweight['diet'].value_counts()
# counts the unique values

diet
1    220
2    120
3    120
4    118
Name: count, dtype: int64

In [None]:
# the mean method...
chickweight.mean()
# extracts the mean

rownum    289.500000
weight    121.818339
time       10.717993
chick      25.750865
diet        2.235294
dtype: float64

<a id = 'ex-explore-data'></a>
### <mark>Part 2: Explore a dataset</mark>
Investigate the `weight` and `time` columns of the dataframe.

1. How many different unique values for `time` are there? What do you think time represents in this dataframe?

In [40]:
chickweight['weight'].value_counts()

weight
41     20
42     15
49     13
62      8
39      8
       ..
322     1
152     1
203     1
237     1
105     1
Name: count, Length: 212, dtype: int64

2. What are the min & max of the time and weight column?

In [57]:
chickweight[['weight','time']].agg(["min","max"])


Unnamed: 0,weight,time
min,35,0
max,373,21


**Bonus:** What is the most common (i.e the mode) weight of a chicken?

In [61]:
chickweight['weight'].mode()

0    41
Name: weight, dtype: int64

### Answers


<details>
    <summary><font style=font-weight:bold>Part 1:</font>
        <font color=blue>Show answer</font></summary>
  
Using exploration includes:

* Checking the shape (`df.shape`) of the dataframe
* The length (`len(df)`) of the dataframe
* General information (`df.info()`) of the dataframe & columns
* Averages of each numeric column (`df.describe()`)
* The column names (`df.columns`)
* Fetching the first/last or a sample of a few rows (`.head()` `df.sample()` `df.tail()`)
* Selecting one (or more) columns (`df['column_name']`)
* Fetching the unique values of a column (`df['column_name'].unique()`)
* Summing the amount of unique values of a column (`df['column_name'].value_counts()`)

</details>


**Part 2:** Uncomment (remove the `# `) and run the cell to see the solution.

In [None]:
# %load answers/01_Introduction/ex-explore-data-1.py

In [None]:
# %load answers/01_Introduction/ex-explore-data-2.py

In [59]:
# %load answers/01_Introduction/ex-explore-data-3.py

chickweight['weight'].mode()


<a id = 'analysis'></a>
## Analysis

### What analysis could you do? 


<img src="images/01_Introduction/chick.png" width="240" height="240" align="center"/>

Imagine that you own a farm and have this dataset available.

Now you have a feel for the dataset, what could you do with it?

In [None]:
chickweight.head()

### Potential areas for analysis:

Some questions you might want to answer are:

The main use case could be to figure out which diet is best, but it is good to think about some of the other use cases. 