# 3.5.19 Pandas

### Introduction to Pandas

`pandas` is a Python library for **data manipulation and analysis**. It offers data structures and operations for manipulating numerical tables and time series. The name is derived from the term "**pan**el **da**ta" (a term for data sets that include observations over multiple time periods for the same observations). **Wes McKinney** started building what would become pandas from 2007 to 2013; it has been an entirely community-managed project since then.

I would also recommend you to check out his **book on data analysis with Python**, which is freely available online at this link: [Python for Data Analysis](https://wesmckinney.com/book/).



```
`# Dit is opgemaakt als code`
```

<figure>
<img src="img/wes-hadley.jpeg" alt="fishy" class="bg-primary mb-1" width="400">
<figcaption align = "center"> Hadley Wickham (R > Tidyverse) & Wes McKinney (Py > pandas) </figcaption>
</figure>

While pandas adopts many coding idioms from NumPy, the biggest difference is that **pandas is designed for working with tabular or heterogeneous data** (that is: numbers, dates, strings, ...). NumPy, by contrast, is best suited for working with homogeneously-typed numerical array data (that is: numbers).

Just like numpy is conventionally imported as `import numpy as np`, pandas has its own convention: 

In [1]:
import pandas as pd
import numpy as np

`pandas` provides two **data structures** that allow you to shape data into a readable form:

- **Series**: it's a one-dimensional array-like object containing a sequence of values
- **Data Frames**: it's a rectangular table of data where each column is a Series

So the Series is the data structure for a single column of a DataFrame not only conceptually, but literally; in fact, the data in a DataFrame is effectively stored in memory as a collection of Series.

#### Series

A pandas `series` is a **one-dimensional data structure** that comprises of a key-value pair. It contains a sequence of values of the same type and an associated array of data labels, called its index (these are the keys of the key-value pair). 

You can initialise a simple series with `pd.Series()`:

In [3]:
# integer series
pd.Series([1,2,3])

0    1
1    2
2    3
dtype: int64

In [4]:
# float series
pd.Series([5.0,5.3,5.5])

0    5.0
1    5.3
2    5.5
dtype: float64

In [6]:
# string series
pd.Series(["WoW","-","Owen Wilson"])

0            WoW
1              -
2    Owen Wilson
dtype: object

We said that a series is a collection of a key-values pairs (similar to a Python dictionary); you can access these elements via the `.index` and `.array` attributes: 

In [8]:
# initialise a series "s"
s = pd.Series([10, 20, 30, 40, 50])

In [10]:
# get the keys (index) of the series 
s.index

RangeIndex(start=0, stop=5, step=1)

In [11]:
# get the values of the series
s.array

<PandasArray>
[10, 20, 30, 40, 50]
Length: 5, dtype: int64

You can also specify the keys/index of the series via a specific value or character: 

In [15]:
# initialise a series "s2"
s2 = pd.Series(["junior", "adult", "senior"], index = ["j", "a", "s"])

In [16]:
# get the keys (index) of the series 
s2.index

Index(['j', 'a', 's'], dtype='object')

In [17]:
# get the values of the series
s2.array

<PandasArray>
['junior', 'adult', 'senior']
Length: 3, dtype: object

Indexing, slicing, assignment and boolean masking work in an analogous way to Python lists and numpy's arrays: 

In [18]:
# slicing by index value
s2["j"]

'junior'

In [21]:
s[1:3]

1    20
2    30
dtype: int64

In [22]:
# overwrite an existing value
s2["s"] = "Boomer"

In [24]:
# boolean masking
s[s > 30]

3    40
4    50
dtype: int64

In [23]:
s2

j    junior
a     adult
s    Boomer
dtype: object

In [25]:
# slicing by single index value
s2["s"]

'Boomer'

In [26]:
# slicing by multiple index values (notice the double square brackets)
s2[["s", "j"]]

s    Boomer
j    junior
dtype: object

In [27]:
# multiplying by a scalar
s * 5

0     50
1    100
2    150
3    200
4    250
dtype: int64

You can transform a dictionary into a Series (and a Series can be converted back to a dictionary with the `.to_dict()` method): 

In [28]:
# given the following dictionary...
cities = {"Rome":2761632, "Milan":1371498, "Naples":914758, "Turin":848885, "Palermo":np.nan}
cities

{'Rome': 2761632,
 'Milan': 1371498,
 'Naples': 914758,
 'Turin': 848885,
 'Palermo': nan}

In [31]:
# ...we can create a Series: 
cities_s = pd.Series(cities)
cities_s

Rome       2761632.0
Milan      1371498.0
Naples      914758.0
Turin       848885.0
Palermo          NaN
dtype: float64

The functions `pd.isna()` and `pd.notna()` can be used to detect missing data:

In [32]:
# detect missing values
pd.isna(cities_s)

Rome       False
Milan      False
Naples     False
Turin      False
Palermo     True
dtype: bool

In [33]:
pd.notna(cities_s)

Rome        True
Milan       True
Naples      True
Turin       True
Palermo    False
dtype: bool

A function like `pd.isna()` can be used as a boolean mask to get a subselection of the Series. Here we select the miising value and assign it the correct value:

In [39]:
# pd.isna() boolean mask + assignment of new value
cities_s[pd.isna(cities_s)] = 911
cities_s

Rome       2761632.0
Milan      1371498.0
Naples      914758.0
Turin       848885.0
Palermo        911.0
dtype: float64

#### Data Frame

A pandas `DataFrame` is a **two-dimensional data-structure** that can be thought of as a spreadsheet. A DataFrame can also be thought of as a **dictionary of Series**. The DataFrame has both a **row and column index** and each column can be of a **different data type** (numeric, string, boolean, etc.)

The most common way to **create a pandas DataFrame** is from a dictionary of equal-length lists or NumPy arrays:

In [34]:
data = {"gender": ["Males", "Males", "Males", "Males", "Females", "Females", "Females", "Females"],
        "year": [2018, 2019, 2020, 2021, 2018, 2019, 2020, 2021],
        "popM": [29.428, 29.131, 29.050, 28.866, 31.056, 30.685, 30.591, 30.370]}
frame = pd.DataFrame(data)
frame

Unnamed: 0,gender,year,popM
0,Males,2018,29.428
1,Males,2019,29.131
2,Males,2020,29.05
3,Males,2021,28.866
4,Females,2018,31.056
5,Females,2019,30.685
6,Females,2020,30.591
7,Females,2021,30.37


You can access the first / last rows of your dataframe with the `.head()` and `.tail()` methods:

In [35]:
# top 5 rows (5 is the default)
frame.head()

Unnamed: 0,gender,year,popM
0,Males,2018,29.428
1,Males,2019,29.131
2,Males,2020,29.05
3,Males,2021,28.866
4,Females,2018,31.056


In [36]:
# last 3 rows 
frame.tail(3)

Unnamed: 0,gender,year,popM
5,Females,2019,30.685
6,Females,2020,30.591
7,Females,2021,30.37


You can **retrieve a column in a DataFrame** as a Series with the following two methods: 

- `frame["colname"]` indexing or square bracket access *(preferred method)*
- `frame.colname` attribute or dot access

In [40]:
# retrieve the popM column with the dict-like notation
frame["popM"]

0    29.428
1    29.131
2    29.050
3    28.866
4    31.056
5    30.685
6    30.591
7    30.370
Name: popM, dtype: float64

In [41]:
# retrieve the popM column with the dot-attribute notation
frame.popM

0    29.428
1    29.131
2    29.050
3    28.866
4    31.056
5    30.685
6    30.591
7    30.370
Name: popM, dtype: float64

Some notes on the **differences between the two** access methods: 

- indexing `[]` (squared brackets access) has the full functionaly to operate on DataFrame column data
- attribute access (dot access) is mainly for convinience to access existing DataFrame column data, but some limitations, namely: 
    - you cannot add a column with this method
    - it only works if the column name does not conflict with any of the method names in DataFrame
    - it won't work if you have spaces in the column name 
    - it won't work if the column name is an integer

Check the [documentation](http://pandas-docs.github.io/pandas-docs-travis/user_guide/indexing.html#attribute-access) for more on attribute access.

**Rows can also be retrieved** by index position `.iloc` or by index name `.loc`. Let's first **set a new index** on our dataframe with the `.set_index()` method so to better understand the difference between these two ways of slicing our data. 

In [42]:
# to set a new index on our dataframe
frame.set_index([["m18", "m19", "m20", "m21", "f18", "f19", "f20", "f21"]], inplace=True)
frame

Unnamed: 0,gender,year,popM
m18,Males,2018,29.428
m19,Males,2019,29.131
m20,Males,2020,29.05
m21,Males,2021,28.866
f18,Females,2018,31.056
f19,Females,2019,30.685
f20,Females,2020,30.591
f21,Females,2021,30.37


Using `.iloc[2]`, we will select the row at **index** = 2 (so the third row, since Python indexes start at 0): 

In [43]:
# the i in iloc stands for index, so this will retrieve the row at index = 2
frame.iloc[2]

gender    Males
year       2020
popM      29.05
Name: m20, dtype: object

Using `.loc["f18"]`, we will select the row whose **index name** = "f18" (so the fifth row in this example): 

In [44]:
# the loc will match the index value, so this will retrieve the row with an index value of 2
frame.loc["f18"]

gender    Females
year         2018
popM       31.056
Name: f18, dtype: object

*Check [this table](https://wesmckinney.com/book/pandas-basics.html#tbl-table_dataframe_loc_iloc) for a complete list of indexing solutions.*

To **create a new column**, you can use the same indexing / square bracket access that we saw before:

In [45]:
frame

Unnamed: 0,gender,year,popM
m18,Males,2018,29.428
m19,Males,2019,29.131
m20,Males,2020,29.05
m21,Males,2021,28.866
f18,Females,2018,31.056
f19,Females,2019,30.685
f20,Females,2020,30.591
f21,Females,2021,30.37


In [49]:
frame["County"] = "Italy"
frame["Constant"] = 10
frame["variabele"] = np.arange(8)
frame["random"] = np.random.rand(8)
frame

Unnamed: 0,gender,year,popM,County,Constant,variabele,random
m18,Males,2018,29.428,Italy,10,0,0.165234
m19,Males,2019,29.131,Italy,10,1,0.702245
m20,Males,2020,29.05,Italy,10,2,0.395166
m21,Males,2021,28.866,Italy,10,3,0.754924
f18,Females,2018,31.056,Italy,10,4,0.987998
f19,Females,2019,30.685,Italy,10,5,0.145037
f20,Females,2020,30.591,Italy,10,6,0.476971
f21,Females,2021,30.37,Italy,10,7,0.750043


You can also multiply one column by the other and save it as a new column:

In [53]:
frame["Multiply"] = frame["Constant"] * frame["random"]
frame

Unnamed: 0,gender,year,popM,County,Constant,variabele,random,Multiply
m18,Males,2018,29.428,Italy,10,0,0.165234,1.652344
m19,Males,2019,29.131,Italy,10,1,0.702245,7.022449
m20,Males,2020,29.05,Italy,10,2,0.395166,3.951657
m21,Males,2021,28.866,Italy,10,3,0.754924,7.549237
f18,Females,2018,31.056,Italy,10,4,0.987998,9.879983
f19,Females,2019,30.685,Italy,10,5,0.145037,1.450374
f20,Females,2020,30.591,Italy,10,6,0.476971,4.769714
f21,Females,2021,30.37,Italy,10,7,0.750043,7.500433


To **retrieve the column names** use the `.columns` attribute: 

In [50]:
frame.columns

Index(['gender', 'year', 'popM', 'County', 'Constant', 'variabele', 'random'], dtype='object')

Similarly, use the `.index` attribute to **retrieve the index names**: 

In [54]:
frame.index

Index(['m18', 'm19', 'm20', 'm21', 'f18', 'f19', 'f20', 'f21'], dtype='object')

To **change the order of the variables** you can rearrange the column names in the following way: 

*(Note: omitting a name in the square brackets will automatically drop that column)* 

In [57]:
frame = frame[['County', 'gender', 'year', 'popM', 'random']]
frame

Unnamed: 0,County,gender,year,popM,random
m18,Italy,Males,2018,29.428,0.165234
m19,Italy,Males,2019,29.131,0.702245
m20,Italy,Males,2020,29.05,0.395166
m21,Italy,Males,2021,28.866,0.754924
f18,Italy,Females,2018,31.056,0.987998
f19,Italy,Females,2019,30.685,0.145037
f20,Italy,Females,2020,30.591,0.476971
f21,Italy,Females,2021,30.37,0.750043


To **delete a column** use the `.drop()` method and specifying the column `axis = 1` *(notice that with `axis = 0` you can delete specific rows by selecting the relative index name)*: 

In [58]:
frame.drop(["random"], axis = 1)

Unnamed: 0,County,gender,year,popM
m18,Italy,Males,2018,29.428
m19,Italy,Males,2019,29.131
m20,Italy,Males,2020,29.05
m21,Italy,Males,2021,28.866
f18,Italy,Females,2018,31.056
f19,Italy,Females,2019,30.685
f20,Italy,Females,2020,30.591
f21,Italy,Females,2021,30.37


In [59]:
frame

Unnamed: 0,County,gender,year,popM,random
m18,Italy,Males,2018,29.428,0.165234
m19,Italy,Males,2019,29.131,0.702245
m20,Italy,Males,2020,29.05,0.395166
m21,Italy,Males,2021,28.866,0.754924
f18,Italy,Females,2018,31.056,0.987998
f19,Italy,Females,2019,30.685,0.145037
f20,Italy,Females,2020,30.591,0.476971
f21,Italy,Females,2021,30.37,0.750043


You can use **boolean conditions** to keep just a slice of the dataset: 

In [60]:
frame[frame["gender"] == "Males"]

Unnamed: 0,County,gender,year,popM,random
m18,Italy,Males,2018,29.428,0.165234
m19,Italy,Males,2019,29.131,0.702245
m20,Italy,Males,2020,29.05,0.395166
m21,Italy,Males,2021,28.866,0.754924


And, by placing another square bracket slicer next to the first one, you can select a specific column: 

In [61]:
frame[frame["gender"] == "Males"]["popM"]

m18    29.428
m19    29.131
m20    29.050
m21    28.866
Name: popM, dtype: float64

To **sort the DataFrame by the values of a column**, use the `.sort_values()` method specifying the column name like this: 

In [62]:
frame.sort_values("popM")

Unnamed: 0,County,gender,year,popM,random
m21,Italy,Males,2021,28.866,0.754924
m20,Italy,Males,2020,29.05,0.395166
m19,Italy,Males,2019,29.131,0.702245
m18,Italy,Males,2018,29.428,0.165234
f21,Italy,Females,2021,30.37,0.750043
f20,Italy,Females,2020,30.591,0.476971
f19,Italy,Females,2019,30.685,0.145037
f18,Italy,Females,2018,31.056,0.987998


To sort in descending order add the `ascending=False` parameter: 

In [63]:
frame.sort_values("popM", ascending=False)

Unnamed: 0,County,gender,year,popM,random
f18,Italy,Females,2018,31.056,0.987998
f19,Italy,Females,2019,30.685,0.145037
f20,Italy,Females,2020,30.591,0.476971
f21,Italy,Females,2021,30.37,0.750043
m18,Italy,Males,2018,29.428,0.165234
m19,Italy,Males,2019,29.131,0.702245
m20,Italy,Males,2020,29.05,0.395166
m21,Italy,Males,2021,28.866,0.754924


To sort multiple columns, include the column names (and the ascending parameter, if necessary) in the square brackets: 

In [64]:
frame.sort_values(["gender","year"])

Unnamed: 0,County,gender,year,popM,random
f18,Italy,Females,2018,31.056,0.987998
f19,Italy,Females,2019,30.685,0.145037
f20,Italy,Females,2020,30.591,0.476971
f21,Italy,Females,2021,30.37,0.750043
m18,Italy,Males,2018,29.428,0.165234
m19,Italy,Males,2019,29.131,0.702245
m20,Italy,Males,2020,29.05,0.395166
m21,Italy,Males,2021,28.866,0.754924


You can compute descriptive statistics using the `.sum()`, `.mean()`, `.cumsum()` (and many more) methods:

In [65]:
print(frame["popM"].min())
print(frame["popM"].mean())
print(frame["popM"].sum())
print(frame["popM"].max())

28.866
29.897125
239.177
31.056


In [74]:
# you can compute the cumulative sum with .cumsum()
frame["ones"] = 1
frame["cumsum"] = frame["ones"].cumsum()
frame

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,County,gender,year,popM,random,ones,cumsum
m18,Italy,Males,2018,29.428,0.165234,1,1
m19,Italy,Males,2019,29.131,0.702245,1,2
m20,Italy,Males,2020,29.05,0.395166,1,3
m21,Italy,Males,2021,28.866,0.754924,1,4
f18,Italy,Females,2018,31.056,0.987998,1,5
f19,Italy,Females,2019,30.685,0.145037,1,6
f20,Italy,Females,2020,30.591,0.476971,1,7
f21,Italy,Females,2021,30.37,0.750043,1,8


The `info()` method lets use see the data type for values in each of our data frame columns

In [75]:
frame.info()

<class 'pandas.core.frame.DataFrame'>
Index: 8 entries, m18 to f21
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   County  8 non-null      object 
 1   gender  8 non-null      object 
 2   year    8 non-null      int64  
 3   popM    8 non-null      float64
 4   random  8 non-null      float64
 5   ones    8 non-null      int64  
 6   cumsum  8 non-null      int64  
dtypes: float64(2), int64(3), object(2)
memory usage: 812.0+ bytes


And we can use `shape()` to see the shape of our dataframe

In [76]:
frame.shape

(8, 7)

The `describe()` method allows you to create a **table of descriptive statistics** about the numerical columns in your dataset: 

In [78]:
frame.describe()

Unnamed: 0,year,popM,random,ones,cumsum
count,8.0,8.0,8.0,8.0,8.0
mean,2019.5,29.897125,0.547202,1.0,4.5
std,1.195229,0.866576,0.301973,0.0,2.44949
min,2018.0,28.866,0.145037,1.0,1.0
25%,2018.75,29.11075,0.337683,1.0,2.75
50%,2019.5,29.899,0.589608,1.0,4.5
75%,2020.25,30.6145,0.751263,1.0,6.25
max,2021.0,31.056,0.987998,1.0,8.0


Check out [this table](https://wesmckinney.com/book/pandas-basics.html#tbl-table_descriptive_stats) for a list of more statistical functions you can use with Series and DataFrames.

Finally, if you need to find out the unique elements in a Column or if you want to count the occurrences of a category within a column, you can use the `.unique()` and `.value_counts()` methods, respectively:

In [79]:
# to enumerate the unique values in a DataFrame column: 
frame["year"].unique()

array([2018, 2019, 2020, 2021])

In [81]:
# to count the occurrences of each category in a DataFrame column: 
frame["gender"].value_counts()

Males      4
Females    4
Name: gender, dtype: int64

---

# Exercise

In the cell below, create a dataframe of the below student names and exam scores:

- Student Names: Sarah, Jack, Alice, John, David
- Pre-Test Scores: 75, 80, 777, 60, 75
- Post-Test Scores: 87, 90, 92, 74, 80


In [85]:
# given the following dictionary...
df = {"name": ["Sarah", "Jack", "Alice", "John", "David"],
      "pre": [75, 80, 777, 60, 75],
      "post": [87, 90, 92, 74, 80]}
    
df = pd.DataFrame(df)
df.head()

Unnamed: 0,name,pre,post
0,Sarah,75,87
1,Jack,80,90
2,Alice,777,92
3,John,60,74
4,David,75,80


It looks like there was a mistake in Alice's pre-test mark. Update the dataframe to correct their mark to 77

In [98]:
df['pre'] = df['pre'].replace([777],77)
df.head()

Unnamed: 0,name,pre,post
0,Sarah,75,87
1,Jack,80,90
2,Alice,77,92
3,John,60,74
4,David,75,80


In the below cells, write the code that would give you the answer to the follow questions:

- What was the highest pre-test score and highest post-test score (can you write some code that would give back the entire row for each?
- Add a new column call Difference that calculates the difference between pre and post-test scores. Who made the biggest improvement?
- What was the average test score for pre and post-tests?
- Use a boolean mask to return back rows for students who improved their score by more than 10 marks. How many students were there?


In [104]:
df.max(axis=0, numeric_only=True)

pre     80
post    92
dtype: int64

In [111]:
df[df["pre"] == df["pre"].max()]

Unnamed: 0,name,pre,post,jungle diff
1,Jack,80,90,10


In [106]:
df["jungle diff"] = df["post"] - df["pre"]
df

Unnamed: 0,name,pre,post,jungle diff
0,Sarah,75,87,12
1,Jack,80,90,10
2,Alice,77,92,15
3,John,60,74,14
4,David,75,80,5


In [108]:
df.mean()

  """Entry point for launching an IPython kernel.


pre            73.4
post           84.6
jungle diff    11.2
dtype: float64

In [112]:
df.describe()

Unnamed: 0,pre,post,jungle diff
count,5.0,5.0,5.0
mean,73.4,84.6,11.2
std,7.765307,7.46994,3.962323
min,60.0,74.0,5.0
25%,75.0,80.0,10.0
50%,75.0,87.0,12.0
75%,77.0,90.0,14.0
max,80.0,92.0,15.0


In [113]:
df[df["jungle diff"] > 10]

Unnamed: 0,name,pre,post,jungle diff
0,Sarah,75,87,12
2,Alice,77,92,15
3,John,60,74,14
