# Week03 Section 

## Data Wrangling with Pandas

Wenjun Sun

## What did we learn this week?

Let's go through each topic, and please raise your hand (using Zoom's raise-hand button) if you have questions about this topic.


- Series and DataFrame
- Load data 
- Exploratory Data Analysis (EDA): `info`, `head`, `shape`, `describe`, `unique`, `ProfileReport`
- Method Chaining
- Select columns and rows of a dataframe
- Apply functions to a dataframe
- `groupby` and summarize
- Index
- Reshape a dataframe
- Visualize
- Other topics

## Topics we will cover today 

I planned some topics for this section. But I priorizie answering all of your questions. 

## Load and save data 

You should already know have to read in a dataset with `pd.read_..()`, for example `pd.read_csv()`. 

tutorial: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

In [1]:
import pandas as pd 

country = pd.read_csv('data/country.csv')
country.head(3) #oops, no column name?

Unnamed: 0,Armenia,Asia
0,Algeria,Africa
1,U.S.S.R/Russia,Europe
2,U.S.,North America


In [2]:
pd.read_csv('data/country.csv', names=['country', 'continent']).head(3)

Unnamed: 0,country,continent
0,Armenia,Asia
1,Algeria,Africa
2,U.S.S.R/Russia,Europe


In [3]:
pd.read_csv('data/country.csv', names=['country', 'continent'], index_col='country').head(3) # set a column as index

Unnamed: 0_level_0,continent
country,Unnamed: 1_level_1
Armenia,Asia
Algeria,Africa
U.S.S.R/Russia,Europe


Similarly, we can save a DataFrame as a CSV file with `.to_csv()`

In [4]:
country = pd.read_csv('data/country.csv', names=['country', 'continent']).sample(5) # take a sample
country

Unnamed: 0,country,continent
14,India,Asia
39,Denmark,Europe
20,Afghanistan,Asia
33,Israel,Asia
7,Czechoslovakia,Europe


In [5]:
country.to_csv('data/country_sample.csv')

## `.select_dtypes()`

Return a subset of the DataFrame’s columns based on the column dtypes.

tutorial: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.select_dtypes.html?highlight=select_dtypes

In [6]:
astronauts = pd.read_csv('data/astronauts.csv')
astronauts.head(3) 

Unnamed: 0,name,gender,birth,nationality,military_civilian,mission_number,occupation,year_of_mission,mission_hours,mission_title
0,"Gagarin, Yuri",male,1934,U.S.S.R/Russia,military,1,pilot,1961,1.77,Vostok 1
1,"Titov, Gherman",male,1935,U.S.S.R/Russia,military,1,pilot,1961,25.0,Vostok 2
2,"Glenn, John H., Jr.",male,1921,U.S.,military,1,pilot,1962,5.0,MA-6


In [7]:
astronauts.select_dtypes('number').head(3)

Unnamed: 0,birth,mission_number,year_of_mission,mission_hours
0,1934,1,1961,1.77
1,1935,1,1961,25.0
2,1921,1,1962,5.0


In [8]:
astronauts.select_dtypes(exclude='number').head(3)

Unnamed: 0,name,gender,nationality,military_civilian,occupation,mission_title
0,"Gagarin, Yuri",male,U.S.S.R/Russia,military,pilot,Vostok 1
1,"Titov, Gherman",male,U.S.S.R/Russia,military,pilot,Vostok 2
2,"Glenn, John H., Jr.",male,U.S.,military,pilot,MA-6


## track statistics over rows

`cumsum()`, `cummax()`, `cummin()`, `cumprod()`

tutorial: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.cumsum.html?highlight=cumsum#pandas.DataFrame.cumsum

In [9]:
s = pd.Series([2, 3, 5, -1, 0])
s

0    2
1    3
2    5
3   -1
4    0
dtype: int64

In [10]:
s.cumsum()

0     2
1     5
2    10
3     9
4     9
dtype: int64

In [11]:
s.cumprod()

0     2
1     6
2    30
3   -30
4     0
dtype: int64

`cumcount()` only works with grouped data

In [12]:
df = pd.DataFrame([['a'], ['a'], ['a'], ['b'], ['b'], ['a']],
                  columns=['col'])
df

Unnamed: 0,col
0,a
1,a
2,a
3,b
4,b
5,a


In [13]:
df.groupby('col').cumcount()

0    0
1    1
2    2
3    0
4    1
5    3
dtype: int64

## `to_markdown()`

Print DataFrame in Markdown-friendly format

tutorial: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_markdown.html?highlight=markdown#pandas.DataFrame.to_markdown

In [14]:
astronauts.head(5)

Unnamed: 0,name,gender,birth,nationality,military_civilian,mission_number,occupation,year_of_mission,mission_hours,mission_title
0,"Gagarin, Yuri",male,1934,U.S.S.R/Russia,military,1,pilot,1961,1.77,Vostok 1
1,"Titov, Gherman",male,1935,U.S.S.R/Russia,military,1,pilot,1961,25.0,Vostok 2
2,"Glenn, John H., Jr.",male,1921,U.S.,military,1,pilot,1962,5.0,MA-6
3,"Glenn, John H., Jr.",male,1921,U.S.,military,2,pilot,1998,213.0,STS-95
4,"Carpenter, M. Scott",male,1925,U.S.,military,1,pilot,1962,5.0,Mercury-Atlas 7


In [15]:
print(astronauts.head(5).to_markdown())

|    | name                | gender   |   birth | nationality    | military_civilian   |   mission_number | occupation   |   year_of_mission |   mission_hours | mission_title   |
|---:|:--------------------|:---------|--------:|:---------------|:--------------------|-----------------:|:-------------|------------------:|----------------:|:----------------|
|  0 | Gagarin, Yuri       | male     |    1934 | U.S.S.R/Russia | military            |                1 | pilot        |              1961 |            1.77 | Vostok 1        |
|  1 | Titov, Gherman      | male     |    1935 | U.S.S.R/Russia | military            |                1 | pilot        |              1961 |           25    | Vostok 2        |
|  2 | Glenn, John H., Jr. | male     |    1921 | U.S.           | military            |                1 | pilot        |              1962 |            5    | MA-6            |
|  3 | Glenn, John H., Jr. | male     |    1921 | U.S.           | military            |                2

This is a markdown table:

|    | name                | gender   |   birth | nationality    | military_civilian   |   mission_number | occupation   |   year_of_mission |   mission_hours | mission_title   |
|---:|:--------------------|:---------|--------:|:---------------|:--------------------|-----------------:|:-------------|------------------:|----------------:|:----------------|
|  0 | Gagarin, Yuri       | male     |    1934 | U.S.S.R/Russia | military            |                1 | pilot        |              1961 |            1.77 | Vostok 1        |
|  1 | Titov, Gherman      | male     |    1935 | U.S.S.R/Russia | military            |                1 | pilot        |              1961 |           25    | Vostok 2        |
|  2 | Glenn, John H., Jr. | male     |    1921 | U.S.           | military            |                1 | pilot        |              1962 |            5    | MA-6            |
|  3 | Glenn, John H., Jr. | male     |    1921 | U.S.           | military            |                2 | pilot        |              1998 |          213    | STS-95          |
|  4 | Carpenter, M. Scott | male     |    1925 | U.S.           | military            |                1 | pilot        |              1962 |            5    | Mercury-Atlas 7 |

## show a progress bar

tqdm is a very useful package that helps predict when theses operations will finish executing

tutorial: https://github.com/tqdm/tqdm

In [16]:
import numpy as np
from tqdm import tqdm, tqdm_notebook # install package tqdm first if necessary

In [17]:
df = pd.DataFrame(np.random.randint(0, 100, (100000, 1000)))

# instantiate
tqdm.pandas()

# Now you can use `progress_apply` instead of `apply`
df.progress_apply(lambda x: x**2)

  from pandas import Panel
100%|██████████| 1000/1000 [00:02<00:00, 448.48it/s]


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,990,991,992,993,994,995,996,997,998,999
0,3600,484,2809,2601,25,8649,576,8836,2116,1156,...,9409,4761,5041,49,1225,1296,3600,1764,64,400
1,2209,1296,3721,3364,5476,729,5929,8649,1521,8649,...,3364,196,9801,5776,1225,1849,784,1089,81,1936
2,3481,8100,1,1444,7225,784,5184,6724,361,576,...,7921,625,5329,841,5625,4356,5041,2025,2601,529
3,196,2209,6084,4624,64,3249,324,169,7569,400,...,144,3249,1521,3136,729,4225,4761,1849,7396,2304
4,2500,2025,3969,9801,3844,100,2809,1,4,2116,...,625,225,5476,25,225,576,144,5476,4761,4624
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99995,1936,2500,1600,144,100,6724,121,2401,1681,25,...,4356,5476,5776,2916,169,1089,7744,576,6561,8281
99996,3969,8649,7921,8100,5625,2601,1600,225,4356,6561,...,4489,676,6561,144,81,1681,441,8649,7921,3969
99997,1225,8100,9,3721,529,2601,1849,5184,3364,625,...,4489,3481,1444,5776,2304,1444,2116,3969,100,8464
99998,5329,1681,225,2601,25,1024,400,1296,1681,3600,...,5476,7056,121,25,9801,6724,1089,16,100,900


## Other topics I think useful

- [MultiIndex / advanced indexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html)
- [How to handle time series data with ease?](https://pandas.pydata.org/docs/getting_started/intro_tutorials/09_timeseries.html)

## Feedback is a gift! 

![](where_feedback.png)

![](form_feedback.png)