# Pandas Dataframes

Lesson Time: 75 minutes

## Agenda

1. About Dataframes   

2. Create Dataframes
    
3. View Dataframes

4. Summarize Dataframes   

5. Attributes of Dataframes

6. Work with the Data in Dataframes
    
7. Circling back to 'About Dataframes': Series vs. Dataframes

## Lesson Goals

Upon completion of this lesson and exercises, you should be able to:

- Describe what a dataframe is and how it differs from a series. 

- Identify when you are using a pandas method or a pandas function. `pd.function()`, `df.method()`

- Create a dataframe from a dictionary, list of lists, n-dimensional array. `pd.DataFrame()`

- Identify an object type. `type(object_name)`

- 'Peek' at the contents of a dataframe `print`, `df.head`, `df.tail`, `df.sample`

- Summarize information contained in a dataframe. `df.info`, `df.describe`

- Access attributes (datatypes, number of rows & columns, row labels, column names) of a dataframe `df.dtypes`, `df.shape`, `df.index`, `df.columns`

- Subset a dataframe by selecting or dropping columns `df[['col1', 'col2]]`, `df.drop(columns=['col1'])`

- Understand the difference between the single and double bracket. `df[col1]` => Series, `df[[col1]]` => DataFrame

- Subset a dataframe by filtering rows using a conditional. `df[df.col1 < 30]`

- Filter rows and subset columns in one step. `df[df.col1 < 30][['col1', 'col2']]`

- Rename columns `df.columns = [newname1, newname2]

- Create a new column using an existing column. `df['newcol'] = df['col1'] + df['col2']`, `df.assign('newcol' = df['col1'] + df['col2']`

- Sort a dataframe by one or more columns. `df.sort_values(by = 'col1', ascending = False)`

- Chain dataframe methods together, understand when it should work, & troubleshoot when it doesn't. `df.sort_values(by='col1').head(8).col2`


## About Dataframes

- tabular  
- 2-dimensional   
- provide a number of facilities for manipulating and transforming the data   

**Pandas Help: Functions & Methods**

You will use the pandas documentation often. pandas.pydata.org/pandas-docs/stable/reference

- Documentation on Pandas Funtions can be found [here](https://pandas.pydata.org/pandas-docs/stable/reference/general_functions.html). 

- Documentation on Pandas DataFrame Methods can be found [here](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html). 

- Documentation on Pandas Series Methods can be found [here](https://pandas.pydata.org/pandas-docs/stable/reference/series.html). 

You may be asking, "What's the difference between a method and a function?"

Boyini (2019) describes a function as *a block of code to carry out a specific task, will contain its own scope and is called by name. All functions may contain zero(no) arguments or more than one arguments. On exit, a function can or can not return one or more values.* 

He then goes on to describe a method as *a function which belongs to an object.* 

Methods are *called on* objects, so Pandas DataFrame Methods will be called on dataframe objects, and Pandas Series Methods will be called on series objects. That means that when calling a method, you precede it with the name of your dataframe or series, such as `my_df.info()` or `my_series.info()`. A Pandas Function will begin with `pd`, such as `pd.concat()`. 



In [125]:
import pandas as pd

## Create Dataframes

1. We can pass a dictionary to create a dataframe, where the keys correspond to the names of the columns, and the values associated with those keys will make up the data.  

2. We can also pass lists or arrays to create a dataframe, where each list or array represents a row in the dataframe. 

3. We can also create dataframes by reading data from an existing structured data set, such as a csv, a sql table, or an excel file. 

For this lesson, we will create a dataframe using the existing records for NBA Western Conference using the first 2 methods above.


| Team| W  | L |
| ----| -- | --|
| LAC | 40 | 19 | 
| LAL | 35 | 23 | 
| UTA | 45 | 15 | 
| PHX |  41 | 16 | 
| GSW |  29 | 29 | 
| POR |  32 | 24 | 
| MEM |  29 | 27 | 
| SAS |  28 | 28 | 
| DAL |  30 | 26 | 
| DEN |  37 | 20 | 
| OKC |  20 | 38 | 
| NOR |  25 | 32 | 
| SAC |  23 | 34| 
| HOU |  15 | 43| 
| MIN |  15 | 43| 

**Method 1**  

Pass a dictionary where keys => column names, values => column values. 

`df = pd.DataFrame({'col1_name': [values], 'col2_name': [values], 'col3_name': [values]})`

In [126]:
Team = ['LAC', 'LAL', 'UTA', 'PHX', 'GSW', 'POR', 'MEM', 'SAS', 'DAL', 'DEN', 'OKC', 'NOR', 'SAC', 'HOU', 'MIN']
W = [40, 35, 45, 41, 29, 32, 29, 28, 30, 37, 20, 25, 23, 15, 15]
L = [19, 23, 15, 16, 29, 24, 27, 28, 26, 20, 38, 32, 34, 43, 43]

In [127]:
# create dataframe from dictionary with lists 
nba_df = pd.DataFrame({'Team': Team, 'W': W, 'L': L})

In [128]:
type(nba_df)

pandas.core.frame.DataFrame

In [129]:
nba_df.head(2)

Unnamed: 0,Team,W,L
0,LAC,40,19
1,LAL,35,23


**Method 2**. 

Pass a list of lists where each nested list is a row in the dataframe. 

`df = pd.DataFrame([row1, row2, row3, row4, ...], columns=[col1_name, col2_name, col3_name])`

In [130]:
# each list represents the team vector, or a row. 

lac = ['LAC', 40, 19]
lal = ['LAL', 35, 23]
uta = ['UTA', 45, 15]
phx = ['PHX', 41, 16] 
gsw = ['GSW', 29, 29] 
por = ['POR', 32, 24] 
mem = ['MEM', 29, 27]
sas = ['SAS', 28, 28]
dal = ['DAL', 30, 26]
den = ['DEN', 37, 20] 
okc = ['OKC', 20, 38]
nor = ['NOR', 25, 32]
sac = ['SAC', 23, 34]
hou = ['HOU', 15, 43]
minn = ['MIN',15, 43]

In [131]:
# create dataframe from list of lists that represent rows 
import pandas as pd
nba_df = pd.DataFrame([lac, lal, uta, phx, gsw, por, mem, sas, dal, den, okc, nor, sac, hou, minn], 
                  columns = ['Team', 'W', 'L'])

In [132]:
type(nba_df)

pandas.core.frame.DataFrame

In [133]:
nba_df.head(2)

Unnamed: 0,Team,W,L
0,LAC,40,19
1,LAL,35,23


**Method 3**. 

Read a structured dataset into a dataframe

From a csv: 

`my_df = pd.read_csv('file_name.csv')`

From a built-in dataset from pydataset:

In [134]:
from pydataset import data
color_df = data('HairEyeColor')

In [135]:
type(color_df)

pandas.core.frame.DataFrame

In [136]:
color_df.head(2)

Unnamed: 0,Hair,Eye,Sex,Freq
1,Black,Brown,Male,32
2,Brown,Brown,Male,53


### Object Types

Guesses for the type of objects we just created?

`type(df)`

In [137]:
type(nba_df)

pandas.core.frame.DataFrame

### Naming Conventions for Dataframes

It is common dataframes to contain `df` in their variable names, as we have here with `nba_df`. In many examples, you may see dataframes simply with the name `df`. In practice, especially if you are working with multiple dataframes, it is good to have a name that will describe what your dataframe contains or how it differs from the other dataframes in your environment, i.e. other dataframes you have created in your notebook or current python session or kernel. For example, when I created the first two dataframes, which are exactly the same, but my purpose is to demonstrate the different ways of creating them, I name them to identify that, albeit ridiculously long names ;). (We will compare them later to prove they are identical, btw). 

## View Dataframes

What's in these dataframes we just created? 

1. `nba_df` (without `print()`) gives a nice pretty display
2. `nba_df` (without `print()`) will not work outside of jupyter or ipython.   
3. `print(nba_df)` does not have an `Out[#]`, while `nba_df` does.    
4. `nba_df` (without `print()`) will not return anything if there is a statement following it in the same cell that displays something.  

In [138]:
nba_df

Unnamed: 0,Team,W,L
0,LAC,40,19
1,LAL,35,23
2,UTA,45,15
3,PHX,41,16
4,GSW,29,29
5,POR,32,24
6,MEM,29,27
7,SAS,28,28
8,DAL,30,26
9,DEN,37,20


In [139]:
print(nba_df)
nba_df

   Team   W   L
0   LAC  40  19
1   LAL  35  23
2   UTA  45  15
3   PHX  41  16
4   GSW  29  29
5   POR  32  24
6   MEM  29  27
7   SAS  28  28
8   DAL  30  26
9   DEN  37  20
10  OKC  20  38
11  NOR  25  32
12  SAC  23  34
13  HOU  15  43
14  MIN  15  43


Unnamed: 0,Team,W,L
0,LAC,40,19
1,LAL,35,23
2,UTA,45,15
3,PHX,41,16
4,GSW,29,29
5,POR,32,24
6,MEM,29,27
7,SAS,28,28
8,DAL,30,26
9,DEN,37,20


We can also take a peek at the data in the dataframe by sampling: the first 5 rows, last 5 rows, a random sample, or anything in between. 

- `df.head()`: first n rows (default n = 5)  
- `df.tail()`: last n rows (default n = 5)  
- `df.sample(n, random_state=int)`: sample n rows
- `df.sample(frac, random_state=int)`: sample frac (proportion) of rows   

In [140]:
nba_df.sample(5, random_state=123)

Unnamed: 0,Team,W,L
7,SAS,28,28
10,OKC,20,38
4,GSW,29,29
0,LAC,40,19
5,POR,32,24


In [141]:
nba_df.sample(5, random_state=235)

Unnamed: 0,Team,W,L
0,LAC,40,19
13,HOU,15,43
9,DEN,37,20
2,UTA,45,15
4,GSW,29,29


In [142]:
nba_df.sample(frac=.25, random_state=123)

Unnamed: 0,Team,W,L
7,SAS,28,28
10,OKC,20,38
4,GSW,29,29
0,LAC,40,19


In [143]:
nba_df.head(8)

Unnamed: 0,Team,W,L
0,LAC,40,19
1,LAL,35,23
2,UTA,45,15
3,PHX,41,16
4,GSW,29,29
5,POR,32,24
6,MEM,29,27
7,SAS,28,28


In [144]:
nba_df.tail(3)

Unnamed: 0,Team,W,L
12,SAC,23,34
13,HOU,15,43
14,MIN,15,43


## Summarize Dataframes

- `df.info()`:  See total number of rows, column names, number of non-null values for each column, datatype of each column, size of the dataframe (memory usage) 

- `df.describe()`: Summary statistics of all the columns with numeric datatypes. 


In [145]:
# get the object type, index (row) range, column names, number of non-null values, datatypes, and size. 
nba_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Team    15 non-null     object
 1   W       15 non-null     int64 
 2   L       15 non-null     int64 
dtypes: int64(2), object(1)
memory usage: 488.0+ bytes


In [146]:
# get summary stats of the numeric columns
nba_df.describe()

Unnamed: 0,W,L
count,15.0,15.0
mean,29.6,27.8
std,9.069572,8.889802
min,15.0,15.0
25%,24.0,21.5
50%,29.0,27.0
75%,36.0,33.0
max,45.0,43.0


## Dataframe Attributes

- `df.dtypes`: datatype of each column
- `df.shape`: tuple of number of rows & columns in the dataframe
- `df.index`: the labels for each row (usually autogenerated int)
- `df.columns`: you can also assign new values to this attribute. 

You will notice that when calling these attributes of dataframes, they are not followed by `()`. When you forget, you will just get a nice error to remind you :)

In [147]:
# get the datatypes of each column
nba_df.dtypes

Team    object
W        int64
L        int64
dtype: object

In [148]:
# what type of object is returned when we call dtypes?
type(nba_df.dtypes)

pandas.core.series.Series

In [149]:
# see number of rows and columns, or the shape, of the df
nba_df.shape

(15, 3)

In [150]:
# what type of object does 'shape' return? 
type(nba_df.shape)

tuple

In [151]:
# how can I get the number of rows? 
nba_df.shape[0]

15

In [152]:
# what are the row labels, or index values, of of my dataframe? 
nba_df.index

RangeIndex(start=0, stop=15, step=1)

In [153]:
# What type of object does .index return? 
type(nba_df.index)

pandas.core.indexes.range.RangeIndex

In [154]:
# what are the column names of my dataframe? 
nba_df.columns

Index(['Team', 'W', 'L'], dtype='object')

In [155]:
# what type of object is returned? 
type(nba_df.columns)

pandas.core.indexes.base.Index

What do you notice? 

Column names are also an index, like rows. Axis 0 = Rows, Axis 1 = Columns

In [156]:
# rename the columns using the .columns attribute and the .upper and .lower methods. 
nba_df.columns = ['TEAM', 'W', 'L']

In [160]:
nba_df.columns = nba_df.columns.str.upper()
nba_df.columns

Index(['TEAM', 'W', 'L'], dtype='object')

In [161]:
nba_df.columns = nba_df.columns.str.lower()
nba_df.columns

Index(['team', 'w', 'l'], dtype='object')

## Subset/Filter Dataframes

### Columns

Return a dataframe

- `df[[col1, col2]]`  
- `df[[col1]]`  
- `mycols = [col1, col2]` -> `df[mycols]`  

Return a series
- `df[col1]`  
- `df.col1`  


A dataframe will have a column index while series will not. A series will instead have a name for the series that was the original column name you selected. 

In [162]:
# return a multi-column dataframe
nba_df[['team', 'w']].head(2)

Unnamed: 0,team,w
0,LAC,40
1,LAL,35


In [163]:
# return a single column dataframe
nba_df[['team']].head(2)

Unnamed: 0,team
0,LAC
1,LAL


In [164]:
# return a series using single bracket
nba_df['team'].head(2)

0    LAC
1    LAL
Name: team, dtype: object

In [165]:
# return a series using '.': df.colname
nba_df.team.head(2)

0    LAC
1    LAL
Name: team, dtype: object

In [166]:
nba_df[['team','w']].w.head(2)

0    40
1    35
Name: w, dtype: int64

### Rows

We can subset a dataframe by filtering rows using a conditional. 

For example, `df[df.col1 < x]` will return all columns and all rows where col1 value is less than x. 

In [167]:
nba_df_winners = nba_df[nba_df.w > 30]
nba_df_winners

Unnamed: 0,team,w,l
0,LAC,40,19
1,LAL,35,23
2,UTA,45,15
3,PHX,41,16
5,POR,32,24
9,DEN,37,20


### Subset Columns and Filter Rows

- `df[df.col1 < x].col2`: column 2 and rows where col1 value is less than x. What kind of object is returned?     
- `df[df.col1 < x][[col1, col2]]`: columns 1 & 2 and rows where col1 value is less than x. What kind of object is returned?   

In [168]:
nba_df[nba_df.w > 30].team
nba_df[nba_df.w > 30]['team']

0    LAC
1    LAL
2    UTA
3    PHX
5    POR
9    DEN
Name: team, dtype: object

In [169]:
nba_df[nba_df.w > 30][['team','l']]

Unnamed: 0,team,l
0,LAC,19
1,LAL,23
2,UTA,15
3,PHX,16
5,POR,24
9,DEN,20


## Drop, Rename, Add Columns

**Drop**

`df.drop(colums=[])`

In the drop and rename methods (and many others in pandas), the original dataframe is not changed, but instead a new dataframe is produced. However, you can use the `inplace` argument to change the original dataframe. 


In [170]:
# add a column
nba_df['l2'] = nba_df['l']

In [171]:
nba_df.head(2)

Unnamed: 0,team,w,l,l2
0,LAC,40,19,19
1,LAL,35,23,23


In [172]:
# drop columns 
nba_df = nba_df.drop(columns=['l2'])
nba_df.head(2)

Unnamed: 0,team,w,l
0,LAC,40,19
1,LAL,35,23


In [175]:
# readd the column
nba_df['l2'] = nba_df['l']

# using 'inplace'
nba_df.drop(columns=['l2'], inplace=True)

In [176]:
nba_df.head(2)

Unnamed: 0,team,w,l
0,LAC,40,19
1,LAL,35,23


**Rename**

`df.columns = [col1_new, col2_new, col3_new]`

In [177]:
# rename columns using .columns
nba_df.columns = ['name', 'wins','losses']
nba_df.columns

Index(['name', 'wins', 'losses'], dtype='object')

Benefit of the rename method is that you don't have to list all column names, only ones you are renaming. 

`df.rename(columns={'original_name': 'new_name'})`


In [178]:
# rename columns using .rename with dictionary
nba_df.rename(columns={'name': 'team'}, inplace=True)
nba_df.columns

Index(['team', 'wins', 'losses'], dtype='object')

`df['new_col'] = df['col1'] - df['col2']`

In [179]:
# create new columns: win_pct
nba_df['win_pct'] = nba_df['wins']/(nba_df['wins']+nba_df['losses'])*100

In [180]:
nba_df.head(2)

Unnamed: 0,team,wins,losses,win_pct
0,LAC,40,19,67.79661
1,LAL,35,23,60.344828


`df.assign('new_col' = df['col1'] - df['col2'])`

In [181]:
# create new column using assign
nba_df.assign('win_pct' = nba_df['wins']/(nba_df['wins']+nba_df['losses'])*100)

SyntaxError: expression cannot contain assignment, perhaps you meant "=="? (<ipython-input-181-8c06c9883a20>, line 2)

## Sort Dataframes

- `df.sort_values(by='col1', ascending=False)`: default is True, so `ascending` argument is not necessary if sorting in ascending order. 

In [182]:
# sort by win_pct using sort_values
nba_df.sort_values(by='win_pct', ascending=False)

Unnamed: 0,team,wins,losses,win_pct
2,UTA,45,15,75.0
3,PHX,41,16,71.929825
0,LAC,40,19,67.79661
9,DEN,37,20,64.912281
1,LAL,35,23,60.344828
5,POR,32,24,57.142857
8,DAL,30,26,53.571429
6,MEM,29,27,51.785714
4,GSW,29,29,50.0
7,SAS,28,28,50.0


In [183]:
# save sorted dataframe and select top/bottom
nba_df_sorted = nba_df.sort_values(by='win_pct', ascending=False)
nba_df_sorted.head(5)

Unnamed: 0,team,wins,losses,win_pct
2,UTA,45,15,75.0
3,PHX,41,16,71.929825
0,LAC,40,19,67.79661
9,DEN,37,20,64.912281
1,LAL,35,23,60.344828


In [184]:
nba_df_sorted.tail(5)

Unnamed: 0,team,wins,losses,win_pct
11,NOR,25,32,43.859649
12,SAC,23,34,40.350877
10,OKC,20,38,34.482759
13,HOU,15,43,25.862069
14,MIN,15,43,25.862069


## Chain Dataframe Methods

As long as each method is returning a dataframe, these can be chained together to quickly and easily create the dataframe you need. 

Challenge: find the teams that would be in playoffs today if it started right now. Find the top 8 teams by win_pct. 

`df.sort_values(by='win_pct').head(8).Team`

In [187]:
nba_df.sort_values(by='win_pct', ascending = False).head(8).team

2    UTA
3    PHX
0    LAC
9    DEN
1    LAL
5    POR
8    DAL
6    MEM
Name: team, dtype: object

**IF TIME ALLOWS**


## Series vs. Dataframe

In [61]:
s_values = ['LAL', 'LAC', 'DEN', 'OKC']
s_name = 'team'
s_index = [0, 1, 2, 3]
s_dtype = 'str'

s_teams = pd.Series(data=s_values, 
                    index=s_index,  
                    dtype=s_dtype, 
                    name=s_name)

In [62]:
print("name: ", s_teams.name,  "\ncolumns: NA\nindex: ", s_teams.index, "\naxes: ", s_teams.axes, "\ndtypes: ", s_teams.dtypes, 
      "\nndim: ", s_teams.ndim, "\nsize: ", s_teams.size, "\nshape: ", s_teams.shape, "\nvalues: ", s_teams.values)

name:  team 
columns: NA
index:  Int64Index([0, 1, 2, 3], dtype='int64') 
axes:  [Int64Index([0, 1, 2, 3], dtype='int64')] 
dtypes:  object 
ndim:  1 
size:  4 
shape:  (4,) 
values:  ['LAL' 'LAC' 'DEN' 'OKC']


In [63]:
s_values = [52, 48, 46, 44]
s_name = 'wins'
s_index = [0, 1, 2, 3]
s_dtype = 'int'

s_wins = pd.Series(data=s_values, 
                    index=s_index,  
                    dtype=s_dtype, 
                    name=s_name)

In [64]:
print("name: ", s_wins.name,  "\ncolumns: NA\nindex: ", s_wins.index, "\naxes: ", s_wins.axes, "\ndtypes: ", s_wins.dtypes, 
      "\nndim: ", s_wins.ndim, "\nsize: ", s_wins.size, "\nshape: ", s_wins.shape, "\nvalues: ", s_wins.values)

name:  wins 
columns: NA
index:  Int64Index([0, 1, 2, 3], dtype='int64') 
axes:  [Int64Index([0, 1, 2, 3], dtype='int64')] 
dtypes:  int64 
ndim:  1 
size:  4 
shape:  (4,) 
values:  [52 48 46 44]


In [65]:
df = pd.DataFrame({'team': s_teams, 'wins': s_wins})

print("name: NA\ncolumns: ", df.columns, "\nindex: ", df.index, "\naxes: ", df.axes, 
      "\ndtypes: ", df.dtypes, "\nndim: ", df.ndim, "\nsize: ", df.size, 
      "\nshape: ", df.shape, "\nvalues: ", df.values)
# data=None, index=None, columns=None, dtype=None, copy=Fals

name: NA
columns:  Index(['team', 'wins'], dtype='object') 
index:  Int64Index([0, 1, 2, 3], dtype='int64') 
axes:  [Int64Index([0, 1, 2, 3], dtype='int64'), Index(['team', 'wins'], dtype='object')] 
dtypes:  team    object
wins     int64
dtype: object 
ndim:  2 
size:  8 
shape:  (4, 2) 
values:  [['LAL' 52]
 ['LAC' 48]
 ['DEN' 46]
 ['OKC' 44]]


In [66]:
df['team']

df[['team']]

Unnamed: 0,team
0,LAL
1,LAC
2,DEN
3,OKC


**References**

Boyini, K. (2019, February 19). Difference between Method and Function in Python. Retrieved January 20, 2021, from https://www.tutorialspoint.com/difference-between-method-and-function-in-python#:~:text=Unlike a function, methods are,access that data within it.

**Time needed to deliver lesson:**

75 minutes