# Lesson: Data Science Libraries - DATAFRAMES

<a href = "https://www.canva.com/design/DAFvezSpCBM/tv-Cwl_qyrQXlCAefpdPgw/view?utm_content=DAFvezSpCBM&utm_campaign=designshare&utm_medium=link&utm_source=publishsharelink">![image.png](attachment:4460a3a1-ce2a-44b5-8978-4e9f4fdc5147.png)</a>

<hr style="border:2px solid gray">


## About Dataframes

- tabular  
- 2-dimensional   
- provide a number of facilities for manipulating and transforming the data   

**Pandas Help: Functions & Methods**

You will use the Pandas' documentation often. pandas.pydata.org/pandas-docs/stable/reference

- Documentation on Pandas Functions can be found [here](https://pandas.pydata.org/pandas-docs/stable/reference/general_functions.html). 

- Documentation on Pandas DataFrame Methods can be found [here](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html). 

- Documentation on Pandas Series Methods can be found [here](https://pandas.pydata.org/pandas-docs/stable/reference/series.html). 

You may be asking, "What's the difference between a method and a function?"

A function is *a block of code to carry out a specific task, will contain its own scope, and is called by name. All functions may contain zero (no) arguments or more than one argument. On exit, a function can or can not return one or more values.* 

A method is *a function which belongs to an object.* 

Methods are *called on* objects, so Pandas DataFrame Methods will be called on DataFrame objects, and Pandas Series Methods will be called on series objects. That means that when calling a method, you precede it with the name of your DataFrame or Series, such as `my_df.info()` or `my_series.info()`. A Pandas Function will begin with `pd`, such as `pd.concat()`. 



In [1]:
# import!
import pandas as pd
import numpy as np


___
## Create Dataframes

1. We can pass a dictionary to create a DataFrame, where the keys correspond to the names of the columns, and the values associated with those keys will make up the data.  

2. We can also pass lists or arrays to create a DataFrame, where each list or array represents a row in the DataFrame. 

3. We can also create DataFrames by reading data from an existing structured data set, such as a csv, a sql table, or an excel file. 

For this lesson, we will create a DataFrame using the existing records for NBA Western Conference using the first 2 methods above.


| Team| W  | L |
| ----| -- | --|
| LAC | 40 | 19 | 
| LAL | 35 | 23 | 
| UTA | 45 | 15 | 
| PHX |  41 | 16 | 
| GSW |  29 | 29 | 
| POR |  32 | 24 | 
| MEM |  29 | 27 | 
| SAS |  28 | 28 | 
| DAL |  30 | 26 | 
| DEN |  37 | 20 | 
| OKC |  20 | 38 | 
| NOR |  25 | 32 | 
| SAC |  23 | 34| 
| HOU |  15 | 43| 
| MIN |  15 | 43| 

**Method 1**  

Pass a dictionary where keys => column names, values => column values. 

`df = pd.DataFrame({'col1_name': [values], 'col2_name': [values], 'col3_name': [values]})`

In [2]:
Team = ['LAC', 'LAL', 'UTA', 'PHX', 'GSW', 'POR', 'MEM', 'SAS', 'DAL', 'DEN', 'OKC', 'NOR', 'SAC', 'HOU', 'MIN']
W = [40, 35, 45, 41, 29, 32, 29, 28, 30, 37, 20, 25, 23, 15, 15]
L = [19, 23, 15, 16, 29, 24, 27, 28, 26, 20, 38, 32, 34, 43, 43]

In [3]:
# create dataframe from dictionary with lists 
nba_df = pd.DataFrame({'Team': Team, 'W': W, 'L': L})

In [4]:
type(nba_df)

pandas.core.frame.DataFrame

In [5]:
nba_df.head()

Unnamed: 0,Team,W,L
0,LAC,40,19
1,LAL,35,23
2,UTA,45,15
3,PHX,41,16
4,GSW,29,29


**Method 2**. 

Pass a list of lists where each nested list is a row in the DataFrame. 

`df = pd.DataFrame([row1, row2, row3, row4, ...], columns=[col1_name, col2_name, col3_name])`

In [6]:
# each list represents the team vector, or a row. 

lac = ['LAC', 40, 19]
lal = ['LAL', 35, 23]
uta = ['UTA', 45, 15]
phx = ['PHX', 41, 16]
gsw = ['GSW', 29, 29]
por = ['POR', 32, 24]
mem = ['MEM', 29, 27]
sas = ['SAS', 28, 28]
dal = ['DAL', 30, 26]
den = ['DEN', 37, 20]
okc = ['OKC', 20, 38]
nor = ['NOR', 25, 32]
sac = ['SAC', 23, 34]
hou = ['HOU', 15, 43]
minn = ['MIN', 15, 43]

In [7]:
[lac, lal, uta, phx, gsw, por, mem, sas, dal, den, okc, nor, sac, hou, minn]

[['LAC', 40, 19],
 ['LAL', 35, 23],
 ['UTA', 45, 15],
 ['PHX', 41, 16],
 ['GSW', 29, 29],
 ['POR', 32, 24],
 ['MEM', 29, 27],
 ['SAS', 28, 28],
 ['DAL', 30, 26],
 ['DEN', 37, 20],
 ['OKC', 20, 38],
 ['NOR', 25, 32],
 ['SAC', 23, 34],
 ['HOU', 15, 43],
 ['MIN', 15, 43]]

In [8]:
# create dataframe from list of lists that represent rows 
nba_df_as_rows = pd.DataFrame([lac, lal, uta, phx, gsw, por, mem, sas, dal, den, okc, nor, sac, hou, minn], columns= ['Team', 'W', 'L'])

In [9]:
nba_df_as_rows.head(2)

Unnamed: 0,Team,W,L
0,LAC,40,19
1,LAL,35,23


**Method 3**. 

Read a structured dataset into a DataFrame

From a csv: 

`my_df = pd.read_csv('file_name.csv')`

From a built-in dataset from pydataset:

In [10]:
from pydataset import data

In [11]:
# let's take a look:


In [12]:
color_df = data('HairEyeColor')

In [13]:
type(color_df)

pandas.core.frame.DataFrame

In [14]:
color_df.tail(5)

Unnamed: 0,Hair,Eye,Sex,Freq
28,Blond,Hazel,Female,5
29,Black,Green,Female,2
30,Brown,Green,Female,14
31,Red,Green,Female,7
32,Blond,Green,Female,8


**Method 4**. 

Copy a dataframe from an existing dataframe.

In [15]:
nba_df = 

SyntaxError: invalid syntax (3639397497.py, line 1)

### Naming Conventions for DataFrames

It is common for DataFrames to contain `df` in their variable names, as we have here with `nba_df`. In many examples, you may see DataFrames simply with the name `df`. 

In practice, especially if you are working with multiple DataFrames, it is good to have a name that will describe what your DataFrame contains or how it differs from the other DataFrames in your environment, 
i.e. other DataFrames you have created in your notebook or current python session or kernel. 

For example, when I created the first two DataFrames, which are exactly the same, but my purpose is to demonstrate the different ways of creating them, I named them to identify that, albeit ridiculously long names ;). 

*(We will compare them later to prove they are identical, btw).*

___
## View DataFrames

What's in these DataFrames we just created? 

1. `nba_df` (without `print()`) gives a nice pretty display
    - will not work outside of jupyter or ipython.   
    - will not return anything if there is a statement following it in the same cell that displays something. 
2. `print(nba_df)` does not have an `Out[#]`, while `nba_df` does.    


We can also take a peek at the data in the DataFrame by sampling: the first 5 rows, last 5 rows, a random sample, or anything in between. 

- `df.head()`: first n rows (default n = 5)  
- `df.tail()`: last n rows (default n = 5)  
- `df.sample(n, random_state=int)`: sample n rows
- `df.sample(frac, random_state=int)`: sample frac (proportion) of rows   

___
## Summarize DataFrames

- `df.info()`:  See total number of rows, column names, number of non-null values for each column, datatype of each column, size of the DataFrame (memory usage) 

- `df.describe()`: Summary statistics of all the columns with numeric datatypes. 


In [None]:
# get the object type, index (row) range, column names, number of non-null values, datatypes, and size. 


In [None]:
# get summary stats of the numeric columns


___
## DataFrame Attributes

- `df.dtypes`: datatype of each column
- `df.shape`: tuple of number of rows & columns in the DataFrame
- `df.index`: the labels for each row (usually autogenerated int)
- `df.columns`: you can also assign new values to this attribute. 

You will notice that when calling these attributes of DataFrames, they are not followed by `()`. When you forget, you will just get a nice error to remind you :)

In [None]:
# get the datatypes of each column


In [None]:
# what type of object is returned when we call dtypes?
type()

In [None]:
# see number of rows and columns, or the shape, of the df


In [None]:
# what type of object does 'shape' return? 
type()

In [None]:
# how can I get the number of rows? 


In [None]:
# what are the row labels, or index values, of of my dataframe? 


In [None]:
# What type of object does .index return? 
type()

In [None]:
# what are the column names of my dataframe? 


In [None]:
# what type of object is returned? 
type()

### What do you notice? 

![image.png](attachment:36c10eaa-3f79-4b6c-a83d-3b5d9c7608cf.png)
#### Column names are also an index, like rows. Axis 0 = Rows, Axis 1 = Columns


In [20]:
# rename the columns using the .columns attribute and the .upper and .lower methods. 
nba_df.columns

Index(['Team', 'W', 'L'], dtype='object')

In [21]:
nba_df.columns = nba_df.columns.str.upper()
nba_df.columns 

Index(['TEAM', 'W', 'L'], dtype='object')

In [22]:
nba_df.columns = nba_df.columns.str.lower()
nba_df.columns

Index(['team', 'w', 'l'], dtype='object')

___
## Subset/Filter DataFrames

### Columns

Return a DataFrame

- `df[[col1, col2]]`  
- `df[[col1]]`  
- `mycols = [col1, col2]` -> `df[mycols]`  

Return a series
- `df[col1]`  
- `df.col1`  


A DataFrame will have a column index while a series will not. A series will instead have a name for the series that was the original column name you selected. 

In [23]:
# return a multi-column dataframe
nba_df[['team', 'w']].head()

Unnamed: 0,team,w
0,LAC,40
1,LAL,35
2,UTA,45
3,PHX,41
4,GSW,29


In [24]:
# return a single column dataframe
nba_df[['team']].head()

Unnamed: 0,team
0,LAC
1,LAL
2,UTA
3,PHX
4,GSW


In [25]:
# return a series using single bracket
nba_df['team'].head(2)

0    LAC
1    LAL
Name: team, dtype: object

In [26]:
# return a series using '.': df.colname
nba_df.team.head()

0    LAC
1    LAL
2    UTA
3    PHX
4    GSW
Name: team, dtype: object

### Rows

We can subset a DataFrame by filtering rows using a conditional. 

For example, `df[df.col1 < x]` will return all columns and all rows where `col1` value is less than x. 

In [27]:
nba_df.w > 30

0      True
1      True
2      True
3      True
4     False
5      True
6     False
7     False
8     False
9      True
10    False
11    False
12    False
13    False
14    False
Name: w, dtype: bool

In [28]:
nba_df_winners = nba_df[nba_df.w > 30]
nba_df_winners

Unnamed: 0,team,w,l
0,LAC,40,19
1,LAL,35,23
2,UTA,45,15
3,PHX,41,16
5,POR,32,24
9,DEN,37,20


### Subset Columns and Filter Rows

- `df[df.col1 < x].col2`: column 2 and rows where col1 value is less than x. What kind of object is returned?     
- `df[df.col1 < x][[col1, col2]]`: columns 1 & 2 and rows where col1 value is less than x. What kind of object is returned?   

In [29]:
nba_df[nba_df.w > 30].team

0    LAC
1    LAL
2    UTA
3    PHX
5    POR
9    DEN
Name: team, dtype: object

In [30]:
nba_df[nba_df.w > 30][['team', 'l']]

Unnamed: 0,team,l
0,LAC,19
1,LAL,23
2,UTA,15
3,PHX,16
5,POR,24
9,DEN,20


In [31]:
# add a column
nba_df['l2'] = nba_df['l']

___
## Drop, Rename, Add Columns

**Drop**

`df.drop(colums=[])`

In the drop and rename methods (and many others in Pandas), the original DataFrame is not changed, but instead a new DataFrame is produced. However, you can use the `inplace` argument to change the original DataFrame. 


In [32]:
# drop columns 
nba_df = nba_df.drop(columns=['12'])
nba_df.head(2)

KeyError: "['12'] not found in axis"

In [None]:
# readd the column


# using 'inplace'


**Rename**

`df.columns = [col1_new, col2_new, col3_new]`

In [None]:
# rename columns using .columns
nba_df.columns 
nba_df.columns

Benefit of the rename method is that you don't have to list all column names, only ones you are renaming. 

`df.rename(columns={'original_name': 'new_name'})`


In [None]:
# rename columns using .rename with dictionary
nba_df
nba_df.columns

`df['new_col'] = df['col1'] - df['col2']`

In [None]:
# create new columns: win_pct
nba_df['win_pct']

In [None]:
nba_df.head(2)

`df.assign('new_col' = df['col1'] - df['col2'])`

In [None]:
# create new column using assign


___
## Sort DataFrames

- `df.sort_values(by='col1', ascending=False)`: default is True, so `ascending` argument is not necessary if sorting in ascending order. 

In [None]:
# sort by win_pct using sort_values


In [None]:
# save sorted dataframe and select top/bottom
nba_df_sorted 
nba_df_sorted

In [None]:
nba_df_sorted

___
## Chain DataFrame Methods

As long as each method is returning a DataFrame, these can be chained together to quickly and easily create the DataFrame you need. 

Challenge: find the teams that would be in playoffs today if it started right now. Find the top 8 teams by win_pct. 

`df.sort_values(by='win_pct').head(8).Team`

___
## Series vs. DataFrames

In [None]:
s_values = 
s_name = 
s_index = 
s_dtype = 

s_teams = 

In [None]:
print(f"""name: {s_teams.}
columns: NA
index: {s_teams.}
axes: {s_teams.}
dtypes: {s_teams.}
ndim: {s_teams.}
size: {s_teams.}
shape: {s_teams.}
values: {s_teams.}""")

In [None]:
s_values = 
s_name = 
s_index = 
s_dtype = 

s_wins = 

In [None]:
print(f"""name: {s_wins.}
columns: NA
index: {s_wins.}
axes: {s_wins.}
dtypes: {s_wins.}
ndim: {s_wins.}
size: {s_wins.}
shape: {s_wins.}
values: {s_wins.}""")

In [None]:
df = 

print(f"""name: NA
columns: {df.columns}
index: {df.index}
axes: {df.axes}
dtypes: {df.dtypes}
ndim: {df.ndim}
size: {df.size}
shape: {df.shape}
values: {df.values}""")



In [None]:
df['team']

In [None]:
df[['team']]