# Pandas Dataframes

## Agenda

1. **About** Dataframes   

2. **Create**   
    A. Object types  
    B. Naming conventions  
    
3. **View**  

4. **Compare**  

5. **Summarize**   

6. **Dataframe Attributes**  

7. **Subset/Filter**  
    A. Columns  
    B. Rows 
        
8. **Drop, Rename, Add Columns**  
    A. Dropping columns  
    B. Renaming columns  
    C. Adding columns   

9. **Sort**

10. **Chain DF Methods**  

## About Dataframes

- tabular  
- 2-dimensional   
- provide a number of facilities for manipulating and transforming the data   

**Pandas Help: Functions & Methods**

[Pandas Funtions](https://pandas.pydata.org/pandas-docs/stable/reference/general_functions.html) will begin with `pd`, such as `pd.concat()`. (See link for list of functions.)

Methods are *called on* objects, so [Pandas DataFrame Methods](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html) 
 will be called on dataframes. That means that these will begin with the name of your dataframe, such as `df.info()` where the name of your dataframe is `df`. Similarly, [Series Methods](https://pandas.pydata.org/pandas-docs/stable/reference/series.html) will look like `my_series.T`, as an example. (See links for list of methods.) 

In [66]:
import pandas as pd

## Create Dataframes

1. We can pass a dictionary to create a dataframe, where the keys correspond to the names of the columns, and the values associated with those keys will make up the data.  

2. We can also pass lists or arrays to create a dataframe, where each list or array represents a row in the dataframe.   

3. We can copy an existing dataframe into a new dataframe

4. We can also create dataframes by reading data from an existing structured data set, such as a csv, a sql table, or an excel file. 

For this lesson, we will create a dataframe using the existing standings for NBA Western Conference using the first 3 methods above.

**Method 1**  

Pass a dictionary where keys => column names, values => column values. 

In [67]:
teams_col = ['LAL', 'LAC', 'DEN', 'OKC', 'HOU', 'UTA', 'DAL', 
             'POR', 'MEM', 'PHX', 'SAS', 'SAC', 'NO', 'MIN', 'GS']
wins_col = [52, 48, 46, 44, 44, 43, 43, 
            34, 33, 33, 32, 30, 30, 19, 15]
losses_col = [18, 23, 26, 27, 27, 28, 31, 
              39, 39, 39, 38, 41, 41, 45, 50]
games_back_col = [0, 4.5, 7, 8.5, 8.5, 9.5, 11, 
                  19.5, 20, 20, 20, 22.5, 22.5, 30, 34.5]

nba_df_from_dict_as_cols = pd.DataFrame(
    {'team': teams_col, 
     'win': wins_col, 
     'loss': losses_col, 
     'games_back': games_back_col}
)

**Method 2**. 

Pass a list of lists where each nested list is a row in the dataframe. 

In [68]:
nba_df_from_lists_as_rows = pd.DataFrame(
    [['LAL', 52, 18, 0], 
     ['LAC', 48, 23, 4.5], 
     ['DEN', 46, 26, 7.0], 
     ['OKC', 44, 27, 8.5], 
     ['HOU', 44, 27, 8.5], 
     ['UTA', 43, 28, 9.5], 
     ['DAL', 43, 31, 11], 
     ['POR', 34, 39, 19.5], 
     ['MEM', 33, 39, 20], 
     ['PHX', 33, 39, 20], 
     ['SAS', 32, 38, 20], 
     ['SAC', 30, 41, 22.5], 
     ['NO', 30, 41, 22.5], 
     ['MIN', 19, 45, 30], 
     ['GS', 15, 50, 34.5]
    ], 
    columns = ['team', 'win', 'loss', 'games_back']
)

**Method 3**

Copy a dataframe from an existing dataframe. 

In [69]:
nba_df = nba_df_from_lists_as_rows

### Object Types

Guesses for the type of objects we just created?

### Naming Conventions for Dataframes

It is common dataframes to contain `df` in their variable names, as we have here with `nba_df`. In many examples, you may see dataframes simply with the name `df`. In practice, especially if you are working with multiple dataframes, it is good to have a name that will describe what your dataframe contains or how it differs from the other dataframes in your environment, i.e. other dataframes you have created in your notebook or current python session or kernel. For example, when I created the first two dataframes, which are exactly the same, but my purpose is to demonstrate the different ways of creating them, I name them to identify that, albeit ridiculously long names ;). (We will compare them later to prove they are identical, btw). 

## View Dataframes

What's in these dataframes we just created? 

1. `nba_df` (without `print()`) gives a nice pretty display
2. `nba_df` (without `print()`) will not work outside of jupyter or ipython.   
3. `print(nba_df)` does not have an `Out[#]`, while `nba_df` does.    
4. `nba_df` (without `print()`) will not return anything if there is a statement following it in the same cell that displays something.  

## Compare Dataframes

Are nba_df_from_dict_as_cols and nba_df_from_lists_as_rows 
equivalent dataframes? Don't take my word for it. 

remember from numpy... `(my_array < 0).all()`?

## Summarize Dataframes

- `df.info()`:  See total number of rows, column names, number of non-null values for each column, datatype of each column, size of the dataframe (memory usage) 

- `df.describe()`: Summary statistics of all the columns with numeric datatypes. 


## Dataframe Attributes

- `df.dtypes`: datatype of each column
- `df.shape`: tuple of number of rows & columns in the dataframe
- `df.index`: the labels for each row (usually autogenerated int)
- `df.columns`: you can also assign new values to this attribute. 

You will notice that when calling these attributes of dataframes, they are not followed by `()`. When you forget, you will just get a nice error to remind you :)

## Subset/Filter Dataframes

### Columns

Return a dataframe

- `df[[col1, col2]]`  
- `df[[col1]]`  
- `mycols = [col1, col2]` -> `df[mycols]`  

Return a series
- `df[col1]`  
- `df.col1`  


### Rows

We can take a peek at the first 5 rows, last 5 rows, a random sample, or anything in between. 

- `df.head()`: first n rows (default n = 5)  
- `df.tail()`: last n rows (default n = 5)  
- `df.sample(n)`: sample n rows  
- `df.sample(frac)`: sample frac (proportion) of rows     
- `df[df.col1 < x]`: all columns and all rows where col1 value is less than x

### Columns & Rows

- `df[df.col1 < x].col2`: column 2 and rows where col1 value is less than x. What kind of object is returned?     
- `df[df.col1 < x][[col1, col2]]`: columns 1 & 2 and rows where col1 value is less than x. What kind of object is returned?   


## Drop, Rename, Add Columns


- `df.drop(colums=[])`
- `df.rename(columns={'original_name': 'new_name'})`
- `df['new_col'] = df.col1 < x`
- `df.assign(new_col=df.col1 < x)`

In the drop and rename methods (and many others in pandas), the original dataframe is not changed, but instead a new dataframe is produced. However, you can use the `inplace` argument to change the original dataframe. 


## Sort Dataframes

- `df.sort_values(by='col1', ascending=False)`: default is True, so `ascending` argument is not necessary if sorting in ascending order. 

## Chain Dataframe Methods

As long as each method is returning a dataframe, these can be chained together to quickly and easily create the dataframe you need. 