# Pandas Methods
> [Main Table of Contents](../README.md)

In [116]:
import pandas as pd
print(pd.__version__)

1.2.0


In [117]:
# NOTEBOOK DATA
twod_dict = { 'breed': ['Beagle', 'Mixed', 'Lab', 'Lab', 'Corgi'],
              'color': ['Brown', 'Brown', 'Black','Black', 'Brown'],
           'height': [1, 1.5, 2, 2, 1 ],
           'weight': [25, 45, 65, pd.NA, pd.NA]}
# labels = ['US', 'RU', 'MO', 'EG', 'CH']
df = pd.DataFrame(twod_dict)
# df = pd.DataFrame(twod_dict, index=labels)

## Series Methods
- Remember series is a column  
- If it's a df method that means it works on each column of the df

Method | DF Method as well? |  Description
--- | --- | ---
.head() | Yes | Return the first 5 rows
.tail() | Yes | Return the last 5 rows
.describe() | Yes | Summary stats on columns of numerical type
.count() | Yes | Count of all non-NA values
.sum() | Yes | Add values
.cumsum() | Yes | Cumulatively add values (shows the progression)
.mean() | Yes | Average only on columns of numerical type
.std() | Yes | Standard deviation only on columns of numerical type
.min() | Yes | Minimum value 
.max() | Yes | Maximum value
.diff() | No | Difference between consecutive elements
.pct_change() | No | Percent change between consecutive elements
.isna() | Yes | Returns boolean series or dataframe
.isna().any() | Yes | True if a column contains a NA value, else False<br>Returns boolean for each column
.isna().sum() | yes | Total number of NA values per column
.fillna() | Yes | Specify how to deal with NA Values
.agg() | Yes | Apply one or more CUSTOM functions over the specified axis
.drop_duplicates() | Yes | Drop rows with duplicate values in chosen subset
~~.values~~<br>.to_numpy() | Yes  |Convert Series of df to numpy array


### isna

In [118]:
df.isna()

Unnamed: 0,breed,color,height,weight
0,False,False,False,False
1,False,False,False,False
2,False,False,False,False
3,False,False,False,True
4,False,False,False,True


In [119]:
df.isna().any()

breed     False
color     False
height    False
weight     True
dtype: bool

In [120]:
df.isna().sum()

breed     0
color     0
height    0
weight    2
dtype: int64

### fillna
Options | Default | Description
--- | --- | ---
method | None | One of 'backfill', 'bfill', 'pad', 'ffill'

In [121]:
df.fillna('hello')  # fill with the word hello
df.fillna(0)
df.fillna(method ='bfill')

Unnamed: 0,breed,color,height,weight
0,Beagle,Brown,1.0,25.0
1,Mixed,Brown,1.5,45.0
2,Lab,Black,2.0,65.0
3,Lab,Black,2.0,
4,Corgi,Brown,1.0,


### agg
- Apply CUSTOM functions
- .agg is alias for .aggregate
- Use .agg

In [122]:
import numpy as np
def quad(x):
  return x*4

def triple(x):
  return x*3

df.agg(quad)                 # one custom transformation function
df.agg(np.median)            # one external aggregation function
# df.agg([quad, np.median])  # --> ERROR: cannot combine transformation and aggregation functions
df.agg([np.median, np.max])  # Mulple external aggregation functions
df.agg([triple, quad])       # Multiple external transformation functions

Unnamed: 0_level_0,breed,breed,color,color,height,height,weight,weight
Unnamed: 0_level_1,triple,quad,triple,quad,triple,quad,triple,quad
0,BeagleBeagleBeagle,BeagleBeagleBeagleBeagle,BrownBrownBrown,BrownBrownBrownBrown,3.0,4.0,75.0,100.0
1,MixedMixedMixed,MixedMixedMixedMixed,BrownBrownBrown,BrownBrownBrownBrown,4.5,6.0,135.0,180.0
2,LabLabLab,LabLabLabLab,BlackBlackBlack,BlackBlackBlackBlack,6.0,8.0,195.0,260.0
3,LabLabLab,LabLabLabLab,BlackBlackBlack,BlackBlackBlackBlack,6.0,8.0,,
4,CorgiCorgiCorgi,CorgiCorgiCorgiCorgi,BrownBrownBrown,BrownBrownBrownBrown,3.0,4.0,,


### drop_duplicates

In [123]:
df.drop_duplicates('breed')

Unnamed: 0,breed,color,height,weight
0,Beagle,Brown,1.0,25.0
1,Mixed,Brown,1.5,45.0
2,Lab,Black,2.0,65.0
4,Corgi,Brown,1.0,


In [124]:
# Select column as a series and drop duplicates
df['breed'].drop_duplicates()
# select column as a dataframe and drop duplicates
df[['breed']].drop_duplicates()

Unnamed: 0,breed
0,Beagle
1,Mixed
2,Lab
4,Corgi


## DataFrame Methods

Methods | Description
--- | ---
.head() | First five rows
.tail() | Last five rows
.info() | Info on the dataframe: count of missing values, data types, row count
.describe() | Summary statistics on columns of numerical type
.sort_values( ,ascending=True) | Sort columns<br>pass in array for multiple 
.count() | Count all non-NA values
.value_count() | Count unique values<br>Count options avail
.groupby() | Method to group in order to then apply functions by chaining to this method
.pivot_table() | Easier way to Group and apply functions
.iterrows() | Iterate over df rows as (index, Series) pairs <br>Access specific column of a row by bracket notation
.itertuples() | Iterate over df rows as namedtuples<br>Access specific column of a row by dot notation

In [125]:
# NOTEBOOK DATA
twod_dict = { 'breed': ['Beagle', 'Mixed', 'Lab', 'Lab', 'Corgi'],
              'color': ['Brown', 'Brown', 'Black','Black', 'Brown'],
           'height': [1, 1.5, 2, 2, 1 ],
           'weight': [25, 45, 65, pd.NA, 27]}
# labels = ['US', 'RU', 'MO', 'EG', 'CH']
df = pd.DataFrame(twod_dict)
# df = pd.DataFrame(twod_dict, index=labels)

## count
- Easy way to count all non-NA values for each row or column
- Will count if value is non-NA value regardless if are duplicate(s) or not

In [126]:
df.count()

breed     5
color     5
height    5
weight    4
dtype: int64

## value_counts
- Easy way to count unique rows and apply commont count methods

Option | Default | Description
--- | --- | ---
subset | None | Choose column(s)
normalize | False | Ratio to the total<br>Proportions instead of frequencies
sort | True | Sort by the result 
ascending | False | Sort direction<br>Default: Highest count on top
dropna | True | Drop na values<br> TODO: doesn't work now b/c new as of 1.3.0 and current version 1.2.0. <br>Add example after upgrading

In [127]:
df.value_counts(subset=['breed'])

breed 
Lab       2
Beagle    1
Corgi     1
Mixed     1
dtype: int64

In [128]:
df.value_counts(subset=['color', 'breed'])

color  breed 
Black  Lab       2
Brown  Beagle    1
       Corgi     1
       Mixed     1
dtype: int64

In [129]:
df.value_counts(subset=['color', 'breed'], normalize=True)

color  breed 
Black  Lab       0.4
Brown  Beagle    0.2
       Corgi     0.2
       Mixed     0.2
dtype: float64

## groupby
- Returns a GroupBy object
- Group by mapping, function, label, list of labels
- Method to group in order to then apply functions by chaining to this method
	- chain any built in method e.g. sum(), max()
	- groupby.apply(...)
	- groupby.agg(...)
	- grouby.transform(...)
	- grouby.pipe(...)



In [130]:
df.groupby('color').size()           # apply to entire df
df.groupby('color')['height'].sum()  # apply to specified Series

color
Black    4.0
Brown    3.5
Name: height, dtype: float64

In [131]:
df.groupby('color').apply(lambda x: x*2)             # apply to entire df
df.groupby('color')['weight'].apply(lambda x: x*2)   # apply to specified Series

0      50
1      90
2     130
3    <NA>
4      54
Name: weight, dtype: object

In [132]:
# Group by multiple groups and apply method to height column
df.groupby(['color', 'breed'])['height'].mean()    

color  breed 
Black  Lab       2.0
Brown  Beagle    1.0
       Corgi     1.0
       Mixed     1.5
Name: height, dtype: float64

## pivot_table
- Another way to group and apply functions
- Pivot tables are automatically sorted on indices
- By default applied aggregate function is "mean" function

Option | Default | Description
--- | --- | ---
values  | None | Column(s) to apply aggregate function
index | None | Set index with column, Group, list
columns | None | Set the columns<br>Used in the more colloquial way of setting up a pivot table
margins | False | Add extra column & row that represents totals

In [133]:
# This is the same as the last example in `groupby` section
# See below for an alternate way to express this
df.pivot_table(values='height', index=['color', 'breed'])   

Unnamed: 0_level_0,Unnamed: 1_level_0,height
color,breed,Unnamed: 2_level_1
Black,Lab,2.0
Brown,Beagle,1.0
Brown,Corgi,1.0
Brown,Mixed,1.5


In [134]:
# More colloquial way of writing the above example
# Pivot height by color vs breed
df.pivot_table(values='height', index='color', columns='breed', fill_value=0) 

breed,Beagle,Corgi,Lab,Mixed
color,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Black,0,0,2,0.0
Brown,1,1,0,1.5


## iterrows

- Iterate over df rows as (index, Series) pairs
- Access specific column of a row by bracket notation

In [135]:
twod_dict = { 'breed': ['Beagle', 'Mixed', 'Lab', 'Lab', 'Corgi'],
              'color': ['Brown', 'Brown', 'Black','Black', 'Brown'],
           'height': [1, 1.5, 2, 2, 1 ],
           'weight': [25, 45, 65, pd.NA, 27]}
gen = pd.DataFrame(twod_dict).iterrows()  # generator object

for idx, row in gen:
    print(row["color"])

Brown
Brown
Black
Black
Brown


## itertuples

- Iterate over df rows using named tuples
	- Access specific column of a row by dot notation

In [136]:
twod_dict = { 'breed': ['Beagle', 'Mixed', 'Lab', 'Lab', 'Corgi'],
              'color': ['Brown', 'Brown', 'Black','Black', 'Brown'],
           'height': [1, 1.5, 2, 2, 1 ],
           'weight': [25, 45, 65, pd.NA, 27]}
map_obj = pd.DataFrame(twod_dict).itertuples()

for row in map_obj:
    print(row.breed)


Beagle
Mixed
Lab
Lab
Corgi
