# Library Comparisons
> [Table of Contents](../README.md)

## Python, Numpy, Pandas

## In this notebook
- Select/Subset Data
	- Numpy
	- Pandas
		- In a nutshell
		- Pandas general notation
		- Pandas .loc() label based indexing
		- Pandas .iloc() index position based indexing
		- Pandas conditional subsets
- Loop over data structures

In [252]:
import pandas as pd
import numpy as np

In [253]:
# NOTEBOOK DATA
oned_list = ['US', 'AUS', 'JPN', 'IN', 'RU', 'MOR', 'EG']
oned_bool_list = [True, False, False, False, True, True, True]
twod_list1 = [[1,2,3],[4,6,2],[0,7,1]]
oned_dict = {'spain':'madrid', 'france':'paris', 'germany':'berlin', 'norway':'oslo' }
twod_dict = { 'spain': { 'capital':'madrid', 'population':46.77 },
           'france': { 'capital':'paris', 'population':66.03 },
           'germany': { 'capital':'berlin', 'population':80.62 },
           'norway': { 'capital':'oslo', 'population':5.084 } }

## Select/Subset Data

### Numpy
- [row,col]

In [254]:
arr = np.array(twod_list1)

In [255]:
arr[2,1] 

7

In [256]:
arr[:2,1:2]  # slice rows, slice columns

array([[2],
       [6]])

### Pandas

In [257]:
df = pd.DataFrame(twod_dict)
df

Unnamed: 0,spain,france,germany,norway
capital,madrid,paris,berlin,oslo
population,46.77,66.03,80.62,5.084


#### In a nutshell

> Slice works in sorted indices b/c it returns consecutive values  

> loc + slice = power combo

```python
# general square bracket syntax
df[row, col]     # everything before comma is rows, everything after is columns
df[col]          # if commas is omitted only referring to column label as series
df[[col]]        # if comaas is omitted only referring to column label as dataframe

# .loc , iloc
df.loc['first_row': 'last_row', 'first_col':'last_col']  # rows & cols -> last_labels are inclusive 
df.loc['row_label']      # row as series  
df.loc[['row_label']]    # row as dataframe
df.loc[:, 'col_label']   # col as seies
df.loc[[:, 'col_label']] # col as dataframe
df.iloc[1:5, 1:5]        # rows & cols -> last index is excluded
df.iloc[0]               # row as series
df.iloc[[0]]             # row as dataframe
df.iloc[:, 1:5]          # col as series
df.iloc[:, [1:5]]        # col as dataframe

# multi-index subsets
df.loc[('first_row_outer_label', 'first_row_inner_label'):('last_row_outer_label', 'last_row_inner_label'), 'first_col':'last_col']
```

#### Pandas general notation
- Useful for accessing columns
- Cannot access rows by label
- Cannot access rows & columns

In [258]:
# row access (dataframe)
# df[0]              <--- error
# df['population']   <--- error
# df[['population']] <--- error
df[:1]               # must slice

Unnamed: 0,spain,france,germany,norway
capital,madrid,paris,berlin,oslo


In [259]:
# column access (series)
df['spain']  # bracket notation
df.spain     # dot notation

capital       madrid
population     46.77
Name: spain, dtype: object

In [260]:
# column access (dataframe)
df[['spain']]   # bracket notation
df[['spain', 'norway']]

Unnamed: 0,spain,norway
capital,madrid,oslo
population,46.77,5.084


In [261]:
# row & column access
# df[:, ['spain']]    <---- error
# df[:1, ['spain']]   <---- error
# df[1, ['spain']]    <---- error

#### Pandas .loc() Label based indexing
- [row,col] similar to numpy arrays
- if the comma is missing then just row(s)
- selections can be out of order
- use slices for ordered sections

In [262]:
# row access (series)
df.loc['capital']

spain      madrid
france      paris
germany    berlin
norway       oslo
Name: capital, dtype: object

In [263]:
# row access (dataframe)
df.loc[:]                          # all rows
df.loc[['capital']]                # select rows
df.loc[['capital', 'population']]  # select rows

Unnamed: 0,spain,france,germany,norway
capital,madrid,paris,berlin,oslo
population,46.77,66.03,80.62,5.084


In [264]:
# column access (dataframe)
# df.loc['spain']       <-------- error 
# df.loc[['spain']]     <-------- error
df.loc[:, 'norway']     # one column (series)
df.loc[:, ['norway']]   # select columns (dataframe)

Unnamed: 0,norway
capital,oslo
population,5.084


In [265]:
# row & column access (dataframe)
df.loc[:, :]                                             # all row, all columns
df.loc[['population', 'capital'], ['norway', 'france']]  # select rows, select columns (out of order)

Unnamed: 0,norway,france
population,5.084,66.03
capital,oslo,paris


#### Pandas .iloc() index position based indexing
- [row,col] similar to numpy arrays
- if the comma is missing then just row(s)
- selections can be out of order
- use slices for ordered sections

In [266]:
# row access (series)
df.iloc[0]

spain      madrid
france      paris
germany    berlin
norway       oslo
Name: capital, dtype: object

In [267]:
# row access (dataframe)
df.iloc[:]                # all rows
df.iloc[[1,0]]            # select rows  (out of order)
df.iloc[0:]               # select rows

Unnamed: 0,spain,france,germany,norway
capital,madrid,paris,berlin,oslo
population,46.77,66.03,80.62,5.084


In [268]:
# column access (series)
df.iloc[:, 3]             # one column

capital        oslo
population    5.084
Name: norway, dtype: object

In [269]:
# column access (dataframe)
df.iloc[:, [3,2]]         # select columns  (out of order)

Unnamed: 0,norway,germany
capital,oslo,berlin
population,5.084,80.62


In [270]:
# row & column access (dataframe)
df.iloc[:, :]           # all row, all columns
df.iloc[[1], [3,1]]     # select rows, select columns (out of order)

Unnamed: 0,norway,france
population,5.084,66.03


In [271]:
# row & column access (dataframe)
df.iloc[0:, :2]         # ordered slices

Unnamed: 0,spain,france
capital,madrid,paris
population,46.77,66.03


#### Pandas conditional subsets

In [272]:
twod_dict = { 'breed': ['Beagle', 'Mixed', 'Lab', 'Lab', 'Corgi'],
              'color': ['Brown', 'Brown', 'Black','Black', 'Brown'],
           'height': [1, 1.5, 2, 2, 1 ],
           'weight': [25, 45, 65, pd.NA, 27]}
df = pd.DataFrame(twod_dict)

##### Subset/Filter entire dataframe

```python
# NOTE: BEWARE of differing datatypes when filtering on entire dataframes
# df must have same data types as the value used for comparison 

# Subset/Filter ENTIRE dataframe and select ENTIRE df
new_df = df[df comparison_operator value]

# Subset/Filter ENTIRE dtaframe AND select ONE column
new_df_contain_select_one_col = df[df comparison_operator value]['col_name']

# Subset/Filter ENTIRE dataframe AND select MULTIPLE columns
new_df_contain_select_multi_cols = df[df comparison_operator value][['col_name' , 'col_name']]
```

##### Subset/Filter one column

```python
# Subset/Filter ONE column and select ENTIRE df
new_df = df[df['column_name'] comparison_operator value]

# Subset/Filter ONE column and select one column
new_df_contain_select_one_col = df[df['column_name'] comparison_operator value]['col_name']

# Subset/Filter ONE column and select multiple columns
new_df_contain_select_multi_cols = df[df['column_name'] comparison_operator value][['col_name', 'col_name']]
```

In [273]:
# Subset/Filter on ONE column and select ENTIRE df
df[df['height'] > 1.75]

Unnamed: 0,breed,color,height,weight
2,Lab,Black,2.0,65.0
3,Lab,Black,2.0,


In [274]:
# Subset/Filter on ONE column and select one column
df[df['height'] > 1.75]['breed']

2    Lab
3    Lab
Name: breed, dtype: object

In [275]:
# Subset/Filter on ONE column and select multiple columns
df[df['height'] > 1.75][['breed', 'color', 'height']]

Unnamed: 0,breed,color,height
2,Lab,Black,2.0
3,Lab,Black,2.0


##### Subset/Filter multiple columns    TODO: NOT WORKING AS EXPECTED.  ROWS WILL ALL NAN SHOWN SEE BELOW

```python
# Subset/Filter on MULTPLE columns and select ENTIRE df
new_df = df[df[['column_name', 'col_name']] comparison_operator value]

# Subset/Filter on MULTPLE columns and select one column
new_df_contain_select_one_col = df[df[['column_name', 'col_name']] comparison_operator value]['col_name']

# Subset/Filter on MULTPLE columns and select multiple columns
new_df_contain_select_multi_cols = df[df[['column_name', 'col_name']] comparison_operator value][['col_name', 'col_name']]
```

In [276]:
# Subset/Filter on MULTIPLE columns and select ENTIRE df
df[df[['color', 'height', 'weight']] > ['Black', 1.5, 40]]

Unnamed: 0,breed,color,height,weight
0,,Brown,,
1,,Brown,,45.0
2,,,2.0,65.0
3,,,2.0,
4,,Brown,,


In [277]:
# Subset/Filter on MULTIPLE columns and select ENTIRE df
df[df[['height', 'weight']] > [ 1.5, 40]]

Unnamed: 0,breed,color,height,weight
0,,,,
1,,,,45.0
2,,,2.0,65.0
3,,,2.0,
4,,,,


In [278]:
# Subset/Filter on MULTIPLE columns and select one column
df[df[['height', 'weight']] > [1.5, 40]]['weight']

0    NaN
1     45
2     65
3    NaN
4    NaN
Name: weight, dtype: object

In [279]:
# Subset/Filter on MULTIPLE columns and select multiple columns
df[df[['height', 'weight']] > [1.5, 40]][['height', 'weight']]

Unnamed: 0,height,weight
0,,
1,,45.0
2,2.0,65.0
3,2.0,
4,,


## Loop over data structures
Useful functions and methods

Data Structure | Name | Type | What does it do
--- | --- | --- | ---
List | enumerate | function | access to index and value
Dict | keys	| method | access to keys
Dict | items | method | access to index and value
Dict | values | method | access to values
np.Array | np.nditer | function | access to every element in ND array
pd.DataFrame | df.iterrows | method | access to each row as two parts: label and series
pd.DataFrame | df.itertuples | method | access to each row as one part: named tuple
pd.DataFrame | pd.series.apply(function_call_name) | method | apply the given function to every row of the selected series (column)



In [280]:
# Python List
for item in oned_list:
  pass

for index, item in enumerate(oned_list):
  pass

In [281]:
# Python dict
for key in oned_dict.keys():
  pass

for key, item in oned_dict.items():
  pass

for item in oned_dict.values():
 pass

In [282]:
# Numpy array
for item in np.array(oned_list):
  pass

for item in np.array(twod_list1):
  pass     # item are rows

for item in np.nditer(np.array(twod_list1)):
  pass     # item is every element in ND array

In [283]:
# Pandas
twod_dict_two = { 'cars_per_cap': [809, 200, 70, 45, 150],
           'country': ['United States', 'Russia', 'Morocco', 'Egypt', 'China'],
           'drives_right': [True, False, False, True, True]}
labels = ['US', 'RU', 'MO', 'EG', 'CH']
df = pd.DataFrame(twod_dict_two, index=labels)

In [284]:
# Pandas dataframe
for row_label, series in df.iterrows():
  # print(row_label, series)
  pass


In [285]:
# Pandas dataframe
for row_as_named_tuple in df.itertuples():
  # print(row_as_named_tuple)
  pass

In [286]:
# Panda dataframe
# Iterate over selected series (column) and apply given function
df['country'].apply(len)   # returns series

# set a new column in two_dict_two
df['name_length'] = df['country'].apply(len) 
df

Unnamed: 0,cars_per_cap,country,drives_right,name_length
US,809,United States,True,13
RU,200,Russia,False,6
MO,70,Morocco,False,7
EG,45,Egypt,True,5
CH,150,China,True,5
