# Getting started
### Jupyter Notebook Shortcut:

Commenting: ctrl + '/'<br>
Run code: shift + enter<br>

### Jupyter's magic function:
https://ipython.readthedocs.io/en/stable/interactive/magics.html
* <b>%%time</b>: timing the total execution time of a cell. Can only be used in .ipynb file extension. Must removed when save script as .py

Get started by importing libraries os and pandas; abbreviate pandas as pd.

os.getcwd() function gets the current working directory where the notebook resides which is shown with the print() function.

# <a name='toc'>Table of Content</a>

<ul style='font-size:12pt;line-height:1.8em'>
    <li><a href='#3_0'>3.0 Data Wrangling</a></li>
    <ul>
        <li><a href='#3_1'>3.1 Working with numeric data</a></li>
        <li><a href='#3_2'>3.2 Working with string data</a></li>
        <li><a href='#3_3'>3.3 Working with datetime data</a></li>
        <li><a href='#3_4'>3.4 Data transformation</a></li>
        <li><a href='#3_5'>3.5 Data mapping</a></li>
        <li><a href='#3_6'>3.6 Pandas Data Joining</a></li>
        <ul>
            <li><a href='#3_6a'>3.6a Pandas Concat</a></li>
            <li><a href='#3_6b'>3.6b Pandas Merge</a></li>
            <li><a href='#3_6c'>3.6c Exclusive Joining</a></li>
        </ul>
        <li><a href='#3_7'>3.7 Check for duplicate keys w/ merge validation</a></li>
        <li><a href='#3_8'>3.8 Merge Indicator</a></li>
    </ul>
</ul>


In [1]:
import os  #https://docs.python.org/3/library/os.html
import pandas as pd  #https://pandas.pydata.org/pandas-docs/stable/reference/
import numpy as np  #https://numpy.org/doc/stable/reference/index.html#reference
import pyodbc
# pd.options.mode.chained_assignment = None
# note: instruction written for Pandas 1.2.4

## Setting up the working directory with a class object ##
class loc:
    d0 = os.getcwd() + '\\'  #get current working directory
    d1 = d0 + 'data\\'
#     pdr = '\\\\clinisilonhh\\ifs\\PHI_Access\\PHI-CO - Data Science Share\\'
#     odr = '\\\\clinisilonhh\\ifs\\PHI_Access\\PHI-CO - System Stroke Data Crosswalk\\'
    
print(loc.d0)  #printing the current working location

C:\Users\E1724299\Desktop\Python Curriculum 2.0\


In [2]:
%%time
def print_full(x):  # Display all df columns and rows
    pd.set_option('display.max_rows', None)
    pd.set_option('display.max_columns', None)
    display(x)
    pd.reset_option('display.max_rows')
    pd.reset_option('display.max_columns')

def df_unique(df): #retrieve unique value from each column of a dataframe
    # https://stackoverflow.com/a/68813683
    uni = pd.Series({col:list(df[col].unique()) for col in df})
    return uni

# this function return a Series object that show the unique values in each column of a dataframe.
# df_unique(csv)

Wall time: 0 ns


# Example dataset

In [3]:
# csv = pd.read_csv(loc.d1 + 'raw_data2.csv')
csv = pd.read_csv(loc.d1 + 'raw_data_alt.csv')
csv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112443 entries, 0 to 112442
Data columns (total 30 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   ids             112443 non-null  object 
 1   AdmitDate       112443 non-null  object 
 2   AdmitTime       112443 non-null  int64  
 3   DischargeDate   112293 non-null  object 
 4   DischargeTime   112293 non-null  float64
 5   EntityCode      112443 non-null  object 
 6   Loc1            112443 non-null  object 
 7   Loc2            86881 non-null   object 
 8   Loc3            63902 non-null   object 
 9   Loc4            31045 non-null   object 
 10  Loc5            14177 non-null   object 
 11  PtType1         112443 non-null  object 
 12  PtType2         86881 non-null   object 
 13  PtType3         63902 non-null   object 
 14  PtType4         31045 non-null   object 
 15  PtType5         14177 non-null   object 
 16  DateIn1         112443 non-null  object 
 17  DateIn2   

In [4]:
print_full(csv.head())

Unnamed: 0,ids,AdmitDate,AdmitTime,DischargeDate,DischargeTime,EntityCode,Loc1,Loc2,Loc3,Loc4,Loc5,PtType1,PtType2,PtType3,PtType4,PtType5,DateIn1,DateIn2,DateIn3,DateIn4,DateIn5,TimeIn1,TimeIn2,TimeIn3,TimeIn4,TimeIn5,EncStatus,FinancialClass,VisitType,cost
0,E0122776959240,1/1/2013,627,1/2/2013,1422.0,MC,EDMC,VUMC,VUMC,J6MB,,ER,OU,IP,IP,,1/1/2013,1/1/2013,1/1/2013,1/1/2013,,1031,1417.0,1417.0,1525.0,,Cancelled,MH - MEDICARE HMO,OP,$1.94
1,E0192437769970,1/1/2013,1507,1/1/2013,1845.0,MC,JANT,,,,,OU,,,,,1/1/2013,,,,,1244,,,,,Discharged,SP - SELF-PAY,Obs,$69.68
2,E0238210647860,1/1/2013,1353,1/2/2013,1726.0,SE,JANT,,,,,OU,,,,,1/1/2013,,,,,1637,,,,,Preadmit,HM - HMO,Obs,$75.59
3,E0295116781805,1/1/2013,625,1/2/2013,1520.0,HH,EDMC,VUMC,J7EC,,,ER,OU,OU,,,1/1/2013,1/1/2013,1/1/2013,,,1657,2016.0,2130.0,,,Active,MH - MEDICARE HMO,ED,$66.22
4,E0576457975867,1/1/2013,56,1/3/2013,1515.0,SE,EDMC,VUMC,VUMC,J7MB,,ER,OU,IP,IP,,1/1/2013,1/1/2013,1/1/2013,1/1/2013,,1050,1551.0,1703.0,1740.0,,Cancelled,MH - MEDICARE HMO,ED,$94.28


<div align='right'><a href='#toc' style='text-decoration:none;font-weight:bold;color:#0877ff;'>&#11014;&#65039; Back to the Top</a></div>

# <a name='3_0'>3.0 Data Wrangling</a>
## <a name='3_1'>3.1 Working with numeric data</a>

type | variation | function | example
- | - | - | -
int | int8, int16, int32, int64 | Store numeric values that are integer (whole number) only. Doesn't accept null value | 1, 2
int | Int64 | A Special numeric data type that can store null values (aka Nullable integer data type) | **
float | float8, float16, float32, float64 | Store numeric values with decimal place. Accept null value | 1.0, 2.2

** https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html

Additional Example:

https://towardsdatascience.com/converting-data-to-a-numeric-type-in-pandas-db9415caab0b

https://github.com/BindiChen/machine-learning/blob/main/data-analysis/036-pandas-change-data-to-numeric-type/change-data-to-a-numeric-type.ipynb

In [5]:
p31e1 = csv[['TimeIn1','TimeIn2']]  #changing datatype from int to float
display(p31e1, p31e1.astype('float64'))

Unnamed: 0,TimeIn1,TimeIn2
0,1031,1417.0
1,1244,
2,1637,
3,1657,2016.0
4,1050,1551.0
...,...,...
112438,1247,1611.0
112439,1536,2127.0
112440,30,500.0
112441,618,826.0


Unnamed: 0,TimeIn1,TimeIn2
0,1031.0,1417.0
1,1244.0,
2,1637.0,
3,1657.0,2016.0
4,1050.0,1551.0
...,...,...
112438,1247.0,1611.0
112439,1536.0,2127.0
112440,30.0,500.0
112441,618.0,826.0


### pd.to_numeric()
This method converts the given argument to a number or numeric type. <br>
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_numeric.html

In [6]:
%%time
dat = ['18', '3', '0', '1', 'nan', '8', '5', '6', '14', '12', '7', '1P',
       '9', '2', '15', '10', '11', '4', '16', '27', '20', '19', '1D',
       '29', 'F3', 'F1', '17', '2W', '21', '13', 'SU', '24', '22', '34',
       '6A', '32', '36', '6B', '5A', '30', 'F5', '25', 'F2', '23', '43',
       '6D', '35', '33', '44', '26', '31', '5B', '47', '49', '46', '37',
       'F4', '28', '6C', '40', '41', '38', '39', '51', '45', '48', '50',
       '42', '71', '5D', 'F6', '57', 'A', '3B', 'F7', 'B', 'C', '60',
       'P1', 'F8', 'PX']

p31e2 = pd.DataFrame(dat, columns=['Bed1'])
p31e2['Bed1_new'] = pd.to_numeric(p31e2['Bed1'], errors='coerce')

print('before:\n',p31e2['Bed1'].unique(),'\nafter:\n',p31e2['Bed1_new'].unique())
display(p31e2[p31e2['Bed1'].isin(['1D','P1','F8','PX'])].filter(['Bed1','Bed1_new']))

before:
 ['18' '3' '0' '1' 'nan' '8' '5' '6' '14' '12' '7' '1P' '9' '2' '15' '10'
 '11' '4' '16' '27' '20' '19' '1D' '29' 'F3' 'F1' '17' '2W' '21' '13' 'SU'
 '24' '22' '34' '6A' '32' '36' '6B' '5A' '30' 'F5' '25' 'F2' '23' '43'
 '6D' '35' '33' '44' '26' '31' '5B' '47' '49' '46' '37' 'F4' '28' '6C'
 '40' '41' '38' '39' '51' '45' '48' '50' '42' '71' '5D' 'F6' '57' 'A' '3B'
 'F7' 'B' 'C' '60' 'P1' 'F8' 'PX'] 
after:
 [18.  3.  0.  1. nan  8.  5.  6. 14. 12.  7.  9.  2. 15. 10. 11.  4. 16.
 27. 20. 19. 29. 17. 21. 13. 24. 22. 34. 32. 36. 30. 25. 23. 43. 35. 33.
 44. 26. 31. 47. 49. 46. 37. 28. 40. 41. 38. 39. 51. 45. 48. 50. 42. 71.
 57. 60.]


Unnamed: 0,Bed1,Bed1_new
22,1D,
78,P1,
79,F8,
80,PX,


Wall time: 0 ns


In [7]:
%%time
p31e3 = csv[['cost']]
p31e3['cost_new'] = pd.to_numeric(p31e3['cost'], errors='coerce')
p31e3

Wall time: 46.9 ms


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,cost,cost_new
0,$1.94,
1,$69.68,
2,$75.59,
3,$66.22,
4,$94.28,
...,...,...
112438,$19.26,
112439,$33.08,
112440,$95.27,
112441,$22.12,


<div align='right'><a href='#toc' style='text-decoration:none;font-weight:bold;color:#0877ff;'>&#11014;&#65039; Back to the Top</a></div>

## <a name='3_2'>3.2 Working with string data</a>

https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html

### str.replace()
This method replaces each occurence of a currently-implented regex/pattern in the string with a newly given regex/pattern. <br>
https://pandas.pydata.org/docs/reference/api/pandas.Series.str.replace.html

In [8]:
%%time
p32e1 = csv[['cost']]
p32e1['cost_new'] = (p32e1['cost'].str.replace('$', '').str.replace(',','')).astype('float64')
p32e1

Wall time: 46.9 ms


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,cost,cost_new
0,$1.94,1.94
1,$69.68,69.68
2,$75.59,75.59
3,$66.22,66.22
4,$94.28,94.28
...,...,...
112438,$19.26,19.26
112439,$33.08,33.08
112440,$95.27,95.27
112441,$22.12,22.12


In [9]:
%%time
p32e1['cost_new3'] = (p32e1['cost'].str.replace('[\$\,]', '', regex=True)).astype('float64')

p32e1

Wall time: 78.3 ms


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,cost,cost_new,cost_new3
0,$1.94,1.94,1.94
1,$69.68,69.68,69.68
2,$75.59,75.59,75.59
3,$66.22,66.22,66.22
4,$94.28,94.28,94.28
...,...,...,...
112438,$19.26,19.26,19.26
112439,$33.08,33.08,33.08
112440,$95.27,95.27,95.27
112441,$22.12,22.12,22.12


### str.split()
This method splits a given string given a specified delimiter/separator. <br>
https://pandas.pydata.org/docs/reference/api/pandas.Series.str.split.html

In [10]:
%%time
p32e2 = csv[['FinancialClass']]
location = p32e2['FinancialClass'].str.split(' - ', expand=True)
p32e2['FinClass'] = location[0]
p32e2['FinClass_desc'] = location[1]

p32e2

Wall time: 136 ms


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,FinancialClass,FinClass,FinClass_desc
0,MH - MEDICARE HMO,MH,MEDICARE HMO
1,SP - SELF-PAY,SP,SELF-PAY
2,HM - HMO,HM,HMO
3,MH - MEDICARE HMO,MH,MEDICARE HMO
4,MH - MEDICARE HMO,MH,MEDICARE HMO
...,...,...,...
112438,HX - HEALTH INSURANCE EXCHANGE,HX,HEALTH INSURANCE EXCHANGE
112439,MR - MEDICARE,MR,MEDICARE
112440,MR - MEDICARE,MR,MEDICARE
112441,DH - MEDICAID HMO,DH,MEDICAID HMO


### str.strip(), str.lstrip(), str.rstrip()
- strip method removes leading and trailing characters
- lstrip method removes leading characters only 
- rstrip method removes trailing characters only <br>

https://pandas.pydata.org/docs/reference/api/pandas.Series.str.strip.html

In [11]:
%%time
p32e3 = csv[['ids']]
p32e3['ID_strip'] = p32e3['ids'].str.strip('E0')
p32e3['ID_rstrip'] = p32e3['ids'].str.rstrip('E0')
p32e3['ID_lstrip'] = p32e3['ids'].str.lstrip('E0')

p32e3

Wall time: 62.5 ms


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,ids,ID_strip,ID_rstrip,ID_lstrip
0,E0122776959240,12277695924,E012277695924,122776959240
1,E0192437769970,19243776997,E019243776997,192437769970
2,E0238210647860,23821064786,E023821064786,238210647860
3,E0295116781805,295116781805,E0295116781805,295116781805
4,E0576457975867,576457975867,E0576457975867,576457975867
...,...,...,...,...
112438,E9870946410920,987094641092,E987094641092,9870946410920
112439,E9875397672366,9875397672366,E9875397672366,9875397672366
112440,E9904534873864,9904534873864,E9904534873864,9904534873864
112441,E9927247603461,9927247603461,E9927247603461,9927247603461


### str.lower(), str.upper(), str.title()
This method converts the given string to uppercase counterpart. <br>
https://pandas.pydata.org/docs/reference/api/pandas.Series.str.upper.html

This method converts the given string to lowercase counterpart. <br>
https://pandas.pydata.org/docs/reference/api/pandas.Series.str.lower.html

This method converts the given string to titlecase counterpart. <br> 
https://pandas.pydata.org/docs/reference/api/pandas.Series.str.title.html

In [12]:
%%time
p32e3 = p32e2[['FinClass_desc']]
p32e3['FinancialClass_lower'] = p32e3['FinClass_desc'].str.lower()
p32e3['FinancialClass_upper'] = p32e3['FinClass_desc'].str.upper()
p32e3['FinancialClass_title'] = p32e3['FinClass_desc'].str.title()

p32e3

Wall time: 62.5 ms


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,FinClass_desc,FinancialClass_lower,FinancialClass_upper,FinancialClass_title
0,MEDICARE HMO,medicare hmo,MEDICARE HMO,Medicare Hmo
1,SELF-PAY,self-pay,SELF-PAY,Self-Pay
2,HMO,hmo,HMO,Hmo
3,MEDICARE HMO,medicare hmo,MEDICARE HMO,Medicare Hmo
4,MEDICARE HMO,medicare hmo,MEDICARE HMO,Medicare Hmo
...,...,...,...,...
112438,HEALTH INSURANCE EXCHANGE,health insurance exchange,HEALTH INSURANCE EXCHANGE,Health Insurance Exchange
112439,MEDICARE,medicare,MEDICARE,Medicare
112440,MEDICARE,medicare,MEDICARE,Medicare
112441,MEDICAID HMO,medicaid hmo,MEDICAID HMO,Medicaid Hmo


### str.contains(), str.startswith(), str.endswith()
This method checks if a string contains a given pattern/regex. <br>
https://pandas.pydata.org/docs/reference/api/pandas.Series.str.contains.html

This method checks if a string starts with a given pattern/regex. <br>
https://pandas.pydata.org/docs/reference/api/pandas.Series.str.startswith.html

This method checks if a string ends with a given pattern/regex. <br> 
https://pandas.pydata.org/docs/reference/api/pandas.Series.str.endswith.html

In [13]:
%%time
p32e3['FinClass_desc'].unique()

Wall time: 0 ns


array(['MEDICARE HMO', 'SELF-PAY', 'HMO', 'MEDICARE',
       'GOV THIRD PRTY LIAB', 'MEDICAID HMO', 'MEDICAID TRADITIONAL',
       'PPO', 'COMMERCIAL LOW BENES', 'BLUE CROSS', 'CHAMPUS',
       'COMMERCIAL', 'MEDICAID PENDING', 'SP THIRD PRTY LIAB',
       'MEDICAID OUT STATE', 'MHHS EMPLOYEE COV',
       'HEALTH INSURANCE EXCHANGE', 'WORK COMP TWC', 'GOVT PROGRAMS',
       'INSTITUTIONAL', 'MEDICARE PART B', 'WORK COMP',
       'INTERNATIONAL SP', 'INTERNATIONAL INS', 'RESEARCH',
       'WORK COMP MHHS', 'MEDICAID STAR'], dtype=object)

In [14]:
%%time
p32e3a = p32e3[p32e3['FinClass_desc'].str.contains('care', case=False)]
display(p32e3a, p32e3a['FinClass_desc'].unique())

Unnamed: 0,FinClass_desc,FinancialClass_lower,FinancialClass_upper,FinancialClass_title
0,MEDICARE HMO,medicare hmo,MEDICARE HMO,Medicare Hmo
3,MEDICARE HMO,medicare hmo,MEDICARE HMO,Medicare Hmo
4,MEDICARE HMO,medicare hmo,MEDICARE HMO,Medicare Hmo
6,MEDICARE,medicare,MEDICARE,Medicare
7,MEDICARE,medicare,MEDICARE,Medicare
...,...,...,...,...
112434,MEDICARE HMO,medicare hmo,MEDICARE HMO,Medicare Hmo
112436,MEDICARE,medicare,MEDICARE,Medicare
112437,MEDICARE HMO,medicare hmo,MEDICARE HMO,Medicare Hmo
112439,MEDICARE,medicare,MEDICARE,Medicare


array(['MEDICARE HMO', 'MEDICARE', 'MEDICARE PART B'], dtype=object)

Wall time: 62.5 ms


In [15]:
%%time
p32e3b = p32e3[p32e3['FinClass_desc'].str.startswith('MED')]
display(p32e3b, p32e3b['FinClass_desc'].unique())

Unnamed: 0,FinClass_desc,FinancialClass_lower,FinancialClass_upper,FinancialClass_title
0,MEDICARE HMO,medicare hmo,MEDICARE HMO,Medicare Hmo
3,MEDICARE HMO,medicare hmo,MEDICARE HMO,Medicare Hmo
4,MEDICARE HMO,medicare hmo,MEDICARE HMO,Medicare Hmo
6,MEDICARE,medicare,MEDICARE,Medicare
7,MEDICARE,medicare,MEDICARE,Medicare
...,...,...,...,...
112436,MEDICARE,medicare,MEDICARE,Medicare
112437,MEDICARE HMO,medicare hmo,MEDICARE HMO,Medicare Hmo
112439,MEDICARE,medicare,MEDICARE,Medicare
112440,MEDICARE,medicare,MEDICARE,Medicare


array(['MEDICARE HMO', 'MEDICARE', 'MEDICAID HMO', 'MEDICAID TRADITIONAL',
       'MEDICAID PENDING', 'MEDICAID OUT STATE', 'MEDICARE PART B',
       'MEDICAID STAR'], dtype=object)

Wall time: 46.9 ms


In [16]:
%%time
p32e3c = p32e3[p32e3['FinClass_desc'].str.endswith('CARE')]
display(p32e3c, p32e3c['FinClass_desc'].unique())

Unnamed: 0,FinClass_desc,FinancialClass_lower,FinancialClass_upper,FinancialClass_title
6,MEDICARE,medicare,MEDICARE,Medicare
7,MEDICARE,medicare,MEDICARE,Medicare
8,MEDICARE,medicare,MEDICARE,Medicare
14,MEDICARE,medicare,MEDICARE,Medicare
17,MEDICARE,medicare,MEDICARE,Medicare
...,...,...,...,...
112430,MEDICARE,medicare,MEDICARE,Medicare
112432,MEDICARE,medicare,MEDICARE,Medicare
112436,MEDICARE,medicare,MEDICARE,Medicare
112439,MEDICARE,medicare,MEDICARE,Medicare


array(['MEDICARE'], dtype=object)

Wall time: 31.2 ms


### str.zfill()
This method pads Strings with prepending zeros/'0's if the initial width of the string is less than the given width parameter. <br> 
https://pandas.pydata.org/docs/reference/api/pandas.Series.str.zfill.html

In [17]:
p32e4 = csv[['AdmitDate','AdmitTime']]

p32e4['AdmitTime2'] = p32e4['AdmitTime'].astype(str).str.zfill(4)
p32e4[['AdmitTime','AdmitTime2']]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  p32e4['AdmitTime2'] = p32e4['AdmitTime'].astype(str).str.zfill(4)


Unnamed: 0,AdmitTime,AdmitTime2
0,627,0627
1,1507,1507
2,1353,1353
3,625,0625
4,56,0056
...,...,...
112438,1329,1329
112439,2257,2257
112440,2045,2045
112441,1626,1626


<div align='right'><a href='#toc' style='text-decoration:none;font-weight:bold;color:#0877ff;'>&#11014;&#65039; Back to the Top</a></div>

## <a name='3_3'>3.3 Working with datetime data</a>
https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html

https://pandas.pydata.org/pandas-docs/stable/user_guide/timedeltas.html

https://towardsdatascience.com/10-tricks-for-converting-numbers-and-strings-to-datetime-in-pandas-82a4645fc23d

### pd.to_datetime()
This method converts a given argument to a datetime (type). <br> 
https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html <br>
This method returns a formatted string representation of the object. <br> 
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Period.strftime.html

In [18]:
%%time
p32e4['AdmitDatetime'] = pd.to_datetime(p32e4['AdmitDate'] + p32e4['AdmitTime2'],format='%m/%d/%Y%H%M')
p32e4

Wall time: 202 ms


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,AdmitDate,AdmitTime,AdmitTime2,AdmitDatetime
0,1/1/2013,627,0627,2013-01-01 06:27:00
1,1/1/2013,1507,1507,2013-01-01 15:07:00
2,1/1/2013,1353,1353,2013-01-01 13:53:00
3,1/1/2013,625,0625,2013-01-01 06:25:00
4,1/1/2013,56,0056,2013-01-01 00:56:00
...,...,...,...,...
112438,6/30/2019,1329,1329,2019-06-30 13:29:00
112439,6/30/2019,2257,2257,2019-06-30 22:57:00
112440,6/30/2019,2045,2045,2019-06-30 20:45:00
112441,6/30/2019,1626,1626,2019-06-30 16:26:00


In [19]:
%%time
# https://stackoverflow.com/questions/29206612/difference-between-data-type-datetime64ns-and-m8ns
# https://cognitivewaves.wordpress.com/data-types-python-numpy-pandas/

p32e4['Admit_dt'] = p32e4['AdmitDate'].astype('M8[ns]')
# p32e4['Admit_dt'] = p32e4['AdmitDate'].astype('datetime64')
p32e4

Wall time: 7.69 ms


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,AdmitDate,AdmitTime,AdmitTime2,AdmitDatetime,Admit_dt
0,1/1/2013,627,0627,2013-01-01 06:27:00,2013-01-01
1,1/1/2013,1507,1507,2013-01-01 15:07:00,2013-01-01
2,1/1/2013,1353,1353,2013-01-01 13:53:00,2013-01-01
3,1/1/2013,625,0625,2013-01-01 06:25:00,2013-01-01
4,1/1/2013,56,0056,2013-01-01 00:56:00,2013-01-01
...,...,...,...,...,...
112438,6/30/2019,1329,1329,2019-06-30 13:29:00,2019-06-30
112439,6/30/2019,2257,2257,2019-06-30 22:57:00,2019-06-30
112440,6/30/2019,2045,2045,2019-06-30 20:45:00,2019-06-30
112441,6/30/2019,1626,1626,2019-06-30 16:26:00,2019-06-30


In [20]:
p32e4.info()  #checking the datatime of the Admit_dt column

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112443 entries, 0 to 112442
Data columns (total 5 columns):
 #   Column         Non-Null Count   Dtype         
---  ------         --------------   -----         
 0   AdmitDate      112443 non-null  object        
 1   AdmitTime      112443 non-null  int64         
 2   AdmitTime2     112443 non-null  object        
 3   AdmitDatetime  112443 non-null  datetime64[ns]
 4   Admit_dt       112443 non-null  datetime64[ns]
dtypes: datetime64[ns](2), int64(1), object(2)
memory usage: 4.3+ MB


### dt.strftime()
This method converts the object to an index using the specified date format given as an argument. <br> 
https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.strftime.html

In [21]:
p32e4['AdmitDatetime_strftime'] = p32e4['AdmitDatetime'].dt.strftime('%m/%d/%y %I:%M %p')
p32e4

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  p32e4['AdmitDatetime_strftime'] = p32e4['AdmitDatetime'].dt.strftime('%m/%d/%y %I:%M %p')


Unnamed: 0,AdmitDate,AdmitTime,AdmitTime2,AdmitDatetime,Admit_dt,AdmitDatetime_strftime
0,1/1/2013,627,0627,2013-01-01 06:27:00,2013-01-01,01/01/13 06:27 AM
1,1/1/2013,1507,1507,2013-01-01 15:07:00,2013-01-01,01/01/13 03:07 PM
2,1/1/2013,1353,1353,2013-01-01 13:53:00,2013-01-01,01/01/13 01:53 PM
3,1/1/2013,625,0625,2013-01-01 06:25:00,2013-01-01,01/01/13 06:25 AM
4,1/1/2013,56,0056,2013-01-01 00:56:00,2013-01-01,01/01/13 12:56 AM
...,...,...,...,...,...,...
112438,6/30/2019,1329,1329,2019-06-30 13:29:00,2019-06-30,06/30/19 01:29 PM
112439,6/30/2019,2257,2257,2019-06-30 22:57:00,2019-06-30,06/30/19 10:57 PM
112440,6/30/2019,2045,2045,2019-06-30 20:45:00,2019-06-30,06/30/19 08:45 PM
112441,6/30/2019,1626,1626,2019-06-30 16:26:00,2019-06-30,06/30/19 04:26 PM


### dt.year, dt.month, dt.day, dt.hour, dt.minute
This method returns the year of the given datetime. <br> 
https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.year.html <br> 
This method returns the month of the given datetime. <br> 
https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.month.html <br> 
This method returns the day of the given datetime. <br> 
https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.day.html <br> 
This method returns the hours of the given datetime. <br> 
https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.hour.html <br> 
This method returns the minutes of the given datetime. <br> 
https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.minute.html <br> 

In [22]:
%%time
p32e4['admit_yr'] = p32e4['AdmitDatetime'].dt.year
p32e4['admit_month'] = p32e4['AdmitDatetime'].dt.month
p32e4['admit_hr'] = p32e4['AdmitDatetime'].dt.hour

p32e4

Wall time: 0 ns


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,AdmitDate,AdmitTime,AdmitTime2,AdmitDatetime,Admit_dt,AdmitDatetime_strftime,admit_yr,admit_month,admit_hr
0,1/1/2013,627,0627,2013-01-01 06:27:00,2013-01-01,01/01/13 06:27 AM,2013,1,6
1,1/1/2013,1507,1507,2013-01-01 15:07:00,2013-01-01,01/01/13 03:07 PM,2013,1,15
2,1/1/2013,1353,1353,2013-01-01 13:53:00,2013-01-01,01/01/13 01:53 PM,2013,1,13
3,1/1/2013,625,0625,2013-01-01 06:25:00,2013-01-01,01/01/13 06:25 AM,2013,1,6
4,1/1/2013,56,0056,2013-01-01 00:56:00,2013-01-01,01/01/13 12:56 AM,2013,1,0
...,...,...,...,...,...,...,...,...,...
112438,6/30/2019,1329,1329,2019-06-30 13:29:00,2019-06-30,06/30/19 01:29 PM,2019,6,13
112439,6/30/2019,2257,2257,2019-06-30 22:57:00,2019-06-30,06/30/19 10:57 PM,2019,6,22
112440,6/30/2019,2045,2045,2019-06-30 20:45:00,2019-06-30,06/30/19 08:45 PM,2019,6,20
112441,6/30/2019,1626,1626,2019-06-30 16:26:00,2019-06-30,06/30/19 04:26 PM,2019,6,16


### calculate duration

In [23]:
%%time
# transform and clean the data before caculation
p32e5 = csv[['ids','AdmitDate','AdmitTime','DischargeDate','DischargeTime']].astype('str')
p32e5['AdmitTime'] = p32e5['AdmitTime'].str.zfill(4)
p32e5['DischargeTime'] = p32e5['DischargeTime'].str.replace('.0','').str.zfill(4)

p32e5['Admit_dt'] = pd.to_datetime(p32e5['AdmitDate'] + p32e5['AdmitTime'],format='%m/%d/%Y%H%M')
p32e5['Discharge_dt'] = pd.to_datetime(p32e5['DischargeDate'] + p32e5['DischargeTime'],format='%m/%d/%Y%H%M',errors='coerce')


# p32e5['datediff'] = (p32e5['Discharge_dt'] - p32e5['Admit_dt'])/np.timedelta64(1, 'D')  #the order of the substraction matter
# p32e5['datediff'] = abs(p32e5['Discharge_dt'] - p32e5['Admit_dt'])/np.timedelta64(1, 'D')  #the order doesn't matter
p32e5['datediff'] = round(abs(p32e5['Discharge_dt'] - p32e5['Admit_dt'])/np.timedelta64(1, 'D'), 3)
p32e5

Wall time: 534 ms


Unnamed: 0,ids,AdmitDate,AdmitTime,DischargeDate,DischargeTime,Admit_dt,Discharge_dt,datediff
0,E0122776959240,1/1/2013,0627,1/2/2013,1422,2013-01-01 06:27:00,2013-01-02 14:22:00,1.330
1,E0192437769970,1/1/2013,1507,1/1/2013,1845,2013-01-01 15:07:00,2013-01-01 18:45:00,0.151
2,E0238210647860,1/1/2013,1353,1/2/2013,1726,2013-01-01 13:53:00,2013-01-02 17:26:00,1.148
3,E0295116781805,1/1/2013,0625,1/2/2013,1520,2013-01-01 06:25:00,2013-01-02 15:20:00,1.372
4,E0576457975867,1/1/2013,0056,1/3/2013,1515,2013-01-01 00:56:00,2013-01-03 15:15:00,2.597
...,...,...,...,...,...,...,...,...
112438,E9870946410920,6/30/2019,1329,7/3/2019,1616,2019-06-30 13:29:00,2019-07-03 16:16:00,3.116
112439,E9875397672366,6/30/2019,2257,7/9/2019,1607,2019-06-30 22:57:00,2019-07-09 16:07:00,8.715
112440,E9904534873864,6/30/2019,2045,7/3/2019,1325,2019-06-30 20:45:00,2019-07-03 13:25:00,2.694
112441,E9927247603461,6/30/2019,1626,7/2/2019,1730,2019-06-30 16:26:00,2019-07-02 17:30:00,2.044


<div align='right'><a href='#toc' style='text-decoration:none;font-weight:bold;color:#0877ff;'>&#11014;&#65039; Back to the Top</a></div>

## <a name='3_4'>3.4 Data transformation</a>
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html

https://pandas.pydata.org/docs/user_guide/gotchas.html#gotchas-udf-mutation

<img src="data/img/Pandas loc vs iloc.png" width="100%">

<img src="data/img/Pandas_loc_iloc_example.svg" width="100%">


* df[(condition1) & (condition2) & (...)] is a method of slicing the data based on a set of conditions and operators (and = &, or = |)

* df[list()] is another method of silicing the data by selecting a specific number of columns, using a list of columns name.

* df[\~df.duplicated(subset=[])] = slice the data for none (~ = not) duplicate records, using a subset of columns to check for duplicates.

https://www.r-craft.org/r-news/using-iloc-loc-ix-to-select-rows-and-columns-in-pandas-dataframes/

https://towardsdatascience.com/filtering-data-frames-in-pandas-b570b1f834b9

In [24]:
# create copy dataset for practice
p34 = csv[['EntityCode','ids','AdmitDate','DischargeDate','DateIn1','TimeIn1','cost',
           'FinancialClass','VisitType']]

p34['cost'] = (p34['cost'].str.replace('[\$\,]', '', regex=True))
p34['TimeIn1'] = pd.to_numeric(p34['TimeIn1'], 'coerce')
location = p34['FinancialClass'].str.split(' - ', expand=True)
p34[['FinClass','FinClass_desc']] = p34['FinancialClass'].str.split(' - ', expand=True)


p34 = p34.astype({'AdmitDate':'M8[ns]','DischargeDate':'M8[ns]','cost':'float64'})
p34

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  p34['cost'] = (p34['cost'].str.replace('[\$\,]', '', regex=True))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  p34['TimeIn1'] = pd.to_numeric(p34['TimeIn1'], 'coerce')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  p34[['FinClass','FinClass_desc']] = p34['FinancialClass'].str.split(' - ', expand=

Unnamed: 0,EntityCode,ids,AdmitDate,DischargeDate,DateIn1,TimeIn1,cost,FinancialClass,VisitType,FinClass,FinClass_desc
0,MC,E0122776959240,2013-01-01,2013-01-02,1/1/2013,1031,1.94,MH - MEDICARE HMO,OP,MH,MEDICARE HMO
1,MC,E0192437769970,2013-01-01,2013-01-01,1/1/2013,1244,69.68,SP - SELF-PAY,Obs,SP,SELF-PAY
2,SE,E0238210647860,2013-01-01,2013-01-02,1/1/2013,1637,75.59,HM - HMO,Obs,HM,HMO
3,HH,E0295116781805,2013-01-01,2013-01-02,1/1/2013,1657,66.22,MH - MEDICARE HMO,ED,MH,MEDICARE HMO
4,SE,E0576457975867,2013-01-01,2013-01-03,1/1/2013,1050,94.28,MH - MEDICARE HMO,ED,MH,MEDICARE HMO
...,...,...,...,...,...,...,...,...,...,...,...
112438,HH,E9870946410920,2019-06-30,2019-07-03,6/30/2019,1247,19.26,HX - HEALTH INSURANCE EXCHANGE,Obs,HX,HEALTH INSURANCE EXCHANGE
112439,SE,E9875397672366,2019-06-30,2019-07-09,6/30/2019,1536,33.08,MR - MEDICARE,Obs,MR,MEDICARE
112440,HH,E9904534873864,2019-06-30,2019-07-03,6/30/2019,30,95.27,MR - MEDICARE,Obs,MR,MEDICARE
112441,HH,E9927247603461,2019-06-30,2019-07-02,6/30/2019,618,22.12,DH - MEDICAID HMO,OP,DH,MEDICAID HMO


In [25]:
p34.info()  #confirm datatype change

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112443 entries, 0 to 112442
Data columns (total 11 columns):
 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   EntityCode      112443 non-null  object        
 1   ids             112443 non-null  object        
 2   AdmitDate       112443 non-null  datetime64[ns]
 3   DischargeDate   112293 non-null  datetime64[ns]
 4   DateIn1         112443 non-null  object        
 5   TimeIn1         112443 non-null  int64         
 6   cost            112443 non-null  float64       
 7   FinancialClass  112443 non-null  object        
 8   VisitType       112443 non-null  object        
 9   FinClass        112443 non-null  object        
 10  FinClass_desc   112443 non-null  object        
dtypes: datetime64[ns](2), float64(1), int64(1), object(7)
memory usage: 9.4+ MB


### df.loc[]
This method accesses a group of rows or columns by a boolean array or label(s). <br> 
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html

In [26]:
p34e1 = p34[p34['EntityCode'] == 'MC']
display(csv,p34e1)

Unnamed: 0,ids,AdmitDate,AdmitTime,DischargeDate,DischargeTime,EntityCode,Loc1,Loc2,Loc3,Loc4,...,DateIn5,TimeIn1,TimeIn2,TimeIn3,TimeIn4,TimeIn5,EncStatus,FinancialClass,VisitType,cost
0,E0122776959240,1/1/2013,627,1/2/2013,1422.0,MC,EDMC,VUMC,VUMC,J6MB,...,,1031,1417.0,1417.0,1525.0,,Cancelled,MH - MEDICARE HMO,OP,$1.94
1,E0192437769970,1/1/2013,1507,1/1/2013,1845.0,MC,JANT,,,,...,,1244,,,,,Discharged,SP - SELF-PAY,Obs,$69.68
2,E0238210647860,1/1/2013,1353,1/2/2013,1726.0,SE,JANT,,,,...,,1637,,,,,Preadmit,HM - HMO,Obs,$75.59
3,E0295116781805,1/1/2013,625,1/2/2013,1520.0,HH,EDMC,VUMC,J7EC,,...,,1657,2016.0,2130.0,,,Active,MH - MEDICARE HMO,ED,$66.22
4,E0576457975867,1/1/2013,56,1/3/2013,1515.0,SE,EDMC,VUMC,VUMC,J7MB,...,,1050,1551.0,1703.0,1740.0,,Cancelled,MH - MEDICARE HMO,ED,$94.28
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
112438,E9870946410920,6/30/2019,1329,7/3/2019,1616.0,HH,EDMC,VUMC,J7MB,,...,,1247,1611.0,1744.0,,,Discharged,HX - HEALTH INSURANCE EXCHANGE,Obs,$19.26
112439,E9875397672366,6/30/2019,2257,7/9/2019,1607.0,SE,EDMC,VUMC,J4EA,J3EA,...,7/2/2019,1536,2127.0,229.0,1827.0,1616.0,Preadmit,MR - MEDICARE,Obs,$33.08
112440,E9904534873864,6/30/2019,2045,7/3/2019,1325.0,HH,EDHH,WC5,WCAN,,...,,30,500.0,1739.0,,,Active,MR - MEDICARE,Obs,$95.27
112441,E9927247603461,6/30/2019,1626,7/2/2019,1730.0,HH,EDMC,VUMC,J4EC,,...,,618,826.0,1356.0,,,Discharged,DH - MEDICAID HMO,OP,$22.12


Unnamed: 0,EntityCode,ids,AdmitDate,DischargeDate,DateIn1,TimeIn1,cost,FinancialClass,VisitType,FinClass,FinClass_desc
0,MC,E0122776959240,2013-01-01,2013-01-02,1/1/2013,1031,1.94,MH - MEDICARE HMO,OP,MH,MEDICARE HMO
1,MC,E0192437769970,2013-01-01,2013-01-01,1/1/2013,1244,69.68,SP - SELF-PAY,Obs,SP,SELF-PAY
6,MC,E0599404356054,2013-01-01,2013-01-02,1/1/2013,246,91.60,MR - MEDICARE,IP,MR,MEDICARE
8,MC,E1190367068869,2013-01-01,2013-01-03,1/1/2013,1413,17.77,MR - MEDICARE,Obs,MR,MEDICARE
9,MC,E1457522035792,2013-01-01,2013-01-11,12/30/2012,921,73.01,MH - MEDICARE HMO,ED,MH,MEDICARE HMO
...,...,...,...,...,...,...,...,...,...,...,...
112429,MC,E9088892148513,2019-06-30,2019-07-02,6/30/2019,2319,6.73,PP - PPO,Obs,PP,PPO
112430,MC,E9098226409965,2019-06-30,2019-07-03,6/30/2019,905,31.06,MR - MEDICARE,OP,MR,MEDICARE
112432,MC,E9290699794441,2019-06-30,2019-07-05,6/30/2019,956,37.53,MR - MEDICARE,OP,MR,MEDICARE
112436,MC,E9449875063797,2019-06-30,2019-07-03,6/30/2019,1152,36.34,MR - MEDICARE,OP,MR,MEDICARE


In [27]:
p34e1b = p34[(p34['EntityCode'] == 'MC') & (p34['FinClass'] == 'HMO')]
display(p34,p34e1b)

Unnamed: 0,EntityCode,ids,AdmitDate,DischargeDate,DateIn1,TimeIn1,cost,FinancialClass,VisitType,FinClass,FinClass_desc
0,MC,E0122776959240,2013-01-01,2013-01-02,1/1/2013,1031,1.94,MH - MEDICARE HMO,OP,MH,MEDICARE HMO
1,MC,E0192437769970,2013-01-01,2013-01-01,1/1/2013,1244,69.68,SP - SELF-PAY,Obs,SP,SELF-PAY
2,SE,E0238210647860,2013-01-01,2013-01-02,1/1/2013,1637,75.59,HM - HMO,Obs,HM,HMO
3,HH,E0295116781805,2013-01-01,2013-01-02,1/1/2013,1657,66.22,MH - MEDICARE HMO,ED,MH,MEDICARE HMO
4,SE,E0576457975867,2013-01-01,2013-01-03,1/1/2013,1050,94.28,MH - MEDICARE HMO,ED,MH,MEDICARE HMO
...,...,...,...,...,...,...,...,...,...,...,...
112438,HH,E9870946410920,2019-06-30,2019-07-03,6/30/2019,1247,19.26,HX - HEALTH INSURANCE EXCHANGE,Obs,HX,HEALTH INSURANCE EXCHANGE
112439,SE,E9875397672366,2019-06-30,2019-07-09,6/30/2019,1536,33.08,MR - MEDICARE,Obs,MR,MEDICARE
112440,HH,E9904534873864,2019-06-30,2019-07-03,6/30/2019,30,95.27,MR - MEDICARE,Obs,MR,MEDICARE
112441,HH,E9927247603461,2019-06-30,2019-07-02,6/30/2019,618,22.12,DH - MEDICAID HMO,OP,DH,MEDICAID HMO


Unnamed: 0,EntityCode,ids,AdmitDate,DischargeDate,DateIn1,TimeIn1,cost,FinancialClass,VisitType,FinClass,FinClass_desc


In [28]:
p34e1c = p34.copy()
p34e1c.loc[(p34e1c['FinClass'] == 'SP') & (p34e1c['AdmitDate'].dt.month == 4), 'condition1'] = 'met'

p34e1c[(p34e1c['condition1'] == 'met')]

Unnamed: 0,EntityCode,ids,AdmitDate,DischargeDate,DateIn1,TimeIn1,cost,FinancialClass,VisitType,FinClass,FinClass_desc,condition1
6841,HH,E1346457045531,2013-04-01,2013-04-02,3/31/2013,1328,69.53,SP - SELF-PAY,OP,SP,SELF-PAY,met
6854,MC,E2328013558681,2013-04-01,2013-04-01,4/1/2013,1800,87.41,SP - SELF-PAY,OP,SP,SELF-PAY,met
6860,MC,E2706514093644,2013-04-01,2013-04-01,4/1/2013,741,0.56,SP - SELF-PAY,Obs,SP,SELF-PAY,met
6900,HH,E0086393129504,2013-04-02,2013-04-03,4/2/2013,930,96.88,SP - SELF-PAY,DS,SP,SELF-PAY,met
6905,HH,E0677304534874,2013-04-02,2013-04-02,4/2/2013,558,72.12,SP - SELF-PAY,OP,SP,SELF-PAY,met
...,...,...,...,...,...,...,...,...,...,...,...,...
97854,HH,E8352911519004,2019-04-30,2019-04-30,4/29/2019,1115,24.20,SP - SELF-PAY,Obs,SP,SELF-PAY,met
97891,HH,E8787881636853,2019-04-30,2019-05-02,4/30/2019,1115,90.25,SP - SELF-PAY,OP,SP,SELF-PAY,met
97918,HH,E8791720141426,2019-04-30,2019-07-08,4/30/2019,2116,96.66,SP - SELF-PAY,OP,SP,SELF-PAY,met
97920,MC,E8792268831690,2019-04-30,2019-05-02,4/30/2019,2321,5.96,SP - SELF-PAY,OP,SP,SELF-PAY,met


### df.iloc[]
This method accesses a group of rows or columns by integer-based indexing/position. <br> 
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html

In [29]:
p34e2 = p34.iloc[:,:3]
p34e2

Unnamed: 0,EntityCode,ids,AdmitDate
0,MC,E0122776959240,2013-01-01
1,MC,E0192437769970,2013-01-01
2,SE,E0238210647860,2013-01-01
3,HH,E0295116781805,2013-01-01
4,SE,E0576457975867,2013-01-01
...,...,...,...
112438,HH,E9870946410920,2019-06-30
112439,SE,E9875397672366,2019-06-30
112440,HH,E9904534873864,2019-06-30
112441,HH,E9927247603461,2019-06-30


### df.T
This property transposes the indices and columns of a dataframe. <br> 
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.T.html <br> 
This method transposes the indices and columns of a dataframe. <br> 
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transpose.html

In [30]:
p34e2.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,112433,112434,112435,112436,112437,112438,112439,112440,112441,112442
EntityCode,MC,MC,SE,HH,SE,HH,MC,HH,MC,MC,...,HH,SE,SE,MC,MC,HH,SE,HH,HH,HH
ids,E0122776959240,E0192437769970,E0238210647860,E0295116781805,E0576457975867,E0579318588523,E0599404356054,E0733758165988,E1190367068869,E1457522035792,...,E9297117118891,E9299623886311,E9435748672227,E9449875063797,E9778408231460,E9870946410920,E9875397672366,E9904534873864,E9927247603461,E9997051257232
AdmitDate,2013-01-01 00:00:00,2013-01-01 00:00:00,2013-01-01 00:00:00,2013-01-01 00:00:00,2013-01-01 00:00:00,2013-01-01 00:00:00,2013-01-01 00:00:00,2013-01-01 00:00:00,2013-01-01 00:00:00,2013-01-01 00:00:00,...,2019-06-30 00:00:00,2019-06-30 00:00:00,2019-06-30 00:00:00,2019-06-30 00:00:00,2019-06-30 00:00:00,2019-06-30 00:00:00,2019-06-30 00:00:00,2019-06-30 00:00:00,2019-06-30 00:00:00,2019-06-30 00:00:00


### df.stack()
This method stacks the prescribed level(s) from columns to index. <br> 
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.stack.html

https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html

In [31]:
pd.DataFrame(p34e2.stack())

Unnamed: 0,Unnamed: 1,0
0,EntityCode,MC
0,ids,E0122776959240
0,AdmitDate,2013-01-01 00:00:00
1,EntityCode,MC
1,ids,E0192437769970
...,...,...
112441,ids,E9927247603461
112441,AdmitDate,2019-06-30 00:00:00
112442,EntityCode,HH
112442,ids,E9997051257232


### df.unstack()

This method pivots a level of the index labels (that are necessarily hierachial), and returns a DataFrame with the new level of column levels or a Series if the index is not a MultiIndex. <br> 
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.unstack.html

https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html

In [32]:
pd.DataFrame(p34e2.unstack())

Unnamed: 0,Unnamed: 1,0
EntityCode,0,MC
EntityCode,1,MC
EntityCode,2,SE
EntityCode,3,HH
EntityCode,4,SE
...,...,...
AdmitDate,112438,2019-06-30 00:00:00
AdmitDate,112439,2019-06-30 00:00:00
AdmitDate,112440,2019-06-30 00:00:00
AdmitDate,112441,2019-06-30 00:00:00


### series.shift()
This method shifts the index of the series by a desired or given number of periods with an optional time parameter freq. <br> 
https://pandas.pydata.org/docs/reference/api/pandas.Series.shift.html

In [33]:
%%time
p34e3 = p34e2.copy()
p34e3['AdmitDate_shiftup'] = p34e3['AdmitDate'].shift(-1)
p34e3['AdmitDate_shiftdown'] = p34e3['AdmitDate'].shift(1)
p34e3

Wall time: 8.63 ms


Unnamed: 0,EntityCode,ids,AdmitDate,AdmitDate_shiftup,AdmitDate_shiftdown
0,MC,E0122776959240,2013-01-01,2013-01-01,NaT
1,MC,E0192437769970,2013-01-01,2013-01-01,2013-01-01
2,SE,E0238210647860,2013-01-01,2013-01-01,2013-01-01
3,HH,E0295116781805,2013-01-01,2013-01-01,2013-01-01
4,SE,E0576457975867,2013-01-01,2013-01-01,2013-01-01
...,...,...,...,...,...
112438,HH,E9870946410920,2019-06-30,2019-06-30,2019-06-30
112439,SE,E9875397672366,2019-06-30,2019-06-30,2019-06-30
112440,HH,E9904534873864,2019-06-30,2019-06-30,2019-06-30
112441,HH,E9927247603461,2019-06-30,2019-06-30,2019-06-30


<div align='right'><a href='#top' style='text-decoration:none;font-weight:bold;color:#0877ff;'>&#11014;&#65039; Back to the Top</a></div>

## <a name='3_5'>3.5 Data mapping</a>
### series.map()
This method maps values of a series in accordance to a given mapping or function. <br>
https://pandas.pydata.org/docs/reference/api/pandas.Series.map.html

In [34]:
%%time
p34e4 = p34e2.copy()
p34e4['EntityCode_desc'] = p34e4['EntityCode'].map({'MC':'memorial city', 'HH':'TMC', 'SE':'southeast'})
p34e4

Wall time: 0 ns


Unnamed: 0,EntityCode,ids,AdmitDate,EntityCode_desc
0,MC,E0122776959240,2013-01-01,memorial city
1,MC,E0192437769970,2013-01-01,memorial city
2,SE,E0238210647860,2013-01-01,southeast
3,HH,E0295116781805,2013-01-01,TMC
4,SE,E0576457975867,2013-01-01,southeast
...,...,...,...,...
112438,HH,E9870946410920,2019-06-30,TMC
112439,SE,E9875397672366,2019-06-30,southeast
112440,HH,E9904534873864,2019-06-30,TMC
112441,HH,E9927247603461,2019-06-30,TMC


### df.apply(), series.apply()
This method applies a function along an axis of the associated object (Dataframe or Series). <br> 
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html

In [35]:
%%time
def count_str(x):
    return len(str(x))

p34e5 = p34e2.copy()
p34e5.apply(count_str, axis=0)

Wall time: 0 ns


EntityCode    190
ids           315
AdmitDate     274
dtype: int64

In [36]:
%%time
p34e5['ID_len1'] = p34e5['ids'].apply(count_str)

Wall time: 48.9 ms


In [37]:
%%time
# https://www.w3schools.com/python/python_lambda.asp
# https://realpython.com/python-lambda/
p34e5['ID_len2'] = p34e5['ids'].apply(lambda x: len(str(x)))
p34e5

Wall time: 34.5 ms


Unnamed: 0,EntityCode,ids,AdmitDate,ID_len1,ID_len2
0,MC,E0122776959240,2013-01-01,14,14
1,MC,E0192437769970,2013-01-01,14,14
2,SE,E0238210647860,2013-01-01,14,14
3,HH,E0295116781805,2013-01-01,14,14
4,SE,E0576457975867,2013-01-01,14,14
...,...,...,...,...,...
112438,HH,E9870946410920,2019-06-30,14,14
112439,SE,E9875397672366,2019-06-30,14,14
112440,HH,E9904534873864,2019-06-30,14,14
112441,HH,E9927247603461,2019-06-30,14,14


### df.applymap()
This function applies an input function to each element in the DataFrame. 
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.applymap.html

In [38]:
p34e5.applymap(lambda x: len(str(x)))

Unnamed: 0,EntityCode,ids,AdmitDate,ID_len1,ID_len2
0,2,14,19,2,2
1,2,14,19,2,2
2,2,14,19,2,2
3,2,14,19,2,2
4,2,14,19,2,2
...,...,...,...,...,...
112438,2,14,19,2,2
112439,2,14,19,2,2
112440,2,14,19,2,2
112441,2,14,19,2,2


<div align='right'><a href='#toc' style='text-decoration:none;font-weight:bold;color:#0877ff;'>&#11014;&#65039; Back to the Top</a></div>

## <a name='3_6'>3.6 Pandas Data Joining</a>
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

Additional reading:
https://medium.com/analytics-vidhya/a-tip-a-day-python-tip-5-pandas-concat-append-dev-skrol-18e4950cc8cc#:~:text=Concat%20function%20will%20do%20a,of%20rows%20in%20second%20dataframe.

---
### <a name='3_6a'>3.6a Concat</a>
<img src='https://miro.medium.com/max/1400/1*NLnoAF5uOSBC2Y7IuzfM_Q.png'>

A | B
- | -
result = pd.concat([df1, df2, df3]) | result = pd.concat([df1, df4], ignore_index=True, sort=False)
![alt](https://pandas.pydata.org/pandas-docs/stable/_images/merging_concat_basic.png) | ![alt](https://pandas.pydata.org/pandas-docs/stable/_images/merging_concat_ignore_index.png)

result = pd.concat([df1, df4], axis=1, sort=False)
![alt](https://pandas.pydata.org/pandas-docs/stable/_images/merging_concat_axis1.png)

result = pd.concat([df1, df4], axis=1, join='inner')
![alt](https://pandas.pydata.org/pandas-docs/stable/_images/merging_concat_axis1_inner.png)

result = pd.concat([df1, df4], axis=1, join_axes=[df1.index])
![alt](https://pandas.pydata.org/pandas-docs/stable/_images/merging_concat_axis1_join_axes.png)


In [39]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']},
                   index=[0, 1, 2, 3])


df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']},
                   index=[4, 5, 6, 7])

df2a = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']},
                   index=[0, 1, 2, 3])

df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                    'B': ['B8', 'B9', 'B10', 'B11'],
                    'C': ['C8', 'C9', 'C10', 'C11'],
                    'D': ['D8', 'D9', 'D10', 'D11']},
                   index=[8, 9, 10, 11])

df4 = pd.DataFrame({'B': ['B2', 'B3', 'B6', 'B7'],
                    'D': ['D2', 'D3', 'D6', 'D7'],
                    'F': ['F2', 'F3', 'F6', 'F7']},
                   index=[2, 3, 6, 7])
    
result1 = pd.concat([df1, df2a, df3])
display(result1)

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3
0,A4,B4,C4,D4
1,A5,B5,C5,D5
2,A6,B6,C6,D6
3,A7,B7,C7,D7
8,A8,B8,C8,D8
9,A9,B9,C9,D9


<div align='right'><a href='#toc' style='text-decoration:none;font-weight:bold;color:#0877ff;'>&#11014;&#65039; Back to the Top</a></div>

---
### <a name='3_6b'>3.6b Pandas Merge</a>
Experienced users of relational databases like SQL will be familiar with the terminology used to describe join operations between two SQL-table like structures (DataFrame objects). There are several cases to consider which are very important to understand:

* <b>one-to-one joins:</b> the joining key must be unique in each of the joining dataset.
* <b>many-to-one joins:</b> the joining key from 1 table is unique while the the joining key on the other table is not.
* <b>many-to-many joins:</b> the joining key is not unique in either of the joining dataset.

---

Merge method | SQL Join Name | Description
- | - | -
left | LEFT OUTER JOIN | Use keys from left frame only
right | RIGHT OUTER JOIN | Use keys from right frame only
outer | FULL OUTER JOIN | Use union of keys from both frames
inner | INNER JOIN | Use intersection of keys from both frames
cross | CROSS JOIN | Create the cartesian product of rows of both frames (new in version 1.2)
<img src="data/img/SQL_pandas_Joins.svg" width="100%">

<img src='https://miro.medium.com/max/1400/1*av8Om3HpG1MC7YTLKvyftg.png'>

https://medium.com/dev-genius/combining-data-in-pandas-31c984afceb7

https://medium.com/@essharmav/combining-data-with-merge-join-and-concat-methods-in-pandas-465c6eab9a34

https://towardsdatascience.com/take-your-sql-from-good-to-great-part-3-687d797d1ede

---
### left join
result = pd.merge(left, right, how='left', on=['key1', 'key2'])
![alt](https://pandas.pydata.org/pandas-docs/stable/_images/merging_merge_on_key_left.png)

---
### right join
result = pd.merge(left, right, how='right', on=['key1', 'key2'])
![alt](https://pandas.pydata.org/pandas-docs/stable/_images/merging_merge_on_key_right.png)

---
### outer join
result = pd.merge(left, right, how='outer', on=['key1', 'key2'])
![alt](https://pandas.pydata.org/pandas-docs/stable/_images/merging_merge_on_key_outer.png)

---
### inner join
result = pd.merge(left, right, how='inner', on=['key1', 'key2'])
![alt](https://pandas.pydata.org/pandas-docs/stable/_images/merging_merge_on_key_inner.png)

---
### cross join
result = pd.merge(left, right, how="cross")
![alt](https://pandas.pydata.org/pandas-docs/stable/_images/merging_merge_cross.png)

<div align='right'><a href='#toc' style='text-decoration:none;font-weight:bold;color:#0877ff;'>&#11014;&#65039; Back to the Top</a></div>

### <a name='3_6c'>3.6c Exclusive join</a>

---
#### Right Exclusive join

In [40]:
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                     'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3'],
                     'C': ['C0', 'C1', 'C2', 'C3'],
                     'D': ['D0', 'D1', 'D2', 'D3']})

right = pd.DataFrame({'key': ['K0', 'K1', 'K4', 'K5'],
                     'E': ['E0', 'E1', 'E2', 'E3'],
                     'F': ['F0', 'F1', 'F2', 'F3'],
                     'G': ['G0', 'G1', 'G2', 'G3'],
                     'H': ['H0', 'H1', 'H2', 'H3']})

result = pd.merge(left, right, how='left', on='key')
result2 = result[result['E'].isna()]
display(result,result2)

Unnamed: 0,key,A,B,C,D,E,F,G,H
0,K0,A0,B0,C0,D0,E0,F0,G0,H0
1,K1,A1,B1,C1,D1,E1,F1,G1,H1
2,K2,A2,B2,C2,D2,,,,
3,K3,A3,B3,C3,D3,,,,


Unnamed: 0,key,A,B,C,D,E,F,G,H
2,K2,A2,B2,C2,D2,,,,
3,K3,A3,B3,C3,D3,,,,


#### Left Exclusive join

In [41]:
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                     'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3'],
                     'C': ['C0', 'C1', 'C2', 'C3'],
                     'D': ['D0', 'D1', 'D2', 'D3']})

right = pd.DataFrame({'key': ['K0', 'K1', 'K4', 'K5'],
                     'E': ['E0', 'E1', 'E2', 'E3'],
                     'F': ['F0', 'F1', 'F2', 'F3'],
                     'G': ['G0', 'G1', 'G2', 'G3'],
                     'H': ['H0', 'H1', 'H2', 'H3']})

result = pd.merge(left, right, how='right', on='key')
result2 = result[result['A'].isna()]
display(result,result2)

Unnamed: 0,key,A,B,C,D,E,F,G,H
0,K0,A0,B0,C0,D0,E0,F0,G0,H0
1,K1,A1,B1,C1,D1,E1,F1,G1,H1
2,K4,,,,,E2,F2,G2,H2
3,K5,,,,,E3,F3,G3,H3


Unnamed: 0,key,A,B,C,D,E,F,G,H
2,K4,,,,,E2,F2,G2,H2
3,K5,,,,,E3,F3,G3,H3


#### Full Outer Exclusive

In [42]:
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                     'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3'],
                     'C': ['C0', 'C1', 'C2', 'C3'],
                     'D': ['D0', 'D1', 'D2', 'D3']})

right = pd.DataFrame({'key': ['K0', 'K1', 'K4', 'K5'],
                     'E': ['E0', 'E1', 'E2', 'E3'],
                     'F': ['F0', 'F1', 'F2', 'F3'],
                     'G': ['G0', 'G1', 'G2', 'G3'],
                     'H': ['H0', 'H1', 'H2', 'H3']})

result = pd.merge(left, right, how='outer', on='key')
result2 = result[(result['A'].isna()) | (result['E'].isna())]
display(result,result2)

Unnamed: 0,key,A,B,C,D,E,F,G,H
0,K0,A0,B0,C0,D0,E0,F0,G0,H0
1,K1,A1,B1,C1,D1,E1,F1,G1,H1
2,K2,A2,B2,C2,D2,,,,
3,K3,A3,B3,C3,D3,,,,
4,K4,,,,,E2,F2,G2,H2
5,K5,,,,,E3,F3,G3,H3


Unnamed: 0,key,A,B,C,D,E,F,G,H
2,K2,A2,B2,C2,D2,,,,
3,K3,A3,B3,C3,D3,,,,
4,K4,,,,,E2,F2,G2,H2
5,K5,,,,,E3,F3,G3,H3


In [43]:
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                     'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3'],
                     'C': ['C0', 'C1', 'C2', 'C3'],
                     'D': ['D0', 'D1', 'D2', 'D3']})

right = pd.DataFrame({'key': ['K0', 'K1', 'K4', 'K5'],
                     'E': ['E0', 'E1', 'E2', 'E3'],
                     'F': ['F0', 'F1', 'F2', 'F3'],
                     'G': ['G0', 'G1', 'G2', 'G3'],
                     'H': ['H0', 'H1', 'H2', 'H3']})

result = pd.merge(left, right, how='cross')
# result2 = result[(result['A'].isna()) | (result['E'].isna())]
display(result)

Unnamed: 0,key_x,A,B,C,D,key_y,E,F,G,H
0,K0,A0,B0,C0,D0,K0,E0,F0,G0,H0
1,K0,A0,B0,C0,D0,K1,E1,F1,G1,H1
2,K0,A0,B0,C0,D0,K4,E2,F2,G2,H2
3,K0,A0,B0,C0,D0,K5,E3,F3,G3,H3
4,K1,A1,B1,C1,D1,K0,E0,F0,G0,H0
5,K1,A1,B1,C1,D1,K1,E1,F1,G1,H1
6,K1,A1,B1,C1,D1,K4,E2,F2,G2,H2
7,K1,A1,B1,C1,D1,K5,E3,F3,G3,H3
8,K2,A2,B2,C2,D2,K0,E0,F0,G0,H0
9,K2,A2,B2,C2,D2,K1,E1,F1,G1,H1


---
#### Pandas Merge: example 1
![alt](https://pandas.pydata.org/pandas-docs/version/0.24.0/_images/merging_merge_on_key.png)

In [44]:
# test script #

left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                     'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3']})

right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                      'C': ['C0', 'C1', 'C2', 'C3'],
                      'D': ['D0', 'D1', 'D2', 'D3']})

result = pd.merge(left, right, on='key')
result

Unnamed: 0,key,A,B,C,D
0,K0,A0,B0,C0,D0
1,K1,A1,B1,C1,D1
2,K2,A2,B2,C2,D2
3,K3,A3,B3,C3,D3


---
#### Pandas Merge: example 2
![alt](https://pandas.pydata.org/pandas-docs/version/0.24.0/_images/merging_merge_on_key_multiple.png)

In [45]:
# test script #

left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                     'key2': ['K0', 'K1', 'K0', 'K1'],
                     'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3']})

right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                      'key2': ['K0', 'K0', 'K0', 'K0'],
                      'C': ['C0', 'C1', 'C2', 'C3'],
                      'D': ['D0', 'D1', 'D2', 'D3']})

result = pd.merge(left, right, on=['key1', 'key2'])
result

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K1,K0,A2,B2,C1,D1
2,K1,K0,A2,B2,C2,D2


---
#### Pandas Merge: example 3
![alt](https://pandas.pydata.org/pandas-docs/version/0.24.0/_images/merging_merge_on_key_dup.png)

In [46]:
left = pd.DataFrame({'A': [1, 2], 'B': [2, 2]})
right = pd.DataFrame({'A': [4, 5, 6], 'B': [2, 2, 2]})

result = pd.merge(left, right, on='B', how='outer')
result

Unnamed: 0,A_x,B,A_y
0,1,2,4
1,1,2,5
2,1,2,6
3,2,2,4
4,2,2,5
5,2,2,6


<div align='right'><a href='#toc' style='text-decoration:none;font-weight:bold;color:#0877ff;'>&#11014;&#65039; Back to the Top</a></div>

---
## <a name='3_7'>3.7 Check for duplicate keys w/ merge validation</a>

In [47]:
left = pd.DataFrame({'A' : [1,2], 'B' : [1, 2]})
right = pd.DataFrame({'A' : [4,5,6], 'B': [2, 2, 2]})
result = pd.merge(left, right, on='B', how='outer', validate="one_to_one")
result

MergeError: Merge keys are not unique in right dataset; not a one-to-one merge

---
## <a name='3_8'>3.8 Merge Indicator</a>
Observation Origin | _merge value
- | -
Merge key only in 'left' frame | left_only
Merge key only in 'right' frame | right_only
Merge key in both frames | both

In [48]:
df1 = pd.DataFrame({'col1': [0, 1], 'col_left': ['a', 'b']})

df2 = pd.DataFrame({'col1': [1, 2, 2], 'col_right': [2, 2, 2]})

pd.merge(df1, df2, on='col1', how='outer', indicator=True)

Unnamed: 0,col1,col_left,col_right,_merge
0,0,a,,left_only
1,1,b,2.0,both
2,2,,2.0,right_only
3,2,,2.0,right_only


<div align='center'><a href='#toc' style='text-decoration:none;font-weight:bold;color:#0877ff;'>&#11014;&#65039; Back to the Top</a></div>