# Day 3: Data Wrangling with pandas

In [1]:
from IPython.display import display, HTML

CSS = """
.output {
    align-items: center;
}
"""

HTML('<style>{}</style>'.format(CSS))

## What can Pandas do in data analytics?

Pandas has so many uses that it might make sense to list the things it can't do instead of what it can do. 

This tool is essentially your data’s home. Through pandas, you get acquainted with your data by cleaning, transforming, merging, .... and analyzing it. 

For example, say you want to explore a dataset stored in a CSV on your computer. Pandas will extract the data from that CSV into a DataFrame — a table, basically — then let you do things like:

![Python-Pandas-Features.jpg](attachment:Python-Pandas-Features.jpg)


- Calculate statistics and answer questions about the data, like

    - What's the average, median, max, or min of each column? 
    - Does column A correlate with column B?
    - What does the distribution of data in column C look like?

- Clean the data by doing things like removing missing values and filtering rows or columns by some criteria


- Visualize the data with help from Matplotlib. Plot bars, lines, histograms, bubbles, and more. 


- Store the cleaned, transformed data back into a CSV, other file or database


Before we jump into the modeling or the complex visualizations we need to have a good understanding of the nature of datasets using pandas as our key package.


## Core components of pandas: Series and DataFrames

The primary two components of pandas are the `Series` and `DataFrame`. 

A `Series` is essentially a column, and a `DataFrame` is a multi-dimensional table made up of a collection of Series. 

![series-and-dataframe.width-1200.png](attachment:series-and-dataframe.width-1200.png)

DataFrames and Series are quite similar in that many operations that you can do with one you can do with the other, such as filling in null values and calculating the mean.

You'll see how these components work when we start working with data below. 

In [2]:
import pandas as pd
import numpy as np

## Tidy Data - A foundation for wrangling with pandas
In a tidy data set each variable is saved in its own column and each observation is saved in a row.

![Tidy data](images/tidy_data.png)

## Creating data frames
Our aim is to create a data frame with 3 __columns__ and 3 __rows__

![Dataframe1](images/df_1.png)

Below is the syntax to create the above data frame columnwise.

In [3]:
df = pd.DataFrame({"a":[4,5,6],
                 "b":[7,8,9],
                 "c": [10,11,12]},
                 index=[1,2,3])
df

Unnamed: 0,a,b,c
1,4,7,10
2,5,8,11
3,6,9,12


## Exercise 3.1
Find the follwing for the variables a, b and c
+ Mean for a hint you can use df.mean() this will give all the means
+ Median for b you can use df.median() this will give all the means
+ Cusum for c you can use df.cumsum() this will give all the means

In [6]:
df.mean()

a     5.0
b     8.0
c    11.0
dtype: float64

In [10]:
df["a"].mean()

5.0

In [11]:
df.median()
df["b"].median()

8.0

In [12]:
df.cumsum()
df["c"].cumsum()

1    10
2    21
3    33
Name: c, dtype: int64

## How can I get help??
We can use the help function or the question mark at the end of the function

In [13]:
help(pd.DataFrame)

Help on class DataFrame in module pandas.core.frame:

class DataFrame(pandas.core.generic.NDFrame)
 |  DataFrame(data=None, index: Union[Collection, NoneType] = None, columns: Union[Collection, NoneType] = None, dtype: Union[str, numpy.dtype, ForwardRef('ExtensionDtype'), NoneType] = None, copy: bool = False)
 |  
 |  Two-dimensional, size-mutable, potentially heterogeneous tabular data.
 |  
 |  Data structure also contains labeled axes (rows and columns).
 |  Arithmetic operations align on both row and column labels. Can be
 |  thought of as a dict-like container for Series objects. The primary
 |  pandas data structure.
 |  
 |  Parameters
 |  ----------
 |  data : ndarray (structured or homogeneous), Iterable, dict, or DataFrame
 |      Dict can contain Series, arrays, constants, or list-like objects.
 |  
 |      .. versionchanged:: 0.23.0
 |         If data is a dict, column order follows insertion-order for
 |         Python 3.6 and later.
 |  
 |      .. versionchanged:: 0.25.0

In [15]:
import statistics
statistics.mean?

Below is the syntax to create the above data frame rowwise.

In [5]:
df = pd.DataFrame(
    [[4, 7, 10],
    [5, 8, 11],
    [6, 9, 12]],
    index=[1, 2, 3],
    columns=['a', 'b', 'c'])
df

Unnamed: 0,a,b,c
1,4,7,10
2,5,8,11
3,6,9,12


## Create DataFrame with a MultiIndex
A multi index is a file with mutiple row identifiers

![MultiIndex](images/df_2.png)

In [7]:
df_multi = pd.DataFrame(
    {"a" : [4 ,5, 6],
     "b" : [7, 8, 9], 
     "c" : [10, 11, 12]}, 
    index = pd.MultiIndex.from_tuples(
        [('d',1),('d',2),('e',2)],
        names=['n','v']))
df_multi

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c
n,v,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
d,1,4,7,10
d,2,5,8,11
e,2,6,9,12


## Reshaping Data – Change the layout of a data set
This involves changing the data layout either from a wide or long format. To gather columns into rows (long format) and spread rows into columns _pd.melt_ and _df.pivot_ is used respectively. 

![Reshaping of DataFrames](images/df_melt_pivot.png)

In [16]:
df

Unnamed: 0,a,b,c
1,4,7,10
2,5,8,11
3,6,9,12


In [27]:
#Gather columns into rows using the  df Dataframe we created
df_long = pd.melt(df)
df_long

Unnamed: 0,variable,value
0,a,4
1,a,5
2,a,6
3,b,7
4,b,8
5,b,9
6,c,10
7,c,11
8,c,12


In [29]:
df1 = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},
                    'B': {0: 1, 1: 3, 2: 5},
                    'C': {0: 2, 1: 4, 2: 6}})
df1

Unnamed: 0,A,B,C
0,a,1,2
1,b,3,4
2,c,5,6


In [31]:
pd.melt(df1, id_vars=['A'], value_vars=['B', "C"])

Unnamed: 0,A,variable,value
0,a,B,1
1,b,B,3
2,c,B,5
3,a,C,2
4,b,C,4
5,c,C,6


## Exercise 3.2
Given the following data on four patients CD4s for 3 visits at a local health facility.
+ **Simon**: 78, 98, 157, 570
+ **Charity**: 567, 324, 679, 721
+ **Mary**: 100, 320, 256, 432
+ **Joel**: 900, 856, 567, 432

+ a) Make a dataframe called **dfcd4_wide** in wide format with the variables name (index), CD4_visit1, ....,CD4_visit4

+ b) Make another dataframe called **dfcd4_long**

+ c) Repeat a) and b) using another slightly different approach 

+ d) Use the help(pd.pivot) to assist you on how to change data from long back to wide format and do it for this example above

In [54]:
dfcd4_wide = pd.DataFrame({"CD4_visit1":[78,567,100,900],
                 "CD4_visit2":[98,324, 320, 856],
                 "CD4_visit3": [157, 679, 256, 567],
                "CD4_visit4":[570, 721, 432, 432]},
                 index=["Simon", "Charity", "Mary", "Joel"])
dfcd4_wide

Unnamed: 0,CD4_visit1,CD4_visit2,CD4_visit3,CD4_visit4
Simon,78,98,157,570
Charity,567,324,679,721
Mary,100,320,256,432
Joel,900,856,567,432


In [56]:
dfcd4_w2 = pd.DataFrame({'name': {0: 'Simon', 1: 'Charity', 2: 'Mary', 3:'Joel'},
                    "CD4_visit1":{0: 78,1: 567,2: 100,3: 900},
                 "CD4_visit2":{0: 98,1: 324, 2: 320, 3: 856},
                 "CD4_visit3": {0: 157,1: 679, 2: 256, 3: 567},
                "CD4_visit4":{0: 570, 1: 721, 2: 432, 3: 432}})
dfcd4_w2

Unnamed: 0,name,CD4_visit1,CD4_visit2,CD4_visit3,CD4_visit4
0,Simon,78,98,157,570
1,Charity,567,324,679,721
2,Mary,100,320,256,432
3,Joel,900,856,567,432


In [71]:
dfcd4_long=pd.melt(dfcd4_w2, id_vars="name", var_name="visit", value_name="CD4_count")
dfcd4_long

Unnamed: 0,name,visit,CD4_count
0,Simon,CD4_visit1,78
1,Charity,CD4_visit1,567
2,Mary,CD4_visit1,100
3,Joel,CD4_visit1,900
4,Simon,CD4_visit2,98
5,Charity,CD4_visit2,324
6,Mary,CD4_visit2,320
7,Joel,CD4_visit2,856
8,Simon,CD4_visit3,157
9,Charity,CD4_visit3,679


In [85]:
df_wide_cd4=dfcd4_long.pivot(index="name", columns="visit", values="CD4_count")
df_wide_cd4

visit,CD4_visit1,CD4_visit2,CD4_visit3,CD4_visit4
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Charity,567,324,679,721
Joel,900,856,567,432
Mary,100,320,256,432
Simon,78,98,157,570


In [111]:
#help(pd.melt)
df_wide_cd4.mean(axis=0)
df_wide_cd4.min(axis=0)
df_wide_cd4.max(axis=0)
df_wide_cd4.median(axis=0)
df_wide_cd4.sum(axis=0)
df_wide_cd4.(axis=0)

visit
CD4_visit1     78
CD4_visit2     98
CD4_visit3    157
CD4_visit4    432
dtype: int64

In [108]:
df_wide_cd4.mean(axis=1)
df_wide_cd4["name":"Charity":].mean()

visit
CD4_visit1   NaN
CD4_visit2   NaN
CD4_visit3   NaN
CD4_visit4   NaN
dtype: float64

In [10]:
### Recoding variables: String to text for binary and 
df['col1'] = np.where(df['col1'] == "Yes", 1, 0)

variable,a,b,c
0,4.0,,
1,5.0,,
2,6.0,,
3,,7.0,
4,,8.0,
5,,9.0,
6,,,10.0
7,,,11.0
8,,,12.0


### Ordering/Sorting DataFrame
This involves ordering rows by values of a column (low to high or high to low).

In [11]:
#Order rows by values of a column (low to high).
df_wide.sort_values('value')

Unnamed: 0,variable,value
0,a,4
1,a,5
2,a,6
3,b,7
4,b,8
5,b,9
6,c,10
7,c,11
8,c,12


In [12]:
#Order rows by values of a column (high to low).
df_wide.sort_values('value',ascending=False) 

Unnamed: 0,variable,value
8,c,12
7,c,11
6,c,10
5,b,9
4,b,8
3,b,7
2,a,6
1,a,5
0,a,4


### Rename the columns of a DataFrame
Here we rename the column _variable_ to _var_ in our _'df_wide'_ DataFrame

In [13]:
df_wide.rename(columns = {'variable':'var'})

Unnamed: 0,var,value
0,a,4
1,a,5
2,a,6
3,b,7
4,b,8
5,b,9
6,c,10
7,c,11
8,c,12


### Sort the index of a DataFrame

In [14]:
df_wide.sort_index()

Unnamed: 0,variable,value
0,a,4
1,a,5
2,a,6
3,b,7
4,b,8
5,b,9
6,c,10
7,c,11
8,c,12


### Moving index to columns
Reset index of DataFrame to row numbers, moving index to columns.

In [15]:
df_wide.reset_index()

Unnamed: 0,index,variable,value
0,0,a,4
1,1,a,5
2,2,a,6
3,3,b,7
4,4,b,8
5,5,b,9
6,6,c,10
7,7,c,11
8,8,c,12


### Deleting or dropping columns from a DataFrame
Here we drop the variable _value_ from our _'df_wide'_ DataFrame.

In [16]:
df_wide.drop(columns=['value'])

Unnamed: 0,variable
0,a
1,a
2,a
3,b
4,b
5,b
6,c
7,c
8,c


## Subset Observations
### Subseting Rows
This involves selecting part of the observations in a DataFrame. This can be done either rowwise or columnwise

![Subset DataFrames](images/df_subset.png)

For example to select only rows that have a _value_ greater than 9 from our _'df_wide'_ DataFrame

In [17]:
## select only rows that have a value greater than 9 
df_wide[df_wide.value > 9]

Unnamed: 0,variable,value
6,c,10
7,c,11
8,c,12


From the above command we use the _>_  Logic in Python (and pandas). Below are some of the most common Logic in Python

![Subset DataFrames](images/df_logic.png)

In [18]:
## Select first n rows.
## Here we select the first 3 rows from the df_wide dataset
df_wide.head(3)

Unnamed: 0,variable,value
0,a,4
1,a,5
2,a,6


In [19]:
## Select last n rows.
## Here we select the last 3 rows from the df_wide dataset
df_wide.tail(3)

Unnamed: 0,variable,value
6,c,10
7,c,11
8,c,12


In [20]:
## Select rows by position.
## Here we indicate position 4 to 7 to select row 4 to 6 
## note 7 is excliuded
df_wide.iloc[4:7]

Unnamed: 0,variable,value
4,b,8
5,b,9
6,c,10


In [21]:
##Select and order top n entries.
## here we select the 3 largest _value_ and order them ascending
df_wide.nlargest(3, 'value')

Unnamed: 0,variable,value
8,c,12
7,c,11
6,c,10


In [22]:
#Select and order bottom n entries.
## here we select the 3 smallest _value_ and order them ascending
df_wide.nsmallest(3, 'value')

Unnamed: 0,variable,value
0,a,4
1,a,5
2,a,6


In [23]:
#help(df_wide.nsmallest)

#### Randomly subsetting the DataFrame rows

In [24]:
#Randomly select fraction of rows. 
df_wide.sample(frac=0.5)

Unnamed: 0,variable,value
0,a,4
7,c,11
2,a,6
5,b,9


In [25]:
#Randomly select 3 rows.
df_wide.sample(n=3)

Unnamed: 0,variable,value
5,b,9
6,c,10
2,a,6


### Subseting Columns
Select multiple columns with specific names or columns that meet a certain criteria using Regex - https://www.w3schools.com/python/python_regex.asp

![Regex](images/df_regex.png)

In [26]:
## Selecting two variables 
df[['a','c']]

Unnamed: 0,a,c
1,4,10
2,5,11
3,6,12


In [27]:
## selecting  a single variable/column with a specific name
df['a']  
df.a

1    4
2    5
3    6
Name: a, dtype: int64

In [28]:
#Select columns whose name matches regular expression regex.
df.filter(regex='c')

Unnamed: 0,c
1,10
2,11
3,12


In [29]:
#Select all columns between a and b (inclusive)
df.loc[:,'a':'b']


Unnamed: 0,a,b
1,4,7
2,5,8
3,6,9


In [30]:
#Select columns in positions 1 and 3 (first column is 0).
df.iloc[:,[0,2]]

Unnamed: 0,a,c
1,4,10
2,5,11
3,6,12


In [31]:
#Select rows meeting logical condition, and only the specific columns .
df.loc[df['c'] > 10, ['a','c']]

Unnamed: 0,a,c
2,5,11
3,6,12


## Solved Excercise 
Create and save two DataFrame's with 10 (id=1,2,3,4,5,6,7,8,9,10) individuals, __'df_personal'__ with  _'id, weight,gender,age,province,income,insurance'_ and the __'df_medical'__ DataFrame with _'id,sbp,dbp, saltadd'_

In [32]:
df_personal =  pd.DataFrame({"id":[1,2,3,4,5,6,7,8,9,10],
                 "age":[65,50,55,45,25,45,35,70,20,30],
                 "gender": ['male', 'female', 'male', 'female', 'female', 'male', 'female', 'male', 'male', 'female'],
                  "weight":[80,65,60,80,95,86,97,58,59,110],
                "province":['FS','GP','KZN','LP','MP','NC','NW','WC','GP','FS'],
                "income":[65000,50000,55000,45000,25000,45000,35000,70000,20000,30000],
                "insurance":[1,0,1,0,1,0,0,1,1,1]})

df_personal.set_index('id')

Unnamed: 0_level_0,age,gender,weight,province,income,insurance
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,65,male,80,FS,65000,1
2,50,female,65,GP,50000,0
3,55,male,60,KZN,55000,1
4,45,female,80,LP,45000,0
5,25,female,95,MP,25000,1
6,45,male,86,NC,45000,0
7,35,female,97,NW,35000,0
8,70,male,58,WC,70000,1
9,20,male,59,GP,20000,1
10,30,female,110,FS,30000,1


In [33]:
df_medical =  pd.DataFrame({"id":[1,2,3,4,5,6,7,8,9,10,11],
                 "sbp":[110, 85, 167, None, 180, 112, 110,None , 171, 133,None],
                 "dbp": [80, 55, 112, None, 120, 78, 70, None, 102, 75,88],
                  "saltadd":['yes', 'no', 'yes', 'yes', 'no', 'no', 'no', 'yes', 'yes', 'yes','no']})

df_medical.set_index('id')

Unnamed: 0_level_0,sbp,dbp,saltadd
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,110.0,80.0,yes
2,85.0,55.0,no
3,167.0,112.0,yes
4,,,yes
5,180.0,120.0,no
6,112.0,78.0,no
7,110.0,70.0,no
8,,,yes
9,171.0,102.0,yes
10,133.0,75.0,yes


In [34]:
## saving the DataFrame to CSV
df_medical.to_csv('df_medical.csv',index=True)
df_personal.to_csv('df_personal.csv',index=True)

__How can you read the CSV files?__

## Handling Missing Data
_dropna_ function drop rows with any column having NA/null data and _fillna()_ replaces all NA/null data with value.

In [35]:
df_medical.dropna()

Unnamed: 0,id,sbp,dbp,saltadd
0,1,110.0,80.0,yes
1,2,85.0,55.0,no
2,3,167.0,112.0,yes
4,5,180.0,120.0,no
5,6,112.0,78.0,no
6,7,110.0,70.0,no
8,9,171.0,102.0,yes
9,10,133.0,75.0,yes


In [36]:
#replaces all NA/null data with value -> here we replace the sbp and dbp zero for exercise case only
df_medical.fillna(0)

Unnamed: 0,id,sbp,dbp,saltadd
0,1,110.0,80.0,yes
1,2,85.0,55.0,no
2,3,167.0,112.0,yes
3,4,0.0,0.0,yes
4,5,180.0,120.0,no
5,6,112.0,78.0,no
6,7,110.0,70.0,no
7,8,0.0,0.0,yes
8,9,171.0,102.0,yes
9,10,133.0,75.0,yes


In [37]:
## replacing specific columns with sepcific values 
#here we replace the sbp and dbp with the average of 120/80 respectively
df_medical.fillna(value={'sbp': 120, 'dbp': 80})

Unnamed: 0,id,sbp,dbp,saltadd
0,1,110.0,80.0,yes
1,2,85.0,55.0,no
2,3,167.0,112.0,yes
3,4,120.0,80.0,yes
4,5,180.0,120.0,no
5,6,112.0,78.0,no
6,7,110.0,70.0,no
7,8,120.0,80.0,yes
8,9,171.0,102.0,yes
9,10,133.0,75.0,yes


## Make New Columns
This is adding new columns to the data. We compute and add a column _'sd_bp_ratio'_ which is $sdp/dbp$

![Adding new column](images/df_new_column.png)

In [38]:
##Compute and append one or more new columns.
df_medical.assign(sd_bp_ratio=lambda df_medical: df_medical.sbp/df_medical.dbp)

Unnamed: 0,id,sbp,dbp,saltadd,sd_bp_ratio
0,1,110.0,80.0,yes,1.375
1,2,85.0,55.0,no,1.545455
2,3,167.0,112.0,yes,1.491071
3,4,,,yes,
4,5,180.0,120.0,no,1.5
5,6,112.0,78.0,no,1.435897
6,7,110.0,70.0,no,1.571429
7,8,,,yes,
8,9,171.0,102.0,yes,1.676471
9,10,133.0,75.0,yes,1.773333


In [39]:
#Add single column.
df_medical['sd_bp_ratio'] = df_medical.sbp/df_medical.dbp
df_medical

Unnamed: 0,id,sbp,dbp,saltadd,sd_bp_ratio
0,1,110.0,80.0,yes,1.375
1,2,85.0,55.0,no,1.545455
2,3,167.0,112.0,yes,1.491071
3,4,,,yes,
4,5,180.0,120.0,no,1.5
5,6,112.0,78.0,no,1.435897
6,7,110.0,70.0,no,1.571429
7,8,,,yes,
8,9,171.0,102.0,yes,1.676471
9,10,133.0,75.0,yes,1.773333


## Combine/Merge Data Sets
This involves joining mutiple DataFrames into a single DataFrame. In this example, we are going to merge the personal and the medical data.
![Merge](images/df_merge.png)

### Join matching rows from _'df_personal'_ to _'df_medical'_ .
You specify the two DataFrames to be joined and also __IMPORTANTLY__ specify the _on_ column. This is the column (sometimes should have the same name ) that uniquely identifies the rows in both DataFrames

In [40]:
## overview personal DF
print(df_personal.head(5))

   id  age  gender  weight province  income  insurance
0   1   65    male      80       FS   65000          1
1   2   50  female      65       GP   50000          0
2   3   55    male      60      KZN   55000          1
3   4   45  female      80       LP   45000          0
4   5   25  female      95       MP   25000          1


In [41]:
### overview medical DF
print(df_medical.head(11))

    id    sbp    dbp saltadd  sd_bp_ratio
0    1  110.0   80.0     yes     1.375000
1    2   85.0   55.0      no     1.545455
2    3  167.0  112.0     yes     1.491071
3    4    NaN    NaN     yes          NaN
4    5  180.0  120.0      no     1.500000
5    6  112.0   78.0      no     1.435897
6    7  110.0   70.0      no     1.571429
7    8    NaN    NaN     yes          NaN
8    9  171.0  102.0     yes     1.676471
9   10  133.0   75.0     yes     1.773333
10  11    NaN   88.0      no          NaN


This merges all the data on the left DataFrame to the right data frame. 

![Left join src:sqltutorial](images/left_join.png)

In [42]:
pd.merge(df_personal, df_medical,how='left', on='id')

Unnamed: 0,id,age,gender,weight,province,income,insurance,sbp,dbp,saltadd,sd_bp_ratio
0,1,65,male,80,FS,65000,1,110.0,80.0,yes,1.375
1,2,50,female,65,GP,50000,0,85.0,55.0,no,1.545455
2,3,55,male,60,KZN,55000,1,167.0,112.0,yes,1.491071
3,4,45,female,80,LP,45000,0,,,yes,
4,5,25,female,95,MP,25000,1,180.0,120.0,no,1.5
5,6,45,male,86,NC,45000,0,112.0,78.0,no,1.435897
6,7,35,female,97,NW,35000,0,110.0,70.0,no,1.571429
7,8,70,male,58,WC,70000,1,,,yes,
8,9,20,male,59,GP,20000,1,171.0,102.0,yes,1.676471
9,10,30,female,110,FS,30000,1,133.0,75.0,yes,1.773333


### Join matching rows from rows from  _'df_medical'_  to _'df_personal'_
Notice that _id_ 11 will be available on the merge below

In [43]:
pd.merge(df_personal, df_medical,how='right', on='id')

Unnamed: 0,id,age,gender,weight,province,income,insurance,sbp,dbp,saltadd,sd_bp_ratio
0,1,65.0,male,80.0,FS,65000.0,1.0,110.0,80.0,yes,1.375
1,2,50.0,female,65.0,GP,50000.0,0.0,85.0,55.0,no,1.545455
2,3,55.0,male,60.0,KZN,55000.0,1.0,167.0,112.0,yes,1.491071
3,4,45.0,female,80.0,LP,45000.0,0.0,,,yes,
4,5,25.0,female,95.0,MP,25000.0,1.0,180.0,120.0,no,1.5
5,6,45.0,male,86.0,NC,45000.0,0.0,112.0,78.0,no,1.435897
6,7,35.0,female,97.0,NW,35000.0,0.0,110.0,70.0,no,1.571429
7,8,70.0,male,58.0,WC,70000.0,1.0,,,yes,
8,9,20.0,male,59.0,GP,20000.0,1.0,171.0,102.0,yes,1.676471
9,10,30.0,female,110.0,FS,30000.0,1.0,133.0,75.0,yes,1.773333


### Join data. Retain only rows in both sets.

![Inner join src:sqltutorial.org](images/inner_join.png)

In [44]:
pd.merge(df_personal, df_medical,how='inner', on='id')

Unnamed: 0,id,age,gender,weight,province,income,insurance,sbp,dbp,saltadd,sd_bp_ratio
0,1,65,male,80,FS,65000,1,110.0,80.0,yes,1.375
1,2,50,female,65,GP,50000,0,85.0,55.0,no,1.545455
2,3,55,male,60,KZN,55000,1,167.0,112.0,yes,1.491071
3,4,45,female,80,LP,45000,0,,,yes,
4,5,25,female,95,MP,25000,1,180.0,120.0,no,1.5
5,6,45,male,86,NC,45000,0,112.0,78.0,no,1.435897
6,7,35,female,97,NW,35000,0,110.0,70.0,no,1.571429
7,8,70,male,58,WC,70000,1,,,yes,
8,9,20,male,59,GP,20000,1,171.0,102.0,yes,1.676471
9,10,30,female,110,FS,30000,1,133.0,75.0,yes,1.773333


### Join data. Retain all values, all rows.
![Full join src:sqltutorial.org](images/full_join.png)

In [45]:
pd.merge(df_personal, df_medical,how='outer', on='id')

Unnamed: 0,id,age,gender,weight,province,income,insurance,sbp,dbp,saltadd,sd_bp_ratio
0,1,65.0,male,80.0,FS,65000.0,1.0,110.0,80.0,yes,1.375
1,2,50.0,female,65.0,GP,50000.0,0.0,85.0,55.0,no,1.545455
2,3,55.0,male,60.0,KZN,55000.0,1.0,167.0,112.0,yes,1.491071
3,4,45.0,female,80.0,LP,45000.0,0.0,,,yes,
4,5,25.0,female,95.0,MP,25000.0,1.0,180.0,120.0,no,1.5
5,6,45.0,male,86.0,NC,45000.0,0.0,112.0,78.0,no,1.435897
6,7,35.0,female,97.0,NW,35000.0,0.0,110.0,70.0,no,1.571429
7,8,70.0,male,58.0,WC,70000.0,1.0,,,yes,
8,9,20.0,male,59.0,GP,20000.0,1.0,171.0,102.0,yes,1.676471
9,10,30.0,female,110.0,FS,30000.0,1.0,133.0,75.0,yes,1.773333


### Filtering Joins
All rows in _'df_personal'_ that have a matching _id_ in '_df_medical'_. Only display _'df_personal'_ with medical data

In [46]:
df_personal[df_personal.id.isin(df_medical.id)]

Unnamed: 0,id,age,gender,weight,province,income,insurance
0,1,65,male,80,FS,65000,1
1,2,50,female,65,GP,50000,0
2,3,55,male,60,KZN,55000,1
3,4,45,female,80,LP,45000,0
4,5,25,female,95,MP,25000,1
5,6,45,male,86,NC,45000,0
6,7,35,female,97,NW,35000,0
7,8,70,male,58,WC,70000,1
8,9,20,male,59,GP,20000,1
9,10,30,female,110,FS,30000,1


Here we try to see if there is any individual (_id_) with personal data but miss in medical data. We get a NULL DataFrame since all the individuals (_id_) have medical data available

In [47]:
df_personal[~df_personal.id.isin(df_medical.id)]

Unnamed: 0,id,age,gender,weight,province,income,insurance


__Excercise: Try filtering the individuals with medical data but have missing personal data__
