# Fictitious Names

### Introduction:

This time you will create a data again 

Special thanks to [Chris Albon](http://chrisalbon.com/) for sharing the dataset and materials.
All the credits to this exercise belongs to him.  

In order to understand about it go [here](https://blog.codinghorror.com/a-visual-explanation-of-sql-joins/).

### Step 1. Import the necessary libraries

In [51]:
import pandas as pd

### Step 2. Create the 3 DataFrames based on the following raw data

In [52]:
raw_data_1 = {
        'subject_id': ['1', '2', '3', '4', '5'],
        'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 
        'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']}

raw_data_2 = {
        'subject_id': ['4', '5', '6', '7', '8'],
        'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 
        'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan']}

raw_data_3 = {
        'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],
        'test_id': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]}

In [53]:
data1 = pd.DataFrame(raw_data_1)
data2 = pd.DataFrame(raw_data_2)
data3 = pd.DataFrame(raw_data_3)

In [54]:
data1 


Unnamed: 0,subject_id,first_name,last_name
0,1,Alex,Anderson
1,2,Amy,Ackerman
2,3,Allen,Ali
3,4,Alice,Aoni
4,5,Ayoung,Atiches


In [55]:
data2

Unnamed: 0,subject_id,first_name,last_name
0,4,Billy,Bonder
1,5,Brian,Black
2,6,Bran,Balwner
3,7,Bryce,Brice
4,8,Betty,Btisan


In [56]:
data3

Unnamed: 0,subject_id,test_id
0,1,51
1,2,15
2,3,15
3,4,61
4,5,16
5,7,14
6,8,15
7,9,1
8,10,61
9,11,16


### Step 4. Join the two dataframes along rows and assign all_data

This process is named "concatenation" -->Joining dataframes with similiar data structure(like data1, data 2)  can be acommplished using function concat().
We assign it to "all_data"

In [57]:
all_data = pd.concat([data1,data2], axis=0)

- axis = 0 : This parametre means concatenation along the rows, which stacks the DataFrames on top of each other.If we had axis = 1 , it would concatenate along the columns which is not what we want in this case.

After concatenating the raw data into one dataframe we must make sure the combined DataFrame is re-indexed in order to maintain consistency. The reset_index method is used to reindex the combined dataframe. 

In [58]:
all_data.reset_index(drop=True, inplace=True)

- drop=True in reset_index: This parametre controls whether to drop the old index after reseting it .When we concatenate Dataframes the resulting dataframe retains the original indices from the input dataframes.This can result to a non sequencial merged DataFrame .
- By using reset_index(drop=True), we reset the index to a default integer index (0, 1, 2, ...), and by setting drop=True, we ensure the old index is not added as a new column.

By using axis=0 and drop=True, we ensure a clean and orderly concatenated dataframe without duplicated or misaligned indices.

In [59]:
#Display the result
all_data

Unnamed: 0,subject_id,first_name,last_name
0,1,Alex,Anderson
1,2,Amy,Ackerman
2,3,Allen,Ali
3,4,Alice,Aoni
4,5,Ayoung,Atiches
5,4,Billy,Bonder
6,5,Brian,Black
7,6,Bran,Balwner
8,7,Bryce,Brice
9,8,Betty,Btisan


### Step 5. Join the two dataframes along columns and assing to all_data_col

Here we can see the difference when we set axis=0 and axis=1 :

In [60]:
all_data_col = pd.concat([data1,data2], axis=1)

In [61]:
all_data_col

Unnamed: 0,subject_id,first_name,last_name,subject_id.1,first_name.1,last_name.1
0,1,Alex,Anderson,4,Billy,Bonder
1,2,Amy,Ackerman,5,Brian,Black
2,3,Allen,Ali,6,Bran,Balwner
3,4,Alice,Aoni,7,Bryce,Brice
4,5,Ayoung,Atiches,8,Betty,Btisan


### Step 6. Print data3

In [62]:
print(data3)

  subject_id  test_id
0          1       51
1          2       15
2          3       15
3          4       61
4          5       16
5          7       14
6          8       15
7          9        1
8         10       61
9         11       16


### Step 7. Merge all_data and data3 along the subject_id value

This task asks us to merge the joined dataframes "all_data" with a specific column : We can do that using pd.merge() and specify the join type  (inner, outer, left, right).

notes :

- #### on='subject_id': This parameter specifies that the merge should be done based on the subject_id column.
- #### how='inner': This parameter specifies the type of join. The options are:
1. 'inner': Only include rows with matching subject_id in both dataframes.
2. 'outer': Include all rows from both dataframes, filling with NaN where there are no matches.
3. 'left': Include all rows from df1 and matching rows from df3.
4. 'right': Include all rows from df3 and matching rows from df1.

In [63]:
merged_data = pd.merge(all_data, data3, on= "subject_id", how= "inner")

In [64]:
merged_data

Unnamed: 0,subject_id,first_name,last_name,test_id
0,1,Alex,Anderson,51
1,2,Amy,Ackerman,15
2,3,Allen,Ali,15
3,4,Alice,Aoni,61
4,5,Ayoung,Atiches,16
5,4,Billy,Bonder,61
6,5,Brian,Black,16
7,7,Bryce,Brice,14
8,8,Betty,Btisan,15


In [65]:
merged_data = pd.merge(all_data, data3, on= "subject_id", how= "outer")
merged_data

Unnamed: 0,subject_id,first_name,last_name,test_id
0,1,Alex,Anderson,51.0
1,10,,,61.0
2,11,,,16.0
3,2,Amy,Ackerman,15.0
4,3,Allen,Ali,15.0
5,4,Alice,Aoni,61.0
6,4,Billy,Bonder,61.0
7,5,Ayoung,Atiches,16.0
8,5,Brian,Black,16.0
9,6,Bran,Balwner,


In [66]:
merged_data = pd.merge(all_data, data3, on= "subject_id", how= "left")
merged_data

Unnamed: 0,subject_id,first_name,last_name,test_id
0,1,Alex,Anderson,51.0
1,2,Amy,Ackerman,15.0
2,3,Allen,Ali,15.0
3,4,Alice,Aoni,61.0
4,5,Ayoung,Atiches,16.0
5,4,Billy,Bonder,61.0
6,5,Brian,Black,16.0
7,6,Bran,Balwner,
8,7,Bryce,Brice,14.0
9,8,Betty,Btisan,15.0


In [67]:
merged_data = pd.merge(all_data, data3, on= "subject_id", how= "right")
merged_data

Unnamed: 0,subject_id,first_name,last_name,test_id
0,1,Alex,Anderson,51
1,2,Amy,Ackerman,15
2,3,Allen,Ali,15
3,4,Alice,Aoni,61
4,4,Billy,Bonder,61
5,5,Ayoung,Atiches,16
6,5,Brian,Black,16
7,7,Bryce,Brice,14
8,8,Betty,Btisan,15
9,9,,,1
