# Fictitious Names

### Introduction:

This time you will create a data again

Special thanks to [Chris Albon](http://chrisalbon.com/) for sharing the dataset and materials.
All the credits to this exercise belongs to him.  

In order to understand about it go [here](https://blog.codinghorror.com/a-visual-explanation-of-sql-joins/).

### Step 1. Import the necessary libraries

In [None]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.1.tar.gz (317.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.1-py2.py3-none-any.whl size=317488491 sha256=f985dcc14ce707bf0addf7d16ee4f6ee836fcc93b7b27b7d396a0a3c1c51533e
  Stored in directory: /root/.cache/pip/wheels/80/1d/60/2c256ed38dddce2fdd93be545214a63e02fbd8d74fb0b7f3a6
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.1


In [None]:
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType, FloatType
from pyspark.sql.functions import expr, col, mean, when, sum, count, desc, min, max
spark = SparkSession.builder.master("local[*]").getOrCreate()

### Step 2. Create the 3 DataFrames based on the following raw data

In [None]:
raw_data_1 = {
        'subject_id': ['1', '2', '3', '4', '5'],
        'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
        'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']}

raw_data_2 = {
        'subject_id': ['4', '5', '6', '7', '8'],
        'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
        'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan']}

raw_data_3 = {
        'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],
        'test_id': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]}

### Step 3. Assign each to a variable called data1, data2, data3

In [None]:
data1 = spark.createDataFrame(zip(*raw_data_1.values()), list(raw_data_1.keys()))
data1.show(5)

+----------+----------+---------+
|subject_id|first_name|last_name|
+----------+----------+---------+
|         1|      Alex| Anderson|
|         2|       Amy| Ackerman|
|         3|     Allen|      Ali|
|         4|     Alice|     Aoni|
|         5|    Ayoung|  Atiches|
+----------+----------+---------+



In [None]:
data2 = spark.createDataFrame(zip(*raw_data_2.values()), list(raw_data_2.keys()))
data2.show(5)

+----------+----------+---------+
|subject_id|first_name|last_name|
+----------+----------+---------+
|         4|     Billy|   Bonder|
|         5|     Brian|    Black|
|         6|      Bran|  Balwner|
|         7|     Bryce|    Brice|
|         8|     Betty|   Btisan|
+----------+----------+---------+



In [None]:
data3 = spark.createDataFrame(zip(*raw_data_3.values()), list(raw_data_3.keys()))
data3.show(5)

+----------+-------+
|subject_id|test_id|
+----------+-------+
|         1|     51|
|         2|     15|
|         3|     15|
|         4|     61|
|         5|     16|
+----------+-------+
only showing top 5 rows



### Step 4. Join the two dataframes along rows and assign all_data

In [None]:
all_data = data1.union(data2)

In [None]:
all_data.show()

+----------+----------+---------+
|subject_id|first_name|last_name|
+----------+----------+---------+
|         1|      Alex| Anderson|
|         2|       Amy| Ackerman|
|         3|     Allen|      Ali|
|         4|     Alice|     Aoni|
|         5|    Ayoung|  Atiches|
|         4|     Billy|   Bonder|
|         5|     Brian|    Black|
|         6|      Bran|  Balwner|
|         7|     Bryce|    Brice|
|         8|     Betty|   Btisan|
+----------+----------+---------+



### Step 5. Join the two dataframes along columns and assing to all_data_col

### Step 6. Print data3

In [None]:
data3.show()

+----------+-------+
|subject_id|test_id|
+----------+-------+
|         1|     51|
|         2|     15|
|         3|     15|
|         4|     61|
|         5|     16|
|         7|     14|
|         8|     15|
|         9|      1|
|        10|     61|
|        11|     16|
+----------+-------+



### Step 7. Merge all_data and data3 along the subject_id value

In [None]:
all_data = all_data.join(data3, all_data.subject_id == data3.subject_id)
all_data.show()

+----------+----------+---------+----------+-------+
|subject_id|first_name|last_name|subject_id|test_id|
+----------+----------+---------+----------+-------+
|         1|      Alex| Anderson|         1|     51|
|         2|       Amy| Ackerman|         2|     15|
|         3|     Allen|      Ali|         3|     15|
|         4|     Alice|     Aoni|         4|     61|
|         4|     Billy|   Bonder|         4|     61|
|         5|    Ayoung|  Atiches|         5|     16|
|         5|     Brian|    Black|         5|     16|
|         7|     Bryce|    Brice|         7|     14|
|         8|     Betty|   Btisan|         8|     15|
+----------+----------+---------+----------+-------+



### Step 8. Merge only the data that has the same 'subject_id' on both data1 and data2

In [None]:
data1.join(data2, data1.subject_id==data2.subject_id,"inner").show()

+----------+----------+---------+----------+----------+---------+
|subject_id|first_name|last_name|subject_id|first_name|last_name|
+----------+----------+---------+----------+----------+---------+
|         4|     Alice|     Aoni|         4|     Billy|   Bonder|
|         5|    Ayoung|  Atiches|         5|     Brian|    Black|
+----------+----------+---------+----------+----------+---------+



### Step 9. Merge all values in data1 and data2, with matching records from both sides where available.

In [None]:
data1.join(data2, data1.subject_id==data2.subject_id,"outer").show()

+----------+----------+---------+----------+----------+---------+
|subject_id|first_name|last_name|subject_id|first_name|last_name|
+----------+----------+---------+----------+----------+---------+
|         1|      Alex| Anderson|      NULL|      NULL|     NULL|
|         2|       Amy| Ackerman|      NULL|      NULL|     NULL|
|         3|     Allen|      Ali|      NULL|      NULL|     NULL|
|         4|     Alice|     Aoni|         4|     Billy|   Bonder|
|         5|    Ayoung|  Atiches|         5|     Brian|    Black|
|      NULL|      NULL|     NULL|         6|      Bran|  Balwner|
|      NULL|      NULL|     NULL|         7|     Bryce|    Brice|
|      NULL|      NULL|     NULL|         8|     Betty|   Btisan|
+----------+----------+---------+----------+----------+---------+

