# Joining DataFrames 

You can merge data frames with the `join` method when working with related tables. The `join` method is a operation to get the data from one data frame and link it to another one via a set of rules. To show this, let me create a SparkSession. 

## Creating a SparkSession

In [29]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
spark = SparkSession.builder.getOrCreate()

## Loading the Datasets

Let's read LogIdentifier.csv first.

In [30]:
import os
DIRECTORY = "./data/broadcast_logs"
log_identifier = spark.read.csv(
    os.path.join(DIRECTORY, "ReferenceTables/LogIdentifier.csv"),
    sep="|",
    header=True,
    inferSchema=True,
)
log_identifier.printSchema()

root
 |-- LogIdentifierID: string (nullable = true)
 |-- LogServiceID: integer (nullable = true)
 |-- PrimaryFG: integer (nullable = true)



Let's take a look at the first five row of DataFrame with the `show` method.

In [31]:
log_identifier.show(5)

+---------------+------------+---------+
|LogIdentifierID|LogServiceID|PrimaryFG|
+---------------+------------+---------+
|           13ST|        3157|        1|
|         2000SM|        3466|        1|
|           70SM|        3883|        1|
|           80SM|        3590|        1|
|           90SM|        3470|        1|
+---------------+------------+---------+
only showing top 5 rows



Let's read the other DataFrame.

In [32]:
logs = spark.read.csv(
    os.path.join(DIRECTORY, "BroadcastLogs_2018_Q3_M8_sample.CSV"),
    sep="|",
    header=True,
    inferSchema=True,
)
logs.printSchema()

root
 |-- BroadcastLogID: integer (nullable = true)
 |-- LogServiceID: integer (nullable = true)
 |-- LogDate: string (nullable = true)
 |-- SequenceNO: integer (nullable = true)
 |-- AudienceTargetAgeID: integer (nullable = true)
 |-- AudienceTargetEthnicID: integer (nullable = true)
 |-- CategoryID: integer (nullable = true)
 |-- ClosedCaptionID: integer (nullable = true)
 |-- CountryOfOriginID: integer (nullable = true)
 |-- DubDramaCreditID: integer (nullable = true)
 |-- EthnicProgramID: integer (nullable = true)
 |-- ProductionSourceID: integer (nullable = true)
 |-- ProgramClassID: integer (nullable = true)
 |-- FilmClassificationID: integer (nullable = true)
 |-- ExhibitionID: integer (nullable = true)
 |-- Duration: string (nullable = true)
 |-- EndTime: string (nullable = true)
 |-- LogEntryDate: string (nullable = true)
 |-- ProductionNO: string (nullable = true)
 |-- ProgramTitle: string (nullable = true)
 |-- StartTime: string (nullable = true)
 |-- Subtitle: string (nulla

## The join method

We have two DataFrame and are ready to start joining! 

Here is the blueprint: *[LEFT].join([RIGHT],on=[PREDICATES] how=[METHOD])*

*on=[PREDICATES]* parameter provides to match records from the left table with the right table. 
*how=[METHOD]* shows how to merge that PySpark will default to an inner join. Let's take a look at the other options of the paramter. 

As you probably know, a *left* join will add the unmatched records from the left table in the joined table, filling the columns coming from the right table with null.

A *right* join will add the unmatched records from the right in the joined table, filling the columns coming from the left table with null.

A *cross* join returns a record for every record pair. Let me show this.

Let's merge the logs table as left table with the log_identifier as rigth table.

In [33]:
logs_and_channels = logs.join(
    log_identifier,
    on="LogServiceID",
    how="inner" 
)

In [34]:
print("The number of the log table columns: ", len(logs.columns))
print("The number of the log_identifier table columns: ", len(log_identifier.columns))
print("The number of the columns after joining: ", len(logs_and_channels.columns))

The number of the log table columns:  30
The number of the log_identifier table columns:  3
The number of the columns after joining:  32


## Naming the Merged Columns

Note that PySpark fails when we try to work with the ambiguous column. To show this, let's join two table according to the `LogServiceID` column.

In [35]:
logs_and_channels_verbose = logs.join(
    log_identifier, 
    logs["LogServiceID"] == log_identifier["LogServiceID"]
)

In [36]:
logs_and_channels_verbose.printSchema()

root
 |-- BroadcastLogID: integer (nullable = true)
 |-- LogServiceID: integer (nullable = true)
 |-- LogDate: string (nullable = true)
 |-- SequenceNO: integer (nullable = true)
 |-- AudienceTargetAgeID: integer (nullable = true)
 |-- AudienceTargetEthnicID: integer (nullable = true)
 |-- CategoryID: integer (nullable = true)
 |-- ClosedCaptionID: integer (nullable = true)
 |-- CountryOfOriginID: integer (nullable = true)
 |-- DubDramaCreditID: integer (nullable = true)
 |-- EthnicProgramID: integer (nullable = true)
 |-- ProductionSourceID: integer (nullable = true)
 |-- ProgramClassID: integer (nullable = true)
 |-- FilmClassificationID: integer (nullable = true)
 |-- ExhibitionID: integer (nullable = true)
 |-- Duration: string (nullable = true)
 |-- EndTime: string (nullable = true)
 |-- LogEntryDate: string (nullable = true)
 |-- ProductionNO: string (nullable = true)
 |-- ProgramTitle: string (nullable = true)
 |-- StartTime: string (nullable = true)
 |-- Subtitle: string (nulla

Let's try to select the LogServiceID:

In [37]:
"""
If you run this command *logs_and_channels_verbose.select("LogServiceID")* you can an error as the follows:
*Reference 'LogServiceID' is ambiguous, could be: LogServiceID, LogServiceID.*

Let's solve this problem

"""

'\nIf you run this command *logs_and_channels_verbose.select("LogServiceID")* you can an error as the follows:\n*Reference \'LogServiceID\' is ambiguous, could be: LogServiceID, LogServiceID.*\n\nLet\'s solve this problem\n\n'

As expected we have an error. The merged table have two same columns: LogServiceID. This leads to error when working with these columns since here are two columns that have same names. 

To overcome this problem, I show three methods. 

First, when performing an join, the `join` method drops the second instance of the predicate column. Let me show you:

In [38]:
logs_and_channels = logs.join(log_identifier, "LogServiceID")
logs_and_channels.printSchema()

root
 |-- LogServiceID: integer (nullable = true)
 |-- BroadcastLogID: integer (nullable = true)
 |-- LogDate: string (nullable = true)
 |-- SequenceNO: integer (nullable = true)
 |-- AudienceTargetAgeID: integer (nullable = true)
 |-- AudienceTargetEthnicID: integer (nullable = true)
 |-- CategoryID: integer (nullable = true)
 |-- ClosedCaptionID: integer (nullable = true)
 |-- CountryOfOriginID: integer (nullable = true)
 |-- DubDramaCreditID: integer (nullable = true)
 |-- EthnicProgramID: integer (nullable = true)
 |-- ProductionSourceID: integer (nullable = true)
 |-- ProgramClassID: integer (nullable = true)
 |-- FilmClassificationID: integer (nullable = true)
 |-- ExhibitionID: integer (nullable = true)
 |-- Duration: string (nullable = true)
 |-- EndTime: string (nullable = true)
 |-- LogEntryDate: string (nullable = true)
 |-- ProductionNO: string (nullable = true)
 |-- ProgramTitle: string (nullable = true)
 |-- StartTime: string (nullable = true)
 |-- Subtitle: string (nulla

The second way is to rename one of two duplicated column or delete one of two duplicated columns. Let me show you.

In [39]:
logs_and_channels_verbose = logs.join(
    log_identifier, logs["LogServiceID"] == log_identifier["LogServiceID"]
)
logs_and_channels.drop(log_identifier["LogServiceID"]).select("LogServiceID") 

DataFrame[LogServiceID: int]

In [40]:
logs_and_channels_verbose.printSchema()

root
 |-- BroadcastLogID: integer (nullable = true)
 |-- LogServiceID: integer (nullable = true)
 |-- LogDate: string (nullable = true)
 |-- SequenceNO: integer (nullable = true)
 |-- AudienceTargetAgeID: integer (nullable = true)
 |-- AudienceTargetEthnicID: integer (nullable = true)
 |-- CategoryID: integer (nullable = true)
 |-- ClosedCaptionID: integer (nullable = true)
 |-- CountryOfOriginID: integer (nullable = true)
 |-- DubDramaCreditID: integer (nullable = true)
 |-- EthnicProgramID: integer (nullable = true)
 |-- ProductionSourceID: integer (nullable = true)
 |-- ProgramClassID: integer (nullable = true)
 |-- FilmClassificationID: integer (nullable = true)
 |-- ExhibitionID: integer (nullable = true)
 |-- Duration: string (nullable = true)
 |-- EndTime: string (nullable = true)
 |-- LogEntryDate: string (nullable = true)
 |-- ProductionNO: string (nullable = true)
 |-- ProgramTitle: string (nullable = true)
 |-- StartTime: string (nullable = true)
 |-- Subtitle: string (nulla

The last approach is to use the Column object directly with the `alias` method.

In [41]:
logs_and_channels_verbose = logs.alias("left").join( 
    log_identifier.alias("right"), 
    logs["LogServiceID"] == log_identifier["LogServiceID"],
)
logs_and_channels_verbose.drop(F.col("right.LogServiceID"))\
                         .select("LogServiceID") 

DataFrame[LogServiceID: int]

## Resource
- Data Anaylsis with Python and PySpark

Don't forget to follow us on [YouTube](http://youtube.com/tirendazacademy) | [Medium](http://tirendazacademy.medium.com) | [Twitter](http://twitter.com/tirendazacademy) | [GitHub](http://github.com/tirendazacademy) | [Linkedin](https://www.linkedin.com/in/tirendaz-academy) | [Kaggle](https://www.kaggle.com/tirendazacademy) 😎