# 1. Dataset Integration

**Notebook:** `01_merge_datasets.ipynb`

The current objectives of this Jupyter Notebooks are:

- Load the original Korean, Japanese, and Thai drama datasets
- Merges all sources into a single consolidated dataset
- Adds metadata for dataset traceability

## 1. Create a unified dataframe

Since the selected dataset is distributed across three separate tables, each stored in a different file, it is necessary to consolidate them into a single dataframe. All three tables share the same column structure and data types, which allows them to be stacked directly without the need for complex transformations or schema alignment procedures.

This step ensures a consistent and unified dataset that can be used as the basis for subsequent cleaning and transformation processes.

### 1.1 Inspect the data to verify column consistency

Before merging the datasets, an initial inspection is performed to confirm that all tables contain the same column headers and compatible data types. This validation step is essential to prevent structural inconsistencies and data integrity issues during the merge process.

In [None]:
import pandas as pd

# Load the three CSV tables

kor_drama_path = r"kor_drama.csv"
kor_drama = pd.read_csv(kor_drama_path)

jap_drama_path = r"jap_drama.csv"
jap_drama = pd.read_csv(jap_drama_path)

thai_drama_path = r"tha_drama.csv"
thai_drama = pd.read_csv(thai_drama_path)

## Look at the data

dramas_dic = {'kor_drama': kor_drama, 'jap_drama':jap_drama, 'thai_drama':thai_drama}

for drama_name, drama_data in dramas_dic.items():
    print(f"{drama_name} columns: ", drama_data.columns)
    print(f"{drama_name} info: ")
    print(drama_data.info())

kor_drama columns:  Index(['drama_id', 'drama_name', 'native_name', 'year', 'synopsis', 'genres',
       'tags', 'director', 'sc_writer', 'country', 'type', 'tot_eps',
       'ep_duration', 'start_dt', 'end_dt', 'aired_on', 'org_net',
       'tot_user_score', 'tot_num_user', 'tot_watched', 'content_rt', 'rank',
       'popularity'],
      dtype='object')
kor_drama info: 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1773 entries, 0 to 1772
Data columns (total 23 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   drama_id        1773 non-null   object 
 1   drama_name      1773 non-null   object 
 2   native_name     1766 non-null   object 
 3   year            1773 non-null   int64  
 4   synopsis        1604 non-null   object 
 5   genres          1768 non-null   object 
 6   tags            1773 non-null   object 
 7   director        1045 non-null   object 
 8   sc_writer       967 non-null    object 
 9   country         177

Based on the inspection results, it can be confirmed that the datasets can be stacked into a single dataframe without structural conflicts. However, the exploratory analysis also reveals that several columns require cleaning and transformation in subsequent steps.

### 1.2 Stack the data and perform an initial inspection

After validating column consistency, the three datasets are vertically concatenated into a single dataframe. Once merged, an initial inspection is conducted to review the overall structure, identify missing values, detect potential anomalies, and highlight columns that will require further cleaning or feature engineering.

In [None]:
# Stack the data

## The keep traceability of where the data came from, let's add a column with an identifier

kor_drama['dataset'] = 'kor_drama'
jap_drama['dataset'] = 'jap_drama'
thai_drama['dataset'] = 'thai_drama'

## Stack the data

dramas = pd.concat(dramas_dic.values(), ignore_index=True)
print(f"dramas columns: ", dramas.columns)
print(f"dramas info: ")
print(dramas.info())

# Save the document into an csv file

dramas.to_csv("dramas.csv", index=False)

dramas columns:  Index(['drama_id', 'drama_name', 'native_name', 'year', 'synopsis', 'genres',
       'tags', 'director', 'sc_writer', 'country', 'type', 'tot_eps',
       'ep_duration', 'start_dt', 'end_dt', 'aired_on', 'org_net',
       'tot_user_score', 'tot_num_user', 'tot_watched', 'content_rt', 'rank',
       'popularity', 'dataset'],
      dtype='object')
dramas info: 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5442 entries, 0 to 5441
Data columns (total 24 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   drama_id        5441 non-null   object 
 1   drama_name      5431 non-null   object 
 2   native_name     5418 non-null   object 
 3   year            5430 non-null   object 
 4   synopsis        4981 non-null   object 
 5   genres          5406 non-null   object 
 6   tags            5428 non-null   object 
 7   director        3511 non-null   object 
 8   sc_writer       3265 non-null   object 
 9   country       