# Phase 3: Intersection Analysis (Dataset 1 & 2)

---

## 1. Introduction
In this notebook, we will focus on joining both  **Dataset 1 & 2**.  
Goals of this notebook:
- Understand the structure of the dataset.  
- Perform exploratory statistics.  
- Extract insights that will guide preprocessing and modeling.  
- *Show bridge between Dataset 1 and 2*

In [30]:
import pandas as pd


In [31]:
# Loading the datasets
links_df = pd.read_csv('../data/raw/dataset1/links.csv')
movies_df_2 = pd.read_csv('../data/raw/dataset2/tmdb_5000_movies.csv')
movies = pd.read_csv('../data/processed/dataset1/full_movies_cleaned.csv')

In [32]:
# dropping null ids
links_df = links_df.dropna()
# converting to integer
links_df['tmdbId'] = links_df['tmdbId'].astype('Int64')

# merging the tmdb ids in dataset 1 with its title
subset1 = links_df[['movieId', 'tmdbId']]
uniq = movies.drop_duplicates(subset='movieId')
subset2 = uniq[['movieId', 'title']]
merge1 = pd.merge(subset1,subset2,on='movieId', how='left')

# merging with dataset 2
subset_c = movies_df_2[['id', 'title']]
merged = pd.merge(merge1,subset_c,left_on='tmdbId',right_on='id',how='left', suffixes = ('_links','_tmdb'))


In [33]:
# dropping null values
merged = merged.dropna()
# dropping redundant columns
merged = merged.drop(columns=['title_tmdb','id'])
merged=merged.rename(columns={"title_links":'title'})
# converting to csv
merged.to_csv('../data/processed/dataset2/intersection.csv')

In [34]:
merged.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3534 entries, 0 to 9370
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  3534 non-null   int64 
 1   tmdbId   3534 non-null   Int64 
 2   title    3534 non-null   object
dtypes: Int64(1), int64(1), object(1)
memory usage: 113.9+ KB


- Intersection size: **3,534 movies** (out of 9,370 in Dataset 1 and 4,803 in Dataset 2).  
- Columns: `movieId` (Dataset 1), `tmdbId` (Dataset 2), `title_links` (mapped titles).  
- Ensures consistent ID mapping across datasets → critical for transfer learning & hybrid recommender.  
- Coverage:  
  - ~37.7% of Dataset 1 movies overlap.  
  - ~73.6% of Dataset 2 movies overlap.  
- Implication: Dataset 2 is largely a subset of Dataset 1 → useful for bridging ratings (Dataset 1) with rich metadata (Dataset 2).  
