# 4. Data Cleaning for Association

**Notebook:** `04_clean_association_data.ipynb`

The current objectives of this Jupyter Notebooks are:

- Select categorical attributes required for association rule mining
- Transform variables into a transactional-like format
- Ensure compatibility with the Apriori algorithm
- Save the association dataset for Weka

## Clean and transform the dataset

Because the Classification, Clustering, and Association tasks rely on different attributes, a separate dataframe will be created and cleaned for each task. This ensures that the data matches the specific input requirements of each algorithm.

## 2.1 Create the "Association" dataframe

For the association task, the required columns are genres, country, and popularity.

Since the Apriori algorithm works with categorical data, all three columns will be transformed (when necessary) into a fully categorical format suitable for association rule mining.

In [None]:
import pandas as pd

# Import the original dataset

association_path = r"your_dataset.csv"

association = pd.read_csv(association_path)

# Keep only the needed columns

print(association.columns)

association_cols = ['country', 'genres', 'popularity']

association = association[association_cols]

print("association cols: ", association.columns)
print("association info: ")
print(association.info())

Index(['drama_id', 'drama_name', 'native_name', 'year', 'synopsis', 'genres',
       'tags', 'director', 'sc_writer', 'country', 'type', 'tot_eps',
       'ep_duration', 'start_dt', 'end_dt', 'aired_on', 'org_net',
       'tot_user_score', 'tot_num_user', 'tot_watched', 'content_rt', 'rank',
       'popularity', 'dataset'],
      dtype='object')
association cols:  Index(['country', 'genres', 'popularity'], dtype='object')
association info: 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5442 entries, 0 to 5441
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   country     5428 non-null   object 
 1   genres      5406 non-null   object 
 2   popularity  5425 non-null   float64
dtypes: float64(1), object(2)
memory usage: 127.7+ KB
None


# 2.2 EDA and drop null values

As observed in the 03_clean_clustering_data.ipynb notebook, the country column contains values that must be reviewed to ensure they do not negatively affect data quality.

In [10]:
# Assure the country datatype and filter by the values we need

association['country'] = association['country'].astype(str).str.strip()

association = association[(association['country'] == 'Japan') | (association['country'] == 'South Korea')| (association['country'] == 'Thailand')]

# Drop na's

association.dropna(subset=association_cols, inplace=True)

print("association info:")
print(association.info())

association info:
<class 'pandas.core.frame.DataFrame'>
Index: 5403 entries, 0 to 5441
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   country     5403 non-null   object 
 1   genres      5403 non-null   object 
 2   popularity  5403 non-null   float64
dtypes: float64(1), object(2)
memory usage: 168.8+ KB
None


Because country is already a categorical variable, no transformation is required for this column.

# 2.3 Transform popularity column

Because the goal of the Association task is to identify which combinations of country and genre are linked to high popularity, the popularity column must be transformed into categorical values (High and Low).

To achieve this transformation, the popularity variable is discretized into High and Low categories. Unlike the approach used in 02_clean_classification_data.ipynb, the Pareto principle (80/20 rule) is not applied in this case. This decision is motivated by the nature of the Apriori algorithm, which relies on the minimum support threshold (lowerBoundMinSupport) to identify frequent itemsets. Since this parameter is typically set between 0.1 and 0.2 (representing 10â€“20% of the dataset), the algorithm itself already enforces a frequency-based filtering mechanism.

Applying the Pareto rule prior to association rule mining could artificially constrain the distribution of the popularity variable and potentially eliminate meaningful but less frequent patterns. Therefore, popularity is encoded into High and Low using a simpler, data-driven threshold that preserves sufficient variability for Apriori to discover relevant associations.

In [11]:
# The first thing we need to do is to find the 20th percentile
# Or, in other words, the top 20% dramas

cutoff_value = association['popularity'].quantile(0.5)

import numpy as np

# Now let's label the data

association['popularity_high_low'] = np.select([association['popularity'] > cutoff_value,
                                                   association['popularity'] <= cutoff_value],
                                                  ['High', 'Low'], default="")

print("Data:")
print(association[['popularity_high_low', 'popularity']].head())

# Drop the original column
association.drop(columns=['popularity'], inplace=True)

Data:
  popularity_high_low  popularity
0                 Low      1052.0
1                High     99999.0
2                 Low      1584.0
3                 Low      1628.0
4                 Low       420.0


# 2.4 Transform genres column

As observed in the exploratory data analysis performed in both 02_clean_classification_data.ipynb and 03_clean_clustering_data.ipynb, the genres column cannot be correctly encoded using a single-label approach, since most dramas belong to multiple genres. Therefore, the use of MultiLabelBinarizer is required to properly represent this multi-genre structure.

Because MultiLabelBinarizer transforms each genre into binary values (0 and 1), the resulting columns will be explicitly cast to the boolean data type. This step is necessary to ensure that the data is treated as nominal rather than numerical when used in the association task.

In [12]:
# Prepare the data for the transformation

association['genres'] = association['genres'].fillna('').str.split(',')

print("association['genres'] split: ", association['genres'].head())

from sklearn.preprocessing import MultiLabelBinarizer

mlb=MultiLabelBinarizer()

# Clean the genres data
association['genres'] = association['genres'].apply(lambda row_list: [str.strip(item) for item in row_list])

# Create the matrix to tranform the genders into numeric values
genre_matrix = mlb.fit_transform(association['genres'])

# Create a dataframe that contains the columns of the matrix
genre_df = pd.DataFrame(genre_matrix, columns=mlb.classes_, index=association.index)
genre_df = genre_df.astype(bool) 
print("genre_df:")
print(genre_df.head())
print("genre_df: ", genre_df.columns)

# Join the matrix with the original dataframe
association = association.join(genre_df)

print("association df:")
print(association)
print("association cols: ", association.columns)

# Due to the transformation, the original genres column can be deleted

association.drop(columns=['genres'], inplace=True)

association['genres'] split:  0         [Thriller,   Mystery,   Comedy,   Drama]
1                                         [Sitcom]
2    [Historical,   Romance,   Drama,   Melodrama]
3            [Music,   Comedy,   Romance,   Youth]
4    [Action,   Mystery,   Comedy,   Supernatural]
Name: genres, dtype: object
genre_df:


   Action  Adventure  Business  Comedy  Crime  Documentary  Drama  Family  \
0   False      False     False    True  False        False   True   False   
1   False      False     False   False  False        False  False   False   
2   False      False     False   False  False        False   True   False   
3   False      False     False    True  False        False  False   False   
4    True      False     False    True  False        False  False   False   

   Fantasy   Food  ...  Romance  Sci-Fi  Sitcom  Sports  Supernatural  \
0    False  False  ...    False   False   False   False         False   
1    False  False  ...    False   False    True   False         False   
2    False  False  ...     True   False   False   False         False   
3    False  False  ...     True   False   False   False         False   
4    False  False  ...    False   False   False   False          True   

   Thriller  Tokusatsu    War  Wuxia  Youth  
0      True      False  False  False  False  
1     

# 2.5 Save the dataframe

In [13]:
print("association cols: ", association.columns)

print("association:")
print(association.head())

print("association info:")
print(association.info())

association.to_csv("association.csv", index=False)

association cols:  Index(['country', 'popularity_high_low', 'Action', 'Adventure', 'Business',
       'Comedy', 'Crime', 'Documentary', 'Drama', 'Family', 'Fantasy', 'Food',
       'Historical', 'Horror', 'Law', 'Life', 'Martial Arts', 'Mature',
       'Medical', 'Melodrama', 'Military', 'Music', 'Mystery', 'Political',
       'Psychological', 'Romance', 'Sci-Fi', 'Sitcom', 'Sports',
       'Supernatural', 'Thriller', 'Tokusatsu', 'War', 'Wuxia', 'Youth'],
      dtype='object')
association:
       country popularity_high_low  Action  Adventure  Business  Comedy  \
0  South Korea                 Low   False      False     False    True   
1  South Korea                High   False      False     False   False   
2  South Korea                 Low   False      False     False   False   
3  South Korea                 Low   False      False     False    True   
4  South Korea                 Low    True      False     False    True   

   Crime  Documentary  Drama  Family  ...  Romance  S