# 3. Data Cleaning for Clustering

**Notebook:** `03_clean_clustering_data.ipynb`

The current objectives of this Jupyter Notebooks are:

- Select attributes relevant for clustering
- Clean and convert numeric attributes
- Encode categorical variables appropriately
- Prepares a dataset suitable for unsupervised learning
- Saves the clustering dataset for Weka

# 3.1 Create the "Clustering" dataframe

For the clustering task, the following columns are required: country, genres, tot_eps, ep_duration and score.

Because k-means (the algorithm chosen for this task) is based on distance calculations, it is highly sensitive to large differences in numerical scales. Therefore, the objective of the transformations applied in this step is to ensure that all numerical features are kept within a comparable range, ideally between 0 and 1.

In [None]:
import pandas as pd

# Import the original dataset

clustering_path = r"your_dataset.csv"

clustering = pd.read_csv(clustering_path)

# Keep only the needed columns

print(clustering.columns)

clustering_cols = ['country', 'genres', 'tot_eps', 'ep_duration', 'tot_user_score']

clustering = clustering[clustering_cols]

print("clustering cols: ", clustering.columns)
print("clustering info: ")
print(clustering.info())

Index(['drama_id', 'drama_name', 'native_name', 'year', 'synopsis', 'genres',
       'tags', 'director', 'sc_writer', 'country', 'type', 'tot_eps',
       'ep_duration', 'start_dt', 'end_dt', 'aired_on', 'org_net',
       'tot_user_score', 'tot_num_user', 'tot_watched', 'content_rt', 'rank',
       'popularity', 'dataset'],
      dtype='object')
clustering cols:  Index(['country', 'genres', 'tot_eps', 'ep_duration', 'tot_user_score'], dtype='object')
clustering info: 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5442 entries, 0 to 5441
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   country         5428 non-null   object 
 1   genres          5406 non-null   object 
 2   tot_eps         5399 non-null   object 
 3   ep_duration     5262 non-null   object 
 4   tot_user_score  5251 non-null   float64
dtypes: float64(1), object(4)
memory usage: 212.7+ KB
None


# 3.2 Drop null values

As observed in the notebook 02_clean_classification_data.ipynb, the dataset contains a small number of missing values that appear to be Missing at Random (MAR). Given the relatively low proportion of missing data and the sufficient size of the dataset, all rows containing null values are removed for the clustering task.

In [140]:
# Before dropping, clean the dtype of the numeric columns

clustering['tot_eps'] = pd.to_numeric(clustering['tot_eps'], errors='coerce')

clustering['ep_duration'] = pd.to_numeric(clustering['ep_duration'], errors='coerce')

# Keep only the rows where the country are actually the needed countries

clustering['country'] = clustering['country'].astype(str).str.strip()

clustering = clustering[(clustering['country'] == 'Japan') | (clustering['country'] == 'South Korea')| (clustering['country'] == 'Thailand')]

# Drop na's

clustering.dropna(subset=clustering_cols, inplace=True)

print("clustering info:")
print(clustering.info())

clustering info:
<class 'pandas.core.frame.DataFrame'>
Index: 5048 entries, 0 to 5440
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   country         5048 non-null   object 
 1   genres          5048 non-null   object 
 2   tot_eps         5048 non-null   float64
 3   ep_duration     5048 non-null   float64
 4   tot_user_score  5048 non-null   float64
dtypes: float64(3), object(2)
memory usage: 236.6+ KB
None


In total, 394 rows (approximately 7% of the dataset) were dropped. The remaining data is still sufficient to perform the clustering analysis without affecting the validity of the results.

# 3.3 Transform numerical data

## 3.3.1 Clean the tot_eps and ep_duration data

Filter those 0 values that were correctly transformed into float64.

In [141]:
# tot_eps

clustering = clustering[(clustering['tot_eps'] != 0.0)]

# ep_duration

clustering = clustering[(clustering['ep_duration'] != 0.0)]

# Valite there are not null values

print(clustering.info())

<class 'pandas.core.frame.DataFrame'>
Index: 5048 entries, 0 to 5440
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   country         5048 non-null   object 
 1   genres          5048 non-null   object 
 2   tot_eps         5048 non-null   float64
 3   ep_duration     5048 non-null   float64
 4   tot_user_score  5048 non-null   float64
dtypes: float64(3), object(2)
memory usage: 236.6+ KB
None


## 3.3.3 Transform numerical data into a 0 to 1 scale

As mentioned at the beginning of this notebook, k-means is highly sensitive to differences in numerical scales. For this reason, all numerical variables must be transformed to a common scale.

Although in 02_clean_classification_data.ipynb we observed that tot_eps contains many outliers and applied StandardScaler for the classification task, in this case (given how k-means computes distances), all numerical columns will be scaled using MinMaxScaler, transforming their values to a range between 0 and 1.

Additionally, using a 0–1 scale allows the numerical variables to be directly comparable with the transformed categorical variables in the next step.

In [142]:
# Import the needed libraries
from sklearn.preprocessing import MinMaxScaler
import numpy as np

# Perform MinMax

numercial_data = ['tot_eps', 'ep_duration', 'tot_user_score']

minmax = MinMaxScaler()

for column in numercial_data:
    numerical_values = clustering[column]
    numerical_values_reshaped = np.array(numerical_values).reshape(-1,1) # Reshaped into a 2d array "[[]]"
    clustering[column] = minmax.fit_transform(numerical_values_reshaped)

print(clustering[numercial_data].head())
print(clustering.columns)

    tot_eps  ep_duration  tot_user_score
0  0.024291     0.388158        0.766667
1  0.008097     0.098684        0.777778
2  0.036437     0.519737        0.855556
3  0.028340     0.157895        0.788889
4  0.044534     0.453947        0.833333
Index(['country', 'genres', 'tot_eps', 'ep_duration', 'tot_user_score'], dtype='object')


# 3.4 Transform categorical variables

Because k-means expects numerical values on a comparable scale (0 to 1), multi-hot encoding and one-hot encoding will be applied to the genres and country columns, respectively, following the same approach used in the classification dataset.

However, due to the high dimensionality generated by the genres column after multi-hot encoding, Principal Component Analysis (PCA) will be applied to the genre features in order to reduce dimensionality while preserving the most relevant variance for the clustering task.

## 3.4.1 Transform country column

As observed during the initial exploratory analysis (Section 2.4.1), the *country* attribute contains three distinct categories: **South Korea**, **Japan**, and **Thailand**. To convert this categorical variable into a numerical format compatible with Weka’s Logistic Regression model, **one-hot encoding** was applied using the `get_dummies` method from the *pandas* library.

This transformation preserves the nominal nature of the variable while preventing the introduction of unintended ordinal relationships between countries.

In [143]:
# Create dummies

print("genres country", clustering['country'].nunique())

countries = pd.get_dummies(clustering['country'])

# Join them with original dataframe

clustering = clustering.join(countries)

print("clustering: ")
print(clustering.head())
print("clustering columns: ", clustering.columns)

# Drop country column
clustering.drop(columns=['country'], inplace=True)

genres country 3
clustering: 
       country                                    genres   tot_eps  \
0  South Korea       Thriller,  Mystery,  Comedy,  Drama  0.024291   
1  South Korea                                    Sitcom  0.008097   
2  South Korea  Historical,  Romance,  Drama,  Melodrama  0.036437   
3  South Korea          Music,  Comedy,  Romance,  Youth  0.028340   
4  South Korea  Action,  Mystery,  Comedy,  Supernatural  0.044534   

   ep_duration  tot_user_score  Japan  South Korea  Thailand  
0     0.388158        0.766667  False         True     False  
1     0.098684        0.777778  False         True     False  
2     0.519737        0.855556  False         True     False  
3     0.157895        0.788889  False         True     False  
4     0.453947        0.833333  False         True     False  
clustering columns:  Index(['country', 'genres', 'tot_eps', 'ep_duration', 'tot_user_score',
       'Japan', 'South Korea', 'Thailand'],
      dtype='object')


## 3.4.2 Transform genres column

In this notebook, an exploratory data analysis (EDA) is performed on the genres column in order to understand how the data is stored in each row and to identify the true unique genre values. This step is necessary to correctly encode the column using MultiLabelBinarizer, following the same approach used in 02_clean_classification_data.ipynb.

In [144]:
# Perform EDA on the columns

print("genres unique", clustering['genres'].nunique())

print("clustering data:")
print(clustering['genres'].head()) # Base on this we prove that we have 

# Split the data

clustering['genres'] = clustering['genres'].fillna('').str.split(',')

print("clustering['genres'] split: ", clustering['genres'].head())

# Use the method explode to break each list into single data

clustering_exploded = clustering.explode('genres')
clustering_exploded['genres'] = clustering_exploded['genres'].str.strip()

print('Unique genres:', clustering_exploded['genres'].unique())

# Import the needed values

from sklearn.preprocessing import MultiLabelBinarizer

mlb=MultiLabelBinarizer()

# Clean the genres data
clustering['genres'] = clustering['genres'].apply(lambda row_list: [str.strip(item) for item in row_list])

# Create the matrix to tranform the genders into numeric values
genre_matrix = mlb.fit_transform(clustering['genres'])

# Create a dataframe that contains the columns of the matrix
genre_df = pd.DataFrame(genre_matrix, columns=mlb.classes_, index=clustering.index)
print("genre_df:")
print(genre_df.head())
print("genre_df: ", genre_df.columns)

genres unique 1090
clustering data:
0         Thriller,  Mystery,  Comedy,  Drama
1                                      Sitcom
2    Historical,  Romance,  Drama,  Melodrama
3            Music,  Comedy,  Romance,  Youth
4    Action,  Mystery,  Comedy,  Supernatural
Name: genres, dtype: object
clustering['genres'] split:  0         [Thriller,   Mystery,   Comedy,   Drama]
1                                         [Sitcom]
2    [Historical,   Romance,   Drama,   Melodrama]
3            [Music,   Comedy,   Romance,   Youth]
4    [Action,   Mystery,   Comedy,   Supernatural]
Name: genres, dtype: object
Unique genres: ['Thriller' 'Mystery' 'Comedy' 'Drama' 'Sitcom' 'Historical' 'Romance'
 'Melodrama' 'Music' 'Youth' 'Action' 'Supernatural' 'Military' 'Fantasy'
 'Horror' 'Psychological' 'Life' 'Crime' 'Food' 'Adventure' 'Sci-Fi'
 'Business' 'Medical' 'Family' 'Political' 'Law' 'Mature' 'Sports'
 'Documentary' 'Martial Arts' 'Tokusatsu' 'War' 'Wuxia']
genre_df:
   Action  Adventure  Business 

Once the data has been transformed using MultiLabelBinarizer, a Principal Component Analysis (PCA) is applied to reduce dimensionality and improve clustering results. This decision is made because a high number of features can introduce noise and negatively affect k-means, which relies on distance calculations. Reducing dimensionality helps preserve the most relevant information while improving cluster quality and interpretability.

In [145]:
from sklearn.decomposition import PCA

# Perform PCA

pca = PCA(n_components=11) # Keep a 33% of the total components

genre_pca = pca.fit_transform(genre_df)

# To get a more explicit headers extract the two principal genders of hour 11 components

components_dic = {}

for i in range(pca.n_components_):
    # Get loadings for component i
    comp_recipe = pd.DataFrame(
        pca.components_[i],
        index=mlb.classes_,
        columns=['Weight']
    )
    
    # Top 2 genres (positive correlation)
    top_genres = comp_recipe.sort_values(by='Weight', ascending=False).head(2).index.tolist()
    
    # Optional: also look at bottom 2 (negative correlation)
    bottom_genres = comp_recipe.sort_values(by='Weight', ascending=True).head(2).index.tolist()
    
    # Create a descriptive name (e.g., "Romance-Drama")
    comp_name = "-".join(top_genres)
    
    # Save in dictionary
    components_dic[f'PC{i+1}'] = comp_name

print(components_dic)

pca_columns = [f'PC{i+1}' for i in range(genre_pca.shape[1])]
genre_pca_df = pd.DataFrame(genre_pca, columns=pca_columns, index=clustering.index)


# Join PCA features back to original dataframe
clustering = clustering.join(genre_pca_df)

clustering.rename(columns=components_dic, inplace=True)

# Drop original genres column (since you now have PCA features)
clustering.drop(columns=['genres'], inplace=True)

print("clustering df:")
print(clustering.head())
print("clustering cols: ", clustering.columns)

# Clean the countries dtype

countries_list = ['Japan', 'South Korea', 'Thailand']

for country in countries_list:
    clustering[country] = pd.to_numeric(clustering[country], errors='coerce').astype(int)

{'PC1': 'Romance-Comedy', 'PC2': 'Drama-Romance', 'PC3': 'Romance-Mystery', 'PC4': 'Life-Youth', 'PC5': 'Youth-Drama', 'PC6': 'Mystery-Life', 'PC7': 'Action-Thriller', 'PC8': 'Supernatural-Fantasy', 'PC9': 'Fantasy-Mystery', 'PC10': 'Thriller-Fantasy', 'PC11': 'Melodrama-Family'}
clustering df:
    tot_eps  ep_duration  tot_user_score  Japan  South Korea  Thailand  \
0  0.024291     0.388158        0.766667  False         True     False   
1  0.008097     0.098684        0.777778  False         True     False   
2  0.036437     0.519737        0.855556  False         True     False   
3  0.028340     0.157895        0.788889  False         True     False   
4  0.044534     0.453947        0.833333  False         True     False   

   Romance-Comedy  Drama-Romance  Romance-Mystery  Life-Youth  Youth-Drama  \
0       -0.738250      -0.160334        -0.509042   -1.027447     0.324786   
1       -0.446158      -0.363106        -0.024570    0.342908    -0.119056   
2        0.248434       0

# 3.5 Save the dataframe to use it with Weka

In [146]:
print("clustering cols: ", clustering.columns)

print("clustering:")
print(clustering.head())

print("clustering info:")
print(clustering.info())

clustering.to_csv("clustering.csv", index=False)

clustering cols:  Index(['tot_eps', 'ep_duration', 'tot_user_score', 'Japan', 'South Korea',
       'Thailand', 'Romance-Comedy', 'Drama-Romance', 'Romance-Mystery',
       'Life-Youth', 'Youth-Drama', 'Mystery-Life', 'Action-Thriller',
       'Supernatural-Fantasy', 'Fantasy-Mystery', 'Thriller-Fantasy',
       'Melodrama-Family'],
      dtype='object')
clustering:
    tot_eps  ep_duration  tot_user_score  Japan  South Korea  Thailand  \
0  0.024291     0.388158        0.766667      0            1         0   
1  0.008097     0.098684        0.777778      0            1         0   
2  0.036437     0.519737        0.855556      0            1         0   
3  0.028340     0.157895        0.788889      0            1         0   
4  0.044534     0.453947        0.833333      0            1         0   

   Romance-Comedy  Drama-Romance  Romance-Mystery  Life-Youth  Youth-Drama  \
0       -0.738250      -0.160334        -0.509042   -1.027447     0.324786   
1       -0.446158      -0.3631