# Preprocessing Movie Genre Classification Data

This notebook demonstrates the preprocessing steps for the movie genre classification dataset. We will load the data, parse it, handle missing values, tokenize descriptions, and save the preprocessed data for further analysis.

## 1. Load Dataset

Load the train and test datasets from the provided file paths.

In [1]:
import pandas as pd
import numpy as np

# Load train and test data as raw text
with open('data/train_data.txt', 'r', encoding='utf-8') as f:
    train_lines = f.readlines()
with open('data/test_data.txt', 'r', encoding='utf-8') as f:
    test_lines = f.readlines()

print(f"Loaded {len(train_lines)} train samples and {len(test_lines)} test samples.")

Loaded 54214 train samples and 54200 test samples.


## 2. Parse Dataset

Split the dataset into columns (ID, TITLE, GENRE, DESCRIPTION) for train data and (ID, TITLE, DESCRIPTION) for test data.

In [2]:
# Parse train data
train_records = [line.strip().split(' ::: ') for line in train_lines if line.strip()]
train_df = pd.DataFrame(train_records, columns=["ID", "TITLE", "GENRE", "DESCRIPTION"])

# Parse test data
test_records = [line.strip().split(' ::: ') for line in test_lines if line.strip()]
test_df = pd.DataFrame(test_records, columns=["ID", "TITLE", "DESCRIPTION"])

# Show first few rows
print("Train Data Sample:")
display(train_df.head())
print("Test Data Sample:")
display(test_df.head())

Train Data Sample:


Unnamed: 0,ID,TITLE,GENRE,DESCRIPTION
0,1,Oscar et la dame rose (2009),drama,Listening in to a conversation between his doc...
1,2,Cupid (1997),thriller,A brother and sister with a past incestuous re...
2,3,"Young, Wild and Wonderful (1980)",adult,As the bus empties the students for their fiel...
3,4,The Secret Sin (1915),drama,To help their unemployed father make ends meet...
4,5,The Unrecovered (2007),drama,The film's title refers not only to the un-rec...


Test Data Sample:


Unnamed: 0,ID,TITLE,DESCRIPTION
0,1,Edgar's Lunch (1998),"L.R. Brane loves his life - his car, his apart..."
1,2,La guerra de papá (1977),"Spain, March 1964: Quico is a very naughty chi..."
2,3,Off the Beaten Track (2010),One year in the life of Albin and his family o...
3,4,Meu Amigo Hindu (2015),"His father has died, he hasn't spoken with his..."
4,5,Er nu zhai (1955),Before he was known internationally as a marti...


## 3. Handle Missing Values

Identify and handle missing values in the dataset, such as empty descriptions.

In [3]:
# Check for missing values in train and test data
def report_missing(df, name):
    missing = df.isnull().sum()
    print(f"Missing values in {name}:")
    print(missing)
    print()

report_missing(train_df, "Train Data")
report_missing(test_df, "Test Data")

# Fill missing descriptions with an empty string
# train_df["DESCRIPTION"] = train_df["DESCRIPTION"].fillna("")
# test_df["DESCRIPTION"] = test_df["DESCRIPTION"].fillna("")

Missing values in Train Data:
ID             0
TITLE          0
GENRE          0
DESCRIPTION    0
dtype: int64

Missing values in Test Data:
ID             0
TITLE          0
DESCRIPTION    0
dtype: int64



## 4. Tokenize Descriptions

Use a tokenizer to split the movie descriptions into tokens for further processing.

In [None]:
import re

def simple_tokenizer(text):
    # Lowercase and split on non-word characters
    return re.findall(r'\b\w+\b', text.lower())

# Tokenize descriptions
train_df['TOKENS'] = train_df['DESCRIPTION'].apply(simple_tokenizer)
test_df['TOKENS'] = test_df['DESCRIPTION'].apply(simple_tokenizer)

# Show tokenized samples
print("Tokenized Train Data Sample:")
display(train_df[['DESCRIPTION', 'TOKENS']].head())

## 5. Save Preprocessed Data

Save the cleaned and tokenized data into new files for later use.

In [4]:
# Save preprocessed data
train_df.to_csv('data/train_data_preprocessed.csv', index=False)
test_df.to_csv('data/test_data_preprocessed.csv', index=False)
print("Preprocessed data saved to data/train_data_preprocessed.csv and data/test_data_preprocessed.csv")

Preprocessed data saved to data/train_data_preprocessed.csv and data/test_data_preprocessed.csv
