# Movie Genre Classification Model Training

This notebook trains a movie genre classification model using movie titles and descriptions as input features. The model predicts the genre of a movie based on its name and description.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

# Load preprocessed data
train_df = pd.read_csv('data/train_data_preprocessed.csv')
train_df.head()

Unnamed: 0,ID,TITLE,GENRE,DESCRIPTION
0,1,Oscar et la dame rose (2009),drama,Listening in to a conversation between his doc...
1,2,Cupid (1997),thriller,A brother and sister with a past incestuous re...
2,3,"Young, Wild and Wonderful (1980)",adult,As the bus empties the students for their fiel...
3,4,The Secret Sin (1915),drama,To help their unemployed father make ends meet...
4,5,The Unrecovered (2007),drama,The film's title refers not only to the un-rec...


## Combine Title and Description

We combine the movie title and description into a single text feature for classification.

In [2]:
# Combine TITLE and DESCRIPTION into a single text feature
train_df['text'] = train_df['TITLE'].astype(str) + ' ' + train_df['DESCRIPTION'].astype(str)

# Features and labels
X = train_df['text']
y = train_df['GENRE']

X[:3], y[:3]  # Show sample combined text and sample labels

(0    Oscar et la dame rose (2009) Listening in to a...
 1    Cupid (1997) A brother and sister with a past ...
 2    Young, Wild and Wonderful (1980) As the bus em...
 Name: text, dtype: object,
 0       drama
 1    thriller
 2       adult
 Name: GENRE, dtype: object)

## Train-Validation Split

Split the data into training and validation sets to evaluate model performance.

In [3]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
print(f"Train samples: {len(X_train)}, Validation samples: {len(X_val)}")

Train samples: 43371, Validation samples: 10843


## Text Vectorization

Convert the combined text into numerical features using TF-IDF.

In [4]:
vectorizer = TfidfVectorizer(max_features=5000)
X_train_vec = vectorizer.fit_transform(X_train)
X_val_vec = vectorizer.transform(X_val)

print(f"TF-IDF feature shape: {X_train_vec.shape}")

TF-IDF feature shape: (43371, 5000)


In [6]:
from sklearn.feature_extraction import text

# Add stopwords and use n-grams
stop_words = text.ENGLISH_STOP_WORDS
vectorizer = TfidfVectorizer(max_features=10000, stop_words='english', ngram_range=(1,2))
X_train_vec = vectorizer.fit_transform(X_train)
X_val_vec = vectorizer.transform(X_val)

print(f"TF-IDF feature shape: {X_train_vec.shape}")

TF-IDF feature shape: (43371, 10000)


## Model Training

Train a Logistic Regression classifier on the vectorized text features.

In [None]:
# 1. Address Class Imbalance with class_weight='balanced'
clf = LogisticRegression(max_iter=200, class_weight='balanced')
clf.fit(X_train_vec, y_train)

## Hyperparameter Tuning with GridSearchCV

Let's use GridSearchCV to find the best hyperparameters for the Logistic Regression model.

In [7]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'C': [0.01, 0.1, 1, 10],
    'max_iter': [200, 500],
    'class_weight': ['balanced'],
    'solver': ['liblinear', 'lbfgs']
}

grid = GridSearchCV(LogisticRegression(), param_grid, cv=3, scoring='f1_weighted', n_jobs=-1, verbose=2)
grid.fit(X_train_vec, y_train)

print('Best parameters:', grid.best_params_)
print('Best cross-validation score:', grid.best_score_)

# Use the best estimator for evaluation
grid_y_pred = grid.predict(X_val_vec)
print('GridSearchCV Validation Accuracy:', accuracy_score(y_val, grid_y_pred))
print(classification_report(y_val, grid_y_pred))

Fitting 3 folds for each of 16 candidates, totalling 48 fits
[CV] END C=0.01, class_weight=balanced, max_iter=200, solver=liblinear; total time=   4.2s
[CV] END C=0.01, class_weight=balanced, max_iter=200, solver=liblinear; total time=   4.6s
[CV] END C=0.01, class_weight=balanced, max_iter=200, solver=lbfgs; total time=   0.9s
[CV] END C=0.01, class_weight=balanced, max_iter=200, solver=lbfgs; total time=   0.8s
[CV] END C=0.01, class_weight=balanced, max_iter=200, solver=liblinear; total time=   2.5s
[CV] END C=0.01, class_weight=balanced, max_iter=200, solver=lbfgs; total time=   0.8s
[CV] END C=0.01, class_weight=balanced, max_iter=500, solver=liblinear; total time=   3.6s
[CV] END C=0.01, class_weight=balanced, max_iter=500, solver=liblinear; total time=   4.7s
[CV] END C=0.01, class_weight=balanced, max_iter=500, solver=lbfgs; total time=   0.8s
[CV] END C=0.01, class_weight=balanced, max_iter=500, solver=lbfgs; total time=   1.0s
[CV] END C=0.01, class_weight=balanced, max_iter=

## Model Evaluation

Evaluate the model on the validation set and print metrics.

In [None]:
y_pred = clf.predict(X_val_vec)
print('Validation Accuracy:', accuracy_score(y_val, y_pred))
print(classification_report(y_val, y_pred))

## Inference Function

Define a function to predict the genre for new movie titles and descriptions.

In [8]:
# Use the best estimator from GridSearchCV for inference and saving
best_clf = grid.best_estimator_

def predict_genre(title, description):
    text = title + ' ' + description
    vec = vectorizer.transform([text])
    return best_clf.predict(vec)[0]

# Example usage:
genre = predict_genre('The Matrix', 'A computer hacker learns about the true nature of reality and his role in the war against its controllers.')
print('Predicted genre:', genre)

# Save the best model and vectorizer
import os
import joblib

os.makedirs('model', exist_ok=True)
joblib.dump(best_clf, 'model/genre_classifier.joblib')
joblib.dump(vectorizer, 'model/vectorizer.joblib')
print('Best model and vectorizer saved to model/ directory.')

Predicted genre: short
Best model and vectorizer saved to model/ directory.
Best model and vectorizer saved to model/ directory.


## Try a Different Model: Random Forest

Let's also train a Random Forest classifier and compare its performance to Logistic Regression.

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=42, n_jobs=-1)
rf_clf.fit(X_train_vec, y_train)

rf_y_pred = rf_clf.predict(X_val_vec)
print('Random Forest Validation Accuracy:', accuracy_score(y_val, rf_y_pred))
print(classification_report(y_val, rf_y_pred))