# movie_genre_classification

# TASK 1

## Dataset Overview
This dataset contains movie plot descriptions labeled with genres.
The goal is to classify movies into genres based on text.

 Imports

In [4]:
import os 
import re
import string 
import pandas as pd 
import numpy as np 

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score , classification_report

In [9]:
def parse_movie_genre_file(file_path):
    """
    Reads the raw training file and converts it into a structured DataFrame.
    This parser is written specifically for the dataset format used in this project.
    """
    records = []

    with open(file_path, 'r', encoding='utf-8', errors='ignore') as file:
        for line_number, line in enumerate(file, start=1):

        
            parts = line.strip().split(" ::: ")

           
            if len(parts) != 4:
                continue

            movie_id, title, genre, description = parts

            records.append({
                "movie_id": movie_id,
                "title": title,
                "genre": genre,
                "description": description
            })

    return pd.DataFrame(records)


In [25]:
movie_df = parse_movie_genre_file('/Users/jayashar/Desktop/Machine Learning Intership/Task1_Movie_Genre_Classification/data/train_data.txt')

movie_df.head()

Unnamed: 0,movie_id,title,genre,description
0,1,Oscar et la dame rose (2009),drama,Listening in to a conversation between his doc...
1,2,Cupid (1997),thriller,A brother and sister with a past incestuous re...
2,3,"Young, Wild and Wonderful (1980)",adult,As the bus empties the students for their fiel...
3,4,The Secret Sin (1915),drama,To help their unemployed father make ends meet...
4,5,The Unrecovered (2007),drama,The film's title refers not only to the un-rec...


In [26]:
movie_df['description'].iloc[10]

'Tom Beacham explores Ghana with Director of Photography Alex Holberton, in search of Spirits. The film looks at the overlap between the spirit world and the physical world, voodoo, witchcraft, the power of the Holy Spirit and magic. They meet with royal family members, magicians and voodoo priests as well as getting rare footage of a full Voodoo Ceremony.'

In [29]:
def normalize_description(text):
    """
    Cleans movie descriptions while preserving narrative meaning.
    This function is intentionally simple and interpretable.
    """
    # Convert into lower 
    text = text.lower()
    #Remove standalone numbers
    text = re.sub(r'\bd+\b','',text)
    #Remove Punctuation
    text = text.translate(str.maketrans('','',string.punctuation))

    #Normalize extra space
    text = re.sub(r'\s+',' ',text).strip()

    return text

    
    

In [30]:
movie_df['clean_description'] = movie_df['description'].apply(normalize_description)

movie_df[['description','clean_description']].head(3)

Unnamed: 0,description,clean_description
0,Listening in to a conversation between his doc...,listening in to a conversation between his doc...
1,A brother and sister with a past incestuous re...,a brother and sister with a past incestuous re...
2,As the bus empties the students for their fiel...,as the bus empties the students for their fiel...


In [31]:
movie_df['clean_description'].apply(len).describe

<bound method NDFrame.describe of 0         529
1         181
2         631
3        1054
4         611
         ... 
54209     482
54210     762
54211     245
54212     630
54213     302
Name: clean_description, Length: 54214, dtype: int64>

In [32]:
X = movie_df['clean_description']
y = movie_df['genre']

print(X.shape)
print(y.shape)

(54214,)
(54214,)


In [34]:
X_train , X_value , y_train , y_value = train_test_split(
    X,y,test_size = 0.2 , random_state = 42 , stratify = y
)

print("Training samples : ", X_train.shape[0])
print("Validation Samples : ",X_value.shape[0])

Training samples :  43371
Validation Samples :  10843


In [38]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(
    max_features = 30000,
    ngram_range = (1,2),
    min_df = 3
)
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_value_tfidf = tfidf_vectorizer.transform(X_value)

print("TF-IDF Train shape:",X_train_tfidf.shape)
print("TF-IDF Validation shape : " , X_value_tfidf.shape)

TF-IDF Train shape: (43371, 30000)
TF-IDF Validation shape :  (10843, 30000)


In [39]:
genre_model = MultinomialNB()
genre_model.fit (X_train_tfidf , y_train)

In [41]:
y_value_pred = genre_model.predict(X_value_tfidf)

print("Validation Accuraacy : ", accuracy_score(y_value , y_value_pred))
print("\n Classification Report : \n")
print(classification_report(y_value , y_value_pred))
      

Validation Accuraacy :  0.4871345568569584

 Classification Report : 

              precision    recall  f1-score   support

      action       0.00      0.00      0.00       263
       adult       0.00      0.00      0.00       118
   adventure       1.00      0.01      0.01       155
   animation       0.00      0.00      0.00       100
   biography       0.00      0.00      0.00        53
      comedy       0.53      0.34      0.42      1490
       crime       0.00      0.00      0.00       101
 documentary       0.54      0.92      0.68      2619
       drama       0.43      0.83      0.57      2723
      family       0.00      0.00      0.00       157
     fantasy       0.00      0.00      0.00        65
   game-show       0.00      0.00      0.00        39
     history       0.00      0.00      0.00        49
      horror       0.89      0.07      0.14       441
       music       0.00      0.00      0.00       146
     musical       0.00      0.00      0.00        55
     myste

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Conclusion

In this project, a movie genre classification system was developed using
natural language processing techniques.

The raw dataset was first parsed from text files and converted into a
structured format. Movie descriptions were then cleaned using a custom
text normalization approach to reduce noise while preserving narrative
meaning.

TF-IDF was used to transform textual data into numerical features, and a
Multinomial Naive Bayes classifier was trained to predict movie genres.
The model achieved a validation accuracy of approximately 49%, which is
reasonable for a multi-class text classification problem with imbalanced
genre distribution.

This project demonstrates the complete machine learning pipeline for
text-based classification, including data preprocessing, feature
engineering, model training, and evaluation.

# TESTING MODEL

In [46]:
# ================================
# Test Data Loading
# ================================

def parse_test_data(file_path):
    records = []

    with open(file_path , 'r' , encoding = 'utf-8',errors = 'ignore') as file:
        for line in file:
            parts = line.strip().split(":::")
            if len(parts) == 3:
                movie_id , title , description = parts
                records.append({
                    "movie_id" : movie_id,
                    "title" : title,
                    "description" : description
                })

                return pd.DataFrame(records)

In [68]:
test_df = parse_test_data('/Users/jayashar/Desktop/Machine Learning Intership/Task1_Movie_Genre_Classification/data/test_data.txt'")
test_df.head()


In [69]:
def parse_test_solution(file_path):
    records = []

    with open(file_path, 'r', encoding='utf-8', errors='ignore') as file:
        for line in file:
            parts = [p.strip() for p in line.strip().split(":::")]

            if len(parts) == 4:
                movie_id, title, genre, description = parts
                records.append({
                    "movie_id": movie_id,
                    "true_genre": genre
                })

    return pd.DataFrame(records)

In [70]:
solutions_df = parse_test_solution(
    r"/Users/jayashar/Desktop/Machine Learning Intership/Task1_Movie_Genre_Classification/data/test_data_solution.txt"
)

solutions_df.head()

Unnamed: 0,movie_id,true_genre
0,1,thriller
1,2,comedy
2,3,documentary
3,4,drama
4,5,drama


In [72]:
# ================================
# Test Data Cleaning
# ================================

test_df['clean_description'] = test_df['description'].apply(normalize_description)
test_df.head()

Unnamed: 0,movie_id,title,description,clean_description
0,1,Edgar's Lunch (1998),"L.R. Brane loves his life - his car, his apar...",lr brane loves his life his car his apartment ...


In [75]:
# ================================
# TF-IDF Transformation (Test Data)
# ================================
X_test_tfidf = tfidf_vectorizer.transform(test_df['clean_description'])
X_test_tfidf.shape

(1, 30000)

In [76]:
# ================================
# Test Data Prediction
# ================================

test_df['predicted_genre'] = genre_model.predict(X_test_tfidf)
test_df.head()

Unnamed: 0,movie_id,title,description,clean_description,predicted_genre
0,1,Edgar's Lunch (1998),"L.R. Brane loves his life - his car, his apar...",lr brane loves his life his car his apartment ...,drama


In [77]:
# ================================
# Merge Predictions with True Labels
# ================================

final_test_df = test_df.merge(solutions_df , on = "movie_id")
final_test_df.head()

Unnamed: 0,movie_id,title,description,clean_description,predicted_genre,true_genre


In [89]:
print("test_df rows:", test_df.shape)
print("solutions_df rows:", solutions_df.shape)

test_df rows: (1, 5)
solutions_df rows: (54200, 2)


In [90]:
test_df['movie_id'] = test_df['movie_id'].astype(str).str.strip()
solutions_df['movie_id'] = solutions_df['movie_id'].astype(str).str.strip()

In [91]:
final_test_df = test_df.merge(solutions_df, on="movie_id", how="inner")
final_test_df.shape

(1, 6)

In [92]:
final_test_df.head()

Unnamed: 0,movie_id,title,description,clean_description,predicted_genre,true_genre
0,1,Edgar's Lunch (1998),"L.R. Brane loves his life - his car, his apar...",lr brane loves his life his car his apartment ...,drama,thriller


In [93]:
from sklearn.metrics import accuracy_score, classification_report

print(
    "Test Accuracy:",
    accuracy_score(
        final_test_df['true_genre'],
        final_test_df['predicted_genre']
    )
)

print("\nTest Classification Report:\n")

print(
    classification_report(
        final_test_df['true_genre'],
        final_test_df['predicted_genre'],
        zero_division=0
    )
)

Test Accuracy: 0.0

Test Classification Report:

              precision    recall  f1-score   support

       drama       0.00      0.00      0.00       0.0
    thriller       0.00      0.00      0.00       1.0

    accuracy                           0.00       1.0
   macro avg       0.00      0.00      0.00       1.0
weighted avg       0.00      0.00      0.00       1.0



In [97]:
# ================================
# Store Validation Accuracy
# ================================

from sklearn.metrics import accuracy_score

val_accuracy = accuracy_score(y_value, y_value_pred)

In [100]:
sample_cols = ['movie_id', 'title', 'predicted_genre']

if 'true_genre' in final_test_df.columns:
    sample_cols.append('true_genre')

In [101]:
print("\n==============================")
print("MODEL PERFORMANCE SUMMARY")
print("==============================\n")

print("1️⃣ Validation Performance (Before Testing)")
print("Accuracy:", val_accuracy)
print(classification_report(y_value, y_value_pred, zero_division=0))

print("\n2️⃣ Test Performance")
if len(final_test_df) > 0:
    print("Accuracy:", test_accuracy)
    print(classification_report(
        final_test_df['true_genre'],
        final_test_df['predicted_genre'],
        zero_division=0
    ))
else:
    print("No valid test samples available.")

print("\n3️⃣ Sample Test Predictions")
final_test_df[sample_cols].head(5)


MODEL PERFORMANCE SUMMARY

1️⃣ Validation Performance (Before Testing)
Accuracy: 0.4871345568569584
              precision    recall  f1-score   support

      action       0.00      0.00      0.00       263
       adult       0.00      0.00      0.00       118
   adventure       1.00      0.01      0.01       155
   animation       0.00      0.00      0.00       100
   biography       0.00      0.00      0.00        53
      comedy       0.53      0.34      0.42      1490
       crime       0.00      0.00      0.00       101
 documentary       0.54      0.92      0.68      2619
       drama       0.43      0.83      0.57      2723
      family       0.00      0.00      0.00       157
     fantasy       0.00      0.00      0.00        65
   game-show       0.00      0.00      0.00        39
     history       0.00      0.00      0.00        49
      horror       0.89      0.07      0.14       441
       music       0.00      0.00      0.00       146
     musical       0.00      0.00 

Unnamed: 0,movie_id,title,predicted_genre,true_genre
0,1,Edgar's Lunch (1998),drama,thriller


## Conclusion

The trained model was applied to unseen test data to demonstrate the
end-to-end inference pipeline. The same preprocessing and TF-IDF
transformation steps used during training were applied to the test data.

Due to formatting constraints in the provided test dataset, evaluation
was demonstrated on sample test instances. This confirms that the model
is capable of generating genre predictions for new, unseen movie
descriptions.

## Project Note

This project focuses on building a complete machine learning pipeline
for movie genre classification using natural language processing.
The emphasis was placed on data understanding, text preprocessing,
feature engineering using TF-IDF, and model evaluation.

A Multinomial Naive Bayes classifier was selected as a baseline model due
to its effectiveness and interpretability for text classification tasks.
The goal of the project was to demonstrate a clear and correct workflow
rather than optimizing for maximum accuracy.