Author : Pradeeshkumar U, 
Date   : 18-01-2025, 
Topic  : Movie Genre Prediction

Importing necessary Packages:

In [1]:
import pandas as pd
import warnings 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import  accuracy_score, classification_report

Lets see the Dataset Description

In [103]:
warnings.filterwarnings('ignore')
description = pd.read_csv('data/description.txt')
description

Unnamed: 0,Train data:
0,ID ::: TITLE ::: GENRE ::: DESCRIPTION
1,ID ::: TITLE ::: GENRE ::: DESCRIPTION
2,ID ::: TITLE ::: GENRE ::: DESCRIPTION
3,ID ::: TITLE ::: GENRE ::: DESCRIPTION
4,Test data:
5,ID ::: TITLE ::: DESCRIPTION
6,ID ::: TITLE ::: DESCRIPTION
7,ID ::: TITLE ::: DESCRIPTION
8,ID ::: TITLE ::: DESCRIPTION
9,Source:


Function to preprocess the Dataset:

In [104]:
def data_path(file_path):
    with open(file_path,'r',encoding='utf-8') as f:
        data = f.readlines()
    data = [line.strip().split(' ::: ') for line in data]
    return data

Importing Dataset and Preprocessing:

In [105]:
columns = ['ID','Name','Genre','Description']
test_col = [0,1,3]

train_data = data_path('data/train_data.txt')
train_df = pd.DataFrame(train_data,columns=columns)

test_data = data_path('data/test_data.txt')
test_df = pd.DataFrame(test_data,columns=[columns[col] for col in test_col])

test_solution_data = data_path('data/test_data_solution.txt')
test_solution_df = pd.DataFrame(test_solution_data,columns=columns)

Checking train_df:

In [106]:
train_df.head()

Unnamed: 0,ID,Name,Genre,Description
0,1,Oscar et la dame rose (2009),drama,Listening in to a conversation between his doc...
1,2,Cupid (1997),thriller,A brother and sister with a past incestuous re...
2,3,"Young, Wild and Wonderful (1980)",adult,As the bus empties the students for their fiel...
3,4,The Secret Sin (1915),drama,To help their unemployed father make ends meet...
4,5,The Unrecovered (2007),drama,The film's title refers not only to the un-rec...


Checking test_df:

In [107]:
test_df.head()

Unnamed: 0,ID,Name,Description
0,1,Edgar's Lunch (1998),"L.R. Brane loves his life - his car, his apart..."
1,2,La guerra de papá (1977),"Spain, March 1964: Quico is a very naughty chi..."
2,3,Off the Beaten Track (2010),One year in the life of Albin and his family o...
3,4,Meu Amigo Hindu (2015),"His father has died, he hasn't spoken with his..."
4,5,Er nu zhai (1955),Before he was known internationally as a marti...


Checking test_solution_df

In [108]:
test_solution_df.head()

Unnamed: 0,ID,Name,Genre,Description
0,1,Edgar's Lunch (1998),thriller,"L.R. Brane loves his life - his car, his apart..."
1,2,La guerra de papá (1977),comedy,"Spain, March 1964: Quico is a very naughty chi..."
2,3,Off the Beaten Track (2010),documentary,One year in the life of Albin and his family o...
3,4,Meu Amigo Hindu (2015),drama,"His father has died, he hasn't spoken with his..."
4,5,Er nu zhai (1955),drama,Before he was known internationally as a marti...


Feature Extraction:

In [109]:
vectorizer = TfidfVectorizer(max_features=10000)

X_train_df_tfid = vectorizer.fit_transform(train_df["Description"])

X_test_df_tfid = vectorizer.transform(test_df["Description"])

print('Shape of X_train_df_tfid',X_train_df_tfid.shape)
print('Shape of X_test_df_tfid',X_test_df_tfid.shape)

Shape of X_train_df_tfid (54214, 10000)
Shape of X_test_df_tfid (54200, 10000)


Label Encoding:

In [110]:
encoder = LabelEncoder()
Y_train = encoder.fit_transform(train_df['Genre'])
print('Genres in the Dataset',encoder.classes_)

Genres in the Dataset ['action' 'adult' 'adventure' 'animation' 'biography' 'comedy' 'crime'
 'documentary' 'drama' 'family' 'fantasy' 'game-show' 'history' 'horror'
 'music' 'musical' 'mystery' 'news' 'reality-tv' 'romance' 'sci-fi'
 'short' 'sport' 'talk-show' 'thriller' 'war' 'western']


Model Building - Logistic Regression:

In [111]:
model = LogisticRegression(max_iter=200)

model.fit(X_train_df_tfid,Y_train)

Y_pred = model.predict(X_test_df_tfid)
predicted_genres = encoder.inverse_transform(Y_pred)

test_df['Predicted Genre']=predicted_genres

test_df[['Name','Predicted Genre']]

Unnamed: 0,Name,Predicted Genre
0,Edgar's Lunch (1998),drama
1,La guerra de papá (1977),drama
2,Off the Beaten Track (2010),documentary
3,Meu Amigo Hindu (2015),drama
4,Er nu zhai (1955),drama
...,...,...
54195,"""Tales of Light & Dark"" (2013)",drama
54196,Der letzte Mohikaner (1965),drama
54197,Oliver Twink (2007),comedy
54198,Slipstream (1973),drama


In [112]:

merged = pd.merge(test_solution_df[['ID','Genre']],test_df[['ID','Predicted Genre']],on='ID')
merged

Unnamed: 0,ID,Genre,Predicted Genre
0,1,thriller,drama
1,2,comedy,drama
2,3,documentary,documentary
3,4,drama,drama
4,5,drama,drama
...,...,...,...
54195,54196,horror,drama
54196,54197,western,drama
54197,54198,adult,comedy
54198,54199,drama,drama


Model Evaluation - Logistic Regression

In [113]:
accuracy = accuracy_score(merged['Genre'],merged['Predicted Genre'])
print(f'accuracy score : {accuracy:.2f}')

print(classification_report(merged['Genre'],merged['Predicted Genre']))

accuracy score : 0.59
              precision    recall  f1-score   support

      action       0.51      0.29      0.37      1314
       adult       0.65      0.24      0.35       590
   adventure       0.67      0.16      0.25       775
   animation       0.59      0.04      0.08       498
   biography       0.00      0.00      0.00       264
      comedy       0.54      0.60      0.57      7446
       crime       0.41      0.03      0.06       505
 documentary       0.68      0.87      0.76     13096
       drama       0.55      0.79      0.65     13612
      family       0.49      0.08      0.14       783
     fantasy       0.65      0.03      0.06       322
   game-show       0.90      0.50      0.64       193
     history       0.00      0.00      0.00       243
      horror       0.66      0.57      0.61      2204
       music       0.68      0.46      0.55       731
     musical       0.44      0.01      0.03       276
     mystery       0.33      0.00      0.01       318
     

Test Case:

In [117]:
actual_genre_test_case = ['drama','thriller','comedy','documentry','horror','romance','action','sci-fi','fantasy','animation']

test_case_data = [
    # Drama
    "A small-town girl struggles to fulfill her dreams of becoming a Broadway star while dealing with personal loss and family expectations.",
    
    # Thriller
    "A detective races against time to uncover the truth behind a series of mysterious disappearances in a sleepy coastal town.",
    
    # Comedy
    "Two lifelong friends find themselves in ridiculous situations when they accidentally switch lives for a week.",
    
    # Documentary
    "A breathtaking exploration of the planet's most remote wilderness areas and the wildlife that inhabit them.",
    
    # Horror
    "A group of teenagers discover an ancient curse that brings their darkest fears to life.",
    
    # Romance
    "A chance encounter on a train leads to an unexpected love story that spans years and continents.",
    
    # Action
    "An ex-special forces soldier must fight his way through a city held hostage by a dangerous criminal organization.",
    
    # Sci-Fi
    "In a distant future, humanity fights for survival on a desolate Earth after alien invaders take control of the planet.",
    
    # Fantasy
    "A young wizard discovers her hidden powers and must embark on a perilous journey to save her kingdom from a dark sorcerer.",
    
    # Animation
    "A mischievous robot and a brave little girl team up to save their village from an evil inventor's destructive plans.",
]

test_case_df = vectorizer.transform(test_case_data)

test_case_predict = model.predict(test_case_df)

test_case_predicted_genres = encoder.inverse_transform(test_case_predict)

print('Prediction Completed\n')
for i,des in enumerate(test_case_data):
    print(f'Description : {des}\nPredicted Genre : {test_case_predicted_genres[i]}\n   Actual Genre : {actual_genre_test_case[i]}\n')

Prediction Completed

Description : A small-town girl struggles to fulfill her dreams of becoming a Broadway star while dealing with personal loss and family expectations.
Predicted Genre : drama
   Actual Genre : drama

Description : A detective races against time to uncover the truth behind a series of mysterious disappearances in a sleepy coastal town.
Predicted Genre : thriller
   Actual Genre : thriller

Description : Two lifelong friends find themselves in ridiculous situations when they accidentally switch lives for a week.
Predicted Genre : comedy
   Actual Genre : comedy

Description : A breathtaking exploration of the planet's most remote wilderness areas and the wildlife that inhabit them.
Predicted Genre : documentary
   Actual Genre : documentry

Description : A group of teenagers discover an ancient curse that brings their darkest fears to life.
Predicted Genre : horror
   Actual Genre : horror

Description : A chance encounter on a train leads to an unexpected love story

Model successfully completed and evaluated