Task 1 - MOVIE GENRE CLASSIFICATION


Create a machine learning model that can predict the genre of a movie
based on its plot summary or other textual information.
You can use techniques like TF-IDF or word embeddings with classifiers such as
Naive Bayes, Logistic Regression, or support Vector Machines.

In [23]:
import os
os.listdir()

['.config',
 'description.txt',
 'train_data.txt',
 'test_data.txt',
 'test_data_solution.txt',
 '.ipynb_checkpoints',
 'sample_data']

1. Read or Extract data

In [24]:
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
data=pd.read_csv("description.txt")
data


Unnamed: 0,Train data:
0,ID ::: TITLE ::: GENRE ::: DESCRIPTION
1,ID ::: TITLE ::: GENRE ::: DESCRIPTION
2,ID ::: TITLE ::: GENRE ::: DESCRIPTION
3,ID ::: TITLE ::: GENRE ::: DESCRIPTION
4,Test data:
5,ID ::: TITLE ::: DESCRIPTION
6,ID ::: TITLE ::: DESCRIPTION
7,ID ::: TITLE ::: DESCRIPTION
8,ID ::: TITLE ::: DESCRIPTION
9,Source:


2. Create and apply Function to read data by splitting by
II...II

In [25]:
def load_data(file_path):
  with open(file_path, 'r', encoding='utf-8') as f:
    data = f.readlines()
  data = [line.strip().split(' ::: ') for line in data]
  return data

In [26]:
train_data = load_data("train_data.txt")
train_df = pd.DataFrame(train_data, columns=['ID', 'Title', 'Genre', 'Description'])
test_data = load_data("test_data.txt")
test_df = pd.DataFrame(test_data, columns=['ID', 'Title', 'Description'])
test_solution = load_data('test_data_solution.txt')
test_solution_df = pd.DataFrame(test_solution, columns=['ID', 'Title', 'Genre', 'Description'])


In [27]:
print("Train Data: ")
train_df

Train Data: 


Unnamed: 0,ID,Title,Genre,Description
0,1,Oscar et la dame rose (2009),drama,Listening in to a conversation between his doc...
1,2,Cupid (1997),thriller,A brother and sister with a past incestuous re...
2,3,"Young, Wild and Wonderful (1980)",adult,As the bus empties the students for their fiel...
3,4,The Secret Sin (1915),drama,To help their unemployed father make ends meet...
4,5,The Unrecovered (2007),drama,The film's title refers not only to the un-rec...
...,...,...,...,...
54209,54210,"""Bonino"" (1953)",comedy,This short-lived NBC live sitcom centered on B...
54210,54211,Dead Girls Don't Cry (????),horror,The NEXT Generation of EXPLOITATION. The siste...
54211,54212,Ronald Goedemondt: Ze bestaan echt (2008),documentary,"Ze bestaan echt, is a stand-up comedy about gr..."
54212,54213,Make Your Own Bed (1944),comedy,Walter and Vivian live in the country and have...


In [28]:
print("\nTest Data: ")
test_df


Test Data: 


Unnamed: 0,ID,Title,Description
0,1,Edgar's Lunch (1998),"L.R. Brane loves his life - his car, his apart..."
1,2,La guerra de papá (1977),"Spain, March 1964: Quico is a very naughty chi..."
2,3,Off the Beaten Track (2010),One year in the life of Albin and his family o...
3,4,Meu Amigo Hindu (2015),"His father has died, he hasn't spoken with his..."
4,5,Er nu zhai (1955),Before he was known internationally as a marti...
...,...,...,...
54195,54196,"""Tales of Light & Dark"" (2013)","Covering multiple genres, Tales of Light & Dar..."
54196,54197,Der letzte Mohikaner (1965),As Alice and Cora Munro attempt to find their ...
54197,54198,Oliver Twink (2007),"A movie 169 years in the making. Oliver Twist,..."
54198,54199,Slipstream (1973),"Popular, but mysterious rock D.J Mike Mallard ..."


In [29]:
print("\nTest Solution:")
test_solution_df


Test Solution:


Unnamed: 0,ID,Title,Genre,Description
0,1,Edgar's Lunch (1998),thriller,"L.R. Brane loves his life - his car, his apart..."
1,2,La guerra de papá (1977),comedy,"Spain, March 1964: Quico is a very naughty chi..."
2,3,Off the Beaten Track (2010),documentary,One year in the life of Albin and his family o...
3,4,Meu Amigo Hindu (2015),drama,"His father has died, he hasn't spoken with his..."
4,5,Er nu zhai (1955),drama,Before he was known internationally as a marti...
...,...,...,...,...
54195,54196,"""Tales of Light & Dark"" (2013)",horror,"Covering multiple genres, Tales of Light & Dar..."
54196,54197,Der letzte Mohikaner (1965),western,As Alice and Cora Munro attempt to find their ...
54197,54198,Oliver Twink (2007),adult,"A movie 169 years in the making. Oliver Twist,..."
54198,54199,Slipstream (1973),drama,"Popular, but mysterious rock D.J Mike Mallard ..."


3. Feature extraction: TF-IDF

In [30]:

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer (max_features=10000)
X_train_tfidf = vectorizer.fit_transform(train_df["Description"])
X_test_tfidf = vectorizer.transform (test_df["Description"])
print(f"Training data shape: {X_train_tfidf.shape}")
print(f"Test data shape: {X_test_tfidf.shape}")

Training data shape: (54214, 10000)
Test data shape: (54200, 10000)


4. Encoding the Target labels

In [31]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
y_train = label_encoder.fit_transform(train_df ['Genre'])
print (f"Unique genres in the training data: {label_encoder.classes_}")

Unique genres in the training data: ['action' 'adult' 'adventure' 'animation' 'biography' 'comedy' 'crime'
 'documentary' 'drama' 'family' 'fantasy' 'game-show' 'history' 'horror'
 'music' 'musical' 'mystery' 'news' 'reality-tv' 'romance' 'sci-fi'
 'short' 'sport' 'talk-show' 'thriller' 'war' 'western']


5. Model Building - Logistic Regression

In [32]:
from sklearn.linear_model import LogisticRegression
lr_model = LogisticRegression(max_iter=200)
lr_model.fit(X_train_tfidf, y_train)

y_pred = lr_model.predict(X_test_tfidf)
predicted_genres = label_encoder.inverse_transform(y_pred)

test_df['Predicted_Genre'] = predicted_genres
test_df[['Title', 'Predicted_Genre']]


Unnamed: 0,Title,Predicted_Genre
0,Edgar's Lunch (1998),drama
1,La guerra de papá (1977),drama
2,Off the Beaten Track (2010),documentary
3,Meu Amigo Hindu (2015),drama
4,Er nu zhai (1955),drama
...,...,...
54195,"""Tales of Light & Dark"" (2013)",drama
54196,Der letzte Mohikaner (1965),drama
54197,Oliver Twink (2007),comedy
54198,Slipstream (1973),drama


In [33]:
test_df['Predicted_Genre'] = predicted_genres
merged_df = pd.merge(test_solution_df[['ID', 'Genre']], test_df[['ID', 'Predicted_Genre']], on='ID')
merged_df

Unnamed: 0,ID,Genre,Predicted_Genre
0,1,thriller,drama
1,2,comedy,drama
2,3,documentary,documentary
3,4,drama,drama
4,5,drama,drama
...,...,...,...
54195,54196,horror,drama
54196,54197,western,drama
54197,54198,adult,comedy
54198,54199,drama,drama


6. Model Evaluation - Logistic Regression

In [34]:
from sklearn.metrics import accuracy_score, classification_report
accuracy = accuracy_score (merged_df['Genre'], merged_df['Predicted_Genre'])
print("\nClassification Report: ")
print(classification_report (merged_df [ 'Genre'], merged_df ['Predicted_Genre']))


Classification Report: 
              precision    recall  f1-score   support

      action       0.51      0.29      0.37      1314
       adult       0.65      0.24      0.35       590
   adventure       0.67      0.16      0.26       775
   animation       0.56      0.04      0.08       498
   biography       0.00      0.00      0.00       264
      comedy       0.54      0.60      0.57      7446
       crime       0.41      0.03      0.06       505
 documentary       0.68      0.87      0.76     13096
       drama       0.55      0.79      0.65     13612
      family       0.48      0.08      0.14       783
     fantasy       0.61      0.03      0.06       322
   game-show       0.90      0.49      0.64       193
     history       0.00      0.00      0.00       243
      horror       0.66      0.57      0.61      2204
       music       0.68      0.46      0.55       731
     musical       0.45      0.02      0.03       276
     mystery       0.25      0.00      0.01       318
  

7. Model Bulding - Navie Bayes

In [35]:
from sklearn.naive_bayes import MultinomialNB
nb_model = MultinomialNB()
nb_model.fit(X_train_tfidf, y_train)

In [36]:
y_pred_nb =nb_model.predict(X_test_tfidf)
predicted_genres_nb = label_encoder.inverse_transform(y_pred_nb)
test_df['Predicted_Genre_NB'] = predicted_genres_nb
merged_df_nb = pd.merge (test_solution_df, test_df[['ID', 'Predicted_Genre_NB']], on='ID')

8. Model Evaluation - Navie Bayes

In [37]:
from sklearn.metrics import accuracy_score, classification_report

accuracy_nb = accuracy_score (merged_df_nb['Genre'], merged_df_nb['Predicted_Genre_NB'])
print(f"Naive Bayes Accuracy: {accuracy_nb}")

print("Naive Bayes Classification Report: ")
print(classification_report(merged_df_nb['Genre'], merged_df_nb['Predicted_Genre_NB'], target_names=label_encoder.classes_))

Naive Bayes Accuracy: 0.5092066420664206
Naive Bayes Classification Report: 
              precision    recall  f1-score   support

      action       0.57      0.03      0.06      1314
       adult       0.46      0.02      0.04       590
   adventure       0.77      0.04      0.08       775
   animation       0.00      0.00      0.00       498
   biography       0.00      0.00      0.00       264
      comedy       0.53      0.40      0.46      7446
       crime       0.00      0.00      0.00       505
 documentary       0.56      0.89      0.69     13096
       drama       0.44      0.84      0.58     13612
      family       0.00      0.00      0.00       783
     fantasy       0.00      0.00      0.00       322
   game-show       1.00      0.02      0.04       193
     history       0.00      0.00      0.00       243
      horror       0.77      0.23      0.35      2204
       music       0.89      0.02      0.04       731
     musical       0.00      0.00      0.00       276
    

Test Case

In [38]:
# Assumiong the model (lr_model, nb_model) have already been trained

zoner_Description = [
    'Explosive fight scenes in the city streets', #Action
    'A haunted mansion that traps its visitors',  #Horror
    'A brave adventurer in search of lost treasure',  #Adventure
    'A forbidden romance in the 1920s',  #Romance
    'A daring rescue mission with a love interest' #Action
]

# step 1;  vectorize the new test data using the same vectorizer
test_data_tfidf = vectorizer.transform (zoner_Description)

#step 2:predict the genres using esch model
y_pred_lr = lr_model.predict(test_data_tfidf)
prd_genres_lr = label_encoder.inverse_transform(y_pred_lr)

y_pred_nb = nb_model.predict(test_data_tfidf)
prd_genres_nb = label_encoder.inverse_transform(y_pred_nb)

#step 3:Output the predicted genres
print("Predicted Genres using Logistic Regression :", prd_genres_lr)
print("Predicted Genres using Naive Bayes :", prd_genres_nb)
print()
for i, message in enumerate(zoner_Description):
    print(f"Description: {message}")
    print(f"Status :\tNaive Bayes: {prd_genres_nb[i]}")
    print(f"\tLogistic Regression Prediction: {prd_genres_lr[i]}")
    print("="*100)

Predicted Genres using Logistic Regression : ['documentary' 'horror' 'adventure' 'drama' 'comedy']
Predicted Genres using Naive Bayes : ['documentary' 'horror' 'documentary' 'drama' 'drama']

Description: Explosive fight scenes in the city streets
Status :	Naive Bayes: documentary
	Logistic Regression Prediction: documentary
Description: A haunted mansion that traps its visitors
Status :	Naive Bayes: horror
	Logistic Regression Prediction: horror
Description: A brave adventurer in search of lost treasure
Status :	Naive Bayes: documentary
	Logistic Regression Prediction: adventure
Description: A forbidden romance in the 1920s
Status :	Naive Bayes: drama
	Logistic Regression Prediction: drama
Description: A daring rescue mission with a love interest
Status :	Naive Bayes: drama
	Logistic Regression Prediction: comedy
