# MOVIE GENRE CLASSIFICATION

Create a machine learning model that can predict the genre of a
movie based on its plot summary or other textual information. You
can use techniques like TF-IDF or word embeddings with classifiers
such as Naive Bayes, Logistic Regression, or Support Vector
Machines.

In [21]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report


In [22]:
# Load Train Data
train_df = pd.read_csv('train_data.txt', sep=':::', engine='python', names=['ID', 'TITLE', 'GENRE', 'DESCRIPTION'])

# Load Test Data
test_df = pd.read_csv('test_data.txt', sep=':::', engine='python', names=['ID', 'TITLE', 'DESCRIPTION'])


In [23]:
# Split Data into Train and Test Sets
X_train, y_train = train_df['DESCRIPTION'], train_df['GENRE']
X_test = test_df['DESCRIPTION']

# TF-IDF Vectorization
tfidf_vectorizer = TfidfVectorizer(max_features=5000)  # Adjust max_features as needed
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)


In [20]:
# Train Naive Bayes Classifier
naive_bayes_classifier = MultinomialNB()
naive_bayes_classifier.fit(X_train_tfidf, y_train)

# Make Predictions
predictions = naive_bayes_classifier.predict(X_test_tfidf)

# If you want to see the predictions along with titles and IDs
predictions_df = pd.DataFrame({'ID': test_df['ID'], 'TITLE': test_df['TITLE'], 'PREDICTED_GENRE': predictions})

# Step 8: Display Predictions (optional)
print(predictions_df)

          ID                             TITLE PREDICTED_GENRE
0          1             Edgar's Lunch (1998)           drama 
1          2         La guerra de papá (1977)           drama 
2          3      Off the Beaten Track (2010)     documentary 
3          4           Meu Amigo Hindu (2015)           drama 
4          5                Er nu zhai (1955)           drama 
...      ...                               ...             ...
54195  54196   "Tales of Light & Dark" (2013)           drama 
54196  54197      Der letzte Mohikaner (1965)           drama 
54197  54198              Oliver Twink (2007)          comedy 
54198  54199                Slipstream (1973)           drama 
54199  54200        Curitiba Zero Grau (2010)     documentary 

[54200 rows x 3 columns]
