# CodeClause Internship
#### Project ID - #CC69848
#### Project Title - Movie Genre Prediction
#### Internship Domain - Data Science Intern
#### Project Level - Intermediate Level
#### Aim -
> Predict the genre of a movie based on its plot summary and other features.
#### Description - 
> Use natural language processing (NLP) techniques for text classification on a movie dataset.
#### Technologies - 
> Python, NLTK or SpaCy, Scikit-learn.

## 1. Import Necessary Libraries

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, classification_report
import nltk
from nltk.corpus import stopwords
import re

# Download NLTK stopwords
# nltk.download('stopwords')


pd.set_option('display.max_rows', 150)

## 2: Loading and Exploring the Dataset

In [2]:
df = pd.read_csv('~/anaconda3/.kaggle/mpst-movie/mpst_full_data.csv')
df.head()

Unnamed: 0,imdb_id,title,plot_synopsis,tags,split,synopsis_source
0,tt0057603,I tre volti della paura,Note: this synopsis is for the orginal Italian...,"cult, horror, gothic, murder, atmospheric",train,imdb
1,tt1733125,Dungeons & Dragons: The Book of Vile Darkness,"Two thousand years ago, Nhagruul the Foul, a s...",violence,train,imdb
2,tt0033045,The Shop Around the Corner,"Matuschek's, a gift store in Budapest, is the ...",romantic,test,imdb
3,tt0113862,Mr. Holland's Opus,"Glenn Holland, not a morning person by anyone'...","inspiring, romantic, stupid, feel-good",train,imdb
4,tt0086250,Scarface,"In May 1980, a Cuban man named Tony Montana (A...","cruelty, murder, dramatic, cult, violence, atm...",val,imdb


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14828 entries, 0 to 14827
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   imdb_id          14828 non-null  object
 1   title            14828 non-null  object
 2   plot_synopsis    14828 non-null  object
 3   tags             14828 non-null  object
 4   split            14828 non-null  object
 5   synopsis_source  14828 non-null  object
dtypes: object(6)
memory usage: 695.2+ KB


In [4]:
df.describe

<bound method NDFrame.describe of          imdb_id                                          title  \
0      tt0057603                        I tre volti della paura   
1      tt1733125  Dungeons & Dragons: The Book of Vile Darkness   
2      tt0033045                     The Shop Around the Corner   
3      tt0113862                             Mr. Holland's Opus   
4      tt0086250                                       Scarface   
...          ...                                            ...   
14823  tt0219952                                  Lucky Numbers   
14824  tt1371159                                     Iron Man 2   
14825  tt0063443                                     Play Dirty   
14826  tt0039464                                      High Wall   
14827  tt0235166                               Against All Hope   

                                           plot_synopsis  \
0      Note: this synopsis is for the orginal Italian...   
1      Two thousand years ago, Nhagruul t

In [5]:
df.shape

(14828, 6)

In [6]:
df.columns

Index(['imdb_id', 'title', 'plot_synopsis', 'tags', 'split',
       'synopsis_source'],
      dtype='object')

##  3: Data Preprocessing

### 3.1 Handle Missing Values

In [7]:
# check for missing values

df.isnull().sum()

imdb_id            0
title              0
plot_synopsis      0
tags               0
split              0
synopsis_source    0
dtype: int64

### 3.2 Preprocess the Text Data

In [None]:
# Define a function to clean the plot_synopsis
def clean_text(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'\s+', ' ', text)  # Remove extra spaces
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    text = ' '.join([word for word in text.split() if word not in stopwords.words('english')])
    return text

# Apply the function to clean the 'plot_synopsis' column
df['cleaned_synopsis'] = df['plot_synopsis'].apply(clean_text)


## 4. Feature Extraction using TF-IDF

In [None]:
# Initialize TF-IDF Vectorizer
tfidf = TfidfVectorizer(max_features=5000)

# Transform the text data
X = tfidf.fit_transform(df['cleaned_synopsis'])

# Target variable
y = df['tags']


## 5. Split the Dataset

In [None]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


## 6. Train and Evaluate Models

### Multinomial Naive Bayes

In [None]:
# Initialize and train the Multinomial Naive Bayes model
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)

# Predict and evaluate the model
nb_predictions = nb_model.predict(X_test)
nb_accuracy = accuracy_score(y_test, nb_predictions)
nb_f1 = f1_score(y_test, nb_predictions, average='weighted')

print("Multinomial Naive Bayes Accuracy:", nb_accuracy)
print("Multinomial Naive Bayes F1 Score:", nb_f1)
print(classification_report(y_test, nb_predictions))


### Random Forest

In [None]:
# Initialize and train the Random Forest model
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)

# Predict and evaluate the model
rf_predictions = rf_model.predict(X_test)
rf_accuracy = accuracy_score(y_test, rf_predictions)
rf_f1 = f1_score(y_test, rf_predictions, average='weighted')

print("Random Forest Accuracy:", rf_accuracy)
print("Random Forest F1 Score:", rf_f1)
print(classification_report(y_test, rf_predictions))


## 7. Testing with Sample Input

In [None]:
def predict_genre(plot_summary):
    cleaned_summary = clean_text(plot_summary)
    vectorized_summary = tfidf.transform([cleaned_summary])
    nb_prediction = nb_model.predict(vectorized_summary)
    rf_prediction = rf_model.predict(vectorized_summary)
    
    print(f"Multinomial Naive Bayes Prediction: {nb_prediction[0]}")
    print(f"Random Forest Prediction: {rf_prediction[0]}")

# Test with a sample input
sample_plot = "A young girl discovers her magical powers and fights to save her kingdom from an evil sorcerer."
predict_genre(sample_plot)


Top 5 Testing Suggestions
You can test the model with the following plots:

"A detective investigates a series of mysterious murders in a small town."
"An astronaut embarks on a dangerous mission to save humanity."
"Two best friends navigate the challenges of high school."
"A warrior seeks revenge against those who wronged him."
"A family moves into a haunted house and experiences paranormal events."
This program is designed to predict movie genres based on plot summaries using both Multinomial Naive Bayes and Random Forest classifiers. You can adjust parameters or try different models to improve accuracy.


