# Task: MOVIE GENRE CLASSIFICATION
    Create a machine learning model that can predict the genre of a
    movie based on its plot summary or other textual information. You
    can use techniques like TF-IDF or word embeddings with classifiers
    such as Naive Bayes, Logistic Regression, or Support Vector
    Machines

## Methodologies

###    1. Data Collection
###    2. Data Cleaning and Preprocessing
###    3. Data Visualization
###    4. Feature Engineering
###    5. Model Selection
###    6. Model Training and Evaluation

## Data Collection: Data was collected from https://www.kaggle.com/code/dhruvtibarewal/movie-genre-classification

## Data Cleaning and Preprocessing

In [65]:
# libraries
import pandas as pd

In [66]:
train_data = pd.read_csv("C:/Users/susha/Downloads/archive (7)/Genre Classification Dataset/train_data.txt", delimiter=':::', names = ['Sno', 'Name', 'Genre', 'Description'] ,engine='python')
test_data = pd.read_csv("C:/Users/susha/Downloads/archive (7)/Genre Classification Dataset/test_data.txt", delimiter = ':::', names = ['Sno', 'Name', 'Description'], engine='python')
test_data_solution = pd.read_csv("C:/Users/susha/Downloads/archive (7)/Genre Classification Dataset/test_data_solution.txt", delimiter=':::', names = ['Sno', 'Name', 'Genre', 'Description'] ,engine='python')

In [67]:
train_data.head()
test_data.head()
test_data_solution.tail()

Unnamed: 0,Sno,Name,Genre,Description
54195,54196,"""Tales of Light & Dark"" (2013)",horror,"Covering multiple genres, Tales of Light & Da..."
54196,54197,Der letzte Mohikaner (1965),western,As Alice and Cora Munro attempt to find their...
54197,54198,Oliver Twink (2007),adult,A movie 169 years in the making. Oliver Twist...
54198,54199,Slipstream (1973),drama,"Popular, but mysterious rock D.J Mike Mallard..."
54199,54200,Curitiba Zero Grau (2010),drama,"Curitiba is a city in movement, with rhythms ..."


#### Looking for null values

In [68]:
#looking for null values

train_data.info()
print('\n')
test_data.info()
print('\n')
test_data_solution.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54214 entries, 0 to 54213
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Sno          54214 non-null  int64 
 1   Name         54214 non-null  object
 2   Genre        54214 non-null  object
 3   Description  54214 non-null  object
dtypes: int64(1), object(3)
memory usage: 1.7+ MB


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54200 entries, 0 to 54199
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Sno          54200 non-null  int64 
 1   Name         54200 non-null  object
 2   Description  54200 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.2+ MB


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54200 entries, 0 to 54199
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Sno          54200 non-null  int64

In [69]:
train_data.isna().sum()

test_data.isna().sum()

test_data_solution.isna().sum()

Sno            0
Name           0
Genre          0
Description    0
dtype: int64

#### Looking for duplicates

In [70]:
train_data.duplicated().sum()
test_data.duplicated().sum()
test_data_solution.duplicated().sum()

0

#### Data Cleaning is done.

## Futher preprocessing
#### The dataset here is almost 50% training  and 50% testing. 
#### This is far from the optimal ratio that will yield in a better working model. 
#### So we will be splitting the dataset in the format 70-15-15 for training, validation and testing respectively.

In [71]:
#code below adds the first 37700 data from test_data_solution into the train datasets and removes those respective data from itself and test_data

last_sno = train_data['Sno'].max()
print(last_sno)

rows_to_append = test_data_solution.head(37700).copy()  # Make a copy to avoid modifying the original DataFrame
rows_to_append.loc[:, 'Sno'] += last_sno + 1  # Use .loc to modify the DataFrame safely

print(rows_to_append)


train_data = train_data.append(rows_to_append)

54214
         Sno                                    Name          Genre  \
0      54216                   Edgar's Lunch (1998)       thriller    
1      54217               La guerra de papá (1977)         comedy    
2      54218            Off the Beaten Track (2010)    documentary    
3      54219                 Meu Amigo Hindu (2015)          drama    
4      54220                      Er nu zhai (1955)          drama    
...      ...                                     ...            ...   
37695  91911                    Fully Loaded (2011)         comedy    
37696  91912                    Tenebrae Lux (2014)         sci-fi    
37697  91913                   Mexican Dance (1898)          short    
37698  91914   Das Lied von den zwei Pferden (2009)    documentary    
37699  91915                  Doin' It Again (2012)    documentary    

                                             Description  
0       L.R. Brane loves his life - his car, his apar...  
1       Spain, March 19

  train_data = train_data.append(rows_to_append)


In [72]:
test_data = test_data.drop(rows_to_append.index)
test_data_solution = test_data_solution.drop(rows_to_append.index)

In [73]:
train_data

Unnamed: 0,Sno,Name,Genre,Description
0,1,Oscar et la dame rose (2009),drama,Listening in to a conversation between his do...
1,2,Cupid (1997),thriller,A brother and sister with a past incestuous r...
2,3,"Young, Wild and Wonderful (1980)",adult,As the bus empties the students for their fie...
3,4,The Secret Sin (1915),drama,To help their unemployed father make ends mee...
4,5,The Unrecovered (2007),drama,The film's title refers not only to the un-re...
...,...,...,...,...
37695,91911,Fully Loaded (2011),comedy,"On a rare evening out, two feisty single moms..."
37696,91912,Tenebrae Lux (2014),sci-fi,A lone traveler with the ability to cross bet...
37697,91913,Mexican Dance (1898),short,"""Another well-known dancer with a national re..."
37698,91914,Das Lied von den zwei Pferden (2009),documentary,"A promise, an old, destroyed horse head violi..."


In [74]:
test_data

Unnamed: 0,Sno,Name,Description
37700,37701,My Lips Betray (1933),"In a make-believe, mittleuropean kingdom, a v..."
37701,37702,The Koreas (2016),"At the end of World War II, Korea was divided..."
37702,37703,Come Together (2016),Colombia is coming out of a period in their h...
37703,37704,With Honors Denied (2003),Japanese bombs hit Pearl Harbor on a Sunday. ...
37704,37705,"""Connect with English"" (2007)",Connect with English is a series that brings ...
...,...,...,...
54195,54196,"""Tales of Light & Dark"" (2013)","Covering multiple genres, Tales of Light & Da..."
54196,54197,Der letzte Mohikaner (1965),As Alice and Cora Munro attempt to find their...
54197,54198,Oliver Twink (2007),A movie 169 years in the making. Oliver Twist...
54198,54199,Slipstream (1973),"Popular, but mysterious rock D.J Mike Mallard..."


In [75]:
test_data_solution

Unnamed: 0,Sno,Name,Genre,Description
37700,37701,My Lips Betray (1933),musical,"In a make-believe, mittleuropean kingdom, a v..."
37701,37702,The Koreas (2016),documentary,"At the end of World War II, Korea was divided..."
37702,37703,Come Together (2016),documentary,Colombia is coming out of a period in their h...
37703,37704,With Honors Denied (2003),short,Japanese bombs hit Pearl Harbor on a Sunday. ...
37704,37705,"""Connect with English"" (2007)",drama,Connect with English is a series that brings ...
...,...,...,...,...
54195,54196,"""Tales of Light & Dark"" (2013)",horror,"Covering multiple genres, Tales of Light & Da..."
54196,54197,Der letzte Mohikaner (1965),western,As Alice and Cora Munro attempt to find their...
54197,54198,Oliver Twink (2007),adult,A movie 169 years in the making. Oliver Twist...
54198,54199,Slipstream (1973),drama,"Popular, but mysterious rock D.J Mike Mallard..."


In [78]:
train_data.iloc[54201]

Sno                                                        54202
Name                                        Singing Guns (1950) 
Genre                                                   western 
Description     Rhiannon, an outlaw who regularly robs gold f...
Name: 54201, dtype: object

In [79]:
test_data_solution.reset_index(drop=True, inplace=True)
test_data_solution.index += 1

In [83]:
test_data_solution.columns

Index(['Sno', 'Name', 'Genre', 'Description'], dtype='object')

In [81]:
test_data.reset_index(drop=True, inplace=True)
test_data.index += 1

In [84]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer

In [85]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\susha\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\susha\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [86]:
# Define preprocessing function
def preprocess_text(text):
    # Remove special characters, punctuation, and symbols
    text = re.sub(r'[^\w\s]', '', text)
    # Tokenization
    tokens = word_tokenize(text.lower())
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word not in stop_words]
    # Stemming
    stemmer = PorterStemmer()
    stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]
    # Join tokens back into a string
    preprocessed_text = ' '.join(stemmed_tokens)
    return preprocessed_text


# Apply preprocessing to 'Description' column
train_data['Description'] = train_data['Description'].apply(preprocess_text)
test_data['Description'] = test_data['Description'].apply(preprocess_text)
test_data_solution['Description'] = test_data_solution['Description'].apply(preprocess_text)
