# Task: MOVIE GENRE CLASSIFICATION
    Create a machine learning model that can predict the genre of a
    movie based on its plot summary or other textual information. You
    can use techniques like TF-IDF or word embeddings with classifiers
    such as Naive Bayes, Logistic Regression, or Support Vector
    Machines

## Methodologies

###    1. Data Collection
###    2. Data Cleaning and Preprocessing
###    3. Data Visualization
###    4. Feature Engineering
###    5. Model Selection
###    6. Model Training and Evaluation

## Data Collection: Data was collected from https://www.kaggle.com/code/dhruvtibarewal/movie-genre-classification

## Data Cleaning and Preprocessing

In [1]:
import pandas as pd

In [2]:
train_data = pd.read_csv("C:/Users/susha/Downloads/archive (7)/Genre Classification Dataset/train_data.txt", delimiter=':::', names = ['Sno', 'Name', 'Genre', 'Description'] ,engine='python')
test_data = pd.read_csv("C:/Users/susha/Downloads/archive (7)/Genre Classification Dataset/test_data.txt", delimiter = ':::', names = ['Sno', 'Name', 'Description'], engine='python')
test_data_solution = pd.read_csv("C:/Users/susha/Downloads/archive (7)/Genre Classification Dataset/test_data_solution.txt", delimiter=':::', names = ['Sno', 'Name', 'Genre', 'Description'] ,engine='python')

In [3]:
train_data.head()
test_data.head()
test_data_solution.tail()

Unnamed: 0,Sno,Name,Genre,Description
54195,54196,"""Tales of Light & Dark"" (2013)",horror,"Covering multiple genres, Tales of Light & Da..."
54196,54197,Der letzte Mohikaner (1965),western,As Alice and Cora Munro attempt to find their...
54197,54198,Oliver Twink (2007),adult,A movie 169 years in the making. Oliver Twist...
54198,54199,Slipstream (1973),drama,"Popular, but mysterious rock D.J Mike Mallard..."
54199,54200,Curitiba Zero Grau (2010),drama,"Curitiba is a city in movement, with rhythms ..."


#### Looking for null values

In [4]:
#looking for null values

train_data.info()
print('\n')
test_data.info()
print('\n')
test_data_solution.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54214 entries, 0 to 54213
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Sno          54214 non-null  int64 
 1   Name         54214 non-null  object
 2   Genre        54214 non-null  object
 3   Description  54214 non-null  object
dtypes: int64(1), object(3)
memory usage: 1.7+ MB


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54200 entries, 0 to 54199
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Sno          54200 non-null  int64 
 1   Name         54200 non-null  object
 2   Description  54200 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.2+ MB


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54200 entries, 0 to 54199
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Sno          54200 non-null  int64

In [5]:
train_data.isna().sum()

test_data.isna().sum()

test_data_solution.isna().sum()

Sno            0
Name           0
Genre          0
Description    0
dtype: int64

#### Looking for duplicates

In [6]:
train_data.duplicated().sum()
test_data.duplicated().sum()
test_data_solution.duplicated().sum()

0

# Data Cleaning is done.

## Futher preprocessing
#### The dataset here is almost 50% training  and 50% testing. 
#### This is far from the optimal ratio that will yield in a better working model. 
#### So we will be splitting the dataset in the format 70-15-15 for training, validation and testing respectively.

In [7]:
#code below adds the first 37700 data from test_data_solution into the train datasets and removes those respective data from itself and test_data

last_sno = train_data['Sno'].max()
print(last_sno)

rows_to_append = test_data_solution.head(37700).copy()  # Make a copy to avoid modifying the original DataFrame
rows_to_append.loc[:, 'Sno'] += last_sno + 1  # Use .loc to modify the DataFrame safely

print(rows_to_append)


train_data = train_data.append(rows_to_append)

54214
         Sno                                    Name          Genre  \
0      54216                   Edgar's Lunch (1998)       thriller    
1      54217               La guerra de papá (1977)         comedy    
2      54218            Off the Beaten Track (2010)    documentary    
3      54219                 Meu Amigo Hindu (2015)          drama    
4      54220                      Er nu zhai (1955)          drama    
...      ...                                     ...            ...   
37695  91911                    Fully Loaded (2011)         comedy    
37696  91912                    Tenebrae Lux (2014)         sci-fi    
37697  91913                   Mexican Dance (1898)          short    
37698  91914   Das Lied von den zwei Pferden (2009)    documentary    
37699  91915                  Doin' It Again (2012)    documentary    

                                             Description  
0       L.R. Brane loves his life - his car, his apar...  
1       Spain, March 19

  train_data = train_data.append(rows_to_append)


In [8]:
test_data = test_data.drop(rows_to_append.index)
test_data_solution = test_data_solution.drop(rows_to_append.index)

In [9]:
train_data

Unnamed: 0,Sno,Name,Genre,Description
0,1,Oscar et la dame rose (2009),drama,Listening in to a conversation between his do...
1,2,Cupid (1997),thriller,A brother and sister with a past incestuous r...
2,3,"Young, Wild and Wonderful (1980)",adult,As the bus empties the students for their fie...
3,4,The Secret Sin (1915),drama,To help their unemployed father make ends mee...
4,5,The Unrecovered (2007),drama,The film's title refers not only to the un-re...
...,...,...,...,...
37695,91911,Fully Loaded (2011),comedy,"On a rare evening out, two feisty single moms..."
37696,91912,Tenebrae Lux (2014),sci-fi,A lone traveler with the ability to cross bet...
37697,91913,Mexican Dance (1898),short,"""Another well-known dancer with a national re..."
37698,91914,Das Lied von den zwei Pferden (2009),documentary,"A promise, an old, destroyed horse head violi..."


In [10]:
test_data

Unnamed: 0,Sno,Name,Description
37700,37701,My Lips Betray (1933),"In a make-believe, mittleuropean kingdom, a v..."
37701,37702,The Koreas (2016),"At the end of World War II, Korea was divided..."
37702,37703,Come Together (2016),Colombia is coming out of a period in their h...
37703,37704,With Honors Denied (2003),Japanese bombs hit Pearl Harbor on a Sunday. ...
37704,37705,"""Connect with English"" (2007)",Connect with English is a series that brings ...
...,...,...,...
54195,54196,"""Tales of Light & Dark"" (2013)","Covering multiple genres, Tales of Light & Da..."
54196,54197,Der letzte Mohikaner (1965),As Alice and Cora Munro attempt to find their...
54197,54198,Oliver Twink (2007),A movie 169 years in the making. Oliver Twist...
54198,54199,Slipstream (1973),"Popular, but mysterious rock D.J Mike Mallard..."


In [11]:
test_data_solution

Unnamed: 0,Sno,Name,Genre,Description
37700,37701,My Lips Betray (1933),musical,"In a make-believe, mittleuropean kingdom, a v..."
37701,37702,The Koreas (2016),documentary,"At the end of World War II, Korea was divided..."
37702,37703,Come Together (2016),documentary,Colombia is coming out of a period in their h...
37703,37704,With Honors Denied (2003),short,Japanese bombs hit Pearl Harbor on a Sunday. ...
37704,37705,"""Connect with English"" (2007)",drama,Connect with English is a series that brings ...
...,...,...,...,...
54195,54196,"""Tales of Light & Dark"" (2013)",horror,"Covering multiple genres, Tales of Light & Da..."
54196,54197,Der letzte Mohikaner (1965),western,As Alice and Cora Munro attempt to find their...
54197,54198,Oliver Twink (2007),adult,A movie 169 years in the making. Oliver Twist...
54198,54199,Slipstream (1973),drama,"Popular, but mysterious rock D.J Mike Mallard..."


In [12]:
train_data.

SyntaxError: invalid syntax (1541890828.py, line 1)

In [13]:
train_data.iloc[54201]

Sno                                                        54202
Name                                        Singing Guns (1950) 
Genre                                                   western 
Description     Rhiannon, an outlaw who regularly robs gold f...
Name: 54201, dtype: object

In [14]:
test_data_solution.reset_index(drop=True, inplace=True)
test_data_solution.index += 1

In [15]:
test_data_solution.columns

Index(['Sno', 'Name', 'Genre', 'Description'], dtype='object')

In [16]:
test_data.reset_index(drop=True, inplace=True)
test_data.index += 1

In [17]:
test_data

Unnamed: 0,Sno,Name,Description
1,37701,My Lips Betray (1933),"In a make-believe, mittleuropean kingdom, a v..."
2,37702,The Koreas (2016),"At the end of World War II, Korea was divided..."
3,37703,Come Together (2016),Colombia is coming out of a period in their h...
4,37704,With Honors Denied (2003),Japanese bombs hit Pearl Harbor on a Sunday. ...
5,37705,"""Connect with English"" (2007)",Connect with English is a series that brings ...
...,...,...,...
16496,54196,"""Tales of Light & Dark"" (2013)","Covering multiple genres, Tales of Light & Da..."
16497,54197,Der letzte Mohikaner (1965),As Alice and Cora Munro attempt to find their...
16498,54198,Oliver Twink (2007),A movie 169 years in the making. Oliver Twist...
16499,54199,Slipstream (1973),"Popular, but mysterious rock D.J Mike Mallard..."


In [18]:
test_data_solution

Unnamed: 0,Sno,Name,Genre,Description
1,37701,My Lips Betray (1933),musical,"In a make-believe, mittleuropean kingdom, a v..."
2,37702,The Koreas (2016),documentary,"At the end of World War II, Korea was divided..."
3,37703,Come Together (2016),documentary,Colombia is coming out of a period in their h...
4,37704,With Honors Denied (2003),short,Japanese bombs hit Pearl Harbor on a Sunday. ...
5,37705,"""Connect with English"" (2007)",drama,Connect with English is a series that brings ...
...,...,...,...,...
16496,54196,"""Tales of Light & Dark"" (2013)",horror,"Covering multiple genres, Tales of Light & Da..."
16497,54197,Der letzte Mohikaner (1965),western,As Alice and Cora Munro attempt to find their...
16498,54198,Oliver Twink (2007),adult,A movie 169 years in the making. Oliver Twist...
16499,54199,Slipstream (1973),drama,"Popular, but mysterious rock D.J Mike Mallard..."


In [19]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score

In [20]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\susha\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\susha\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [21]:
def preprocess_text(text):
    
    # Remove special characters, punctuation, and symbols
    text = re.sub(r'[^\w\s]', '', text)

    # Tokenization
    tokens = word_tokenize(text.lower())
    
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word not in stop_words]
    
    # Stemming
    stemmer = PorterStemmer()
    stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]
    
    # Join tokens back into a string
    preprocessed_text = ' '.join(stemmed_tokens)
    
    return preprocessed_text


# Apply preprocessing to 'Description' column
train_data['Description'] = train_data['Description'].apply(preprocess_text)
test_data['Description'] = test_data['Description'].apply(preprocess_text)
test_data_solution['Description'] = test_data_solution['Description'].apply(preprocess_text)


In [22]:
train_data.iloc[90000]


Sno                                                        90002
Name                                  The Copper Village (2015) 
Genre                                               documentary 
Description    villag western nepal copper mine oper 400 year...
Name: 35786, dtype: object

In [23]:
X_train = train_data['Description']
y_train = train_data['Genre']
X_test = test_data['Description']
y_test_solution = test_data_solution['Genre']

In [24]:
vectorizer = TfidfVectorizer(max_features=1000)
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

### Using TF-IDF vectorization

In [25]:
vectorizer = TfidfVectorizer(max_features=1000)  
X_train_vectorized = vectorizer.fit_transform(X_train).toarray()
print(X_train_vectorized)
X_test_vectorized = vectorizer.transform(X_test)

[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.14971738 0.         ... 0.         0.         0.        ]]


AttributeError: head not found

#### MultiClass Perceptron Classifier

In [30]:
# from sklearn.model_selection import train_test_split
# import tensorflow as tf
# import numpy as np
# from keras.models import Sequential
# from keras.layers import Dense, Dropout
# from sklearn.feature_extraction.text import TfidfVectorizer
# from sklearn.preprocessing import LabelEncoder
# from tensorflow.keras.optimizers import Adam

# # Convert sparse matrices to CSR format
# X_train_csr = X_train_vectorized.tocsr()
# X_test_csr = X_test_vectorized.tocsr()

# # Split the data into training and validation sets
# X_train, X_val, y_train, y_val = train_test_split(X_train_csr, y_train_encoded, test_size=0.2, random_state=42)

# # Convert sparse matrices to COO format for easier manipulation
# X_train_coo = X_train.tocoo()
# X_val_coo = X_val.tocoo()

# # Reshape data to ensure rank-2 tensors
# X_train_sparse = tf.sparse.reorder(tf.sparse.SparseTensor(
#     indices=np.transpose([X_train_coo.row, X_train_coo.col]),
#     values=X_train_coo.data,
#     dense_shape=X_train_coo.shape
# ))
# X_val_sparse = tf.sparse.reorder(tf.sparse.SparseTensor(
#     indices=np.transpose([X_val_coo.row, X_val_coo.col]),
#     values=X_val_coo.data,
#     dense_shape=X_val_coo.shape
# ))

# # Define and train the model
# model = Sequential([
#     Dense(256, activation='relu', input_dim=X_train_sparse.shape[1]),
#     Dropout(0.8),
#     #Dense(128, activation='relu'),

#     Dense(128, activation='relu'),
#     # Dropout(0.8),
#     Dense(64, activation='relu'),
#     Dropout(0.8),
#     Dense(64, activation='relu'),

#     Dense(32, activation='relu'),
    
#     # Dense(32, activation='relu'),
#     Dense(27, activation='softmax')  
# ])
# optimizer = Adam(learning_rate=0.002)

# model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# history = model.fit(X_train_sparse, y_train, epochs=30, batch_size=64, validation_data=(X_val_sparse, y_val))



import re
import nltk
import numpy as np
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam

# Preprocessing function
def preprocess_text(text):
    text = re.sub(r'[^\w\s]', '', text)
    tokens = word_tokenize(text.lower())
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word not in stop_words]
    stemmer = PorterStemmer()
    stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]
    preprocessed_text = ' '.join(stemmed_tokens)
    return preprocessed_text

# Apply preprocessing to 'Description' column
train_data['Description'] = train_data['Description'].apply(preprocess_text)
test_data['Description'] = test_data['Description'].apply(preprocess_text)
test_data_solution['Description'] = test_data_solution['Description'].apply(preprocess_text)

# Data preparation
X_train = train_data['Description']
y_train = train_data['Genre']
X_test = test_data['Description']
y_test_solution = test_data_solution['Genre']

vectorizer = TfidfVectorizer(max_features=1000)  
X_train_vectorized = vectorizer.fit_transform(X_train).toarray()
X_test_vectorized = vectorizer.transform(X_test).toarray()

# Convert target labels to numerical values
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_solution_encoded = label_encoder.transform(y_test_solution)

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train_vectorized, y_train_encoded, test_size=0.2, random_state=42)

# Define and compile the model


KeyboardInterrupt: 

In [43]:
model = Sequential([
    Dense(512, activation='relu', input_dim=X_train.shape[1]),
    Dropout(0.8),
    Dense(256, activation='relu'),
    # Dropout(0.8),
    Dense(128, activation='relu'),
    # Dropout(0.8),

    Dense(64, activation='relu'),
    Dense(32, activation='relu'),
    Dense(27, activation='softmax')
])



optimizer = Adam(learning_rate=0.0035)
model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Train the model
history = model.fit(X_train, y_train, epochs=30, batch_size=64, validation_data=(X_val, y_val))


Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30

KeyboardInterrupt: 

In [28]:
model.summary()


NameError: name 'model' is not defined

In [81]:
model = MLPClassifier()

# Train the model
model.fit(X_train_vectorized, y_train_encoded)

# Predict on test data
y_pred = model.predict(X_test_vectorized)

# Calculate accuracy
accuracy = accuracy_score(y_test_solution_encoded, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.5322424242424243




In [37]:
max_tokens = train_data['Description'].apply(lambda x: len(word_tokenize(x))).max()
max_tokens


1479

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Bidirectional, LSTM, Dense, Embedding, Dropout
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import numpy as np
print("hello world")

hello world


In [74]:
vectorizer = TfidfVectorizer(max_features=1000) 

# Fit and transform the training and test text data
X_train_vectorized = vectorizer.fit_transform(train_data['Description'])
X_test_vectorized = vectorizer.transform(test_data['Description'])

# Convert sparse matrices to dense arrays
X_train_dense = X_train_vectorized.toarray()
X_test_dense = X_test_vectorized.toarray()

# Define maximum sequence length based on the maximum number of tokens
max_length = 800  

# Pad sequences to ensure uniform length
X_train_padded = pad_sequences(X_train_dense, maxlen=max_length, padding='post')
X_test_padded = pad_sequences(X_test_dense, maxlen=max_length, padding='post')

#### Bidirectional LSTM

In [73]:
# Define the model
model = Sequential([
    Embedding(input_dim=X_train_padded.shape[1], output_dim=128, input_length=max_length),
    Bidirectional(LSTM(128, return_sequences=True)),
    Bidirectional(LSTM(64)),
    Dense(128, activation='relu'),
    Dropout(0.8),
    Dense(32, activation='relu'),
    Dense(len(np.unique(y_train_encoded)), activation='softmax')
])

# Compile the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Print the model summary
model.summary()

# Train the model
history = model.fit(X_train_padded, y_train_encoded, epochs=10, batch_size=64, validation_split=0.2)

# Evaluate the model on test data
loss, accuracy = model.evaluate(X_test_padded, y_test_solution_encoded)
print("Test Loss:", loss)
print("Test Accuracy:", accuracy)


Model: "sequential_33"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 800, 128)          102400    
                                                                 
 bidirectional_2 (Bidirectio  (None, 800, 256)         263168    
 nal)                                                            
                                                                 
 dense_163 (Dense)           (None, 800, 32)           8224      
                                                                 
 dense_164 (Dense)           (None, 800, 27)           891       
                                                                 
Total params: 374,683
Trainable params: 374,683
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10


InvalidArgumentError: Graph execution error:

Detected at node 'sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits' defined at (most recent call last):
    File "C:\Users\susha\anaconda3\lib\runpy.py", line 197, in _run_module_as_main
      return _run_code(code, main_globals, None,
    File "C:\Users\susha\anaconda3\lib\runpy.py", line 87, in _run_code
      exec(code, run_globals)
    File "C:\Users\susha\anaconda3\lib\site-packages\ipykernel_launcher.py", line 17, in <module>
      app.launch_new_instance()
    File "C:\Users\susha\anaconda3\lib\site-packages\traitlets\config\application.py", line 846, in launch_instance
      app.start()
    File "C:\Users\susha\anaconda3\lib\site-packages\ipykernel\kernelapp.py", line 712, in start
      self.io_loop.start()
    File "C:\Users\susha\anaconda3\lib\site-packages\tornado\platform\asyncio.py", line 199, in start
      self.asyncio_loop.run_forever()
    File "C:\Users\susha\anaconda3\lib\asyncio\base_events.py", line 601, in run_forever
      self._run_once()
    File "C:\Users\susha\anaconda3\lib\asyncio\base_events.py", line 1905, in _run_once
      handle._run()
    File "C:\Users\susha\anaconda3\lib\asyncio\events.py", line 80, in _run
      self._context.run(self._callback, *self._args)
    File "C:\Users\susha\anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 510, in dispatch_queue
      await self.process_one()
    File "C:\Users\susha\anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 499, in process_one
      await dispatch(*args)
    File "C:\Users\susha\anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 406, in dispatch_shell
      await result
    File "C:\Users\susha\anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 730, in execute_request
      reply_content = await reply_content
    File "C:\Users\susha\anaconda3\lib\site-packages\ipykernel\ipkernel.py", line 390, in do_execute
      res = shell.run_cell(code, store_history=store_history, silent=silent)
    File "C:\Users\susha\anaconda3\lib\site-packages\ipykernel\zmqshell.py", line 528, in run_cell
      return super().run_cell(*args, **kwargs)
    File "C:\Users\susha\anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2914, in run_cell
      result = self._run_cell(
    File "C:\Users\susha\anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2960, in _run_cell
      return runner(coro)
    File "C:\Users\susha\anaconda3\lib\site-packages\IPython\core\async_helpers.py", line 78, in _pseudo_sync_runner
      coro.send(None)
    File "C:\Users\susha\anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 3185, in run_cell_async
      has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
    File "C:\Users\susha\anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 3377, in run_ast_nodes
      if (await self.run_code(code, result,  async_=asy)):
    File "C:\Users\susha\anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 3457, in run_code
      exec(code_obj, self.user_global_ns, self.user_ns)
    File "C:\Users\susha\AppData\Local\Temp\ipykernel_35816\3013874624.py", line 18, in <module>
      history = model.fit(X_train_padded, y_train_encoded, epochs=10, batch_size=64, validation_split=0.2)
    File "C:\Users\susha\anaconda3\lib\site-packages\keras\utils\traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "C:\Users\susha\anaconda3\lib\site-packages\keras\engine\training.py", line 1685, in fit
      tmp_logs = self.train_function(iterator)
    File "C:\Users\susha\anaconda3\lib\site-packages\keras\engine\training.py", line 1284, in train_function
      return step_function(self, iterator)
    File "C:\Users\susha\anaconda3\lib\site-packages\keras\engine\training.py", line 1268, in step_function
      outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "C:\Users\susha\anaconda3\lib\site-packages\keras\engine\training.py", line 1249, in run_step
      outputs = model.train_step(data)
    File "C:\Users\susha\anaconda3\lib\site-packages\keras\engine\training.py", line 1051, in train_step
      loss = self.compute_loss(x, y, y_pred, sample_weight)
    File "C:\Users\susha\anaconda3\lib\site-packages\keras\engine\training.py", line 1109, in compute_loss
      return self.compiled_loss(
    File "C:\Users\susha\anaconda3\lib\site-packages\keras\engine\compile_utils.py", line 265, in __call__
      loss_value = loss_obj(y_t, y_p, sample_weight=sw)
    File "C:\Users\susha\anaconda3\lib\site-packages\keras\losses.py", line 142, in __call__
      losses = call_fn(y_true, y_pred)
    File "C:\Users\susha\anaconda3\lib\site-packages\keras\losses.py", line 268, in call
      return ag_fn(y_true, y_pred, **self._fn_kwargs)
    File "C:\Users\susha\anaconda3\lib\site-packages\keras\losses.py", line 2078, in sparse_categorical_crossentropy
      return backend.sparse_categorical_crossentropy(
    File "C:\Users\susha\anaconda3\lib\site-packages\keras\backend.py", line 5660, in sparse_categorical_crossentropy
      res = tf.nn.sparse_softmax_cross_entropy_with_logits(
Node: 'sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits'
logits and labels must have the same first dimension, got logits shape [51200,27] and labels shape [64]
	 [[{{node sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits}}]] [Op:__inference_train_function_2277178]

In [None]:
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train_padded, y_train_encoded, epochs=10, batch_size=32, validation_split=0.1)


Epoch 1/10

### Using Word2Vec vectorization

In [115]:
from gensim.models import Word2Vec
import numpy as np

In [116]:
# Train Word2Vec model
word2vec_model = Word2Vec(sentences=train_data['Description'], vector_size=100, window=5, min_count=1, workers=4)

# Function to calculate document vectors
def calculate_doc_vector(tokens):
    vectors = [word2vec_model.wv[word] for word in tokens if word in word2vec_model.wv]
    return np.mean(vectors, axis=0) if vectors else np.zeros(word2vec_model.vector_size)


In [117]:
# Calculate document vectors for train and test data
X_train_vectors = np.array([calculate_doc_vector(tokens) for tokens in train_data['Description']])
X_test_vectors = np.array([calculate_doc_vector(tokens) for tokens in test_data['Description']])

# Encode labels
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(train_data['Genre'])
y_test_solution_encoded = label_encoder.transform(test_data_solution['Genre'])


#### MultiClass Perceptron Classifier

In [33]:
model = MLPClassifier()

# Train the model
model.fit(X_train_vectors, y_train_encoded)

# Predict on test data
y_pred = model.predict(X_test_vectors)



In [34]:
accuracy = accuracy_score(y_test_solution_encoded, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.3626666666666667


####  Bidirectional LSTM

In [70]:
X_train_vectors_reshaped = np.expand_dims(X_train_vectors, axis=-1)
X_test_vectors_reshaped = np.expand_dims(X_test_vectors, axis=-1)

# Define MLPClassifier
model = Sequential([
    Bidirectional(LSTM(64)),
    Dense(len(np.unique(y_train_encoded)), activation='softmax')
])

from tensorflow.keras.optimizers import Adam

# Define the optimizer with a custom learning rate
optimizer = Adam(learning_rate=0.003)

# Compile the model
model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X_train_vectors_reshaped, y_train_encoded, epochs=15, batch_size=64, validation_split=0.1)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.callbacks.History at 0x1ad2686af10>

In [53]:

model = Sequential([
    Embedding(input_dim=len(word2vec_model.wv), output_dim=100, input_length=900),
    Bidirectional(LSTM(128, return_sequences=True)),  # Increase units to 128 and return_sequences=True
    Bidirectional(LSTM(64, return_sequences=True)),  # Increase units to 64
    Bidirectional(LSTM(32)),
    Dense(64, activation='relu'),  # Add an additional Dense layer with 64 units and ReLU activation
    Dense(32, activation='relu'),  # Add another Dense layer with 32 units and ReLU activation
    Dense(16, activation='relu'),
    Dense(len(np.unique(y_train_encoded)), activation='softmax')
])

In [71]:

# Determine the number of features after reshaping
# num_features = X_train_vectors.shape[1]  # Only one dimension beyond batch size

# Reshape input data to have shape (None, num_features)
X_train_vectors_reshaped = X_train_vectors.reshape(-1, 301)
X_test_vectors_reshaped = np.expand_dims(X_test_vectors, axis=-1)

from tensorflow.keras.optimizers import Adam

# Define the optimizer with a custom learning rate
optimizer = Adam(learning_rate=0.003)

# Compile the model
model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X_train_vectors_reshaped, y_train_encoded, epochs=15, batch_size=64, validation_split=0.1)

ValueError: cannot reshape array of size 9191400 into shape (301)

In [29]:
from keras.models import Sequential
from keras.layers import Embedding, Bidirectional, LSTM, Dense

# Define the model
model = Sequential([
    Embedding(input_dim=len(word2vec_model.wv), output_dim=100, input_length=900),
    Bidirectional(LSTM(128, return_sequences=True)),
    Bidirectional(LSTM(64, return_sequences=True)),
    Bidirectional(LSTM(32)),
    Dense(64, activation='relu'),
    Dense(32, activation='relu'),
    Dense(16, activation='relu'),
    Dense(len(np.unique(y_train_encoded)), activation='softmax')
])

# Compile the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Print the model summary
model.summary()


NameError: name 'y_train_encoded' is not defined

# In conclusion word2vec vectorization technique seems to get the best out of the dataset

## Now let's look into some Machine Learning models to see if it generates better accuracy models.

### SVM

In [25]:
from sklearn.svm import SVC


In [26]:
model = SVC()

# Train the model
model.fit(X_train_vectors, y_train_encoded)

# Predict on test data
y_pred = model.predict(X_test_vectors)

accuracy = accuracy_score(y_test_solution_encoded, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.34763636363636363
