# *Step 3: Model Creation*

## 1. Import Required Libraries for Model Creation
- `tensorflow`: For building and training the neural network.
- `pandas` & `numpy`: For data handling and numerical operations.
- `Tokenizer` & `pad_sequences`: For processing text data into numerical format.
- `train_test_split`: To divide the dataset into training and testing sets.
- `LabelEncoder`: To convert labels into numerical format.

We will use a **Bidirectional LSTM (Long Short-Term Memory) model** to classify whether two questions are duplicates.

## 2. Loading and Preprocessing Dataset
We load the dataset from `train.csv` and ensure it contains no missing values in the columns:
- `question1`
- `question2`
- `is_duplicate`

To prepare the data for the neural network:
1. **Tokenization**: Convert text into sequences of numbers.
2. **Padding**: Ensure all sequences have the same length.
3. **Label Encoding**: Convert the `is_duplicate` column into binary values.
4. **Splitting Data**: Separate the dataset into training and test sets.
This ensures that the model receives well-structured input.

## 3. Neural Network Model Architecture
We use a **Bidirectional LSTM** model to process the text sequences.
### Model Components:
- **Embedding Layer**: Converts words into dense vector representations.
- **Bidirectional LSTM**: Captures contextual relationships from both past and future words.
- **Concatenation Layer**: Merges encoded question1 and question2 representations.
- **Dense & Dropout Layers**: Add non-linearity and prevent overfitting.
- **Sigmoid Output Layer**: Predicts whether the questions are similar (1) or not (0).
This architecture ensures efficient text comparison and classification.

## 4. Model Training and Evaluation
We train the model using:
- **Binary Crossentropy Loss**: Suitable for binary classification problems.
- **Adam Optimizer**: Efficient for training deep networks.
- **Accuracy Metric**: Evaluates the model's performance.

The dataset is divided into:
- **Training Data** (80%): Used for learning.
- **Test Data** (20%): Used to measure generalization.
After training, we'll evaluate the model on the test set to check its performance.

In [1]:
pip install tensorflow

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, LSTM, Dense, Dropout, Bidirectional, Concatenate
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Load dataset
file_path = "train.csv"
df = pd.read_csv(file_path)
df = df.dropna(subset=['question1', 'question2', 'is_duplicate'])

# Tokenization & Padding
max_words = 20000  # Vocabulary size
max_len = 50  # Max sequence length
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(df['question1'].tolist() + df['question2'].tolist())

q1_sequences = tokenizer.texts_to_sequences(df['question1'].tolist())
q2_sequences = tokenizer.texts_to_sequences(df['question2'].tolist())

q1_padded = pad_sequences(q1_sequences, maxlen=max_len, padding='post')
q2_padded = pad_sequences(q2_sequences, maxlen=max_len, padding='post')

# Encode labels
label_encoder = LabelEncoder()
df['is_duplicate'] = label_encoder.fit_transform(df['is_duplicate'])
y = np.array(df['is_duplicate'])

# Split dataset
X_train_q1, X_test_q1, X_train_q2, X_test_q2, y_train, y_test = train_test_split(
    q1_padded, q2_padded, y, test_size=0.2, random_state=42)

# Model Architecture
embedding_dim = 128
lstm_units = 64

def build_lstm_model():
    input_q1 = Input(shape=(max_len,))
    input_q2 = Input(shape=(max_len,))
    
    embedding = Embedding(input_dim=max_words, output_dim=embedding_dim, input_length=max_len)
    lstm = Bidirectional(LSTM(lstm_units, return_sequences=False))
    
    encoded_q1 = lstm(embedding(input_q1))
    encoded_q2 = lstm(embedding(input_q2))
    
    merged = Concatenate()([encoded_q1, encoded_q2])
    dense = Dense(64, activation='relu')(merged)
    dropout = Dropout(0.5)(dense)
    output = Dense(1, activation='sigmoid')(dropout)
    
    model = Model(inputs=[input_q1, input_q2], outputs=output)
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

# Compile and Train Model
model = build_lstm_model()
model.summary()

model.fit([X_train_q1, X_train_q2], y_train, validation_data=([X_test_q1, X_test_q2], y_test),
          epochs=10, batch_size=64, verbose=1)

# Evaluate Model
loss, accuracy = model.evaluate([X_test_q1, X_test_q2], y_test)
print(f'Loss: {loss}, Accuracy: {accuracy}')

print("Neural Network Model for Question Similarity Classification Completed!")




Epoch 1/10
[1m5054/5054[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m763s[0m 149ms/step - accuracy: 0.7355 - loss: 0.5279 - val_accuracy: 0.7843 - val_loss: 0.4454
Epoch 2/10
[1m5054/5054[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1202s[0m 238ms/step - accuracy: 0.8105 - loss: 0.4072 - val_accuracy: 0.8054 - val_loss: 0.4150
Epoch 3/10
[1m5054/5054[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1254s[0m 248ms/step - accuracy: 0.8491 - loss: 0.3403 - val_accuracy: 0.8131 - val_loss: 0.4091
Epoch 4/10
[1m5054/5054[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1286s[0m 254ms/step - accuracy: 0.8760 - loss: 0.2878 - val_accuracy: 0.8165 - val_loss: 0.4272
Epoch 5/10
[1m5054/5054[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1143s[0m 226ms/step - accuracy: 0.8978 - loss: 0.2424 - val_accuracy: 0.8162 - val_loss: 0.4700
Epoch 6/10
[1m5054/5054[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1094s[0m 216ms/step - accuracy: 0.9139 - loss: 0.2083 - val_accuracy: 0.8172 - val_

# Summary of Model Creation

### Key Steps:
1. **Data Preprocessing**:
   - Tokenized and padded text sequences.
   - Encoded labels for classification.
   - Split data into training and testing sets.

2. **LSTM Model Architecture**:
   - **Bidirectional LSTM** for capturing contextual meaning.
   - **Concatenation of two question embeddings** to compare similarity.
   - **Dense & Dropout layers** for robust learning.

3. **Training & Evaluation**:
   - Trained for **10 epochs** using **binary cross-entropy loss**.
   - Achieved final accuracy on the test set.

### Final Outcome:
- The model is now ready to classify whether two questions are duplicates.