# *Step 5: Model Tuning and Hyperparameter Optimization*

## 1. Install and Import Required Libraries
- **TensorFlow & Keras**: Used for building and training the deep learning model.
- **Keras Tuner (`kt`)**: For hyperparameter tuning.
- **Pandas & NumPy**: For data handling and numerical computations.
- **TF-IDF Vectorizer**: To convert text data into numerical features.
- **Sparse Matrices**: Used to efficiently store high-dimensional TF-IDF features.
  
## 2. Loading and Preprocessing Data Efficiently
**Why use chunking?**  
- The dataset may be **large**, leading to memory issues.
- We use **`chunksize=10000`** to load data in smaller parts.
- `on_bad_lines='skip'` ensures problematic rows do not interrupt loading.

**Handling Missing Values:**  
- Any missing data in `question1` or `question2` is replaced with an **empty string (`''`)**.

**Why concatenate questions?**  
- Instead of processing `question1` and `question2` separately, we **merge them into a single string** for each pair before applying TF-IDF.

**TF-IDF Feature Extraction:**  
- **`max_features=2000`** limits vocabulary size to save memory.

## 3. Splitting Data into Training & Testing Sets

- **80% training** and **20% testing** split.
- Helps evaluate model performance on unseen data.
- We also prepare `X_final_test` for making predictions on the actual test dataset.

## 4. Building a Tunable Deep Learning Model

We use **Keras Tuner** to optimize hyperparameters for our neural network.

**Model Architecture**:
- **1st Dense Layer**:  
  - Units: Tuned between **32 to 128**.
  - Activation: **ReLU**.
- **Dropout Layer**:
  - Dropout rate: Tuned between **0.2 to 0.5** (to prevent overfitting).
- **Final Output Layer**:
  - **1 neuron** with **Sigmoid Activation** for binary classification.

**Optimizers**:
- We test between **'adam'**, **'rmsprop'**, and **'sgd'**.

**Loss Function**:
- `binary_crossentropy`: Suitable for binary classification.

**Evaluation Metric**:
- **Accuracy**: To measure model performance.

## 5. Hyperparameter Tuning with Keras Tuner

- We use **RandomizedSearch** to explore different model configurations efficiently.
**Key Parameters Tuned**:
  - **Number of Neurons** (`dense_units`): Between **32 and 128**.
  - **Dropout Rate** (`dropout_1`): Between **0.2 and 0.5**.
  - **Optimizer**: Choice between **'adam'**, **'rmsprop'**, and **'sgd'**.
- **Max Trials = 5**: Limits excessive computation.
- **Executions per Trial = 1**: Runs each configuration once.

## 6. Efficient Sparse Data Handling in TensorFlow

**Why use Sparse Tensors?**  
- TF-IDF produces **high-dimensional** but **sparse** matrices.
- Converting to **SparseTensor** reduces memory usage.
- `tf.sparse.reorder(X_sparse_tensor)`: Ensures correct ordering of indices before training.

**Custom Function (`sparse_to_dataset`)**:
- Converts sparse matrices into **TensorFlow datasets**.
- Uses **batching** (`batch_size=64`) to speed up training.

## 7. Training the Best Tuned Model

- After tuning, we extract the **best hyperparameters**.
- The model is **rebuilt** with optimal settings.
- Trained for **10 epochs** on **efficiently loaded sparse data**.

**Expected Outcome**:  
- Improved accuracy with the best parameter combination.
- Reduced overfitting due to dropout.
- Faster training by using **TF datasets** and **sparse matrices**.

## 8. Making Predictions on the Test Set

**Final Step**:
- Convert test set to **sparse format**.
- Use the **best trained model** to predict duplicate questions.
- Apply a **0.5 probability threshold** for classification.
- Save results as `test_predictions_tuned.csv`.

**Final Output**:  
- The model generates predictions and **saves them in a CSV file**.

In [2]:
pip install tensorflow keras-tuner


Defaulting to user installation because normal site-packages is not writeableNote: you may need to restart the kernel to use updated packages.



In [2]:
#Import Required Libraries
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
import keras_tuner as kt
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import csr_matrix

# Load and Preprocess Data
chunk_size = 10000  # Read file in chunks
train_df = pd.concat(pd.read_csv('train.csv', chunksize=chunk_size, encoding='utf-8', on_bad_lines='skip', low_memory=False), ignore_index=True).fillna('')
test_df = pd.concat(pd.read_csv('test.csv', chunksize=chunk_size, encoding='utf-8', on_bad_lines='skip', low_memory=False), ignore_index=True).fillna('')

# Convert Text to TF-IDF Features
vectorizer = TfidfVectorizer(max_features=2000)
X = vectorizer.fit_transform(train_df['question1'] + " " + train_df['question2'])
y = train_df['is_duplicate'].values  # Convert labels to NumPy array

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_final_test = vectorizer.transform(test_df['question1'] + " " + test_df['question2'])


# Define Tunable Model
def build_model(hp):
    model = Sequential()
    model.add(Dense(units=hp.Int('dense_units', min_value=32, max_value=128, step=32), activation='relu'))
    model.add(Dropout(rate=hp.Float('dropout_1', min_value=0.2, max_value=0.5, step=0.1)))
    model.add(Dense(1, activation='sigmoid'))

    optimizer = hp.Choice('optimizer', values=['adam', 'rmsprop', 'sgd'])
    model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
    
    return model


# Hyperparameter Tuning
tuner = kt.RandomSearch(
    build_model,
    objective='val_accuracy',
    max_trials=5,
    executions_per_trial=1,
    directory='tuner_results',
    project_name='quora_duplicate_tuning'
)


# Efficiently Load Sparse Data with Sorted Indices
def sparse_to_dataset(X, y, batch_size=64):
    indices = np.column_stack(X.nonzero()).astype(np.int64)  
    values = X.data.astype(np.float32)  # Get values
    shape = X.shape

    X_sparse_tensor = tf.sparse.SparseTensor(indices=indices, values=values, dense_shape=shape)
    X_sparse_tensor = tf.sparse.reorder(X_sparse_tensor)  

    dataset = tf.data.Dataset.from_tensor_slices((X_sparse_tensor, y))
    return dataset.batch(batch_size)

# Search for the best hyperparameters
tuner.search(sparse_to_dataset(X_train, y_train), validation_data=sparse_to_dataset(X_test, y_test), epochs=5)

# Get the best model
best_hps = tuner.get_best_hyperparameters(num_trials=1)[0]
best_model = tuner.hypermodel.build(best_hps)

# Train the best model
best_model.fit(sparse_to_dataset(X_train, y_train), validation_data=sparse_to_dataset(X_test, y_test), epochs=10)


# Evaluate and Predict on Test Data
num_test_samples = X_final_test.shape[0]  
test_predictions = (best_model.predict(sparse_to_dataset(X_final_test, np.zeros(num_test_samples))) > 0.5).astype(int)

test_df['is_duplicate'] = test_predictions
test_df.to_csv('test_predictions_tuned.csv', index=False)
print("Test predictions saved to 'test_predictions_tuned.csv'")


Reloading Tuner from tuner_results\quora_duplicate_tuning\tuner0.json
Epoch 1/10
[1m5054/5054[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m41s[0m 8ms/step - accuracy: 0.7172 - loss: 0.5467 - val_accuracy: 0.7525 - val_loss: 0.4962
Epoch 2/10
[1m5054/5054[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m36s[0m 7ms/step - accuracy: 0.7570 - loss: 0.4904 - val_accuracy: 0.7646 - val_loss: 0.4777
Epoch 3/10
[1m5054/5054[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m35s[0m 7ms/step - accuracy: 0.7730 - loss: 0.4672 - val_accuracy: 0.7732 - val_loss: 0.4648
Epoch 4/10
[1m5054/5054[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m36s[0m 7ms/step - accuracy: 0.7848 - loss: 0.4486 - val_accuracy: 0.7779 - val_loss: 0.4568
Epoch 5/10
[1m5054/5054[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m35s[0m 7ms/step - accuracy: 0.7938 - loss: 0.4341 - val_accuracy: 0.7816 - val_loss: 0.4512
Epoch 6/10
[1m5054/5054[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m36s[0m 7ms/step - accuracy: 0.80

## Function: Convert Sparse Data to Dataset
**Why use a generator?**
- The test dataset is **sparse**, so converting it to **dense format** would consume excessive memory.
- Instead, we **convert and yield one row at a time** to keep memory usage low.

**Batch Processing**
- The dataset is batched (`batch_size=64`) to enable efficient GPU/CPU processing.
- **`.repeat()` was removed** because test data should not be repeated.

**Goal**:  
Convert the **sparse TF-IDF matrix** into a format that the **TensorFlow model can process efficiently**.

## Load Test Data (Ensure X_final_test is available)
**Why check for `X_final_test`?**
- If the variable is **not defined**, the script will raise an error.
- This prevents runtime crashes due to missing data.

**Loading Test IDs**
- We read `test.csv` to extract `test_id` values.
- These are used in the final submission file.

**Goal**:  
Ensure that the test data is **loaded properly** before making predictions.

## Make Predictions using the Trained Model
**Why check for `best_model`?**
- Ensures the trained model is **loaded before prediction**.
- If `best_model` is missing, an error is raised.

**Why `steps=test_steps`?**
- Ensures predictions match the number of batches in the dataset.
- Uses `.shape[0] // 64` (batch size) to determine the number of steps.

**Converting Predictions to Binary Labels**
- The model outputs **probabilities** between `0` and `1`.
- We apply **thresholding** (`> 0.5`) to classify predictions as **0 (Not Duplicate) or 1 (Duplicate)**.

**Goal**:  
Use the trained model to **predict duplicate questions** in the test set.

## Prepare Submission File
**Ensure `test_id` Count Matches Predictions**
- Some test samples may be missing or skipped.
- We use `test_ids[:len(test_predictions)]` to **ensure alignment**.

**Save as `test_predictions_tuned.csv`**
- The file is saved in CSV format.
- This allows submission to **Kaggle or any evaluation platform**.

**Goal**:  
Generate a well-formatted submission file containing **test IDs and their corresponding predictions**.

In [11]:
import numpy as np
import pandas as pd
import tensorflow as tf
from scipy.sparse import csr_matrix

# Function: Convert Sparse Data to Dataset
def sparse_to_dataset(X, batch_size=64):
    """ Convert sparse matrix to a TensorFlow dataset without converting to dense. """
    def generator():
        for i in range(X.shape[0]): 
            yield X[i].toarray().flatten()  # Convert only one row at a time
    
    dataset = tf.data.Dataset.from_generator(
        generator,
        output_signature=tf.TensorSpec(shape=(X.shape[1],), dtype=tf.float32)
    )
    
    return dataset.batch(batch_size)  # Removed .repeat() for test data


# Load Test Data (Ensure X_final_test is available)
try:
    X_final_test.shape  # Check if it's defined
except NameError:
    raise ValueError("X_final_test is not defined. Please load the test data.")

# Load sample submission file to get test IDs
sample_submission = pd.read_csv("test.csv")
test_ids = sample_submission['test_id'].values


# Make Predictions using the Trained Model

test_dataset = sparse_to_dataset(X_final_test)  # Convert test data to dataset

# Ensure `best_model` is defined
try:
    best_model  # Check if the model exists
except NameError:
    raise ValueError("best_model is not defined. Please load the trained model.")

# Predict using the trained model
test_steps = X_final_test.shape[0] // 64  
test_predictions = best_model.predict(test_dataset, steps=test_steps)
test_predictions = (test_predictions > 0.5).astype(int)  # Convert to binary output
test_predictions = test_predictions.flatten()  # Ensure correct shape


# Prepare Submission File

submission_df = pd.DataFrame({
    "test_id": test_ids[:len(test_predictions)],  # Ensure ID count matches predictions
    "is_duplicate": test_predictions
})

# Save predictions to CSV
submission_csv_path = "test_predictions_tuned.csv"
submission_df.to_csv(submission_csv_path, index=False)
print(f"Test predictions saved to '{submission_csv_path}'")


[1m36653/36653[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1259s[0m 34ms/step
Test predictions saved to 'test_predictions_tuned.csv'


 In this step, we convert the sparse test data into a format suitable for our trained model, make predictions, and generate a submission file.

## Key Steps:
### 1. Convert Sparse Data to TensorFlow Dataset
- Uses a generator function to avoid memory issues.
- Batches the data for efficient processing.
  
### 2. Load and Verify Test Data:
- Ensures X_final_test is available.
- Retrieves test_id values for submission.

### 3. Convert Test Data to TensorFlow Dataset:
- Prepares the test data for model inference.
  
### 4. Make Predictions with the Trained Model:
- Ensures best_model is available before inference.
- Converts probabilities to binary labels (0 or 1).
  
### 5. Prepare and Save Submission File:
- Creates a DataFrame with test_id and is_duplicate.
- Saves predictions as test_predictions_tuned.csv.