### **Project Overview: Text Classification with TensorFlow**

## **Objective**

The goal of this project is to develop and deploy a text classification model using TensorFlow. The model will be designed to categorize text data into predefined classes based on its content. Text classification has a wide range of applications, including sentiment analysis, spam detection, topic categorization, and more.      


## **DataSet**

The dataset used for this project comprises questions collected from various sources. Each question is labeled to indicate whether it was identified as insincere. The objective is to build a text classifier that can accurately predict the sincerity of the questions based on the provided labels. The dataset is already divided into training(training.csv) and test(test.csv and sample_submission.csv).

**Key Details:**
- **Dataset Size**: Large-scale dataset with thousands of text samples.
- **Textual Data**: Questions asked on Quora.
- **Labels**: Whether it was insincere or not.

### **Methodology**

1. **Data Preprocessing**:
   - **Tokenization**: Convert text into sequences of tokens (words or subwords).
   - **Padding**: Ensures that all sequences are of equal length for model compatibility.
   

2. **Model Building**:
   - **Architecture**: Develop a sequential neural network using TensorFlow’s Keras API.
     - **Embedding Layer**: Convert tokens into dense vectors.
     - **Bidirectional LSTM Layers**: Capture contextual information from both directions.
     - **Dropout Layers**: Apply regularization to prevent overfitting.
     - **Dense Output Layer**: Final classification layer for predicting text categories.

3. **Model Training**:
   - **Optimization**: Compile the model with an appropriate optimizer and loss function (Adam optimizer and binary cross-entropy for binary classification).
   - **Training Process**: Train the model on the training dataset.

4. **Model Evaluation**:
   - **Metrics**: Model is evaluated using accuracy, precision, recall, and F1-score to measure its performance.

5. **Results**:
   - **Performance Summary**: Present the model’s accuracy on the test dataset is approximately 93%.

6. **Deployment**:
   - **Integration**: The model can be used to filter out insincere comments on any platform with a comment option.

### **Conclusion**

This project aims to build a robust text classification model that can effectively categorize textual data into predefined classes. By leveraging TensorFlow’s deep learning capabilities, the project will deliver a model with high accuracy and generalization ability. The findings and results will provide valuable insights into text classification and demonstrate practical applications in various domains.


In [1]:
import pandas as pd
import numpy as np

data = pd.read_csv("train.csv") # Creating a dataframe out of the training data
new_data = data.iloc[:150000] # Reducing the number of rows to 1.5Lakhs
new_data["target"].value_counts() # Displaying all possible values with their counts


Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
0,140733
1,9267


In [2]:
new_data.head() # Displaying first 5 rows

Unnamed: 0,qid,question_text,target
0,00002165364db923c7e6,How did Quebec nationalists see their province...,0
1,000032939017120e6e44,"Do you have an adopted dog, how would you enco...",0
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity a...,0
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg h...,0
4,0000455dfa3e01eae3af,Can I convert montra helicon D to a mountain b...,0


In [3]:
x_train_data = new_data["question_text"].values # Creating predictor variables
y_train_data = new_data["target"].values # Creating target variable
type(x_train_data)

numpy.ndarray

In [4]:
x_train_data

array(['How did Quebec nationalists see their province as a nation in the 1960s?',
       'Do you have an adopted dog, how would you encourage people to adopt and not shop?',
       'Why does velocity affect time? Does velocity affect space geometry?',
       ...,
       'Why does Xi jinping support genocidal leaders like Mao and Kim Jong-un?',
       "I don't mean to be rude but, I am tired of seeing Gordon Miller's answers to questions in my feed. How do I set my feed to hide his answers?",
       'What are the treatments for pain when you move your eye?'],
      dtype=object)

In [5]:
y_train_data

array([0, 0, 0, ..., 0, 0, 0])

In [6]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.metrics import Precision, Recall, AUC

text_data = x_train_data

# Initializing the Tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(text_data)

# Converting text to sequences
sequences = tokenizer.texts_to_sequences(text_data)
padded_sequences = pad_sequences(sequences, padding='post')

# Example binary labels (1D array of shape (1306122,))
labels = y_train_data

# Defining the model with tuned parameters
model = Sequential([
    Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=128, input_length=padded_sequences.shape[1]),  # Increased output_dim
    Bidirectional(LSTM(64, return_sequences=True)),  # Bidirectional LSTM layer with return_sequences=True
    Dropout(0.5),  # Dropout for regularization
    LSTM(64),  # Another LSTM layer
    Dropout(0.5),  # Additional Dropout for regularization
    Dense(1, activation='sigmoid')  # Binary classification
])

# Compiling the model with a tuned learning rate
optimizer = Adam(learning_rate=0.001)  # Adjusted learning rate
model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy', Precision(), AUC(), Recall()])

# Create a TensorFlow dataset with features and labels
train_dataset = tf.data.Dataset.from_tensor_slices((padded_sequences, labels))
train_dataset = train_dataset.batch(64)  # Adjusted batch size

# Training the model with 10 epochs for better accuracy
model.fit(train_dataset, epochs=10)


Epoch 1/10




[1m2344/2344[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m56s[0m 19ms/step - accuracy: 0.9374 - auc: 0.5035 - loss: 0.2433 - precision: 0.0000e+00 - recall: 0.0000e+00
Epoch 2/10
[1m2344/2344[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m75s[0m 19ms/step - accuracy: 0.9375 - auc: 0.5080 - loss: 0.2371 - precision: 0.1151 - recall: 0.0016
Epoch 3/10
[1m2344/2344[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m80s[0m 18ms/step - accuracy: 0.9381 - auc: 0.4996 - loss: 0.2355 - precision: 0.0000e+00 - recall: 0.0000e+00
Epoch 4/10
[1m2344/2344[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m43s[0m 18ms/step - accuracy: 0.9381 - auc: 0.5071 - loss: 0.2343 - precision: 0.0030 - recall: 4.3414e-06
Epoch 5/10
[1m2344/2344[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m83s[0m 19ms/step - accuracy: 0.9430 - auc: 0.8889 - loss: 0.1520 - precision: 0.5750 - recall: 0.2504
Epoch 6/10
[1m2344/2344[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m80s[0m 18ms/step - accuracy: 0.9611 - auc: 

<keras.src.callbacks.history.History at 0x7b1bb7def820>

In [7]:
import numpy as np
import tensorflow as tf
import pandas as pd
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Loading the test data
test_data = pd.read_csv("test.csv")
test_data_labels = pd.read_csv("sample_submission.csv")

x_test = test_data["question_text"].values # Test predictor variables
y_test = test_data_labels["prediction"].values # Test target variable
# Ensuring all data is string type
x_test = [str(text) for text in x_test]

# Using the tokenizer that was originally fitted on the training data
tokenizer = Tokenizer()
tokenizer.fit_on_texts(x_train_data)

# Converting text to sequences
sequences = tokenizer.texts_to_sequences(x_test)

# We'll use the same padding length as used during training
# Adjusting the padding length to match the training length
max_length = 63
padded_sequences = pad_sequences(sequences, padding='post', maxlen=max_length)

# Verifying the shape again
print("Adjusted Shape of padded_sequences:", padded_sequences.shape)

# Creating a TensorFlow dataset from the test sequences
test_dataset = tf.data.Dataset.from_tensor_slices((padded_sequences,y_test))
test_dataset = test_dataset.batch(64)  # Use an appropriate batch size

# Testing the Model
model.evaluate(test_dataset)




Adjusted Shape of padded_sequences: (375806, 63)
[1m5872/5872[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m38s[0m 6ms/step - accuracy: 0.9456 - auc: 0.0000e+00 - loss: 0.1787 - precision: 0.0000e+00 - recall: 0.0000e+00


[0.17627988755702972, 0.9457432627677917, 0.0, 0.0, 0.0]