# ParaphraseAI: Building and Deploying a Text Paraphrasing Model with Docker and Kubernetes

# Objective:
Develop a simple AI model and set up a deployment pipeline using Docker and Kubernetes. The candidate will also need to create a basic web service (using either Python or Go Lang) that interacts with the AI model and stores results in a MySQL  database.

# Import Libraries:

In [1]:
import pandas as pd
import re
import tensorflow as tf
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, Bidirectional, LSTM, Dense, Concatenate
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.metrics import BinaryAccuracy
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

import warnings
warnings.filterwarnings('ignore')

# Load Dataset

In [2]:
train_df = pd.read_csv(r"C:\Users\bodak\Downloads\quora-question-pairs\train.csv\train.csv")
test_df = pd.read_csv(r"C:\Users\bodak\Downloads\quora-question-pairs\test.csv\test.csv")

In [3]:
train_df.head(10)

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0
5,5,11,12,Astrology: I am a Capricorn Sun Cap moon and c...,"I'm a triple Capricorn (Sun, Moon and ascendan...",1
6,6,13,14,Should I buy tiago?,What keeps childern active and far from phone ...,0
7,7,15,16,How can I be a good geologist?,What should I do to be a great geologist?,1
8,8,17,18,When do you use シ instead of し?,"When do you use ""&"" instead of ""and""?",0
9,9,19,20,Motorola (company): Can I hack my Charter Moto...,How do I hack Motorola DCX3400 for free internet?,0


In [4]:
test_df.head(10)

Unnamed: 0,test_id,question1,question2
0,0,How does the Surface Pro himself 4 compare wit...,Why did Microsoft choose core m3 and not core ...
1,1,Should I have a hair transplant at age 24? How...,How much cost does hair transplant require?
2,2,What but is the best way to send money from Ch...,What you send money to China?
3,3,Which food not emulsifiers?,What foods fibre?
4,4,"How ""aberystwyth"" start reading?",How their can I start reading?
5,5,How are the two wheeler insurance from Bharti ...,I admire I am considering of buying insurance ...
6,6,How can I reduce my belly fat through a diet?,How can I reduce my lower belly fat in one month?
7,7,"By scrapping the 500 and 1000 rupee notes, how...",How will the recent move to declare 500 and 10...
8,8,What are the how best books of all time?,What are some of the military history books of...
9,9,After 12th years old boy and I had sex with a ...,Can a 14 old guy date a 12 year old girl?


# Data Preprocessing 

In [5]:
train_df.drop(['id', 'qid1', 'qid2'], axis=1, inplace=True)
test_df.drop(['test_id'], axis=1,inplace=True)

In [6]:
train_df.isna().sum()

question1       1
question2       2
is_duplicate    0
dtype: int64

In [7]:
test_df.isna().sum()

question1    4
question2    6
dtype: int64

In [8]:
train_df = train_df.dropna()
test_df = test_df.dropna()

In [9]:
train_df.isna().sum()

question1       0
question2       0
is_duplicate    0
dtype: int64

In [10]:
test_df.isna().sum()

question1    0
question2    0
dtype: int64

In [11]:
train_df

Unnamed: 0,question1,question2,is_duplicate
0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0
...,...,...,...
404285,How many keywords are there in the Racket prog...,How many keywords are there in PERL Programmin...,0
404286,Do you believe there is life after death?,Is it true that there is life after death?,1
404287,What is one coin?,What's this coin?,0
404288,What is the approx annual cost of living while...,I am having little hairfall problem but I want...,0


In [12]:
test_df.head()

Unnamed: 0,question1,question2
0,How does the Surface Pro himself 4 compare wit...,Why did Microsoft choose core m3 and not core ...
1,Should I have a hair transplant at age 24? How...,How much cost does hair transplant require?
2,What but is the best way to send money from Ch...,What you send money to China?
3,Which food not emulsifiers?,What foods fibre?
4,"How ""aberystwyth"" start reading?",How their can I start reading?


In [13]:
# Select a random subset of 5,000 rows
subset_size = 5000
train_df = train_df.sample(n=subset_size, random_state=42)

In [14]:
train_df

Unnamed: 0,question1,question2,is_duplicate
8067,How do I play Pokémon GO in Korea?,How do I play Pokémon GO in China?,0
224279,Will a breathing treatment help a cough?,How can I help someone that is unconscious but...,0
252452,Is Kellyanne Conway annoying in your opinion?,Did Kellyanne Conway really imply that we shou...,0
174039,How do you rate (1-10) and review Maruti Baleno?,What career options does one have after comple...,0
384863,What are some good books on marketing?,What are some of the best books ever written a...,1
...,...,...,...
222761,Why India is building Aircraft carrier's inste...,Is India looking for a fourth aircraft carrier?,0
369359,"If a die is rolled, what is the probability th...",If a die is rolled. what is the probability th...,0
46468,How do i stop thinking about someone?,How do I stop thinking about myself?,0
384046,What are some tips on making it through the jo...,Biology project the effect of sound on plant i...,0


# Model Building

In [15]:
# Define the preprocessing function
def preprocess_text(text):
    # Remove punctuation
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)

    # Tokenize the words
    words = text.split()

    # Remove stop words
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words]

    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]

    # Remove short words (length less than 3)
    words = [word for word in words if len(word) > 2]

    # Join the processed words back into a single string
    processed_text = ' '.join(words)

    # Return the preprocessed text
    return processed_text



df = train_df

# Preprocess the data
df['question1'] = df['question1'].apply(preprocess_text)
df['question2'] = df['question2'].apply(preprocess_text)

# Split the dataset into training and validation sets
train_df, val_df = train_test_split(df, test_size=0.2, random_state=42)

# Tokenize the text
tokenizer = Tokenizer()
tokenizer.fit_on_texts(train_df['question1'].tolist() + train_df['question2'].tolist())

train_sequences1 = tokenizer.texts_to_sequences(train_df['question1'].tolist())
train_sequences2 = tokenizer.texts_to_sequences(train_df['question2'].tolist())

val_sequences1 = tokenizer.texts_to_sequences(val_df['question1'].tolist())
val_sequences2 = tokenizer.texts_to_sequences(val_df['question2'].tolist())

# Pad the sequences to a fixed length
max_length = 100
train_sequences1 = pad_sequences(train_sequences1, maxlen=max_length)
train_sequences2 = pad_sequences(train_sequences2, maxlen=max_length)

val_sequences1 = pad_sequences(val_sequences1, maxlen=max_length)
val_sequences2 = pad_sequences(val_sequences2, maxlen=max_length)

# Calculate the vocabulary size
vocabulary_size = len(tokenizer.word_index) + 1

# Create the Siamese network model
input1 = Input(shape=(max_length,))
input2 = Input(shape=(max_length,))

embedding_layer = Embedding(input_dim=vocabulary_size, output_dim=128)
lstm_layer = Bidirectional(LSTM(128))
dense_layer = Dense(128, activation='relu')
output_layer = Dense(1, activation='sigmoid')

encoded1 = lstm_layer(embedding_layer(input1))
encoded2 = lstm_layer(embedding_layer(input2))

merged = Concatenate(axis=-1)([encoded1, encoded2])
dense_output = dense_layer(merged)
output = output_layer(dense_output)

model = Model(inputs=[input1, input2], outputs=output)

# Compile the model
model.compile(optimizer=Adam(),
              loss='binary_crossentropy',
              metrics=[BinaryAccuracy()])

# Train the model
model.fit([train_sequences1, train_sequences2], train_df['is_duplicate'],
          epochs=10,
          batch_size=30,
          validation_data=([val_sequences1, val_sequences2], val_df['is_duplicate']))

# Save the model
model.save('quora_question_similarity_model.h5')

# Load the model for later use
loaded_model = tf.keras.models.load_model('quora_question_similarity_model.h5')

# Evaluate the model on the validation set
evaluation = loaded_model.evaluate([val_sequences1, val_sequences2], val_df['is_duplicate'])
print("Evaluation Loss: {:.4f}, Accuracy: {:.4f}".format(evaluation[0], evaluation[1]))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Evaluation Loss: 2.6535, Accuracy: 0.6730


# Model Testing

In [20]:
# Select a random subset of 5,000 rows
subset_size = 5000
test_df= test_df.sample(n=subset_size, random_state=42)

In [22]:
# Preprocess the test data
test_df['question1'] = test_df['question1'].apply(preprocess_text)
test_df['question2'] = test_df['question2'].apply(preprocess_text)

# Tokenize and pad the sequences
test_sequences1 = tokenizer.texts_to_sequences(test_df['question1'].tolist())
test_sequences2 = tokenizer.texts_to_sequences(test_df['question2'].tolist())

test_sequences1 = pad_sequences(test_sequences1, maxlen=max_length)
test_sequences2 = pad_sequences(test_sequences2, maxlen=max_length)

# Load the trained model
loaded_model = tf.keras.models.load_model('quora_question_similarity_model.h5')

# Predict labels for the test set
test_predictions = loaded_model.predict([test_sequences1, test_sequences2])

# Example: Display the predicted labels
print(test_predictions)

[[2.5128532e-08]
 [1.7763105e-03]
 [1.2724269e-07]
 ...
 [1.5297836e-06]
 [1.6084478e-04]
 [1.0628735e-07]]


In [25]:
# Set the threshold
threshold = 0.5

# Convert probabilities to binary predictions using the threshold
binary_predictions = (test_predictions >= threshold).astype(int)
binary_predictions

array([[0],
       [0],
       [0],
       ...,
       [0],
       [0],
       [0]])