
# Project 2

### Denish Kalariya
### DMK220001

# Dataset Description:
This dataset provides 140k questions and answers related to Python coding. It contains both simple topics, such as basic data types, and more complex problems dealing with object-oriented programming. In order to gain the most from this dataset it is important to understand how best to use it for learning or development.

For Learning
The questions and answers are formatted so that they can be used for study or practice in understanding the fundamentals of programming in Python. To get started, you could review the types of topics included in the questions (e.g., data types) by exploring several different examples or start by reading an answer related to a topic of interest including appointed code examples. Analysis of multiple examples can allow you gain a better understanding of how each topic works before attempting any coding exercises yourself!

To further cement your understanding, creating your own practice projects and writing sample code through trial-and-error guided from the given datasets is an effective way learn beyond just memorizing facts or syntax rules from books or web tutorials on a given language. Over time, patterns between problems will become easier to recognize and solve quicker over time!

### For Development: AI Model Training for Code Assistants

For developers, the Glaive code assistant dataset is invaluable resource when it comes to training machine learning models for creating AI natural language processing applications like automated coding assistances since each question has an associated answer direction written out clearly explained with kept succinctly with relevant code snippets available as needed depending on complexity level required . With enough training data points (questions/answer pairs), models can be trained that provide robust advice tailored towards whatever particular problem may arise based on user input queries parsed through model’s functions~

This file contains a dataset of Python code problems and solutions in a QA format for developing intelligent code assistants

Answer: Stores the answer strings associated with each question. (String)
Answer: Stores the answer strings associated with each question. (String)

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


# Load the dataset
data_path = '/kaggle/input/glaive-python-code-qa-dataset/train.csv'
data = pd.read_csv(data_path)
# Display the first few rows of the dataset to understand its structure
print(data.head())
# Display summary statistics and information about the dataset
# Display basic descriptive statistics
print(data.describe())

# Display missing values per column
print(data.isnull().sum())


                                              answer  \
0  Yes, you can format the output text in Bash to...   
1  To install Python 3 on an AWS EC2 instance, yo...   
2  You can achieve the desired time format using ...   
3  Your current implementation is actually quite ...   
4  The use of 'self' in Python is quite different...   

                                            question  
0  How can I output bold text in Bash? I have a B...  
1  How can I install Python 3 on an AWS EC2 insta...  
2  How can I format the elapsed time from seconds...  
3  I am trying to create a matrix of random numbe...  
4  I am learning Python and have noticed extensiv...  
          answer                 question
count     136108                   136109
unique    136107                   135564
top     <Answer>  What does this code do?
freq           2                       19
answer      1
question    0
dtype: int64


In [16]:
import pandas as pd
import numpy as np
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, LSTM, Dense, Embedding, Attention
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Concatenate



In [17]:
data = data[:5000]
data['question'] = data['question'].apply(lambda x: x.lower().replace('[^\w\s]', ''))
data['answer'] = data['answer'].apply(lambda x: x.lower().replace('[^\w\s]', ''))

In [18]:
# Prepare input and output pairs
input_texts = data['question'].values
target_texts = ['\t' + text + '\n' for text in data['answer'].values]  # Add start and end tokens

# Tokenization and sequence conversion
tokenizer = Tokenizer()
tokenizer.fit_on_texts(input_texts + target_texts)
input_sequences = tokenizer.texts_to_sequences(input_texts)
target_sequences = tokenizer.texts_to_sequences(target_texts)

# Padding sequences
max_seq_length = max(max(len(seq) for seq in input_sequences), max(len(seq) for seq in target_sequences))
input_sequences = pad_sequences(input_sequences, maxlen=max_seq_length, padding='post')
target_sequences = pad_sequences(target_sequences, maxlen=max_seq_length, padding='post')


In [19]:
# Model building
embedding_dim = 256
latent_dim = 1024  # Latent dimensionality of the encoding space.

# Encoder
encoder_inputs = Input(shape=(None,))
enc_emb = Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=embedding_dim)(encoder_inputs)
encoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)  # Change here: return_sequences=True
encoder_outputs, state_h, state_c = encoder_lstm(enc_emb)
encoder_states = [state_h, state_c]

# Decoder
decoder_inputs = Input(shape=(None,))
dec_emb_layer = Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=embedding_dim)
dec_emb = dec_emb_layer(decoder_inputs)
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(dec_emb, initial_state=encoder_states)

attention = Attention()
attn_out = attention([decoder_outputs, encoder_outputs])  # Ensure both inputs are 3D
decoder_concat_input = Concatenate(axis=-1)([decoder_outputs, attn_out])
decoder_dense = Dense(len(tokenizer.word_index) + 1, activation='softmax')
decoder_outputs = decoder_dense(decoder_concat_input)

# Define the model (ensure all components and imports are properly defined as earlier explained)
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

# Compile the model with sparse_categorical_crossentropy
model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

In [30]:
model.fit([input_sequences, target_sequences], target_sequences, batch_size=16, epochs=3)

Epoch 1/3
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m377s[0m 1s/step - accuracy: 0.7785 - loss: 1.5762
Epoch 2/3
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m377s[0m 1s/step - accuracy: 0.8060 - loss: 1.3513
Epoch 3/3
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m377s[0m 1s/step - accuracy: 0.8293 - loss: 1.2033


<keras.src.callbacks.history.History at 0x7b5f1476ae00>

Prediction with under trained model

In [31]:
import numpy as np
from tensorflow.keras.preprocessing.sequence import pad_sequences

def generate_response(input_text, max_len):
    # Tokenize the input text
    input_sequence = tokenizer.texts_to_sequences([input_text])
    # Pad the input sequence to the expected length
    input_sequence = pad_sequences(input_sequence, maxlen=max_len, padding='post')
    
    # Initialize an empty sequence for the decoder input
    decoder_input_sequence = np.zeros((1, 1))  # Start with sequence of length 1
    
    # Prepare to collect the response
    decoded_sentence = ''
    while True:
        predictions = model.predict([input_sequence, decoder_input_sequence])
        predicted_id = np.argmax(predictions[0, -1, :])  # Get the last token in the sequence
        
        # Break if the predicted token is 0 (often used as padding)
        if predicted_id == 0:
            break
        
        # Append the predicted token to the decoder input sequence
        next_word = tokenizer.index_word.get(predicted_id, '')  # Default to '' if not found
        decoded_sentence += next_word + ' '
        
        # Update the decoder input sequence to include the predicted token
        decoder_input_sequence = np.pad(decoder_input_sequence[0], (0, 1), 'constant', constant_values=predicted_id)
        decoder_input_sequence = np.expand_dims(decoder_input_sequence, axis=0)
        
        # Optional: Break if the decoded sentence reaches a certain length to prevent overly long responses
        if len(decoded_sentence.split()) > max_len - 1:
            break

    return decoded_sentence.strip()

# Example usage
max_len = 20  # Define this based on your model's training configuration
input_text = "What is Python function ?"
response = generate_response(input_text, max_len)
print("Response:", response)


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 22ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 22ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step
Response: this is is is is is is is is is is is


In [None]:
import numpy as np
import pandas as pd
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, LSTM, Dense, Embedding, Attention, Concatenate
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Load and preprocess data
data_path = '/kaggle/input/glaive-python-code-qa-dataset/train.csv'  # Modify this to your data file path
data = pd.read_csv(data_path)

# Convert all entries to strings and preprocess
data['question'] = data['question'].astype(str).apply(lambda x: x.lower().replace('[^\w\s]', ''))
data['answer'] = data['answer'].astype(str).apply(lambda x: x.lower().replace('[^\w\s]', ''))

data = data[:5000]
# Prepare input and output pairs
input_texts = data['question'].values
target_texts = ['\t' + text + '\n' for text in data['answer'].values]  # Add start and end tokens

# Tokenization and sequence conversion
tokenizer = Tokenizer()
tokenizer.fit_on_texts(input_texts + target_texts)
input_sequences = tokenizer.texts_to_sequences(input_texts)
target_sequences = tokenizer.texts_to_sequences(target_texts)

# Padding sequences
max_seq_length = max(max(len(seq) for seq in input_sequences), max(len(seq) for seq in target_sequences))
input_sequences = pad_sequences(input_sequences, maxlen=max_seq_length, padding='post')
target_sequences = pad_sequences(target_sequences, maxlen=max_seq_length, padding='post')

# Model building
embedding_dim = 256
latent_dim = 1024  # Latent dimensionality of the encoding space.

# Encoder
encoder_inputs = Input(shape=(None,))
enc_emb = Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=embedding_dim)(encoder_inputs)
encoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
encoder_outputs, state_h, state_c = encoder_lstm(enc_emb)
encoder_states = [state_h, state_c]

# Decoder
decoder_inputs = Input(shape=(None,))
dec_emb_layer = Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=embedding_dim)
dec_emb = dec_emb_layer(decoder_inputs)
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(dec_emb, initial_state=encoder_states)
attention = Attention()
attn_out = attention([decoder_outputs, encoder_outputs])
decoder_concat_input = Concatenate(axis=-1)([decoder_outputs, attn_out])
decoder_dense = Dense(len(tokenizer.word_index) + 1, activation='softmax')
decoder_outputs = decoder_dense(decoder_concat_input)

# Define and compile the model
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Train the model
model.fit([input_sequences, target_sequences], np.expand_dims(target_sequences, -1), batch_size=16, epochs=1)

# Function to generate response
def generate_response(input_text, max_len):
    input_sequence = tokenizer.texts_to_sequences([input_text])
    input_sequence = pad_sequences(input_sequence, maxlen=max_len, padding='post')
    decoder_input_sequence = np.zeros((1, 1))

    decoded_sentence = ''
    while True:
        predictions = model.predict([input_sequence, decoder_input_sequence])
        predicted_id = np.argmax(predictions[0, -1, :])
        if predicted_id == 0:
            break
        next_word = tokenizer.index_word.get(predicted_id, '')
        decoded_sentence += next_word + ' '
        decoder_input_sequence = np.pad(decoder_input_sequence[0], (0, 1), 'constant', constant_values=predicted_id)
        decoder_input_sequence = np.expand_dims(decoder_input_sequence, axis=0)
        if len(decoded_sentence.split()) > max_len - 1:
            break

    return decoded_sentence.strip()

# Example usage
input_text = "What is Python function ?"
response = generate_response(input_text, max_seq_length)
print("Response:", response)


[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m383s[0m 1s/step - accuracy: 0.7301 - loss: 2.5486
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 241ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 246ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 43ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 41ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 41ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 42ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 43ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 41ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 42ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 42ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 42ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 43ms/step
[1m1/1[0m [32m━━━━━━━

# ChatBot 1 : Attention based Encoder-Decoder Model

This model is designed to generate contextually appropriate responses based on user inputs, making use of sequence-to-sequence learning typically employed in machine translation and chatbot applications.
I have taken only first 10000 rows to train the model due resource constraints.

Key Components:
- Encoder: Processes the input text and converts it into a context vector.
- Decoder: Uses the context vector to generate output text step by step.
- Attention Mechanism: Enhances the model's ability to focus on relevant parts of the input during the decoding process, improving the relevance and specificity of responses.


Role: Acts as the primary response generation engine. When a user query is received, this model processes the text to generate a coherent and contextually relevant response.
Data Flow: User inputs are preprocessed, tokenized, and fed into the encoder. The decoder then constructs a response, guided by the attention mechanism, which is delivered back to the user.


Techniques:
- Text Preprocessing: Includes converting characters from Unicode to ASCII, removing non-alphabetic characters, and handling contractions to clean and standardize the text.
- Tokenization and Padding: Converts text to sequences of integers and ensures that sequences are padded to a consistent length for modelling.
- Embedding: Transforms tokenized text into dense vectors that capture semantic meanings.
- LSTM with Dropout: Enhances the model's generalization by randomly dropping units (dropout) during training to prevent overfitting.


In [6]:
from datasets import load_dataset, load_metric
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.layers import Input, LSTM, Dense, Embedding, Attention, Concatenate, Dropout
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import re
import unicodedata
import pandas as pd
import numpy as np

Text - Generation 

### Using first 10000 values of the dataset to train the model

In [8]:
dataset = pd.read_csv("/kaggle/input/glaive-python-code-qa-dataset/train.csv")
dataset = dataset[:10000]
dataset

Unnamed: 0,answer,question
0,"Yes, you can format the output text in Bash to...",How can I output bold text in Bash? I have a B...
1,"To install Python 3 on an AWS EC2 instance, yo...",How can I install Python 3 on an AWS EC2 insta...
2,You can achieve the desired time format using ...,How can I format the elapsed time from seconds...
3,Your current implementation is actually quite ...,I am trying to create a matrix of random numbe...
4,The use of 'self' in Python is quite different...,I am learning Python and have noticed extensiv...
...,...,...
9995,"Implementing a ""Did you mean?"" feature without...","How can I implement a ""Did you mean?"" feature,..."
9996,"Yes, you can open a website via a proxy in Pyt...","In Python, I am trying to open a website via a..."
9997,To extract a substring from a string after a s...,How can I extract a substring from a given str...
9998,"In Python, creating an 'empty if statement' is...",How can I create an 'empty if statement' in Py...


In [9]:
stop_words = set(stopwords.words('english'))
contractions = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not", "didn't": "did not",  "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not", "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is", "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would", "i'd've": "i would have", "i'll": "i will",  "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would", "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam", "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have", "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock", "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have", "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is", "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as", "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would", "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have", "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have", "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are", "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are", "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is", "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have", "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have", "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all", "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have", "you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have", "you're": "you are", "you've": "you have"}

In [10]:
def unicode_to_ascii(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s)
      if unicodedata.category(c) != 'Mn')

# Function for preprocessing text
def preprocess_text(text):
    # Convert text to lowercase
    text = unicode_to_ascii(text.lower().strip())
    text = re.sub("(\\W)"," ",text) 
    text = re.sub('\S*\d\S*\s*','', text)
    text = ' '.join([contractions[word] if word in contractions else word for word in text.split()])
    text =  "<sos> " +  text + "<eos>"
    return text

# Apply preprocessing to question and answer columns
preprocessed_df = dataset.copy()
preprocessed_df['question'] = preprocessed_df['question'].apply(preprocess_text)
preprocessed_df['answer'] = preprocessed_df['answer'].apply(preprocess_text)

# Print the preprocessed DataFrame
preprocessed_df

Unnamed: 0,answer,question
0,<sos> yes you can format the output text in ba...,<sos> how can i output bold text in bash i hav...
1,<sos> to install python on an aws instance you...,<sos> how can i install python on an aws insta...
2,<sos> you can achieve the desired time format ...,<sos> how can i format the elapsed time from s...
3,<sos> your current implementation is actually ...,<sos> i am trying to create a matrix of random...
4,<sos> the use of self in python is quite diffe...,<sos> i am learning python and have noticed ex...
...,...,...
9995,<sos> implementing a did you mean feature with...,<sos> how can i implement a did you mean featu...
9996,<sos> yes you can open a website via a proxy i...,<sos> in python i am trying to open a website ...
9997,<sos> to extract a substring from a string aft...,<sos> how can i extract a substring from a giv...
9998,<sos> in python creating an empty if statement...,<sos> how can i create an empty if statement i...


In [11]:
# Preprocessing the data
questions = preprocessed_df['question'].values.tolist()
answers = preprocessed_df['answer'].values.tolist()

# Tokenizing the data
tokenizer = Tokenizer(filters='')
tokenizer.fit_on_texts(np.concatenate((questions, answers), axis=0))

vocab_size = len(tokenizer.word_index) + 1

# Convert text to sequences
question_seqs = tokenizer.texts_to_sequences(questions)
answer_seqs = tokenizer.texts_to_sequences(answers)

# Padding sequences for equal length
# Pad sequences for equal length
max_len_question = max(len(seq) for seq in question_seqs)
max_len_answer = max(len(seq) for seq in answer_seqs)
max_len = max(max_len_question, max_len_answer)
max_len = 60
print(max_len)
# Pad sequences separately for questions and answers
question_seqs = pad_sequences(question_seqs, maxlen=max_len, padding='post')
answer_seqs = pad_sequences(answer_seqs, maxlen=max_len, padding='post')

60


In [12]:
tokenizer.texts_to_sequences("<sos>")
tokenizer.word_index["<sos>"]

18

In [13]:
# Define the model architecture
latent_dim = 256  # Dimensionality of the encoding space

# Encoder
encoder_inputs = Input(shape=(max_len,))
encoder_embedding = Embedding(vocab_size, latent_dim, input_shape=(max_len,))
encoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True, dropout=0.1, recurrent_dropout=0.1)
encoder_outputs, state_h, state_c = encoder_lstm(encoder_embedding(encoder_inputs))
encoder_states = [state_h, state_c]

# Decoder
decoder_inputs = Input(shape=(max_len-1,))
decoder_embedding = Embedding(vocab_size, latent_dim, input_shape=(max_len-1,))
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True, dropout=0.1, recurrent_dropout=0.1)
decoder_outputs, _, _ = decoder_lstm(decoder_embedding(decoder_inputs), initial_state=encoder_states)

# Attention mechanism
attention_layer = Attention()
attention_output = attention_layer([decoder_outputs, encoder_outputs])

# Concatenate attention output and decoder LSTM output
decoder_concat_input = Concatenate(axis=-1)([decoder_outputs, attention_output])

# Add dropout layer for regularization
decoder_concat_input = Dropout(0.1)(decoder_concat_input)

# Output layer
decoder_dense = Dense(vocab_size, activation='softmax')
decoder_outputs = decoder_dense(decoder_concat_input)

  super().__init__(**kwargs)


# The Model Architecture

In [14]:
# Define the model
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

# Compile the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')

# Print model summary
model.summary()

In [15]:
# Train the model
model.fit([question_seqs, answer_seqs[:, :-1]], answer_seqs[:, 1:],
          batch_size=32,
          epochs=100,
          validation_split=0.2)

Epoch 1/100
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m54s[0m 188ms/step - loss: 7.6368 - val_loss: 6.3762
Epoch 2/100
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m46s[0m 184ms/step - loss: 6.1442 - val_loss: 5.9972
Epoch 3/100
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m46s[0m 182ms/step - loss: 5.7289 - val_loss: 5.7695
Epoch 4/100
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m45s[0m 182ms/step - loss: 5.4263 - val_loss: 5.6021
Epoch 5/100
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m46s[0m 183ms/step - loss: 5.1377 - val_loss: 5.4509
Epoch 6/100
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m45s[0m 182ms/step - loss: 4.8880 - val_loss: 5.3467
Epoch 7/100
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m46s[0m 182ms/step - loss: 4.6564 - val_loss: 5.2797
Epoch 8/100
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m46s[0m 184ms/step - loss: 4.4487 - val_loss: 5.2356
Epoch 9/

<keras.src.callbacks.history.History at 0x784ce76110f0>

In [16]:
#saving the model
model.save('mymodelEDA.h5')

In [17]:
loader = load_model('/kaggle/working/mymodelEDA.h5')

In [18]:
# To get the inference
def generate_response(input_text):
    # Tokenize the input text
    input_sequence = tokenizer.texts_to_sequences([input_text])
    # Pad the input sequence
    input_sequence = pad_sequences(input_sequence, maxlen=max_len, padding='post')
    
    # Initialize the decoder input sequence with start token
    decoder_input_sequence = np.zeros((1, max_len-1))
    decoder_input_sequence[0, 0] = tokenizer.word_index.get('<sos>', 0)  # Safely get '<sos>' index or default to 0
    
    # Generate response using the trained model
    for i in range(max_len - 2):
        predictions = loader.predict([input_sequence, decoder_input_sequence])
        predicted_id = np.argmax(predictions[0, i, :])
        if tokenizer.word_index.get('<eos>') == predicted_id:  # Safely check for '<eos>'
            break
        decoder_input_sequence[0, i+1] = predicted_id
    
    # Convert output sequence to text
    output_text = ''
    for token_index in decoder_input_sequence[0]:
        if token_index == tokenizer.word_index.get('<eos>', 0) or token_index == 0:  # Safely check for '<eos>'
            break
        output_text += tokenizer.index_word.get(token_index, '') + ' '  # Safely get word or default to empty string
    
    return output_text.strip()

# Test the function with input "how are you"
input_text = "How timestamp is used in python?"
response = generate_response(input_text)
print("Response:", response)


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 400ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 51ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 50ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 48ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 51ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 52ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 51ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 50ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 52ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 51ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 50ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 48ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 50ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5

## ChatBot for Attention Based Model
## You can find the working chatbot conversation below

In [27]:
import os
import re
import spacy
import joblib
import json
import tensorflow as tf  # Import TensorFlow
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Load the spaCy model for NLP tasks
nlp = spacy.load("en_core_web_sm")

# Load a Keras model (.h5 file)
try:
    model = tf.keras.models.load_model('mymodelEDA.h5')
    print("Keras model loaded successfully.")
except Exception as e:
    print(f"Failed to load Keras model: {str(e)}")


# Check if the model has a predict method
if hasattr(model, 'predict'):
    print("Model loaded successfully and it can predict.")
else:
    print("Loaded object is not a model with a predict method.")

    ## Load or Create USer model from the directory
def load_or_create_user_data(name):
    filename = f"{name.lower()}_data.json"
    if os.path.exists(filename):
        with open(filename, 'r') as f:
            return json.load(f)
    else:
        return {"name": name, "likes": [], "dislikes": [], "personal_info": {}}

    ## Save the user data in JSON Format
def save_user_data(user_data):
    filename = f"{user_data['name'].lower()}_data.json"
    with open(filename, 'w') as f:
        json.dump(user_data, f, indent=4)

## Extract the Features for User Model
def extract_and_update_user_data(text, user_data):
    response = ""
    doc = nlp(text)
    greeting_words = ['hello', 'hi', 'hey', 'hii', 'greetings']

    # Handle greetings
    if any(greeting in text.lower() for greeting in greeting_words):
        response += "Hello! How can I assist you today?\n"

    # Extract personal info from entities
    for ent in doc.ents:
        user_data['personal_info'][ent.label_.lower()] = ent.text
        if ent.label_ == "PERSON":
            response += f"Nice to learn more about you, {ent.text}!\n"
        elif ent.label_ == "ORG":
            response += f"Interesting to hear you are involved with {ent.text}.\n"
        elif ent.label_ == "LOC":
            response += f"Great to know you are from {ent.text}.\n"
        elif ent.label_ == "DATE":
            response += f"Noted, the date {ent.text} is important to you.\n"

    # Process likes and dislikes with refined patterns
    likes_dislikes_patterns = {
        'like': [r"\blike[s]? (\w+)", r"\benjoy[s]? (\w+)", r"\bam into (\w+)"],
        'dislike': [r"\bdislike[s]? (\w+)", r"\bhat[e]? (\w+)", r"\bcan't stand (\w+)"]
    }

    for key, patterns in likes_dislikes_patterns.items():
        for pattern in patterns:
            found_items = re.findall(pattern, text, re.I)
            for item in found_items:
                if item not in user_data[key+'s']:
                    user_data[key+'s'].append(item)
                    response += f"It's great that you {key} {item}.\n"
                else:
                    response += f"You still {key} {item}, good to know!\n"

    return response

## Generate the responses using the model
def generate_response(input_text,user_model):
    try:
        # Tokenize the input text
        input_sequence = tokenizer.texts_to_sequences([input_text])
        # Pad the input sequence
        input_sequence = pad_sequences(input_sequence, maxlen=max_len, padding='post')
        
        # Initialize the decoder input sequence with start token
        decoder_input_sequence = np.zeros((1, max_len-1))
        decoder_input_sequence[0, 0] = tokenizer.word_index.get('<sos>', 0)  # Safely get '<sos>' index or default to 0
        
        # Generate response using the trained model
        for i in range(max_len - 2):
            predictions = model.predict([input_sequence, decoder_input_sequence])
            predicted_id = np.argmax(predictions[0, i, :])
            if tokenizer.word_index.get('<eos>') == predicted_id:  # Safely check for '<eos>'
                break
            decoder_input_sequence[0, i+1] = predicted_id
        
        # Convert output sequence to text
        output_text = ''
        for token_index in decoder_input_sequence[0]:
            if token_index == tokenizer.word_index.get('<eos>', 0) or token_index == 0:  # Safely check for '<eos>'
                break
            output_text += tokenizer.index_word.get(token_index, '') + ' '  # Safely get word or default to empty string
        
        return output_text.strip()
    except Exception as e:
        return f"Error processing your request: {str(e)}"

## Main Chat Function , Stream Lines the whole structure and functionality
def chat():
    print("Hi, I'm your Friendly ChatBot. What's your name?")
    name = input("Enter your name: ")
    user_data = load_or_create_user_data(name)
    print(f"Welcome back, {name}!" if os.path.exists(f"{name.lower()}_data.json") else f"Nice to meet you, {name}!")

    while True:
        user_input = input(f"{name}: ")
        if user_input.lower() == "quit":
            print("Bot: Goodbye!")
            break

        personal_response = extract_and_update_user_data(user_input, user_data)
        if personal_response:
            print(f"Bot: {personal_response}")
        else:
            generated_response = generate_response(user_input, user_data)
            print(f"Bot: {generated_response}")

        save_user_data(user_data)


if __name__ == "__main__":
    chat()


Keras model loaded successfully.
Model loaded successfully and it can predict.
Hi, I'm your Friendly ChatBot. What's your name?


Enter your name:  Denish


Welcome back, Denish!


Denish:  hello


Bot: Hello! How can I assist you today?



Denish:  what is python ?


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 381ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 53ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 53ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 55ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 53ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 53ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 52ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 54ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 51ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 52ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 54ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 51ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 51ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5

Denish:  how to import tensorflow model?


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 50ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 54ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 50ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 51ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 50ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 54ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 52ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 55ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 53ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 50ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 55ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 52ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 52ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 50

Denish:  How to multiply array in python ?


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 51ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 52ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 50ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 51ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 50ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 51ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 50ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 52ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 53ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 53ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 52ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 52ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 52ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 52

Denish:  I like coffee 


Bot: It's great that you like coffee.



Denish:  i dont like tea


Bot: It's great that you like tea.



Denish:  i hate tea


Bot: It's great that you dislike tea.



Denish:  I am doing my master's at University of Texas at Dallas


Bot: Interesting to hear you are involved with University of Texas.



Denish:  Which library in python to visualise the data?


Bot: Hello! How can I assist you today?



Denish:  python library to visualize the data


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 53ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 52ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 52ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 51ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 52ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 52ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 49ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 52ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 49ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 49ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 51ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 48ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 51ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 56

Denish:  models used for text classifications


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 52ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 51ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 49ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 50ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 51ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 50ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 50ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 52ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 49ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 50ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 52ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 56ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 51ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 52

Denish:  quit


Bot: Goodbye!


in case if you cant see:

Above question was: enish:  python library to visualize the data

In [28]:
import json

# Path to your JSON file
file_path = '/kaggle/working/denish_data.json'

# Open the file and load the data
with open(file_path, 'r') as file:
    data = json.load(file)

# Print the data
print(json.dumps(data, indent=4))

{
    "name": "Denish",
    "likes": [
        "coffee",
        "tea"
    ],
    "dislikes": [
        "tea"
    ],
    "personal_info": {
        "org": "University of Texas",
        "gpe": "Dallas"
    }
}


# SKLearn Based training of The model

## Using NB

## Text Classification and Response Retrieval with Naive Bayes

This script demonstrates how to use a Naive Bayes classifier to build a text classification model using Python's scikit-learn library. The goal is to predict answers based on questions from a dataset, which can simulate a simple question-answer retrieval system.

### Data Loading and Preprocessing

1. **Data Loading**:
   - Data is loaded from a CSV file into a Pandas DataFrame. Adjust the `data_path` variable as necessary to point to the correct file location.
   - The dataset is limited to the first 20,000 rows for this example to manage memory and computational efficiency.

2. **Data Cleaning**:
   - Both 'question' and 'answer' columns are converted to lowercase to standardize the text.
   - Non-alphanumeric characters are removed to simplify the text. This is done using the `str.replace` method with a regex that filters out anything that's not a word character or whitespace.

### Model Training and Evaluation

1. **Data Splitting**:
   - The dataset is split into training and testing sets using a 80/20 split. This allows for model training on 80% of the data and evaluation on the remaining 20%.

2. **Pipeline Creation**:
   - A pipeline is created with `TfidfVectorizer` and `MultinomialNB`:
     - `TfidfVectorizer` converts text data into a format suitable for model training by computing the Term Frequency-Inverse Document Frequency (TF-IDF) of each word.
     - `MultinomialNB` is a Naive Bayes classifier that is suitable for classification with discrete features (like word counts for text classification).

3. **Model Training**:
   - The model is trained on the prepared training data (`X_train` and `y_train`).

4. **Model Evaluation**:
   - Predictions are made on the test data.
 

### Function for Response Retrieval

- A function `get_response` is defined to retrieve responses based on input questions:
  - It uses the trained model to predict the 'answer' directly from an input 'question'.
  - This simplistic approach assumes a direct mapping between questions and answers as found in the training data.

### Testing the System

- The system is tested with a sample query, and the response is printed out to verify the model's behavior.


In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report


# Load data from the CSV file
data_path = '/kaggle/input/glaive-python-code-qa-dataset/train.csv'  # Adjust path if needed
data = pd.read_csv(data_path)
data = data[:20000]
# Basic cleaning and preprocessing
data['question'] = data['question'].astype(str).str.lower().str.replace('[^\w\s]', '')
data['answer'] = data['answer'].astype(str).str.lower().str.replace('[^\w\s]', '')

# Split data into training and testing to evaluate the model
X_train, X_test, y_train, y_test = train_test_split(data['question'], data['answer'], test_size=0.2, random_state=42)

# Create a text processing and classification pipeline
model = make_pipeline(TfidfVectorizer(), MultinomialNB())

# Train the model
model.fit(X_train, y_train)

# Evaluate the model on the test set
predicted = model.predict(X_test)
# print(classification_report(y_test, predicted))

In [2]:
# Define a simple response retrieval function (assuming a direct mapping for simplicity)
def get_response(input_text):
    predicted_intent = model.predict([input_text])[0]
    # This example uses the predicted 'intent' as the response; adjust as needed
    return predicted_intent

# Test the system with a sample input
sample_query = "How timestamp work in python?"
response = get_response(sample_query)
print("Query:", sample_query)
print("Response:", response)

Query: How timestamp work in python?
Response: in python, you can convert each individual timestamp in a pandas series into a string using the `astype()` function. `astype()` function is used to cast a pandas object to a specified datatype. here, it's used to convert timestamps to strings.

here is a step-by-step guide on how you can achieve that:

1. import the necessary libraries:
```python
import pandas as pd
```

2. let's assume you have a dataframe called `df` with a column `timestamp` as follows:
```python
print(df)
```
output:
```
                     timestamp
0  2021-01-01 00:00:00.000000
1  2021-01-01 00:01:00.000000
2  2021-01-01 00:02:00.000000
```

3. to convert the `timestamp` column to string, you can apply `astype(str)` to the `timestamp` column:
```python
df['timestamp'] = df['timestamp'].astype(str)
```

4. now, if you print your dataframe, you'll see that the `timestamp` column has been converted to string:
```python
print(df)
```
output:
```
                   times

In [4]:
import joblib  # Import joblib
# Save the model pipeline
model_filename = 'chatbot_modelNB.pkl'  # Specify the path to save the model
joblib.dump(model, model_filename)  # Save the actual pipeline, not the function

['chatbot_modelNB.pkl']

In [4]:
# import joblib  # Import joblib
# model_filename = 'chatbot_modelNB.pkl'  \

# # Load the model from the file
# model = joblib.load(model_filename)


# You can find the working chatbot conversation below using NB

In [19]:
import os
import re
import spacy
import joblib
import json

# Load the spaCy model for NLP tasks
nlp = spacy.load("en_core_web_sm")

# Try to load the Naive Bayes model from a PKL file
model = joblib.load('chatbot_modelNB.pkl')

# Check if the model has a predict method
if hasattr(model, 'predict'):
    print("Model loaded successfully and it can predict.")
else:
    print("Loaded object is not a model with a predict method.")

def load_or_create_user_data(name):
    filename = f"{name.lower()}_data.json"
    if os.path.exists(filename):
        with open(filename, 'r') as f:
            return json.load(f)
    else:
        return {"name": name, "likes": [], "dislikes": [], "personal_info": {}}

def save_user_data(user_data):
    filename = f"{user_data['name'].lower()}_data.json"
    with open(filename, 'w') as f:
        json.dump(user_data, f, indent=4)

def extract_and_update_user_data(text, user_data):
    response = ""
    doc = nlp(text)
    greeting_words = ['hello', 'hi', 'hey', 'hii', 'greetings']

    # Handle greetings
    if any(greeting in text.lower() for greeting in greeting_words):
        response += "Hello! How can I assist you today?\n"

    # Extract personal info from entities
    for ent in doc.ents:
        user_data['personal_info'][ent.label_.lower()] = ent.text
        if ent.label_ == "PERSON":
            response += f"Nice to learn more about you, {ent.text}!\n"
        elif ent.label_ == "ORG":
            response += f"Interesting to hear you are involved with {ent.text}.\n"
        elif ent.label_ == "LOC":
            response += f"Great to know you are from {ent.text}.\n"
        elif ent.label_ == "DATE":
            response += f"Noted, the date {ent.text} is important to you.\n"

    # Process likes and dislikes with refined patterns
    likes_dislikes_patterns = {
        'like': [r"\blike[s]? (\w+)", r"\benjoy[s]? (\w+)", r"\bam into (\w+)"],
        'dislike': [r"\bdislike[s]? (\w+)", r"\bhat[e]? (\w+)", r"\bcan't stand (\w+)"]
    }

    for key, patterns in likes_dislikes_patterns.items():
        for pattern in patterns:
            found_items = re.findall(pattern, text, re.I)
            for item in found_items:
                if item not in user_data[key+'s']:
                    user_data[key+'s'].append(item)
                    response += f"It's great that you {key} {item}.\n"
                else:
                    response += f"You still {key} {item}, good to know!\n"

    return response

def generate_response(input_text, user_data):
    try:
        processed_input = [input_text]  # Ensure your input matches the model's expected format
        model_response = model.predict(processed_input)
        return model_response[0]
    except Exception as e:
        return f"Error processing your request: {str(e)}"

def chat():
    print("Hi, I'm your Friendly ChatBot. What's your name?")
    name = input("Enter your name: ")
    user_data = load_or_create_user_data(name)
    print(f"Welcome back, {name}!" if os.path.exists(f"{name.lower()}_data.json") else f"Nice to meet you, {name}!")

    while True:
        user_input = input(f"{name}: ")
        if user_input.lower() == "quit":
            print("Bot: Goodbye!")
            break

        personal_response = extract_and_update_user_data(user_input, user_data)
        if personal_response:
            print(f"Bot: {personal_response}")
        else:
            generated_response = generate_response(user_input, user_data)
            print(f"Bot: {generated_response}")

        save_user_data(user_data)

if __name__ == "__main__":
    chat()


Model loaded successfully and it can predict.
Hi, I'm your Friendly ChatBot. What's your name?


Enter your name:  Denish


Welcome back, Denish!


Denish:  hello


Bot: Hello! How can I assist you today?



Denish:  i like coffee


Bot: You still like coffee, good to know!



Denish:  I am doing my master's at University of Texas at Dallas


Bot: Interesting to hear you are involved with University of Texas.



Denish:  how to use timestamp


Bot: in python, you can convert each individual timestamp in a pandas series into a string using the `astype()` function. `astype()` function is used to cast a pandas object to a specified datatype. here, it's used to convert timestamps to strings.

here is a step-by-step guide on how you can achieve that:

1. import the necessary libraries:
```python
import pandas as pd
```

2. let's assume you have a dataframe called `df` with a column `timestamp` as follows:
```python
print(df)
```
output:
```
                     timestamp
0  2021-01-01 00:00:00.000000
1  2021-01-01 00:01:00.000000
2  2021-01-01 00:02:00.000000
```

3. to convert the `timestamp` column to string, you can apply `astype(str)` to the `timestamp` column:
```python
df['timestamp'] = df['timestamp'].astype(str)
```

4. now, if you print your dataframe, you'll see that the `timestamp` column has been converted to string:
```python
print(df)
```
output:
```
                   timestamp
0  2021-01-01 00:00:00.000000
1  2021

Denish:  us of lstm keras model


Bot: this error is likely due to a version mismatch between keras and tensorflow. 'get_default_graph' is a method that was available in tensorflow 1.x, but has been removed in tensorflow 2.x. if you're using tensorflow 2.x, you might be using a version of keras that isn't compatible with it.

the best way to solve this problem is to use the keras api that comes bundled with tensorflow 2.x, instead of using standalone keras. you can import the necessary modules from `tensorflow.keras` instead of `keras`. here is how you can modify your code:

```python
from tensorflow.keras.models import sequential
from tensorflow.keras.layers import dense, activation, lstm

model = sequential()
model.add(dense(32, input_dim=784))
model.add(activation('relu'))
model.add(lstm(17))
model.add(dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
```

by using the keras api bundled with tensorflow, you can ensure that the versions are compatible an

Denish:  thank you 


Bot: yes, you can easily switch the x-axis with the y-axis in matplotlib. here is an example on how to do it:

assuming you have a plot 'p', you can get the x and y data like this:

```python
x, y = p.get_data()
```

if you want to switch x and y, you need to clear the current figure with `plt.clf()`, then plot y against x instead of x against y like this:

```python
plt.clf()
plt.plot(y, x)
```

this will give you a plot with switched axes. remember to call `plt.show()` at the end to actually display the plot.

here is a full example:

```python
import matplotlib.pyplot as plt
import numpy as np

# create some data
x = np.arange(0, 10, 0.1)
y = np.sin(x)

# plot x against y
plt.plot(x, y)
plt.show()

# get the current plot
p = plt.gca().get_lines()[0]

# get the x and y data
x, y = p.get_data()

# clear the current figure
plt.clf()

# plot y against x
plt.plot(y, x)
plt.show()
```

this code first creates a sine wave, plots it, gets the current plot, gets the x and y data, clears the 

Denish:  quit


Bot: Goodbye!


## Showcasing The User Model

In [20]:
import json

# Path to your JSON file
file_path = '/kaggle/working/denish_data.json'

# Open the file and load the data
with open(file_path, 'r') as file:
    data = json.load(file)

# Print the data
print(json.dumps(data, indent=4))


{
    "name": "Denish",
    "likes": [
        "java",
        "black",
        "coffee",
        "Java"
    ],
    "dislikes": [],
    "personal_info": {
        "person": "Java",
        "name": "java",
        "organization": "University of Texas",
        "gpe": "Dallas",
        "org": "University of Texas"
    }
}


# Text Classification System Using SVM and NLP

This Python script demonstrates building a text classification model using a Support Vector Machine (SVM), implemented within a natural language processing (NLP) context. It covers data preprocessing, feature extraction, model training, and preliminary evaluation.

## Libraries and Tools Used

- **pandas**: Used for loading and manipulating the dataset. It provides efficient data structures like DataFrame and functions for data manipulation.
- **numpy**: Supports handling large, multi-dimensional arrays and matrices, along with a large collection of mathematical functions.
- **scikit-learn**: Provides tools for data mining and data analysis, including machine learning algorithms like SVM and utilities for text processing such as TF-IDF.
- **spacy**: An NLP library used for language processing tasks such as tokenization and lemmatization.
- **re**: Used for regex operations, enabling text cleaning by removing non-alphabetic characters and extra spaces.

## Workflow Overview

### Step 1: Data Loading
Data is loaded from a CSV file, specifically focusing on a subset of 10,000 question-answer pairs to manage performance and complexity.

### Step 2: Data Cleaning
A custom `clean_text` function is defined to normalize text by removing non-alphabetic characters and converting text to lowercase. This standardization is crucial for reducing model complexity and improving performance.

### Step 3: Text Processing
Using spacy, the script performs tokenization and lemmatization to process text into a more manageable form by reducing words to their base or root form (lemmas).

### Step 4: Feature Extraction
The TF-IDF vectorizer from scikit-learn converts text data into a numerical format that the machine learning model can process, weighing words based on their importance to document context.

### Step 5: Model Training
A machine learning pipeline that includes the TF-IDF vectorizer and SVM classifier is set up and used to train the model on the preprocessed text data.

### Step 6: Model Evaluation
The dataset is split into training and testing sets to evaluate the model's performance, setting up the framework for assessing metrics like accuracy and F1-score.

### Step 7: Chatbot Integration
Defines a function `get_response` that utilizes the trained model to predict responses to input queries, showcasing a practical application in a chatbot system.

## Execution Checkpoints
Print statements are used throughout the script as checkpoints to monitor the flow of data processing and to assist in debugging.

## Testing and Use Case
The system can be tested with specific queries to demonstrate its capability to automatically generate relevant responses based on learned data.



In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
import spacy
import re

# Load data
data_path = '/kaggle/input/glaive-python-code-qa-dataset/train.csv'  # Adjust path as necessary
data = pd.read_csv(data_path)
data = data[:10000]

print("1")
# Cleaning function to handle text and potential non-text data
def clean_text(text):
    if not pd.isnull(text) and not isinstance(text, float):
        text = re.sub(r'[^a-zA-Z\s]', '', text, re.I|re.A)
        text = text.lower().strip()
    else:
        text = ''
    return text

print("2")
# Apply cleaning function to both questions and answers
data['question'] = data['question'].apply(clean_text)
data['answer'] = data['answer'].apply(clean_text)

# Initialize SpaCy for NLP tasks
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

print("3")
# Tokenization and lemmatization using SpaCy
def lemmatize_text(text):
    return ' '.join([token.lemma_ for token in nlp(text) if not token.is_punct and not token.is_stop])

data['processed_text'] = data['question'].apply(lemmatize_text)

# Define a pipeline for TF-IDF Vectorization and SVM Classifier
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('svm', SVC(kernel='linear'))  # Using linear kernel for SVM
])

print("4")
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data['processed_text'], data['answer'], test_size=0.1, random_state=42)

# Train the SVM model
pipeline.fit(X_train, y_train)

print("5")
# Model Evaluation
# predicted = pipeline.predict(X_test)
# print("Classification Report:\n", classification_report(y_test, predicted))

# Function to process and respond to queries
def get_response(input_text):
    input_text = clean_text(input_text)  # Clean the input text
    input_text = lemmatize_text(input_text)  # Lemmatize the input text
    prediction = pipeline.predict([input_text])[0]  # Predict the intent or response
    return prediction 

print("6")
# Testing the chatbot with sample queries
# sample_queries = ['how are you', 'tell me a joke', 'thank you']
query = "How datatime works ?"
# for query in sample_queries:
print("Query:", query)
print("Response:", get_response(query))


1
2
3
4
5
6
Query: How datatime works ?
Response: in python you can change the current working directory using the os modules chdir function heres how you can do it

python
import os

oschdirpathtoyourdirectory


in the above code replace pathtoyourdirectory with the path of the directory that you want to set as the current working directory this will change the current working directory to the specified path

you can verify if the current working directory was changed successfully by using the getcwd function of the os module which returns the current working directory heres how you can do it

python
import os

oschdirpathtoyourdirectory
printosgetcwd


in the above code after changing the current working directory we print the current working directory if the change was successful it will print pathtoyourdirectory


In [2]:
import joblib  # Import joblib
model_filename = 'chatbot_modelSVM.pkl'  # Specify the path to save the model
joblib.dump(pipeline,model_filename)

['chatbot_modelSVM.pkl']

In [None]:
import joblib  # Import joblib
model_filename = 'chatbot_modelNB.pkl'  # Specify the path of the model file

# Load the model from the file
model = joblib.load(model_filename)


Below code generates the chats and give sthe response

# You can find the working chatbot conversation below

In [3]:
import os
import re
import spacy
import joblib
import json

# Load the spaCy model for NLP tasks
nlp = spacy.load("en_core_web_sm")

# Try to load the Naive Bayes model from a PKL file
model = joblib.load('chatbot_modelSVM.pkl')

# Check if the model has a predict method
if hasattr(model, 'predict'):
    print("Model loaded successfully and it can predict.")
else:
    print("Loaded object is not a model with a predict method.")

def load_or_create_user_data(name):
    filename = f"{name.lower()}_data.json"
    if os.path.exists(filename):
        with open(filename, 'r') as f:
            return json.load(f)
    else:
        return {"name": name, "likes": [], "dislikes": [], "personal_info": {}}

def save_user_data(user_data):
    filename = f"{user_data['name'].lower()}_data.json"
    with open(filename, 'w') as f:
        json.dump(user_data, f, indent=4)

def extract_and_update_user_data(text, user_data):
    response = ""
    doc = nlp(text)
    greeting_words = ['hello', 'hi', 'hey', 'hii', 'greetings']

    # Handle greetings
    if any(greeting in text.lower() for greeting in greeting_words):
        response += "Hello! How can I assist you today?\n"

    # Extract personal info from entities
    for ent in doc.ents:
        user_data['personal_info'][ent.label_.lower()] = ent.text
        if ent.label_ == "PERSON":
            response += f"Nice to learn more about you, {ent.text}!\n"
        elif ent.label_ == "ORG":
            response += f"Interesting to hear you are involved with {ent.text}.\n"
        elif ent.label_ == "LOC":
            response += f"Great to know you are from {ent.text}.\n"
        elif ent.label_ == "DATE":
            response += f"Noted, the date {ent.text} is important to you.\n"

    # Process likes and dislikes with refined patterns
    likes_dislikes_patterns = {
        'like': [r"\blike[s]? (\w+)", r"\benjoy[s]? (\w+)", r"\bam into (\w+)"],
        'dislike': [r"\bdislike[s]? (\w+)", r"\bhat[e]? (\w+)", r"\bcan't stand (\w+)"]
    }

    for key, patterns in likes_dislikes_patterns.items():
        for pattern in patterns:
            found_items = re.findall(pattern, text, re.I)
            for item in found_items:
                if item not in user_data[key+'s']:
                    user_data[key+'s'].append(item)
                    response += f"It's great that you {key} {item}.\n"
                else:
                    response += f"You still {key} {item}, good to know!\n"

    return response

def generate_response(input_text, user_data):
    try:
        processed_input = [input_text]  # Ensure your input matches the model's expected format
        model_response = model.predict(processed_input)
        return model_response[0]
    except Exception as e:
        return f"Error processing your request: {str(e)}"

def chat():
    print("Hi, I'm your Friendly ChatBot. What's your name?")
    name = input("Enter your name: ")
    user_data = load_or_create_user_data(name)
    print(f"Welcome back, {name}!" if os.path.exists(f"{name.lower()}_data.json") else f"Nice to meet you, {name}!")

    while True:
        user_input = input(f"{name}: ")
        if user_input.lower() == "quit":
            print("Bot: Goodbye!")
            break

        personal_response = extract_and_update_user_data(user_input, user_data)
        if personal_response:
            print(f"Bot: {personal_response}")
        else:
            generated_response = generate_response(user_input, user_data)
            print(f"Bot: {generated_response}")

        save_user_data(user_data)

if __name__ == "__main__":
    chat()


Model loaded successfully and it can predict.
Hi, I'm your Friendly ChatBot. What's your name?


Enter your name:  Denish 


Nice to meet you, Denish !


Denish :  hello


Bot: Hello! How can I assist you today?



Denish :  i like coffee


Bot: It's great that you like coffee.



Denish :  What is bilinear scaling ?


Bot: it is likely that the pillow and pytorch libraries implement bilinear interpolation differently leading to the difference in results you are seeing to get the same results with both libraries you will need to use the same scaling method during both training and inference

here is a stepbystep guide on how you can compare the difference

 start by importing the required libraries and defining the transformation from pil to torch and the reshape size

python
import numpy as np
from pil import image
import torch
import torchnnfunctional as f
from torchvision import transforms
import matplotlibpyplot as plt

piltotorch  transformstotensor
resshape   


 open the image and convert it to a torch tensor

python
pilimg  imageopenlennapng
torchimg  piltotorchpilimg


 scale the image using both pil and torch and then convert the pil image to a torch tensor

python
pilimagescaled  pilimgresizeresshape imagebilinear
torchimgscaled  finterpolatetorchimgunsqueeze resshape modebilinearsqueeze



Denish :  Describe python's opencv


Bot: the different python interfaces for opencv cater to different needs and have evolved over time 

 opencv this is the original opencv library for python developed by the opencv team itself it provides a comprehensive interface for most of the functionality provided by the opencv library you can find its documentation herehttpopencvwillowgaragecomdocumentationpythonintroductionhtml

 cv this is an older version of the opencv library for python it was also created by the opencv team and it provides a similar interface to opencv but it may not contain some of the newer features or improvements made in opencv you can find its documentation herehttpopencvwillowgaragecomdocumentationpythoncookbookhtml

 pyopencv this is a thirdparty interface for the opencv library developed independently of the opencv team it provides a slightly different interface to the opencv library and may have some additional features not present in opencv or cv its predecessor ctypesopencv was a similar interface

Denish :  quit


Bot: Goodbye!
