<!-- Notebook title -->
# Title

# 1. Notebook Description

### 1.1 Task Description
<!-- 
- A brief description of the problem you're solving with machine learning.
- Define the objective (e.g., classification, regression, clustering, etc.).
-->

TODO

### 1.2 Useful Resources
<!--
- Links to relevant papers, articles, or documentation.
- Description of the datasets (if external).
-->

### 1.2.1 Data

#### 1.2.1.1 Common

* [Datasets Kaggle](https://www.kaggle.com/datasets)  
  &nbsp;&nbsp;&nbsp;&nbsp;A vast repository of datasets across various domains provided by Kaggle, a platform for data science competitions.
  
* [Toy datasets from Sklearn](https://scikit-learn.org/stable/datasets/toy_dataset.html)  
  &nbsp;&nbsp;&nbsp;&nbsp;A collection of small datasets that come with the Scikit-learn library, useful for quick prototyping and testing algorithms.
  
* [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)  
  &nbsp;&nbsp;&nbsp;&nbsp;A widely-used repository for machine learning datasets, with a variety of real-world datasets available for research and experimentation.
  
* [Google Dataset Search](https://datasetsearch.research.google.com/)  
  &nbsp;&nbsp;&nbsp;&nbsp;A tool from Google that helps to find datasets stored across the web, with a focus on publicly available data.
  
* [AWS Public Datasets](https://registry.opendata.aws/)  
  &nbsp;&nbsp;&nbsp;&nbsp;A registry of publicly available datasets that can be analyzed on the cloud using Amazon Web Services (AWS).
  
* [Microsoft Azure Open Datasets](https://azure.microsoft.com/en-us/services/open-datasets/)  
  &nbsp;&nbsp;&nbsp;&nbsp;A collection of curated datasets from various domains, made available by Microsoft Azure for use in machine learning and analytics.
  
* [Awesome Public Datasets](https://github.com/awesomedata/awesome-public-datasets)  
  &nbsp;&nbsp;&nbsp;&nbsp;A GitHub repository that lists a wide variety of datasets across different domains, curated by the community.
  
* [Data.gov](https://www.data.gov/)  
  &nbsp;&nbsp;&nbsp;&nbsp;A portal to the US government's open data, offering access to a wide range of datasets from various federal agencies.
  
* [Google BigQuery Public Datasets](https://cloud.google.com/bigquery/public-data)  
  &nbsp;&nbsp;&nbsp;&nbsp;Public datasets hosted by Google BigQuery, allowing for quick and powerful querying of large datasets in the cloud.
  
* [Papers with Code](https://paperswithcode.com/datasets)  
  &nbsp;&nbsp;&nbsp;&nbsp;A platform that links research papers with the corresponding code and datasets, helping researchers reproduce results and explore new data.
  
* [Zenodo](https://zenodo.org/)  
  &nbsp;&nbsp;&nbsp;&nbsp;An open-access repository that allows researchers to share datasets, software, and other research outputs, often linked to academic publications.
  
* [The World Bank Open Data](https://data.worldbank.org/)  
  &nbsp;&nbsp;&nbsp;&nbsp;A comprehensive source of global development data, with datasets covering various economic and social indicators.
  
* [OpenML](https://www.openml.org/)  
  &nbsp;&nbsp;&nbsp;&nbsp;An online platform for sharing datasets, machine learning experiments, and results, fostering collaboration in the ML community.
  
* [Stanford Large Network Dataset Collection (SNAP)](https://snap.stanford.edu/data/)  
  &nbsp;&nbsp;&nbsp;&nbsp;A collection of large-scale network datasets from Stanford University, useful for network analysis and graph-based machine learning.
  
* [KDnuggets Datasets](https://www.kdnuggets.com/datasets/index.html)  
  &nbsp;&nbsp;&nbsp;&nbsp;A curated list of datasets for data mining and data science, compiled by the KDnuggets community.


#### 1.2.1.2 Project

### 1.2.2 Learning

* [K-Nearest Neighbors on Kaggle](https://www.kaggle.com/code/mmdatainfo/k-nearest-neighbors)

* [Complete Guide to K-Nearest-Neighbors](https://kevinzakka.github.io/2016/07/13/k-nearest-neighbor)

### 1.2.3 Documentation

---

# 2. Setup

In [1]:
from ikt450.src.common_imports import *
from ikt450.src.config import get_paths
from ikt450.src.common_func import load_dataset, save_dataframe, ensure_dir_exists

In [2]:
paths = get_paths()

In [3]:
RANDOM_SEED = 7

In [4]:
SPLITRATIO = 0.8

---

## 4.1 Data loading
<!--
- Load datasets from files or other sources.
-->

In [5]:
questions_df = pd.read_csv(f"{paths['PATH_COMMON_DATASETS']}/pythonQuestions/Questions.csv", delimiter=",", encoding="latin-1")
tags_df = pd.read_csv(f"{paths['PATH_COMMON_DATASETS']}/pythonQuestions/Tags.csv", delimiter=",", encoding="latin-1")
answers_df = pd.read_csv(f"{paths['PATH_COMMON_DATASETS']}/pythonQuestions/Answers.csv", delimiter=",", encoding="latin-1")

### 4.2.1 Info

In [6]:
print(tags_df.info())
print(questions_df.info())
print(answers_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1885078 entries, 0 to 1885077
Data columns (total 2 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   Id      int64 
 1   Tag     object
dtypes: int64(1), object(1)
memory usage: 28.8+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 607282 entries, 0 to 607281
Data columns (total 6 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   Id            607282 non-null  int64  
 1   OwnerUserId   601070 non-null  float64
 2   CreationDate  607282 non-null  object 
 3   Score         607282 non-null  int64  
 4   Title         607282 non-null  object 
 5   Body          607282 non-null  object 
dtypes: float64(1), int64(2), object(3)
memory usage: 27.8+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 987122 entries, 0 to 987121
Data columns (total 6 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   Id            98712

In [7]:
# Merge the tags and questions dataframes



### 4.2.2 Describe

In [8]:
num_tags = len(list(tags_df["Tag"].unique()))
unique_tags = list(tags_df["Tag"].unique())

In [9]:
num_tags

16896

In [10]:
questions_df

Unnamed: 0,Id,OwnerUserId,CreationDate,Score,Title,Body
0,469,147.0,2008-08-02T15:11:16Z,21,How can I find the full path to a font from it...,<p>I am using the Photoshop's javascript API t...
1,502,147.0,2008-08-02T17:01:58Z,27,Get a preview JPEG of a PDF on Windows?,<p>I have a cross-platform (Python) applicatio...
2,535,154.0,2008-08-02T18:43:54Z,40,Continuous Integration System for a Python Cod...,<p>I'm starting work on a hobby project with a...
3,594,116.0,2008-08-03T01:15:08Z,25,cx_Oracle: How do I iterate over a result set?,<p>There are several ways to iterate over a re...
4,683,199.0,2008-08-03T13:19:16Z,28,Using 'in' to match an attribute of Python obj...,<p>I don't remember whether I was dreaming or ...
...,...,...,...,...,...,...
607277,40143190,333403.0,2016-10-19T23:36:01Z,1,How to execute multiline python code from a ba...,<p>I need to extend a shell script (bash). As ...
607278,40143228,6662462.0,2016-10-19T23:40:00Z,0,How to get google reCaptcha image source using...,<p>I understood that reCaptcha loads a new fra...
607279,40143267,4064680.0,2016-10-19T23:44:07Z,0,Updating an ManyToMany field with Django rest,<p>I'm trying to set up this API so I can use ...
607280,40143338,7044980.0,2016-10-19T23:52:27Z,2,Most possible pairs,"<p>Given a list of values, and information on ..."


In [11]:
tags_grouped = tags_df.groupby('Id')['Tag'].apply(list).reset_index(name='Tags')
questions_and_tags_df = questions_df.merge(tags_grouped,on="Id")



In [12]:
answers_and_questions_df = answers_df.merge(questions_and_tags_df, left_on="ParentId", right_on="Id", suffixes=('_answer', '_question'))
answers_and_questions_df
# merge the answers and questions with tags dataframes
answers_and_questions_df = answers_and_questions_df.merge(tags_grouped, left_on="ParentId", right_on="Id", suffixes=('_answer', '_question'))
answers_and_questions_df

Unnamed: 0,Id_answer,OwnerUserId_answer,CreationDate_answer,ParentId,Score_answer,Body_answer,Id_question,OwnerUserId_question,CreationDate_question,Score_question,Title,Body_question,Tags_answer,Id,Tags_question
0,497,50.0,2008-08-02T16:56:53Z,469,4,<p>open up a terminal (Applications-&gt;Utilit...,469,147.0,2008-08-02T15:11:16Z,21,How can I find the full path to a font from it...,<p>I am using the Photoshop's javascript API t...,"[python, osx, fonts, photoshop]",469,"[python, osx, fonts, photoshop]"
1,518,153.0,2008-08-02T17:42:28Z,469,2,<p>I haven't been able to find anything that d...,469,147.0,2008-08-02T15:11:16Z,21,How can I find the full path to a font from it...,<p>I am using the Photoshop's javascript API t...,"[python, osx, fonts, photoshop]",469,"[python, osx, fonts, photoshop]"
2,3040,457.0,2008-08-06T03:01:23Z,469,12,<p>Unfortunately the only API that isn't depre...,469,147.0,2008-08-02T15:11:16Z,21,How can I find the full path to a font from it...,<p>I am using the Photoshop's javascript API t...,"[python, osx, fonts, photoshop]",469,"[python, osx, fonts, photoshop]"
3,195170,745.0,2008-10-12T07:02:40Z,469,1,<p>There must be a method in Cocoa to get a li...,469,147.0,2008-08-02T15:11:16Z,21,How can I find the full path to a font from it...,<p>I am using the Photoshop's javascript API t...,"[python, osx, fonts, photoshop]",469,"[python, osx, fonts, photoshop]"
4,536,161.0,2008-08-02T18:49:07Z,502,9,<p>You can use ImageMagick's convert utility f...,502,147.0,2008-08-02T17:01:58Z,27,Get a preview JPEG of a PDF on Windows?,<p>I have a cross-platform (Python) applicatio...,"[python, windows, image, pdf]",502,"[python, windows, image, pdf]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
987117,40143239,6640099.0,2016-10-19T23:41:38Z,40142731,2,<p>Well there are many different ways to detec...,40142731,6875348.0,2016-10-19T22:46:59Z,0,Collision Between two sprites - Python 3.5.2,<p>I have an image of a ufo and a missile. I'm...,"[python, pygame, collision-detection]",40142731,"[python, pygame, collision-detection]"
987118,40143315,3125566.0,2016-10-19T23:49:43Z,40143166,2,"<p>First thing, you should use <code>if/elif</...",40143166,7044992.0,2016-10-19T23:33:31Z,1,finding cubed root using delta and epsilon in ...,<p>I am trying to write a program that finds c...,"[python, python-3.x]",40143166,"[python, python-3.x]"
987119,40143317,2350575.0,2016-10-19T23:50:04Z,40142194,0,<p>If you are using firefox ver >47.0.1 you ne...,40142194,7044759.0,2016-10-19T21:58:32Z,1,errors with webdriver.Firefox() with selenium,"<p>I am using python 3.5, firefox 45 (also tri...","[python, selenium, firefox]",40142194,"[python, selenium, firefox]"
987120,40143349,6934347.0,2016-10-19T23:54:02Z,40077010,0,<p>I solved my own problem defining the follow...,40077010,6934347.0,2016-10-17T00:33:51Z,2,Can't pass random variable to tf.image.central...,<p>In Tensorflow I am training from a set of P...,"[python, tensorflow]",40077010,"[python, tensorflow]"


In [13]:
# keep only the columns we need
# answers body and tag
answers_and_questions_df = answers_and_questions_df[["Body_answer", "Body_question"]]
answers_and_questions_df

Unnamed: 0,Body_answer,Body_question
0,<p>open up a terminal (Applications-&gt;Utilit...,<p>I am using the Photoshop's javascript API t...
1,<p>I haven't been able to find anything that d...,<p>I am using the Photoshop's javascript API t...
2,<p>Unfortunately the only API that isn't depre...,<p>I am using the Photoshop's javascript API t...
3,<p>There must be a method in Cocoa to get a li...,<p>I am using the Photoshop's javascript API t...
4,<p>You can use ImageMagick's convert utility f...,<p>I have a cross-platform (Python) applicatio...
...,...,...
987117,<p>Well there are many different ways to detec...,<p>I have an image of a ufo and a missile. I'm...
987118,"<p>First thing, you should use <code>if/elif</...",<p>I am trying to write a program that finds c...
987119,<p>If you are using firefox ver >47.0.1 you ne...,"<p>I am using python 3.5, firefox 45 (also tri..."
987120,<p>I solved my own problem defining the follow...,<p>In Tensorflow I am training from a set of P...


In [14]:

answers_and_questions_df = answers_and_questions_df[:20000]


In [15]:

import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Initialize lemmatizer and stopwords
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words("english"))

def preprocess_text(text):
    # Remove leading <p> tags
    text = re.sub(r'^<p>', '', text)
    # Lowercase the text
    text = text.lower()
    # Remove punctuation
    text = re.sub(r'[^\w\s]', '', text)
    # Tokenize, remove stopwords, then lemmatize words
    text = " ".join(lemmatizer.lemmatize(word) for word in text.split())
    return text

# Apply preprocessing to the 'text' column
answers_and_questions_df['Body_question'] = answers_and_questions_df['Body_question'].apply(preprocess_text)
answers_and_questions_df['Body_answer'] = answers_and_questions_df['Body_answer'].apply(preprocess_text)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  answers_and_questions_df['Body_question'] = answers_and_questions_df['Body_question'].apply(preprocess_text)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  answers_and_questions_df['Body_answer'] = answers_and_questions_df['Body_answer'].apply(preprocess_text)


In [16]:
answers_and_questions_df

# drop rows with empty strings and duplicates for any row
answers_and_questions_df = answers_and_questions_df.drop_duplicates()



answers_and_questions_df = answers_and_questions_df.dropna()
answers_and_questions_df

Unnamed: 0,Body_answer,Body_question
0,open up a terminal applicationsgtutilitiesgtte...,i am using the photoshops javascript api to fi...
1,i havent been able to find anything that doe t...,i am using the photoshops javascript api to fi...
2,unfortunately the only api that isnt deprecate...,i am using the photoshops javascript api to fi...
3,there must be a method in cocoa to get a list ...,i am using the photoshops javascript api to fi...
4,you can use imagemagicks convert utility for t...,i have a crossplatform python application whic...
...,...,...
19995,to find the load path of module already loaded...,how can i get the file path of a module import...
19996,i have been using this method which applies to...,how can i get the file path of a module import...
19997,in this simple case the easiest way is to just...,im trying to use scons to build a latex docume...
19998,something along these line should do p precode...,im trying to use scons to build a latex docume...


In [17]:
from collections import Counter
from itertools import chain
from torch.nn.utils.rnn import pad_sequence
import torch

# Set threshold for minimum word frequency
min_freq = 1  # or a suitable value based on your data
max_length = 15  # Maximum sequence length

# Step 1: Tokenize and Build Vocabulary with a Frequency Filter
tokenized_texts = answers_and_questions_df['Body_answer'].apply(lambda x: x.split())
word_counts = Counter(chain(*tokenized_texts))

# Build vocabulary with words meeting the min frequency requirement
vocab = {word: idx + 2 for idx, (word, count) in enumerate(word_counts.items()) if count >= min_freq}  # Start at 2
vocab['<PAD>'] = 0
vocab['<UNK>'] = 1  # Unknown token for rare words
vocab['<EOS>'] = 2

# Step 2: Encode Texts with Unknown Token Handling
answers_and_questions_df['encoded_text'] = tokenized_texts.apply(
    lambda x: [vocab.get(word, vocab['<UNK>']) for word in x]  # Use <UNK> for words not in vocab
)


# Step 3: Pad or Truncate Sequences
answers_and_questions_df['padded_text'] = answers_and_questions_df['encoded_text'].apply(
    lambda x: x[:max_length] + [vocab['<PAD>']] * (max_length - len(x)) if len(x) < max_length else x[:max_length]
)
answers_and_questions_df['padded_text'] = answers_and_questions_df['padded_text'].apply(lambda x: x + [vocab["<EOS>"]])
# Convert to tensor
y = torch.tensor(answers_and_questions_df['padded_text'].tolist())







In [18]:
# Step 1: Tokenize and Build Vocabulary with a Frequency Filter
tokenized_texts = answers_and_questions_df['Body_question'].apply(lambda x: x.split())



# Step 2: Encode Texts with Unknown Token Handling
answers_and_questions_df['encoded_text'] = tokenized_texts.apply(
    lambda x: [vocab.get(word, vocab['<UNK>']) for word in x]  # Use <UNK> for words not in vocab
)

# Step 3: Pad or Truncate Sequences
answers_and_questions_df['padded_text'] = answers_and_questions_df['encoded_text'].apply(
    lambda x: x[:max_length] + [vocab['<PAD>']] * (max_length - len(x)) if len(x) < max_length else x[:max_length]
)

# add EOS token
answers_and_questions_df['padded_text'] = answers_and_questions_df['padded_text'].apply(lambda x: x + [vocab["<EOS>"]])
# Convert to tensor
x = torch.tensor(answers_and_questions_df['padded_text'].tolist())

In [19]:
# find max value in x
max_value = x.max().item()
max_value

invalid_indices = x >= len(vocab)
if invalid_indices.any():
    print(f"Found {invalid_indices.sum().item()} invalid indices, setting them to '<UNK>' token.")
    x[invalid_indices] = vocab['<UNK>']

max_value = x.max().item()
max_value

# find max value in y
max_value = y.max().item()
max_value

invalid_indices = y >= len(vocab)
if invalid_indices.any():
    print(f"Found {invalid_indices.sum().item()} invalid indices, setting them to '<UNK>' token.")
    y[invalid_indices] = vocab['<UNK>']

max_value = y.max().item()
max_value


141279

In [20]:
len(vocab)

141290

In [21]:
# Split the data into training and testing sets
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=RANDOM_SEED)

x_train.shape, x_test.shape, y_train.shape, y_test.shape

(torch.Size([15999, 16]),
 torch.Size([4000, 16]),
 torch.Size([15999, 16]),
 torch.Size([4000, 16]))

In [22]:
del x, y
del answers_and_questions_df
del tokenized_texts

del tags_df
del questions_df
del answers_df
del tags_grouped
del questions_and_tags_df




In [23]:
import sys

def print_large_items(locals_dict, size_threshold=1024):
    """
    Prints variables from the given dictionary that are above the specified size threshold.

    Parameters:
    - locals_dict: dict, the dictionary of variables to inspect, typically locals() or globals().
    - size_threshold: int, the size in bytes above which variables should be printed.
    """
    for var_name, var_value in locals_dict.items():
        var_size = sys.getsizeof(var_value)
        if var_size > size_threshold:
            print(f"Variable '{var_name}' is {var_size} bytes in memory.")

# Usage example
# Assuming you call this function in the scope where your variables are defined
print_large_items(locals(), size_threshold=5000)  # Adjust threshold as needed

Variable 'unique_tags' is 135224 bytes in memory.
Variable '_10' is 987135972 bytes in memory.
Variable '_12' is 2579456964 bytes in memory.
Variable '_13' is 2075577061 bytes in memory.
Variable 'stop_words' is 8408 bytes in memory.
Variable '_16' is 51960177 bytes in memory.
Variable 'word_counts' is 5242984 bytes in memory.
Variable 'vocab' is 5242968 bytes in memory.


In [24]:
train_dataset = torch.utils.data.TensorDataset(x_train, y_train)
test_dataset = torch.utils.data.TensorDataset(x_test, y_test)

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=32, shuffle=False)


### 4.2.3 Head

## 4.3 Data Visualization

## 4.4 Data Cleaning
<!--
- Handle missing values, outliers, and inconsistencies.
- Remove or impute missing data.
-->

### 4.4.1 NULL, NaN, Missing values

## 4.5 Feature Engineering
<!--
- Create new features from existing data.
- Normalize or standardize features.
- Encode categorical variables.
-->

### 4.5.1 Normalize

#### 4.5.1.1 Feature Selection / Data Separation

<details>
<br>
<details>
<summary>What does it?</summary>
<br>
This line removes the `` column from the DataFrame `df` and assigns the remaining columns to `X`.
</details>
<br>
<details>
<summary>Why do we do it?</summary>
<br>
We do this to separate the input features (which are stored in `X`) from the target variable (which will be stored in `y`). This separation is essential in supervised learning tasks where the goal is to predict the target variable based on the input features.
</details>
</details>

#### 4.5.1.3 Feature Scaling / Standardization / Z-score Normalization

<details>
<br>
<details>
<summary>What does it?</summary>
<br>
This line standardizes the features in `X` by subtracting the mean of each feature and dividing by the standard deviation of that feature. This transforms the data so that each feature has a mean of 0 and a standard deviation of 1.
</details>
<br>
<details>
<summary>Why do we do it?</summary>
<br>
Standardization is crucial when using machine learning algorithms that rely on distance calculations (like K-Nearest Neighbors, SVM, or Neural Networks). Without standardization, features with larger scales could dominate the distance calculation, leading to biased model behavior. By standardizing, all features contribute equally to the model, regardless of their original scale.
</details>
</details>

## 4.6 Data Splitting
<!--
- Split data into training, validation, and test sets.
-->

---

# 5. Model Development

In [25]:
# Seq2Seq Model

import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

class Seq2Seq(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, dropout):
        super(Seq2Seq, self).__init__()
        self.encoder = nn.Embedding(vocab_size, embedding_dim)
        self.gru = nn.GRU(embedding_dim, hidden_dim, n_layers,batch_first=True,
                          bidirectional=False)
        
        self.decoder = nn.Linear(hidden_dim, output_dim)
        self.dropout = nn.Dropout(dropout)
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        

    # Example in the forward method
    def forward(self, x, hidden):
        # Embed the input sequence
        x = self.encoder(x)
      #  print("Encoder output shape:", x.shape)  # Add this line

        # Ensure x has dimensions (batch_size, sequence_length, embedding_dim)
        output, hidden = self.gru(x, hidden)
     #   print("GRU output shape:", output.shape)  # Add this line

        output = self.dropout(output)
    #    print("Output after dropout shape:", output.shape)  # Add this line

        # Pass the output through the decoder layer and reshape as necessary
        output = self.decoder(output)
   #     print("Decoder output shape:", output.shape)  # Add this line

        return output, hidden

    
    def init_hidden(self, batch_size):
       return torch.zeros(self.gru.num_layers, batch_size, self.gru.hidden_size)
    



In [26]:
# Training the Seq2Seq Model
model = Seq2Seq(vocab_size=len(vocab), embedding_dim=100, hidden_dim=256, output_dim=len(vocab), n_layers=1, dropout=0.5)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

learning_rate = 0.01
num_epochs = 5
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

for epoch in range(num_epochs):
    total_loss = 0
    for questions, answers in train_loader:  # data_loader gives pairs of questions and corresponding answers
        # Send questions and answers to GPU
        questions = questions.to(device)
        answers = answers.to(device)

        optimizer.zero_grad()
        loss = 0

        # Forward pass for the whole sequence, using teacher forcing
        hidden = model.init_hidden(questions.size(0)).to(device)
        
        encoder_outputs, hidden = model(questions, hidden)  # forward pass through encoder
        hidden = hidden.to(device)

        # calculate loss for first token
       # encoder_outputs = encoder_outputs.view(-1, encoder_outputs.shape[-1])
        target = answers[:, 0]
        encoder_outputs = encoder_outputs[:, -1, :]

        loss += criterion(encoder_outputs, target)


        # Teacher-forcing loop over answer sequence
        for i in range(1,answers.size(1)-1):  # Iterating over answer length
            input_token = answers[:, i].unsqueeze(1)  # Extract and add a dimension for embedding
     

            # Forward pass for each time step
            output, hidden = model(input_token, hidden)  # input is answer token at each time step

            # Reshape output to match expected input for CrossEntropyLoss
            output = output.view(-1, output.shape[-1])  
            target = answers[:, i+1]

            # Calculate loss for this time step
            loss += criterion(output, target)

        loss.backward()  # Backpropagation
        optimizer.step()  # Optimization step
        total_loss += loss.item() / answers.size(1)  # Average over sequence length

    print(f"Epoch {epoch+1}, Loss: {total_loss / len(train_loader)}")


Epoch 1, Loss: 6.39327209186554
Epoch 2, Loss: 5.600018615722656
Epoch 3, Loss: 5.244165034294128
Epoch 4, Loss: 4.972579601287841
Epoch 5, Loss: 4.736355346679687


In [27]:
# testing the model
model.eval()
loss_total = 0
with torch.no_grad():
    for questions, answers in test_loader:
        questions = questions.to(device)
        answers = answers.to(device)
        hidden = model.init_hidden(questions.size(0))
        output, hidden = model(questions, hidden)
        output = output.view(-1, output.shape[-1])
        target = answers.view(-1)
        loss = criterion(output, target)
        loss_total += loss.item()

print(f"Test Loss: {loss.item() / len(test_loader)}")

Test Loss: 0.06751475524902344


In [28]:
question_text = "How do I sort a list of dictionaries by a value of the dictionary?"
# Preprocess the question text
question_text = preprocess_text(question_text)
# Tokenize the question text
question_tokens = [vocab.get(word, vocab['<UNK>']) for word in question_text.split()]
# Pad the question text
question_tokens = question_tokens[:max_length] + [vocab['<PAD>']] * (max_length - len(question_tokens))
# Convert to tensor
# add EOS token
question_tokens = question_tokens + [vocab["<EOS>"]]

question_tensor = torch.tensor(question_tokens).unsqueeze(0).to(device)
# Initialize hidden state
hidden = model.init_hidden(1)
question_tensor


tensor([[ 967,  454,   33, 1261,    4,   99,   89,  702,  213,    4,  548,   89,
           21,  702,    0,    2]])

In [29]:
# Testing with no gradient calculation
hidden = model.init_hidden(1).to(device)  # Initialize hidden state for batch size 1
question_tensor = question_tensor.to(device)  # Ensure question tensor is on the same device

output_sequence = []  # List to store output tokens

with torch.no_grad():
    for i in range(answers.size(1)):  # Iterating over the answer sequence length
        # Forward pass through the model
        output, hidden = model(question_tensor, hidden)
        
        # Reshape output and find the most likely token for this time step
        output = output.view(-1, output.shape[-1])
        predicted_token = torch.argmax(output, dim=1)
        
        # Print or store the predicted token for each batch element individually
        
        output_sequence.extend(predicted_token.cpu().numpy().tolist())  # Append tokens as a list
        
        # Optionally, update question_tensor to the predicted token for auto-regressive testing
        question_tensor = predicted_token.unsqueeze(0)  # Reshape for next time step if needed
        
        # Exit after the first time step for debugging purposes (remove break to run full sequence)
        break

# Final output after one time step or full sequence
final_output = torch.tensor(output_sequence)
final_output


tensor([30, 23,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2, 33])

In [30]:
predicted_indices = final_output.tolist()
# remove the <EOS> and <PAD> tokens and any tokens after <EOS> or <PAD>
predicted_indices = [idx for idx in predicted_indices if idx not in [vocab["<EOS>"], vocab["<PAD>"]]]

predicted_answer = [word for idx in predicted_indices for word, word_idx in vocab.items() if word_idx == idx]
predicted_answer = " ".join(predicted_answer)
predicted_answer

'to you i'

In [31]:
while True:
    question_text = input(predicted_answer)

    if question_text.lower() == 'exit':
        break
    # Preprocess the question text
    question_text = preprocess_text(question_text)
    # Tokenize the question text
    question_tokens = [vocab.get(word, vocab['<UNK>']) for word in question_text.split()]
    # Pad the question text
    question_tokens = question_tokens[:max_length] + [vocab['<PAD>']] * (max_length - len(question_tokens))
    # Convert to tensor
    # add EOS token
    question_tokens = question_tokens + [vocab["<EOS>"]]

    question_tensor = torch.tensor(question_tokens).unsqueeze(0).to(device)



     # Testing with no gradient calculation
    hidden = model.init_hidden(1).to(device)  # Initialize hidden state for batch size 1
    question_tensor = question_tensor.to(device)  # Ensure question tensor is on the same device
    
    output_sequence = []  # List to store output tokens

    with torch.no_grad():
        for i in range(15):  # Iterating over the answer sequence length
            # Forward pass through the model
        
    #    print(question_tensor)
            # reshape question tensor
            question_tensor = question_tensor.view(1, -1)
            output, hidden = model(question_tensor, hidden)

            
            
            # Reshape output and find the most likely token for this time step
            output = output[0, -1, :]
     #       print(output.shape)
            predicted_token = torch.argmax(output, dim=0)
            
            # Print or store the predicted token for each batch element individually
      #      print(predicted_token)
           
            output_sequence.append(predicted_token)  # Append tokens as a list
            
            # Optionally, update question_tensor to the predicted token for auto-regressive testing
            question_tensor = predicted_token  # Reshape for next time step if needed
            
            # Exit after the first time step for debugging purposes (remove break to run full sequence)
        

    # Final output after one time step or full sequence
    final_output = torch.tensor(output_sequence)
    final_output

    predicted_indices = final_output.tolist()
    # remove the <EOS> and <PAD> tokens and any tokens after <EOS> or <PAD>
    predicted_indices = [idx for idx in predicted_indices if idx not in [vocab["<EOS>"], vocab["<PAD>"]]]

    predicted_answer = [word for idx in predicted_indices for word, word_idx in vocab.items() if word_idx == idx]
    predicted_answer = " ".join(predicted_answer)
    predicted_answer
    print(predicted_answer)
    print("")


    

a hrefhttpdocsbinstarorgdraftexampleshtmlsubmitabuildfromgithub relnofollowbinstara for a list of the list of the list the

a look at a a answera of the a recent module a you

i have a good difference between the a function in python and i

you can use a a hrefhttpenwikipediaorgwikipython_28programming_language29pythona a a relnofollowrequestsa a a you want to



## 5.1 Model Selection
<!--
- Choose the model(s) to be trained (e.g., linear regression, decision trees, neural networks).
-->

## 5.3 Model Evaluation
<!--
- Evaluate model performance on validation data.
- Use appropriate metrics (e.g., accuracy, precision, recall, RMSE).
-->

## 5.4 Hyperparameter Tuning
<!--
- Fine-tune the model using techniques like Grid Search or Random Search.
- Evaluate the impact of different hyperparameters.
-->

## 5.5 Model Testing
<!--
- Evaluate the final model on the test dataset.
- Ensure that the model generalizes well to unseen data.
-->

## 5.6 Model Interpretation (Optional)
<!--
- Interpret the model results (e.g., feature importance, SHAP values).
- Discuss the strengths and limitations of the model.
-->

---

# 6. Predictions


## 6.1 Make Predictions
<!--
- Use the trained model to make predictions on new/unseen data.
-->

## 6.2 Save Model and Results
<!--
- Save the trained model to disk for future use.
- Export prediction results for further analysis.
-->

---

# 7. Documentation and Reporting

## 7.1 Summary of Findings
<!--
- Summarize the results and findings of the analysis.
-->

## 7.2 Next Steps
<!--
- Suggest further improvements, alternative models, or future work.
-->

## 7.3 References
<!--
- Cite any resources, papers, or documentation used.
-->