<!-- Notebook title -->
# Title

# 1. Notebook Description

### 1.1 Task Description
<!-- 
- A brief description of the problem you're solving with machine learning.
- Define the objective (e.g., classification, regression, clustering, etc.).
-->

TODO

### 1.2 Useful Resources
<!--
- Links to relevant papers, articles, or documentation.
- Description of the datasets (if external).
-->

### 1.2.1 Data

#### 1.2.1.1 Common

* [Datasets Kaggle](https://www.kaggle.com/datasets)  
  &nbsp;&nbsp;&nbsp;&nbsp;A vast repository of datasets across various domains provided by Kaggle, a platform for data science competitions.
  
* [Toy datasets from Sklearn](https://scikit-learn.org/stable/datasets/toy_dataset.html)  
  &nbsp;&nbsp;&nbsp;&nbsp;A collection of small datasets that come with the Scikit-learn library, useful for quick prototyping and testing algorithms.
  
* [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)  
  &nbsp;&nbsp;&nbsp;&nbsp;A widely-used repository for machine learning datasets, with a variety of real-world datasets available for research and experimentation.
  
* [Google Dataset Search](https://datasetsearch.research.google.com/)  
  &nbsp;&nbsp;&nbsp;&nbsp;A tool from Google that helps to find datasets stored across the web, with a focus on publicly available data.
  
* [AWS Public Datasets](https://registry.opendata.aws/)  
  &nbsp;&nbsp;&nbsp;&nbsp;A registry of publicly available datasets that can be analyzed on the cloud using Amazon Web Services (AWS).
  
* [Microsoft Azure Open Datasets](https://azure.microsoft.com/en-us/services/open-datasets/)  
  &nbsp;&nbsp;&nbsp;&nbsp;A collection of curated datasets from various domains, made available by Microsoft Azure for use in machine learning and analytics.
  
* [Awesome Public Datasets](https://github.com/awesomedata/awesome-public-datasets)  
  &nbsp;&nbsp;&nbsp;&nbsp;A GitHub repository that lists a wide variety of datasets across different domains, curated by the community.
  
* [Data.gov](https://www.data.gov/)  
  &nbsp;&nbsp;&nbsp;&nbsp;A portal to the US government's open data, offering access to a wide range of datasets from various federal agencies.
  
* [Google BigQuery Public Datasets](https://cloud.google.com/bigquery/public-data)  
  &nbsp;&nbsp;&nbsp;&nbsp;Public datasets hosted by Google BigQuery, allowing for quick and powerful querying of large datasets in the cloud.
  
* [Papers with Code](https://paperswithcode.com/datasets)  
  &nbsp;&nbsp;&nbsp;&nbsp;A platform that links research papers with the corresponding code and datasets, helping researchers reproduce results and explore new data.
  
* [Zenodo](https://zenodo.org/)  
  &nbsp;&nbsp;&nbsp;&nbsp;An open-access repository that allows researchers to share datasets, software, and other research outputs, often linked to academic publications.
  
* [The World Bank Open Data](https://data.worldbank.org/)  
  &nbsp;&nbsp;&nbsp;&nbsp;A comprehensive source of global development data, with datasets covering various economic and social indicators.
  
* [OpenML](https://www.openml.org/)  
  &nbsp;&nbsp;&nbsp;&nbsp;An online platform for sharing datasets, machine learning experiments, and results, fostering collaboration in the ML community.
  
* [Stanford Large Network Dataset Collection (SNAP)](https://snap.stanford.edu/data/)  
  &nbsp;&nbsp;&nbsp;&nbsp;A collection of large-scale network datasets from Stanford University, useful for network analysis and graph-based machine learning.
  
* [KDnuggets Datasets](https://www.kdnuggets.com/datasets/index.html)  
  &nbsp;&nbsp;&nbsp;&nbsp;A curated list of datasets for data mining and data science, compiled by the KDnuggets community.


#### 1.2.1.2 Project

### 1.2.2 Learning

* [K-Nearest Neighbors on Kaggle](https://www.kaggle.com/code/mmdatainfo/k-nearest-neighbors)

* [Complete Guide to K-Nearest-Neighbors](https://kevinzakka.github.io/2016/07/13/k-nearest-neighbor)

### 1.2.3 Documentation

---

# 2. Setup

In [28]:
from ikt450.src.common_imports import *
from ikt450.src.config import get_paths
from ikt450.src.common_func import load_dataset, save_dataframe, ensure_dir_exists

In [29]:
paths = get_paths()

In [30]:
RANDOM_SEED = 7

In [31]:
SPLITRATIO = 0.8

---

## 4.1 Data loading
<!--
- Load datasets from files or other sources.
-->

In [32]:
questions_df = pd.read_csv(f"{paths['PATH_COMMON_DATASETS']}/pythonQuestions/Questions.csv", delimiter=",", encoding="latin-1")
tags_df = pd.read_csv(f"{paths['PATH_COMMON_DATASETS']}/pythonQuestions/Tags.csv", delimiter=",", encoding="latin-1")
answers_df = pd.read_csv(f"{paths['PATH_COMMON_DATASETS']}/pythonQuestions/Answers.csv", delimiter=",", encoding="latin-1")

### 4.2.1 Info

In [33]:
print(tags_df.info())
print(questions_df.info())
print(answers_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1885078 entries, 0 to 1885077
Data columns (total 2 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   Id      int64 
 1   Tag     object
dtypes: int64(1), object(1)
memory usage: 28.8+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 607282 entries, 0 to 607281
Data columns (total 6 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   Id            607282 non-null  int64  
 1   OwnerUserId   601070 non-null  float64
 2   CreationDate  607282 non-null  object 
 3   Score         607282 non-null  int64  
 4   Title         607282 non-null  object 
 5   Body          607282 non-null  object 
dtypes: float64(1), int64(2), object(3)
memory usage: 27.8+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 987122 entries, 0 to 987121
Data columns (total 6 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   Id            98712

In [34]:
# Merge the tags and questions dataframes



### 4.2.2 Describe

In [35]:
num_tags = len(list(tags_df["Tag"].unique()))
unique_tags = list(tags_df["Tag"].unique())

In [36]:
num_tags

16896

In [37]:
questions_df

Unnamed: 0,Id,OwnerUserId,CreationDate,Score,Title,Body
0,469,147.0,2008-08-02T15:11:16Z,21,How can I find the full path to a font from it...,<p>I am using the Photoshop's javascript API t...
1,502,147.0,2008-08-02T17:01:58Z,27,Get a preview JPEG of a PDF on Windows?,<p>I have a cross-platform (Python) applicatio...
2,535,154.0,2008-08-02T18:43:54Z,40,Continuous Integration System for a Python Cod...,<p>I'm starting work on a hobby project with a...
3,594,116.0,2008-08-03T01:15:08Z,25,cx_Oracle: How do I iterate over a result set?,<p>There are several ways to iterate over a re...
4,683,199.0,2008-08-03T13:19:16Z,28,Using 'in' to match an attribute of Python obj...,<p>I don't remember whether I was dreaming or ...
...,...,...,...,...,...,...
607277,40143190,333403.0,2016-10-19T23:36:01Z,1,How to execute multiline python code from a ba...,<p>I need to extend a shell script (bash). As ...
607278,40143228,6662462.0,2016-10-19T23:40:00Z,0,How to get google reCaptcha image source using...,<p>I understood that reCaptcha loads a new fra...
607279,40143267,4064680.0,2016-10-19T23:44:07Z,0,Updating an ManyToMany field with Django rest,<p>I'm trying to set up this API so I can use ...
607280,40143338,7044980.0,2016-10-19T23:52:27Z,2,Most possible pairs,"<p>Given a list of values, and information on ..."


In [38]:
tags_grouped = tags_df.groupby('Id')['Tag'].apply(list).reset_index(name='Tags')
questions_and_tags_df = questions_df.merge(tags_grouped,on="Id")



In [39]:
answers_and_questions_df = answers_df.merge(questions_and_tags_df, left_on="ParentId", right_on="Id", suffixes=('_answer', '_question'))
answers_and_questions_df
# merge the answers and questions with tags dataframes
answers_and_questions_df = answers_and_questions_df.merge(tags_grouped, left_on="ParentId", right_on="Id", suffixes=('_answer', '_question'))
answers_and_questions_df

Unnamed: 0,Id_answer,OwnerUserId_answer,CreationDate_answer,ParentId,Score_answer,Body_answer,Id_question,OwnerUserId_question,CreationDate_question,Score_question,Title,Body_question,Tags_answer,Id,Tags_question
0,497,50.0,2008-08-02T16:56:53Z,469,4,<p>open up a terminal (Applications-&gt;Utilit...,469,147.0,2008-08-02T15:11:16Z,21,How can I find the full path to a font from it...,<p>I am using the Photoshop's javascript API t...,"[python, osx, fonts, photoshop]",469,"[python, osx, fonts, photoshop]"
1,518,153.0,2008-08-02T17:42:28Z,469,2,<p>I haven't been able to find anything that d...,469,147.0,2008-08-02T15:11:16Z,21,How can I find the full path to a font from it...,<p>I am using the Photoshop's javascript API t...,"[python, osx, fonts, photoshop]",469,"[python, osx, fonts, photoshop]"
2,3040,457.0,2008-08-06T03:01:23Z,469,12,<p>Unfortunately the only API that isn't depre...,469,147.0,2008-08-02T15:11:16Z,21,How can I find the full path to a font from it...,<p>I am using the Photoshop's javascript API t...,"[python, osx, fonts, photoshop]",469,"[python, osx, fonts, photoshop]"
3,195170,745.0,2008-10-12T07:02:40Z,469,1,<p>There must be a method in Cocoa to get a li...,469,147.0,2008-08-02T15:11:16Z,21,How can I find the full path to a font from it...,<p>I am using the Photoshop's javascript API t...,"[python, osx, fonts, photoshop]",469,"[python, osx, fonts, photoshop]"
4,536,161.0,2008-08-02T18:49:07Z,502,9,<p>You can use ImageMagick's convert utility f...,502,147.0,2008-08-02T17:01:58Z,27,Get a preview JPEG of a PDF on Windows?,<p>I have a cross-platform (Python) applicatio...,"[python, windows, image, pdf]",502,"[python, windows, image, pdf]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
987117,40143239,6640099.0,2016-10-19T23:41:38Z,40142731,2,<p>Well there are many different ways to detec...,40142731,6875348.0,2016-10-19T22:46:59Z,0,Collision Between two sprites - Python 3.5.2,<p>I have an image of a ufo and a missile. I'm...,"[python, pygame, collision-detection]",40142731,"[python, pygame, collision-detection]"
987118,40143315,3125566.0,2016-10-19T23:49:43Z,40143166,2,"<p>First thing, you should use <code>if/elif</...",40143166,7044992.0,2016-10-19T23:33:31Z,1,finding cubed root using delta and epsilon in ...,<p>I am trying to write a program that finds c...,"[python, python-3.x]",40143166,"[python, python-3.x]"
987119,40143317,2350575.0,2016-10-19T23:50:04Z,40142194,0,<p>If you are using firefox ver >47.0.1 you ne...,40142194,7044759.0,2016-10-19T21:58:32Z,1,errors with webdriver.Firefox() with selenium,"<p>I am using python 3.5, firefox 45 (also tri...","[python, selenium, firefox]",40142194,"[python, selenium, firefox]"
987120,40143349,6934347.0,2016-10-19T23:54:02Z,40077010,0,<p>I solved my own problem defining the follow...,40077010,6934347.0,2016-10-17T00:33:51Z,2,Can't pass random variable to tf.image.central...,<p>In Tensorflow I am training from a set of P...,"[python, tensorflow]",40077010,"[python, tensorflow]"


In [40]:
# keep only the columns we need
# answers body and tag
answers_and_questions_df = answers_and_questions_df[["Body_answer", "Tags_question"]]
answers_and_questions_df

Unnamed: 0,Body_answer,Tags_question
0,<p>open up a terminal (Applications-&gt;Utilit...,"[python, osx, fonts, photoshop]"
1,<p>I haven't been able to find anything that d...,"[python, osx, fonts, photoshop]"
2,<p>Unfortunately the only API that isn't depre...,"[python, osx, fonts, photoshop]"
3,<p>There must be a method in Cocoa to get a li...,"[python, osx, fonts, photoshop]"
4,<p>You can use ImageMagick's convert utility f...,"[python, windows, image, pdf]"
...,...,...
987117,<p>Well there are many different ways to detec...,"[python, pygame, collision-detection]"
987118,"<p>First thing, you should use <code>if/elif</...","[python, python-3.x]"
987119,<p>If you are using firefox ver >47.0.1 you ne...,"[python, selenium, firefox]"
987120,<p>I solved my own problem defining the follow...,"[python, tensorflow]"


In [41]:

# find the most common tags
tag_count = {}
for tags in questions_and_tags_df["Tags"]:
    for tag in tags:
        if tag in tag_count:
            tag_count[tag] += 1
        else:
            tag_count[tag] = 1
            



In [42]:
tag_count
# sort the tags by count
sorted_tags = sorted(tag_count.items(), key=lambda x: x[1], reverse=True)
sorted_tags

# get the top 10 tags

# 
class_count = 10
# get the top 100 with out the top 10
top_100_tags = [tag for tag, count in sorted_tags[10:(10+class_count)]]
top_100_tags




['tkinter',
 'string',
 'flask',
 'google-app-engine',
 'csv',
 'arrays',
 'json',
 'mysql',
 'linux',
 'html']

In [43]:
# filter the questions to only include the top 100 tags

questions_and_tags_df["Tags"] = questions_and_tags_df["Tags"].apply(lambda tags: [tag for tag in tags if tag in top_100_tags])

# filter the answers to only include the top 100 tags
answers_and_questions_df["Tags_question"] = answers_and_questions_df["Tags_question"].apply(lambda tags: [tag for tag in tags if tag in top_100_tags])
answers_and_questions_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  answers_and_questions_df["Tags_question"] = answers_and_questions_df["Tags_question"].apply(lambda tags: [tag for tag in tags if tag in top_100_tags])


Unnamed: 0,Body_answer,Tags_question
0,<p>open up a terminal (Applications-&gt;Utilit...,[]
1,<p>I haven't been able to find anything that d...,[]
2,<p>Unfortunately the only API that isn't depre...,[]
3,<p>There must be a method in Cocoa to get a li...,[]
4,<p>You can use ImageMagick's convert utility f...,[]
...,...,...
987117,<p>Well there are many different ways to detec...,[]
987118,"<p>First thing, you should use <code>if/elif</...",[]
987119,<p>If you are using firefox ver >47.0.1 you ne...,[]
987120,<p>I solved my own problem defining the follow...,[]


In [44]:

# remove questions with no tags 
questions_and_tags_df = questions_and_tags_df[questions_and_tags_df["Tags"].apply(len) > 0]
questions_and_tags_df

# remove questions with two or more tags
questions_and_tags_df = questions_and_tags_df[questions_and_tags_df["Tags"].apply(len) == 1]
questions_and_tags_df

# remove answers with no tags
answers_and_questions_df = answers_and_questions_df[answers_and_questions_df["Tags_question"].apply(len) > 0]
answers_and_questions_df

# remove answers with two or more tags
answers_and_questions_df = answers_and_questions_df[answers_and_questions_df["Tags_question"].apply(len) == 1]
answers_and_questions_df





Unnamed: 0,Body_answer,Tags_question
17,"<p>No, you were not dreaming. Python has a pr...",[arrays]
18,<p>I think:</p>\r\n\r\n<pre><code>#!/bin/pytho...,[arrays]
19,<p>Are you looking to get a list of objects th...,[arrays]
20,<p>What I was thinking of can be achieved usin...,[arrays]
21,<p>you could always write one yourself:</p>\n\...,[arrays]
...,...,...
987094,<p>try this:</p>\n\n<pre><code>my_list.sort(ke...,[arrays]
987095,<p>For <strong><em>storing the sorted list as ...,[arrays]
987104,<p>That is actually how your data is:</p>\n\n<...,[json]
987105,<p>Editting based on the <em>edit</em> in the ...,[json]


In [45]:
# remove id, owneruserid,creationdate,scoure 
questions_and_tags_df = questions_and_tags_df.drop(columns=["Id","OwnerUserId","CreationDate","Score","Title"])
questions_and_tags_df

Unnamed: 0,Body,Tags
4,<p>I don't remember whether I was dreaming or ...,[arrays]
6,<p>I can get Python to work with Postgresql bu...,[mysql]
15,<p>Python works on multiple platforms and can ...,[tkinter]
17,<p>I have a Prolite LED sign that I like to se...,[linux]
26,<p>[I hope this isn't too obscure&hellip; I'll...,[mysql]
...,...,...
607252,<p>I would like make a website that shows cont...,[html]
607255,<p>So I'm trying to make a little battle scene...,[tkinter]
607261,<p>I have the following json:</p>\n\n<p>{</p>\...,[json]
607263,<p><strong>EDIT:</strong>\nAs @Alfe suggested ...,[json]


In [46]:
questions_and_tags_df = questions_and_tags_df.rename(columns={"Body":"Question"})
questions_and_tags_df = questions_and_tags_df.rename(columns={"Tags":"Class"})
questions_and_tags_df

Unnamed: 0,Question,Class
4,<p>I don't remember whether I was dreaming or ...,[arrays]
6,<p>I can get Python to work with Postgresql bu...,[mysql]
15,<p>Python works on multiple platforms and can ...,[tkinter]
17,<p>I have a Prolite LED sign that I like to se...,[linux]
26,<p>[I hope this isn't too obscure&hellip; I'll...,[mysql]
...,...,...
607252,<p>I would like make a website that shows cont...,[html]
607255,<p>So I'm trying to make a little battle scene...,[tkinter]
607261,<p>I have the following json:</p>\n\n<p>{</p>\...,[json]
607263,<p><strong>EDIT:</strong>\nAs @Alfe suggested ...,[json]


In [47]:
# select first 10000 rows
questions_and_tags_df = questions_and_tags_df[:25000]
questions_and_tags_df

# select first 10000 rows
answers_and_questions_df = answers_and_questions_df[:25000]
answers_and_questions_df


Unnamed: 0,Body_answer,Tags_question
17,"<p>No, you were not dreaming. Python has a pr...",[arrays]
18,<p>I think:</p>\r\n\r\n<pre><code>#!/bin/pytho...,[arrays]
19,<p>Are you looking to get a list of objects th...,[arrays]
20,<p>What I was thinking of can be achieved usin...,[arrays]
21,<p>you could always write one yourself:</p>\n\...,[arrays]
...,...,...
208787,<p>Let's assume your words are bounded at a re...,[string]
208788,<p>There are specialized index structures for ...,[string]
208789,<p>A very fast way to search for a lot of stri...,[string]
208802,<p>You can rely on neither of these two proper...,[string]


In [48]:

import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Initialize lemmatizer and stopwords
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words("english"))

def preprocess_text(text):
    # Remove leading <p> tags
    text = re.sub(r'^<p>', '', text)
    # Lowercase the text
    text = text.lower()
    # Remove punctuation
    text = re.sub(r'[^\w\s]', '', text)
    # Tokenize, remove stopwords, then lemmatize words
    text = " ".join(lemmatizer.lemmatize(word) for word in text.split() if word not in stop_words)
    return text

# Apply preprocessing to the 'text' column
questions_and_tags_df['processed_text'] = questions_and_tags_df['Question'].apply(preprocess_text)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  questions_and_tags_df['processed_text'] = questions_and_tags_df['Question'].apply(preprocess_text)


In [49]:
questions_and_tags_df

Unnamed: 0,Question,Class,processed_text
4,<p>I don't remember whether I was dreaming or ...,[arrays],dont remember whether dreaming seem recall fun...
6,<p>I can get Python to work with Postgresql bu...,[mysql],get python work postgresql cannot get work mys...
15,<p>Python works on multiple platforms and can ...,[tkinter],python work multiple platform used desktop web...
17,<p>I have a Prolite LED sign that I like to se...,[linux],prolite led sign like set show scrolling searc...
26,<p>[I hope this isn't too obscure&hellip; I'll...,[mysql],hope isnt obscurehellip ill ask newsgroup nobo...
...,...,...,...
195004,<p>I'm running a program which is processing 3...,[csv],im running program processing 30000 similar fi...
195009,<p>How do you use Scrapy to scrape web request...,[json],use scrapy scrape web request return json exam...
195015,<p>I would like to ask some guidelines on a sm...,[google-app-engine],would like ask guideline small task trying sol...
195051,<p>I have this simple program made with Python...,[tkinter],simple program made python code275code basical...


In [50]:
from collections import Counter
from itertools import chain
from torch.nn.utils.rnn import pad_sequence
import torch

# Set threshold for minimum word frequency
min_freq = 5  # or a suitable value based on your data
max_length = 50  # Maximum sequence length

# Step 1: Tokenize and Build Vocabulary with a Frequency Filter
tokenized_texts = questions_and_tags_df['processed_text'].apply(lambda x: x.split())
word_counts = Counter(chain(*tokenized_texts))

# Build vocabulary with words meeting the min frequency requirement
vocab = {word: idx + 2 for idx, (word, count) in enumerate(word_counts.items()) if count >= min_freq}  # Start at 2
vocab['<PAD>'] = 0
vocab['<UNK>'] = 1  # Unknown token for rare words

# Step 2: Encode Texts with Unknown Token Handling
questions_and_tags_df['encoded_text'] = tokenized_texts.apply(
    lambda x: [vocab.get(word, vocab['<UNK>']) for word in x]  # Use <UNK> for words not in vocab
)

# Step 3: Pad or Truncate Sequences
questions_and_tags_df['padded_text'] = questions_and_tags_df['encoded_text'].apply(
    lambda x: x[:max_length] + [vocab['<PAD>']] * (max_length - len(x)) if len(x) < max_length else x[:max_length]
)

# Convert to tensor
x = torch.tensor(questions_and_tags_df['padded_text'].tolist())


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  questions_and_tags_df['encoded_text'] = tokenized_texts.apply(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  questions_and_tags_df['padded_text'] = questions_and_tags_df['encoded_text'].apply(


In [51]:
# find max value in x
max_value = x.max().item()
max_value

invalid_indices = x >= len(vocab)
if invalid_indices.any():
    print(f"Found {invalid_indices.sum().item()} invalid indices, setting them to '<UNK>' token.")
    x[invalid_indices] = vocab['<UNK>']

max_value = x.max().item()
max_value


Found 58674 invalid indices, setting them to '<UNK>' token.


26056

In [52]:
len(vocab)

26072

In [53]:



y = questions_and_tags_df['Class']

In [54]:

y = y.apply(lambda x: x[0])
y = y.apply(lambda x: top_100_tags.index(x))
y = y.to_numpy()

np.mean(y)

3.95012

In [55]:
# apply the same preprocessing to tags in answers dataframe 
answers_and_questions_df["processed_text"] = answers_and_questions_df["Body_answer"]

# to tags
answers_and_questions_df["Class"] = answers_and_questions_df["Tags_question"].apply(lambda tags: tags[0])
answers_and_questions_df["Class"] = answers_and_questions_df["Class"].apply(lambda x: top_100_tags.index(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  answers_and_questions_df["processed_text"] = answers_and_questions_df["Body_answer"]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  answers_and_questions_df["Class"] = answers_and_questions_df["Tags_question"].apply(lambda tags: tags[0])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  answers_and_qu

In [56]:
answers_and_questions_df

Unnamed: 0,Body_answer,Tags_question,processed_text,Class
17,"<p>No, you were not dreaming. Python has a pr...",[arrays],"<p>No, you were not dreaming. Python has a pr...",5
18,<p>I think:</p>\r\n\r\n<pre><code>#!/bin/pytho...,[arrays],<p>I think:</p>\r\n\r\n<pre><code>#!/bin/pytho...,5
19,<p>Are you looking to get a list of objects th...,[arrays],<p>Are you looking to get a list of objects th...,5
20,<p>What I was thinking of can be achieved usin...,[arrays],<p>What I was thinking of can be achieved usin...,5
21,<p>you could always write one yourself:</p>\n\...,[arrays],<p>you could always write one yourself:</p>\n\...,5
...,...,...,...,...
208787,<p>Let's assume your words are bounded at a re...,[string],<p>Let's assume your words are bounded at a re...,1
208788,<p>There are specialized index structures for ...,[string],<p>There are specialized index structures for ...,1
208789,<p>A very fast way to search for a lot of stri...,[string],<p>A very fast way to search for a lot of stri...,1
208802,<p>You can rely on neither of these two proper...,[string],<p>You can rely on neither of these two proper...,1


In [57]:
x = torch.tensor(x)
y = torch.tensor(y)


  x = torch.tensor(x)


In [58]:
x.shape, y.shape

(torch.Size([25000, 50]), torch.Size([25000]))

In [59]:
# Split the data into training and testing sets
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=RANDOM_SEED)

x_train.shape, x_test.shape, y_train.shape, y_test.shape

(torch.Size([20000, 50]),
 torch.Size([5000, 50]),
 torch.Size([20000]),
 torch.Size([5000]))

In [60]:
train_dataset = torch.utils.data.TensorDataset(x_train, y_train)
test_dataset = torch.utils.data.TensorDataset(x_test, y_test)

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=32, shuffle=False)


### 4.2.3 Head

## 4.3 Data Visualization

## 4.4 Data Cleaning
<!--
- Handle missing values, outliers, and inconsistencies.
- Remove or impute missing data.
-->

### 4.4.1 NULL, NaN, Missing values

## 4.5 Feature Engineering
<!--
- Create new features from existing data.
- Normalize or standardize features.
- Encode categorical variables.
-->

### 4.5.1 Normalize

#### 4.5.1.1 Feature Selection / Data Separation

<details>
<br>
<details>
<summary>What does it?</summary>
<br>
This line removes the `` column from the DataFrame `df` and assigns the remaining columns to `X`.
</details>
<br>
<details>
<summary>Why do we do it?</summary>
<br>
We do this to separate the input features (which are stored in `X`) from the target variable (which will be stored in `y`). This separation is essential in supervised learning tasks where the goal is to predict the target variable based on the input features.
</details>
</details>

#### 4.5.1.3 Feature Scaling / Standardization / Z-score Normalization

<details>
<br>
<details>
<summary>What does it?</summary>
<br>
This line standardizes the features in `X` by subtracting the mean of each feature and dividing by the standard deviation of that feature. This transforms the data so that each feature has a mean of 0 and a standard deviation of 1.
</details>
<br>
<details>
<summary>Why do we do it?</summary>
<br>
Standardization is crucial when using machine learning algorithms that rely on distance calculations (like K-Nearest Neighbors, SVM, or Neural Networks). Without standardization, features with larger scales could dominate the distance calculation, leading to biased model behavior. By standardizing, all features contribute equally to the model, regardless of their original scale.
</details>
</details>

## 4.6 Data Splitting
<!--
- Split data into training, validation, and test sets.
-->

In [61]:
# Sklearn train_test_split
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=(1-SPLITRATIO), random_state=RANDOM_SEED)

---

# 5. Model Development

## 5.1 Model Selection
<!--
- Choose the model(s) to be trained (e.g., linear regression, decision trees, neural networks).
-->

In [62]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class ClassificationModel(nn.Module):
    def __init__(self, num_tags, vocab_size, embedding_dim=100):
        super(ClassificationModel, self).__init__()
        
        # Embedding layer with randomly initialized embeddings
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)  # Set padding_idx to 0 to ignore padding token
        
        # LSTM layer
        self.lstm = nn.LSTM(input_size=embedding_dim, hidden_size=512, num_layers=1, batch_first=True, bidirectional=True)
        
        # Fully connected layers
        self.fc1 = nn.Linear(512 * 2, 512)
        self.fc2 = nn.Linear(512, num_tags)
    
    def forward(self, input):
        # Pass input through embedding layer
        x = self.embedding(input)  # Shape: [batch_size, sequence_length, embedding_dim]
        
        # LSTM layer
        output, _ = self.lstm(x)  # output shape: [batch_size, sequence_length, hidden_size * 2]
        
        # Extract the output at the last time step
        x = self.fc1(output[:, -1, :])  # Shape: [batch_size, 512]
        
        # Apply ReLU activation
        x = F.relu(x)
        
        # Pass through the final layer to get class scores
        x = self.fc2(x)
        
        # Apply sigmoid for multi-label classification; for single-label, use softmax
        x = torch.sigmoid(x)
        
        return x


## 5.2 Model Training
<!--
- Train the selected model(s) using the training data.
-->

In [63]:
# Define the loss function and optimizer
import torch.optim as optim
vocab_size = len(vocab)  # Size of the vocabulary
embedding_dim = 100      # Dimension of the embedding vectors
num_tags = len(top_100_tags)  # Number of tags

model = ClassificationModel(num_tags, vocab_size, embedding_dim)




criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.0003)

model

ClassificationModel(
  (embedding): Embedding(26072, 100, padding_idx=0)
  (lstm): LSTM(100, 512, batch_first=True, bidirectional=True)
  (fc1): Linear(in_features=1024, out_features=512, bias=True)
  (fc2): Linear(in_features=512, out_features=10, bias=True)
)

In [64]:
vocab_size

26072

In [65]:
# Train the model
num_epochs = 20

for epoch in range(num_epochs):
    running_loss = 0.0
    for i, data in enumerate(train_loader, 0):
        inputs, labels = data
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
    
    print(f"Epoch {epoch + 1}, Loss: {running_loss / len(train_loader)}")
           
    
print("Finished Training")


Epoch 1, Loss: 2.200465757751465
Epoch 2, Loss: 2.1753180652618407
Epoch 3, Loss: 2.1521447032928465
Epoch 4, Loss: 2.1160340604782104
Epoch 5, Loss: 2.0206469820022583
Epoch 6, Loss: 1.9620017696380616
Epoch 7, Loss: 1.9150386213302613
Epoch 8, Loss: 1.8536129451751708
Epoch 9, Loss: 1.8152563228607177
Epoch 10, Loss: 1.7970488634109496
Epoch 11, Loss: 1.7840364030838012
Epoch 12, Loss: 1.7746731422424316
Epoch 13, Loss: 1.7652589473724365
Epoch 14, Loss: 1.76026398563385
Epoch 15, Loss: 1.7547171394348144
Epoch 16, Loss: 1.6992574754714966
Epoch 17, Loss: 1.6821757265090942
Epoch 18, Loss: 1.6717980459213257
Epoch 19, Loss: 1.665675400352478
Epoch 20, Loss: 1.6534924655914307
Finished Training


In [66]:
# Evaluate the model
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report

model.eval()

y_pred = []
y_true = []

with torch.no_grad():
    for data in test_loader:
        inputs, labels = data
        outputs = model(inputs)
        _, predicted = torch.max(outputs, 1)
        y_pred.extend(predicted.tolist())
        y_true.extend(labels.tolist())

accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred, average='macro')
recall = recall_score(y_true, y_pred, average='macro')
f1 = f1_score(y_true, y_pred, average='macro')

print(classification_report(y_true, y_pred)) 

              precision    recall  f1-score   support

           0       0.92      0.83      0.87       597
           1       0.63      0.85      0.72       726
           2       0.55      0.41      0.47       342
           3       0.65      0.87      0.74      1166
           4       0.00      0.00      0.00       295
           5       0.65      0.77      0.70       297
           6       0.00      0.00      0.00       276
           7       0.76      0.80      0.78       460
           8       0.62      0.67      0.64       494
           9       0.70      0.55      0.62       347

    accuracy                           0.68      5000
   macro avg       0.55      0.58      0.55      5000
weighted avg       0.61      0.68      0.63      5000



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [67]:
# create a chatbot wich recieves a question and returns the tag

def predict_tag(question):
    # Preprocess the question
    processed_text = preprocess_text(question)

    # Tokenize the question
    encoded_text = [vocab.get(word, vocab['<UNK>']) for word in processed_text.split()]
    padded_text = encoded_text[:max_length] + [vocab['<PAD>']] * (max_length - len(encoded_text))
    x = torch.tensor(padded_text).unsqueeze(0)  # Add batch dimension

    # Get the model's prediction
    with torch.no_grad():
        output = model(x)
        _, predicted = torch.max(output, 1)
    
    return predicted.item()

In [71]:

# Test the chatbot
question = "linux command to list files in a directory"
tag = predict_tag(question)
print(f"Predicted tag: {top_100_tags[tag]}")

# get a answer for the question
answer = answers_and_questions_df[answers_and_questions_df["Tags_question"].apply(lambda tags: tags[0] == top_100_tags[tag])]["Body_answer"].iloc[0]
print(answer)

Predicted tag: linux
<p>have you tried watching the traffic between the GUI and the serial port to see if there is some kind of special command being sent across?  Also just curious, Python is sending ASCII and not UTF-8 or something else right?  The reason I ask is because I noticed your quote changes for the strings and in some languages that actually is the difference between ASCII and UTF-8.</p>



In [None]:
import random

while True:
    question = input("Enter a question: ")
    tag = predict_tag(question)
    print(f"Predicted tag: {top_100_tags[tag]}")
    # shuffle the answers
    answers_and_questions_df = answers_and_questions_df
    answer = answers_and_questions_df[
    answers_and_questions_df["Tags_question"].apply(lambda tags: tags[0] == top_100_tags[tag])
]["Body_answer"].sample(n=1, random_state=random.randint(0, 1000)).iloc[0]
    print(answer)

Predicted tag: linux
<p>Fewer characters and guaranteed to work:</p>

<pre><code>sh -c 'echo $PPID'
</code></pre>

Predicted tag: linux
<p>Various widget toolkits (GTK+, Qt, etc.) can run on <a href="http://directfb.org/" rel="nofollow">DirectFB</a> instead of X11, which will allow you to have a GUI running on the Linux framebuffer device instead of requiring a full X server.</p>

Predicted tag: linux
<p><a href="http://manpages.debian.net/cgi-bin/man.cgi?query=udevadm&amp;sektion=8" rel="nofollow"><code>udevadm monitor</code></a> (the udev administration binary) or <a href="http://www.kernel.org/pub/linux/utils/kernel/hotplug/libudev/libudev-udev-monitor.html" rel="nofollow"><code>udev_monitor</code></a> (in libudev).</p>

<p>Alternately, if you're running in X11 with input hotplugging, you can listen for the XI extension event <code>DevicePresenceNotify</code>.</p>



## 5.3 Model Evaluation
<!--
- Evaluate model performance on validation data.
- Use appropriate metrics (e.g., accuracy, precision, recall, RMSE).
-->

## 5.4 Hyperparameter Tuning
<!--
- Fine-tune the model using techniques like Grid Search or Random Search.
- Evaluate the impact of different hyperparameters.
-->

## 5.5 Model Testing
<!--
- Evaluate the final model on the test dataset.
- Ensure that the model generalizes well to unseen data.
-->

## 5.6 Model Interpretation (Optional)
<!--
- Interpret the model results (e.g., feature importance, SHAP values).
- Discuss the strengths and limitations of the model.
-->

---

# 6. Predictions


## 6.1 Make Predictions
<!--
- Use the trained model to make predictions on new/unseen data.
-->

## 6.2 Save Model and Results
<!--
- Save the trained model to disk for future use.
- Export prediction results for further analysis.
-->

---

# 7. Documentation and Reporting

## 7.1 Summary of Findings
<!--
- Summarize the results and findings of the analysis.
-->

## 7.2 Next Steps
<!--
- Suggest further improvements, alternative models, or future work.
-->

## 7.3 References
<!--
- Cite any resources, papers, or documentation used.
-->