<!-- Notebook title -->
# Title

# 1. Notebook Description

### 1.1 Task Description
<!-- 
- A brief description of the problem you're solving with machine learning.
- Define the objective (e.g., classification, regression, clustering, etc.).
-->

TODO

### 1.2 Useful Resources
<!--
- Links to relevant papers, articles, or documentation.
- Description of the datasets (if external).
-->

### 1.2.1 Data

#### 1.2.1.1 Common

* [Datasets Kaggle](https://www.kaggle.com/datasets)  
  &nbsp;&nbsp;&nbsp;&nbsp;A vast repository of datasets across various domains provided by Kaggle, a platform for data science competitions.
  
* [Toy datasets from Sklearn](https://scikit-learn.org/stable/datasets/toy_dataset.html)  
  &nbsp;&nbsp;&nbsp;&nbsp;A collection of small datasets that come with the Scikit-learn library, useful for quick prototyping and testing algorithms.
  
* [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)  
  &nbsp;&nbsp;&nbsp;&nbsp;A widely-used repository for machine learning datasets, with a variety of real-world datasets available for research and experimentation.
  
* [Google Dataset Search](https://datasetsearch.research.google.com/)  
  &nbsp;&nbsp;&nbsp;&nbsp;A tool from Google that helps to find datasets stored across the web, with a focus on publicly available data.
  
* [AWS Public Datasets](https://registry.opendata.aws/)  
  &nbsp;&nbsp;&nbsp;&nbsp;A registry of publicly available datasets that can be analyzed on the cloud using Amazon Web Services (AWS).
  
* [Microsoft Azure Open Datasets](https://azure.microsoft.com/en-us/services/open-datasets/)  
  &nbsp;&nbsp;&nbsp;&nbsp;A collection of curated datasets from various domains, made available by Microsoft Azure for use in machine learning and analytics.
  
* [Awesome Public Datasets](https://github.com/awesomedata/awesome-public-datasets)  
  &nbsp;&nbsp;&nbsp;&nbsp;A GitHub repository that lists a wide variety of datasets across different domains, curated by the community.
  
* [Data.gov](https://www.data.gov/)  
  &nbsp;&nbsp;&nbsp;&nbsp;A portal to the US government's open data, offering access to a wide range of datasets from various federal agencies.
  
* [Google BigQuery Public Datasets](https://cloud.google.com/bigquery/public-data)  
  &nbsp;&nbsp;&nbsp;&nbsp;Public datasets hosted by Google BigQuery, allowing for quick and powerful querying of large datasets in the cloud.
  
* [Papers with Code](https://paperswithcode.com/datasets)  
  &nbsp;&nbsp;&nbsp;&nbsp;A platform that links research papers with the corresponding code and datasets, helping researchers reproduce results and explore new data.
  
* [Zenodo](https://zenodo.org/)  
  &nbsp;&nbsp;&nbsp;&nbsp;An open-access repository that allows researchers to share datasets, software, and other research outputs, often linked to academic publications.
  
* [The World Bank Open Data](https://data.worldbank.org/)  
  &nbsp;&nbsp;&nbsp;&nbsp;A comprehensive source of global development data, with datasets covering various economic and social indicators.
  
* [OpenML](https://www.openml.org/)  
  &nbsp;&nbsp;&nbsp;&nbsp;An online platform for sharing datasets, machine learning experiments, and results, fostering collaboration in the ML community.
  
* [Stanford Large Network Dataset Collection (SNAP)](https://snap.stanford.edu/data/)  
  &nbsp;&nbsp;&nbsp;&nbsp;A collection of large-scale network datasets from Stanford University, useful for network analysis and graph-based machine learning.
  
* [KDnuggets Datasets](https://www.kdnuggets.com/datasets/index.html)  
  &nbsp;&nbsp;&nbsp;&nbsp;A curated list of datasets for data mining and data science, compiled by the KDnuggets community.


#### 1.2.1.2 Project

### 1.2.2 Learning

* [K-Nearest Neighbors on Kaggle](https://www.kaggle.com/code/mmdatainfo/k-nearest-neighbors)

* [Complete Guide to K-Nearest-Neighbors](https://kevinzakka.github.io/2016/07/13/k-nearest-neighbor)

### 1.2.3 Documentation

---

# 2. Setup

In [1]:
from ikt450.src.common_imports import *
from ikt450.src.config import get_paths
from ikt450.src.common_func import load_dataset, save_dataframe, ensure_dir_exists

In [2]:
paths = get_paths()

In [3]:
RANDOM_SEED = 7

In [4]:
SPLITRATIO = 0.8

---

## 4.1 Data loading
<!--
- Load datasets from files or other sources.
-->

In [5]:
questions_df = pd.read_csv(f"{paths['PATH_COMMON_DATASETS']}/pythonQuestions/Questions.csv", delimiter=",", encoding="latin-1")
tags_df = pd.read_csv(f"{paths['PATH_COMMON_DATASETS']}/pythonQuestions/Tags.csv", delimiter=",", encoding="latin-1")
answers_df = pd.read_csv(f"{paths['PATH_COMMON_DATASETS']}/pythonQuestions/Answers.csv", delimiter=",", encoding="latin-1")

### 4.2.1 Info

In [6]:
print(tags_df.info())
print(questions_df.info())
print(answers_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1885078 entries, 0 to 1885077
Data columns (total 2 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   Id      int64 
 1   Tag     object
dtypes: int64(1), object(1)
memory usage: 28.8+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 607282 entries, 0 to 607281
Data columns (total 6 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   Id            607282 non-null  int64  
 1   OwnerUserId   601070 non-null  float64
 2   CreationDate  607282 non-null  object 
 3   Score         607282 non-null  int64  
 4   Title         607282 non-null  object 
 5   Body          607282 non-null  object 
dtypes: float64(1), int64(2), object(3)
memory usage: 27.8+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 987122 entries, 0 to 987121
Data columns (total 6 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   Id            98712

In [7]:
answers_df

Unnamed: 0,Id,OwnerUserId,CreationDate,ParentId,Score,Body
0,497,50.0,2008-08-02T16:56:53Z,469,4,<p>open up a terminal (Applications-&gt;Utilit...
1,518,153.0,2008-08-02T17:42:28Z,469,2,<p>I haven't been able to find anything that d...
2,536,161.0,2008-08-02T18:49:07Z,502,9,<p>You can use ImageMagick's convert utility f...
3,538,156.0,2008-08-02T18:56:56Z,535,23,<p>One possibility is Hudson. It's written in...
4,541,157.0,2008-08-02T19:06:40Z,535,20,"<p>We run <a href=""http://buildbot.net/trac"">B..."
...,...,...,...,...,...,...
987117,40143290,3831.0,2016-10-19T23:46:58Z,40142906,0,<p>I am fairly certain your problem is your us...
987118,40143315,3125566.0,2016-10-19T23:49:43Z,40143166,2,"<p>First thing, you should use <code>if/elif</..."
987119,40143317,2350575.0,2016-10-19T23:50:04Z,40142194,0,<p>If you are using firefox ver >47.0.1 you ne...
987120,40143349,6934347.0,2016-10-19T23:54:02Z,40077010,0,<p>I solved my own problem defining the follow...


### 4.2.2 Describe

In [8]:
num_tags = len(list(tags_df["Tag"].unique()))
unique_tags = list(tags_df["Tag"].unique())

In [9]:
tags_grouped = tags_df.groupby('Id')['Tag'].apply(list).reset_index(name='Tags')
questions_and_tags_df = questions_df.merge(tags_grouped,on="Id")

In [10]:
questions_and_tags_df = questions_and_tags_df.iloc[:10000]

In [11]:
questions_body = questions_and_tags_df["Body"]
questions_title = questions_and_tags_df["Title"]
tagsToPredict = list(questions_and_tags_df["Tags"])
tagsOneHot = np.zeros((num_tags,len(list(questions_and_tags_df["Tags"]))))
for i in tagsToPredict:
    indexes = []
    for j in i:
        indexes.append(unique_tags.index(j))
    for index in indexes:
        tagsOneHot[index] = 1

questionsData = []
for title, body in zip(questions_title,questions_body):
    postText = "Title: " + title + "\nBody: " + body
    questionsData.append(postText.lower())
questionsData

["title: how can i find the full path to a font from its display name on a mac?\nbody: <p>i am using the photoshop's javascript api to find the fonts in a given psd.</p>\n\n<p>given a font name returned by the api, i want to find the actual physical font file that that font name corresponds to on the disc.</p>\n\n<p>this is all happening in a python program running on osx so i guess i'm looking for one of:</p>\n\n<ul>\n<li>some photoshop javascript</li>\n<li>a python function</li>\n<li>an osx api that i can call from python</li>\n</ul>\n",
 'title: get a preview jpeg of a pdf on windows?\nbody: <p>i have a cross-platform (python) application which needs to generate a jpeg preview of the first page of a pdf.</p>\n\n<p>on the mac i am spawning <a href="http://developer.apple.com/documentation/darwin/reference/manpages/man1/sips.1.html">sips</a>.  is there something similarly simple i can do on windows?</p>\n',
 "title: continuous integration system for a python codebase\nbody: <p>i'm sta

In [12]:
import nltk

tokenizedQuestionsData = [nltk.word_tokenize(post) for post in questionsData]

In [16]:
tokenizedQuestionsData

[['title',
  ':',
  'how',
  'can',
  'i',
  'find',
  'the',
  'full',
  'path',
  'to',
  'a',
  'font',
  'from',
  'its',
  'display',
  'name',
  'on',
  'a',
  'mac',
  '?',
  'body',
  ':',
  '<',
  'p',
  '>',
  'i',
  'am',
  'using',
  'the',
  'photoshop',
  "'s",
  'javascript',
  'api',
  'to',
  'find',
  'the',
  'fonts',
  'in',
  'a',
  'given',
  'psd.',
  '<',
  '/p',
  '>',
  '<',
  'p',
  '>',
  'given',
  'a',
  'font',
  'name',
  'returned',
  'by',
  'the',
  'api',
  ',',
  'i',
  'want',
  'to',
  'find',
  'the',
  'actual',
  'physical',
  'font',
  'file',
  'that',
  'that',
  'font',
  'name',
  'corresponds',
  'to',
  'on',
  'the',
  'disc.',
  '<',
  '/p',
  '>',
  '<',
  'p',
  '>',
  'this',
  'is',
  'all',
  'happening',
  'in',
  'a',
  'python',
  'program',
  'running',
  'on',
  'osx',
  'so',
  'i',
  'guess',
  'i',
  "'m",
  'looking',
  'for',
  'one',
  'of',
  ':',
  '<',
  '/p',
  '>',
  '<',
  'ul',
  '>',
  '<',
  'li',
  '>',
  'som

### 4.2.3 Head

## 4.3 Data Visualization

## 4.4 Data Cleaning
<!--
- Handle missing values, outliers, and inconsistencies.
- Remove or impute missing data.
-->

### 4.4.1 NULL, NaN, Missing values

## 4.5 Feature Engineering
<!--
- Create new features from existing data.
- Normalize or standardize features.
- Encode categorical variables.
-->

### 4.5.1 Normalize

#### 4.5.1.1 Feature Selection / Data Separation

<details>
<br>
<details>
<summary>What does it?</summary>
<br>
This line removes the `` column from the DataFrame `df` and assigns the remaining columns to `X`.
</details>
<br>
<details>
<summary>Why do we do it?</summary>
<br>
We do this to separate the input features (which are stored in `X`) from the target variable (which will be stored in `y`). This separation is essential in supervised learning tasks where the goal is to predict the target variable based on the input features.
</details>
</details>

#### 4.5.1.3 Feature Scaling / Standardization / Z-score Normalization

<details>
<br>
<details>
<summary>What does it?</summary>
<br>
This line standardizes the features in `X` by subtracting the mean of each feature and dividing by the standard deviation of that feature. This transforms the data so that each feature has a mean of 0 and a standard deviation of 1.
</details>
<br>
<details>
<summary>Why do we do it?</summary>
<br>
Standardization is crucial when using machine learning algorithms that rely on distance calculations (like K-Nearest Neighbors, SVM, or Neural Networks). Without standardization, features with larger scales could dominate the distance calculation, leading to biased model behavior. By standardizing, all features contribute equally to the model, regardless of their original scale.
</details>
</details>

## 4.6 Data Splitting
<!--
- Split data into training, validation, and test sets.
-->

In [13]:
# Sklearn train_test_split
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=(1-SPLITRATIO), random_state=RANDOM_SEED)

---

# 5. Model Development

## 5.1 Model Selection
<!--
- Choose the model(s) to be trained (e.g., linear regression, decision trees, neural networks).
-->

In [14]:
import torch
import torch.nn as nn 
import torch.nn.functional as F

class classification_model(nn.Module):
    def __init__(self):
        super(classification_model,self).__init__()
        self.embedding = nn.Embedding(num_tags,20)

        self.lstm = nn.LSTM(input_size=128, hidden_size= 512, num_layers=3,batch_first=True,bidirectional=True)

        self.fc1 = nn.Linear(512, 512)
        self.fc2 = nn.Linear(512, num_tags)
    
    def forward(self,input):
        e = self.embedding(input)
        output, hidden = self.lstm(e)

        x = self.fc1(output[:,-1,:])
        x = F.relu(x)

        x = self.fc2(x)
        x = torch.sigmoid(x)

        return x

## 5.2 Model Training
<!--
- Train the selected model(s) using the training data.
-->

In [15]:
mod1 = classification_model()
mod1.forward(questionsData)

TypeError: embedding(): argument 'indices' (position 2) must be Tensor, not list

## 5.3 Model Evaluation
<!--
- Evaluate model performance on validation data.
- Use appropriate metrics (e.g., accuracy, precision, recall, RMSE).
-->

## 5.4 Hyperparameter Tuning
<!--
- Fine-tune the model using techniques like Grid Search or Random Search.
- Evaluate the impact of different hyperparameters.
-->

## 5.5 Model Testing
<!--
- Evaluate the final model on the test dataset.
- Ensure that the model generalizes well to unseen data.
-->

## 5.6 Model Interpretation (Optional)
<!--
- Interpret the model results (e.g., feature importance, SHAP values).
- Discuss the strengths and limitations of the model.
-->

---

# 6. Predictions


## 6.1 Make Predictions
<!--
- Use the trained model to make predictions on new/unseen data.
-->

## 6.2 Save Model and Results
<!--
- Save the trained model to disk for future use.
- Export prediction results for further analysis.
-->

---

# 7. Documentation and Reporting

## 7.1 Summary of Findings
<!--
- Summarize the results and findings of the analysis.
-->

## 7.2 Next Steps
<!--
- Suggest further improvements, alternative models, or future work.
-->

## 7.3 References
<!--
- Cite any resources, papers, or documentation used.
-->