<!-- Notebook title -->
# Title

# 1. Notebook Description

### 1.1 Task Description
<!-- 
- A brief description of the problem you're solving with machine learning.
- Define the objective (e.g., classification, regression, clustering, etc.).
-->

TODO

### 1.2 Useful Resources
<!--
- Links to relevant papers, articles, or documentation.
- Description of the datasets (if external).
-->

### 1.2.1 Data

#### 1.2.1.1 Common

* [Datasets Kaggle](https://www.kaggle.com/datasets)  
  &nbsp;&nbsp;&nbsp;&nbsp;A vast repository of datasets across various domains provided by Kaggle, a platform for data science competitions.
  
* [Toy datasets from Sklearn](https://scikit-learn.org/stable/datasets/toy_dataset.html)  
  &nbsp;&nbsp;&nbsp;&nbsp;A collection of small datasets that come with the Scikit-learn library, useful for quick prototyping and testing algorithms.
  
* [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)  
  &nbsp;&nbsp;&nbsp;&nbsp;A widely-used repository for machine learning datasets, with a variety of real-world datasets available for research and experimentation.
  
* [Google Dataset Search](https://datasetsearch.research.google.com/)  
  &nbsp;&nbsp;&nbsp;&nbsp;A tool from Google that helps to find datasets stored across the web, with a focus on publicly available data.
  
* [AWS Public Datasets](https://registry.opendata.aws/)  
  &nbsp;&nbsp;&nbsp;&nbsp;A registry of publicly available datasets that can be analyzed on the cloud using Amazon Web Services (AWS).
  
* [Microsoft Azure Open Datasets](https://azure.microsoft.com/en-us/services/open-datasets/)  
  &nbsp;&nbsp;&nbsp;&nbsp;A collection of curated datasets from various domains, made available by Microsoft Azure for use in machine learning and analytics.
  
* [Awesome Public Datasets](https://github.com/awesomedata/awesome-public-datasets)  
  &nbsp;&nbsp;&nbsp;&nbsp;A GitHub repository that lists a wide variety of datasets across different domains, curated by the community.
  
* [Data.gov](https://www.data.gov/)  
  &nbsp;&nbsp;&nbsp;&nbsp;A portal to the US government's open data, offering access to a wide range of datasets from various federal agencies.
  
* [Google BigQuery Public Datasets](https://cloud.google.com/bigquery/public-data)  
  &nbsp;&nbsp;&nbsp;&nbsp;Public datasets hosted by Google BigQuery, allowing for quick and powerful querying of large datasets in the cloud.
  
* [Papers with Code](https://paperswithcode.com/datasets)  
  &nbsp;&nbsp;&nbsp;&nbsp;A platform that links research papers with the corresponding code and datasets, helping researchers reproduce results and explore new data.
  
* [Zenodo](https://zenodo.org/)  
  &nbsp;&nbsp;&nbsp;&nbsp;An open-access repository that allows researchers to share datasets, software, and other research outputs, often linked to academic publications.
  
* [The World Bank Open Data](https://data.worldbank.org/)  
  &nbsp;&nbsp;&nbsp;&nbsp;A comprehensive source of global development data, with datasets covering various economic and social indicators.
  
* [OpenML](https://www.openml.org/)  
  &nbsp;&nbsp;&nbsp;&nbsp;An online platform for sharing datasets, machine learning experiments, and results, fostering collaboration in the ML community.
  
* [Stanford Large Network Dataset Collection (SNAP)](https://snap.stanford.edu/data/)  
  &nbsp;&nbsp;&nbsp;&nbsp;A collection of large-scale network datasets from Stanford University, useful for network analysis and graph-based machine learning.
  
* [KDnuggets Datasets](https://www.kdnuggets.com/datasets/index.html)  
  &nbsp;&nbsp;&nbsp;&nbsp;A curated list of datasets for data mining and data science, compiled by the KDnuggets community.


#### 1.2.1.2 Project

### 1.2.2 Learning

* [K-Nearest Neighbors on Kaggle](https://www.kaggle.com/code/mmdatainfo/k-nearest-neighbors)

* [Complete Guide to K-Nearest-Neighbors](https://kevinzakka.github.io/2016/07/13/k-nearest-neighbor)

### 1.2.3 Documentation

---

# 2. Setup

In [1]:
from ikt450.src.common_imports import *
from ikt450.src.config import get_paths
from ikt450.src.common_func import load_dataset, save_dataframe, ensure_dir_exists

In [2]:
paths = get_paths()

In [3]:
RANDOM_SEED = 7

In [4]:
SPLITRATIO = 0.8

---

## 4.1 Data loading
<!--
- Load datasets from files or other sources.
-->

In [5]:
questions_df = pd.read_csv(f"{paths['PATH_COMMON_DATASETS']}/pythonQuestions/Questions.csv", delimiter=",", encoding="latin-1")
tags_df = pd.read_csv(f"{paths['PATH_COMMON_DATASETS']}/pythonQuestions/Tags.csv", delimiter=",", encoding="latin-1")
answers_df = pd.read_csv(f"{paths['PATH_COMMON_DATASETS']}/pythonQuestions/Answers.csv", delimiter=",", encoding="latin-1")

### 4.2.1 Info

In [6]:
print(tags_df.info())
print(questions_df.info())
print(answers_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1885078 entries, 0 to 1885077
Data columns (total 2 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   Id      int64 
 1   Tag     object
dtypes: int64(1), object(1)
memory usage: 28.8+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 607282 entries, 0 to 607281
Data columns (total 6 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   Id            607282 non-null  int64  
 1   OwnerUserId   601070 non-null  float64
 2   CreationDate  607282 non-null  object 
 3   Score         607282 non-null  int64  
 4   Title         607282 non-null  object 
 5   Body          607282 non-null  object 
dtypes: float64(1), int64(2), object(3)
memory usage: 27.8+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 987122 entries, 0 to 987121
Data columns (total 6 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   Id            98712

In [7]:
# Merge the tags and questions dataframes



### 4.2.2 Describe

In [8]:
num_tags = len(list(tags_df["Tag"].unique()))
unique_tags = list(tags_df["Tag"].unique())

In [9]:
num_tags

16896

In [10]:
answers_df

Unnamed: 0,Id,OwnerUserId,CreationDate,ParentId,Score,Body
0,497,50.0,2008-08-02T16:56:53Z,469,4,<p>open up a terminal (Applications-&gt;Utilit...
1,518,153.0,2008-08-02T17:42:28Z,469,2,<p>I haven't been able to find anything that d...
2,536,161.0,2008-08-02T18:49:07Z,502,9,<p>You can use ImageMagick's convert utility f...
3,538,156.0,2008-08-02T18:56:56Z,535,23,<p>One possibility is Hudson. It's written in...
4,541,157.0,2008-08-02T19:06:40Z,535,20,"<p>We run <a href=""http://buildbot.net/trac"">B..."
...,...,...,...,...,...,...
987117,40143290,3831.0,2016-10-19T23:46:58Z,40142906,0,<p>I am fairly certain your problem is your us...
987118,40143315,3125566.0,2016-10-19T23:49:43Z,40143166,2,"<p>First thing, you should use <code>if/elif</..."
987119,40143317,2350575.0,2016-10-19T23:50:04Z,40142194,0,<p>If you are using firefox ver >47.0.1 you ne...
987120,40143349,6934347.0,2016-10-19T23:54:02Z,40077010,0,<p>I solved my own problem defining the follow...


In [11]:
tags_grouped = tags_df.groupby('Id')['Tag'].apply(list).reset_index(name='Tags')
questions_and_tags_df = questions_df.merge(tags_grouped,on="Id")



In [12]:

# find the most common tags
tag_count = {}
for tags in questions_and_tags_df["Tags"]:
    for tag in tags:
        if tag in tag_count:
            tag_count[tag] += 1
        else:
            tag_count[tag] = 1
            



In [13]:
tag_count
# sort the tags by count
sorted_tags = sorted(tag_count.items(), key=lambda x: x[1], reverse=True)
sorted_tags

# get the top 10 tags

# 
class_count = 2
# get the top 100 with out the top 10
top_100_tags = [tag for tag, count in sorted_tags[10:13]]
top_100_tags


['tkinter', 'string', 'flask']

In [14]:
# filter the questions to only include the top 100 tags

questions_and_tags_df["Tags"] = questions_and_tags_df["Tags"].apply(lambda tags: [tag for tag in tags if tag in top_100_tags])

In [15]:

# remove questions with no tags 
questions_and_tags_df = questions_and_tags_df[questions_and_tags_df["Tags"].apply(len) > 0]
questions_and_tags_df

# remove questions with two or more tags
questions_and_tags_df = questions_and_tags_df[questions_and_tags_df["Tags"].apply(len) == 1]
questions_and_tags_df



Unnamed: 0,Id,OwnerUserId,CreationDate,Score,Title,Body,Tags
15,2933,1384652.0,2008-08-05T22:26:00Z,171,How can I create a directly-executable cross-p...,<p>Python works on multiple platforms and can ...,[tkinter]
34,13454,,2008-08-17T01:23:50Z,9,Python version of PHP's stripslashes,<p>I wrote a piece of code to convert PHP's st...,[string]
69,28165,305.0,2008-08-26T14:20:48Z,10,Does PHP have an equivalent to this type of Py...,<p>Python has this wonderful way of handling s...,[string]
101,36139,3205.0,2008-08-30T17:03:09Z,211,How do I sort a list of strings in Python?,<p>What is the best way of creating an alphabe...,[string]
149,45540,4717.0,2008-09-05T11:07:06Z,5,How to know whether a window with a given titl...,<p>Iâve writen a little python script that j...,[tkinter]
...,...,...,...,...,...,...,...
607172,40140574,7043471.0,2016-10-19T20:05:57Z,0,Problems with unsing TKinter and RPi.GPIO toge...,<p>I'm trying to make a program with Python th...,[tkinter]
607176,40140817,7044354.0,2016-10-19T20:20:50Z,0,Tkinter open with 1Button the directory of 2 B...,<p>I want to make a code to search for file na...,[tkinter]
607204,40141540,3799576.0,2016-10-19T21:06:36Z,1,Easier way to check if a string contains only ...,<p>I have a string <code>'829383&amp;&amp;*&am...,[string]
607208,40141620,1356863.0,2016-10-19T21:12:54Z,-1,Application not picking up CSS file (python/fl...,"<p>I am rendering a template, that I am attemp...",[flask]


In [16]:
# remove id, owneruserid,creationdate,scoure 
questions_and_tags_df = questions_and_tags_df.drop(columns=["Id","OwnerUserId","CreationDate","Score","Title"])
questions_and_tags_df

Unnamed: 0,Body,Tags
15,<p>Python works on multiple platforms and can ...,[tkinter]
34,<p>I wrote a piece of code to convert PHP's st...,[string]
69,<p>Python has this wonderful way of handling s...,[string]
101,<p>What is the best way of creating an alphabe...,[string]
149,<p>Iâve writen a little python script that j...,[tkinter]
...,...,...
607172,<p>I'm trying to make a program with Python th...,[tkinter]
607176,<p>I want to make a code to search for file na...,[tkinter]
607204,<p>I have a string <code>'829383&amp;&amp;*&am...,[string]
607208,"<p>I am rendering a template, that I am attemp...",[flask]


In [17]:
questions_and_tags_df = questions_and_tags_df.rename(columns={"Body":"Question"})
questions_and_tags_df = questions_and_tags_df.rename(columns={"Tags":"Class"})
questions_and_tags_df

Unnamed: 0,Question,Class
15,<p>Python works on multiple platforms and can ...,[tkinter]
34,<p>I wrote a piece of code to convert PHP's st...,[string]
69,<p>Python has this wonderful way of handling s...,[string]
101,<p>What is the best way of creating an alphabe...,[string]
149,<p>Iâve writen a little python script that j...,[tkinter]
...,...,...
607172,<p>I'm trying to make a program with Python th...,[tkinter]
607176,<p>I want to make a code to search for file na...,[tkinter]
607204,<p>I have a string <code>'829383&amp;&amp;*&am...,[string]
607208,"<p>I am rendering a template, that I am attemp...",[flask]


In [18]:
# select first 10000 rows
questions_and_tags_df = questions_and_tags_df[:25000]
questions_and_tags_df

Unnamed: 0,Question,Class
15,<p>Python works on multiple platforms and can ...,[tkinter]
34,<p>I wrote a piece of code to convert PHP's st...,[string]
69,<p>Python has this wonderful way of handling s...,[string]
101,<p>What is the best way of creating an alphabe...,[string]
149,<p>Iâve writen a little python script that j...,[tkinter]
...,...,...
473000,<p>I'm writing a custom Python class that simp...,[tkinter]
473008,<p>I was finishing up a simple little user log...,[flask]
473009,<p>I am creating a Flask Application that conn...,[flask]
473010,"<p>Just as it sounds, I have a basic functiona...",[flask]


In [19]:

import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Initialize lemmatizer and stopwords
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words("english"))

def preprocess_text(text):
    # Remove leading <p> tags
    text = re.sub(r'^<p>', '', text)
    # Lowercase the text
    text = text.lower()
    # Remove punctuation
    text = re.sub(r'[^\w\s]', '', text)
    # Tokenize, remove stopwords, then lemmatize words
    text = " ".join(lemmatizer.lemmatize(word) for word in text.split() if word not in stop_words)
    return text

# Apply preprocessing to the 'text' column
questions_and_tags_df['processed_text'] = questions_and_tags_df['Question'].apply(preprocess_text)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  questions_and_tags_df['processed_text'] = questions_and_tags_df['Question'].apply(preprocess_text)


In [20]:
questions_and_tags_df

Unnamed: 0,Question,Class,processed_text
15,<p>Python works on multiple platforms and can ...,[tkinter],python work multiple platform used desktop web...
34,<p>I wrote a piece of code to convert PHP's st...,[string],wrote piece code convert phps striplashes vali...
69,<p>Python has this wonderful way of handling s...,[string],python wonderful way handling string substitut...
101,<p>What is the best way of creating an alphabe...,[string],best way creating alphabetically sorted list p...
149,<p>Iâve writen a little python script that j...,[tkinter],iâve writen little python script pop message b...
...,...,...,...
473000,<p>I'm writing a custom Python class that simp...,[tkinter],im writing custom python class simplifies inte...
473008,<p>I was finishing up a simple little user log...,[flask],finishing simple little user login emflaskem e...
473009,<p>I am creating a Flask Application that conn...,[flask],creating flask application connects locallyhos...
473010,"<p>Just as it sounds, I have a basic functiona...",[flask],sound basic functional test suite two test im ...


In [21]:
from collections import Counter
from itertools import chain
from torch.nn.utils.rnn import pad_sequence
import torch

# Set threshold for minimum word frequency
min_freq = 5  # or a suitable value based on your data
max_length = 50  # Maximum sequence length

# Step 1: Tokenize and Build Vocabulary with a Frequency Filter
tokenized_texts = questions_and_tags_df['processed_text'].apply(lambda x: x.split())
word_counts = Counter(chain(*tokenized_texts))

# Build vocabulary with words meeting the min frequency requirement
vocab = {word: idx + 2 for idx, (word, count) in enumerate(word_counts.items()) if count >= min_freq}  # Start at 2
vocab['<PAD>'] = 0
vocab['<UNK>'] = 1  # Unknown token for rare words

# Step 2: Encode Texts with Unknown Token Handling
questions_and_tags_df['encoded_text'] = tokenized_texts.apply(
    lambda x: [vocab.get(word, vocab['<UNK>']) for word in x]  # Use <UNK> for words not in vocab
)

# Step 3: Pad or Truncate Sequences
questions_and_tags_df['padded_text'] = questions_and_tags_df['encoded_text'].apply(
    lambda x: x[:max_length] + [vocab['<PAD>']] * (max_length - len(x)) if len(x) < max_length else x[:max_length]
)

# Convert to tensor
x = torch.tensor(questions_and_tags_df['padded_text'].tolist())


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  questions_and_tags_df['encoded_text'] = tokenized_texts.apply(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  questions_and_tags_df['padded_text'] = questions_and_tags_df['encoded_text'].apply(


In [22]:
# find max value in x
max_value = x.max().item()
max_value

invalid_indices = x >= len(vocab)
if invalid_indices.any():
    print(f"Found {invalid_indices.sum().item()} invalid indices, setting them to '<UNK>' token.")
    x[invalid_indices] = vocab['<UNK>']

max_value = x.max().item()
max_value


Found 66801 invalid indices, setting them to '<UNK>' token.


29120

In [23]:
len(vocab)

29128

In [24]:



y = questions_and_tags_df['Class']

In [25]:

y = y.apply(lambda x: x[0])
y = y.apply(lambda x: top_100_tags.index(x))
y = y.to_numpy()

np.mean(y)

0.98476

In [26]:
x = torch.tensor(x)
y = torch.tensor(y)


  x = torch.tensor(x)


In [27]:
x.shape, y.shape

(torch.Size([25000, 50]), torch.Size([25000]))

In [28]:
# Split the data into training and testing sets
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=RANDOM_SEED)

x_train.shape, x_test.shape, y_train.shape, y_test.shape

(torch.Size([20000, 50]),
 torch.Size([5000, 50]),
 torch.Size([20000]),
 torch.Size([5000]))

In [29]:
train_dataset = torch.utils.data.TensorDataset(x_train, y_train)
test_dataset = torch.utils.data.TensorDataset(x_test, y_test)

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=32, shuffle=False)


### 4.2.3 Head

## 4.3 Data Visualization

## 4.4 Data Cleaning
<!--
- Handle missing values, outliers, and inconsistencies.
- Remove or impute missing data.
-->

### 4.4.1 NULL, NaN, Missing values

## 4.5 Feature Engineering
<!--
- Create new features from existing data.
- Normalize or standardize features.
- Encode categorical variables.
-->

### 4.5.1 Normalize

#### 4.5.1.1 Feature Selection / Data Separation

<details>
<br>
<details>
<summary>What does it?</summary>
<br>
This line removes the `` column from the DataFrame `df` and assigns the remaining columns to `X`.
</details>
<br>
<details>
<summary>Why do we do it?</summary>
<br>
We do this to separate the input features (which are stored in `X`) from the target variable (which will be stored in `y`). This separation is essential in supervised learning tasks where the goal is to predict the target variable based on the input features.
</details>
</details>

#### 4.5.1.3 Feature Scaling / Standardization / Z-score Normalization

<details>
<br>
<details>
<summary>What does it?</summary>
<br>
This line standardizes the features in `X` by subtracting the mean of each feature and dividing by the standard deviation of that feature. This transforms the data so that each feature has a mean of 0 and a standard deviation of 1.
</details>
<br>
<details>
<summary>Why do we do it?</summary>
<br>
Standardization is crucial when using machine learning algorithms that rely on distance calculations (like K-Nearest Neighbors, SVM, or Neural Networks). Without standardization, features with larger scales could dominate the distance calculation, leading to biased model behavior. By standardizing, all features contribute equally to the model, regardless of their original scale.
</details>
</details>

## 4.6 Data Splitting
<!--
- Split data into training, validation, and test sets.
-->

In [30]:
# Sklearn train_test_split
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=(1-SPLITRATIO), random_state=RANDOM_SEED)

---

# 5. Model Development

## 5.1 Model Selection
<!--
- Choose the model(s) to be trained (e.g., linear regression, decision trees, neural networks).
-->

In [31]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class ClassificationModel(nn.Module):
    def __init__(self, num_tags, vocab_size, embedding_dim=100):
        super(ClassificationModel, self).__init__()
        
        # Embedding layer with randomly initialized embeddings
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)  # Set padding_idx to 0 to ignore padding token
        
        # LSTM layer
        self.lstm = nn.LSTM(input_size=embedding_dim, hidden_size=512, num_layers=1, batch_first=True, bidirectional=True)
        
        # Fully connected layers
        self.fc1 = nn.Linear(512 * 2, 512)
        self.fc2 = nn.Linear(512, num_tags)
    
    def forward(self, input):
        # Pass input through embedding layer
        x = self.embedding(input)  # Shape: [batch_size, sequence_length, embedding_dim]
        
        # LSTM layer
        output, _ = self.lstm(x)  # output shape: [batch_size, sequence_length, hidden_size * 2]
        
        # Extract the output at the last time step
        x = self.fc1(output[:, -1, :])  # Shape: [batch_size, 512]
        
        # Apply ReLU activation
        x = F.relu(x)
        
        # Pass through the final layer to get class scores
        x = self.fc2(x)
        
        # Apply sigmoid for multi-label classification; for single-label, use softmax
        x = torch.sigmoid(x)
        
        return x


## 5.2 Model Training
<!--
- Train the selected model(s) using the training data.
-->

In [35]:
# Define the loss function and optimizer
import torch.optim as optim
vocab_size = len(vocab)  # Size of the vocabulary
embedding_dim = 100      # Dimension of the embedding vectors
num_tags = len(top_100_tags)  # Number of tags

model = ClassificationModel(num_tags, vocab_size, embedding_dim)




criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

model

ClassificationModel(
  (embedding): Embedding(29128, 100, padding_idx=0)
  (lstm): LSTM(100, 512, batch_first=True, bidirectional=True)
  (fc1): Linear(in_features=1024, out_features=512, bias=True)
  (fc2): Linear(in_features=512, out_features=3, bias=True)
)

In [36]:
vocab_size

29128

In [37]:
# Train the model
num_epochs = 10

for epoch in range(num_epochs):
    running_loss = 0.0
    for i, data in enumerate(train_loader, 0):
        inputs, labels = data
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
    
    print(f"Epoch {epoch + 1}, Loss: {running_loss / len(train_loader)}")
           
    
print("Finished Training")


Epoch 1, Loss: 1.0636905636787415
Epoch 2, Loss: 0.9163144303321838
Epoch 3, Loss: 0.7771083378791809
Epoch 4, Loss: 0.6473542965888978
Epoch 5, Loss: 0.6069113124847412
Epoch 6, Loss: 0.5985156487464904
Epoch 7, Loss: 0.5895508230209351
Epoch 8, Loss: 0.583424378490448
Epoch 9, Loss: 0.5785200416564942
Epoch 10, Loss: 0.5792918829917908
Finished Training


In [38]:
# Evaluate the model
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report

model.eval()

y_pred = []
y_true = []

with torch.no_grad():
    for data in test_loader:
        inputs, labels = data
        outputs = model(inputs)
        _, predicted = torch.max(outputs, 1)
        y_pred.extend(predicted.tolist())
        y_true.extend(labels.tolist())

accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred, average='macro')
recall = recall_score(y_true, y_pred, average='macro')
f1 = f1_score(y_true, y_pred, average='macro')

print(classification_report(y_true, y_pred))

              precision    recall  f1-score   support

           0       0.97      0.94      0.96      1690
           1       0.92      0.96      0.94      1786
           2       0.96      0.94      0.95      1524

    accuracy                           0.95      5000
   macro avg       0.95      0.95      0.95      5000
weighted avg       0.95      0.95      0.95      5000



## 5.3 Model Evaluation
<!--
- Evaluate model performance on validation data.
- Use appropriate metrics (e.g., accuracy, precision, recall, RMSE).
-->

## 5.4 Hyperparameter Tuning
<!--
- Fine-tune the model using techniques like Grid Search or Random Search.
- Evaluate the impact of different hyperparameters.
-->

## 5.5 Model Testing
<!--
- Evaluate the final model on the test dataset.
- Ensure that the model generalizes well to unseen data.
-->

## 5.6 Model Interpretation (Optional)
<!--
- Interpret the model results (e.g., feature importance, SHAP values).
- Discuss the strengths and limitations of the model.
-->

---

# 6. Predictions


## 6.1 Make Predictions
<!--
- Use the trained model to make predictions on new/unseen data.
-->

## 6.2 Save Model and Results
<!--
- Save the trained model to disk for future use.
- Export prediction results for further analysis.
-->

---

# 7. Documentation and Reporting

## 7.1 Summary of Findings
<!--
- Summarize the results and findings of the analysis.
-->

## 7.2 Next Steps
<!--
- Suggest further improvements, alternative models, or future work.
-->

## 7.3 References
<!--
- Cite any resources, papers, or documentation used.
-->