In [1]:
import os

# Get the current working directory
current_directory = os.getcwd()

# Get the parent directory
parent_directory = os.path.dirname(current_directory)

# Change the current working directory to the parent directory
os.chdir(parent_directory)



In [2]:
import pandas as pd
df = pd.read_csv("./data/medium.csv")
df.describe()

Unnamed: 0,Title,Text
count,1391,1391
unique,1390,1391
top,Autonomous Agents And Multi-Agent Systems 101:...,1. Introduction of Word2vec\r\n\r\nWord2vec is...
freq,2,1


In [3]:
df['ID'] = range(0, len(df))
df.set_index('ID')

Unnamed: 0_level_0,Title,Text
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
0,A Beginner’s Guide to Word Embedding with Gens...,1. Introduction of Word2vec\r\n\r\nWord2vec is...
1,Hands-on Graph Neural Networks with PyTorch & ...,"In my last article, I introduced the concept o..."
2,How to Use ggplot2 in Python,Introduction\r\n\r\nThanks to its strict imple...
3,Databricks: How to Save Data Frames as CSV Fil...,Photo credit to Mika Baumeister from Unsplash\...
4,A Step-by-Step Implementation of Gradient Desc...,A Step-by-Step Implementation of Gradient Desc...
...,...,...
1386,Brain: A Mystery,“The most beautiful experience we can have is ...
1387,Machine Learning: Lincoln Was Ahead of His Time,Photo by Jp Valery on Unsplash\r\n\r\nIn the 4...
1388,AI and Us — an Opera Experience. In my previou...,EKHO COLLECTIVE: OPERA BEYOND SERIES\r\n\r\nIn...
1389,Digital Skills as a Service (DSaaS),Have you ever thought about what will be in th...


In [4]:
records = df.to_dict('records')

In [5]:
records[0]

{'Title': 'A Beginner’s Guide to Word Embedding with Gensim Word2Vec Model',
 'Text': '1. Introduction of Word2vec\r\n\r\nWord2vec is one of the most popular technique to learn word embeddings using a two-layer neural network. Its input is a text corpus and its output is a set of vectors. Word embedding via word2vec can make natural language computer-readable, then further implementation of mathematical operations on words can be used to detect their similarities. A well-trained set of word vectors will place similar words close to each other in that space. For instance, the words women, men, and human might cluster in one corner, while yellow, red and blue cluster together in another.\r\n\r\nThere are two main training algorithms for word2vec, one is the continuous bag of words(CBOW), another is called skip-gram. The major difference between these two methods is that CBOW is using context to predict a target word while skip-gram is using a word to predict a target context. Generally, 

In [6]:
from spacy.lang.en import English # see https://spacy.io/usage for install instructions

nlp = English()

# Add a sentencizer pipeline, see https://spacy.io/api/sentencizer/ 
nlp.add_pipe("sentencizer")

# Create a document instance as an example
doc = nlp("This is a sentence. This another sentence.")
assert len(list(doc.sents)) == 2

# Access the sentences of the document
list(doc.sents)

[This is a sentence., This another sentence.]

In [7]:
for item in records:
    item["sentences"] = list(nlp(item["Text"].replace('\r', '').replace('\n', '')).sents)
    item["sentences"] = [str(sentence) for sentence in item["sentences"]]
    item["sentence_count_spacy"] = len(item["sentences"])

In [8]:
import random
random.sample(records, k=1)

[{'Title': 'Missing Data and Imputation',
  'Text': 'Missing data can skew findings, increase computational expense, and frustrate researchers. In recent years, dealing with missing data has become more prevalent in fields like biological and life sciences, as we are seeing very direct consequences of mismanaged null values¹. In response, there are more diverse methods for handling missing data emerging.\r\n\r\nThis is great for increasing the effectiveness of studies, and a bit tricky for aspiring and active data scientists keep up with. This blog post will introduce you to a few helpful concepts in dealing with missing data, and get you started with some tangible ways to clean up your data in Python that you can try out today.\r\n\r\nPhoto by Carlos Muza on Unsplash\r\n\r\nWhy do anything at all?\r\n\r\nYou may be asking yourself — why do I need to deal with missing data at all? Why not let sleeping dogs lie? Well, first of all, missing values (termed NaN, Null or NA) cause computati

In [9]:
# Define split size to turn groups of sentences into chunks
num_sentence_chunk_size = 5 

# Create a function that recursively splits a list into desired sizes
def split_list(input_list: list, 
               slice_size: int) -> list[list[str]]:
    """
    Splits the input_list into sublists of size slice_size (or as close as possible).

    For example, a list of 17 sentences would be split into two lists of [[10], [7]]
    """
    return [input_list[i:i + slice_size] for i in range(0, len(input_list), slice_size)]

# Loop through pages and texts and split sentences into chunks
for item in records:
    item["sentence_chunks"] = split_list(input_list=item["sentences"],
                                         slice_size=num_sentence_chunk_size)
    item["num_chunks"] = len(item["sentence_chunks"])

In [10]:
random.sample(records, k=1)

[{'Title': 'Which U.S. States Have the Most Neighbors?',
  'Text': 'Photo by Joey Csunyo on Unsplash\r\n\r\nThis is going to be a pretty quick and dirty post on using python to determine whether one U.S. state (or any arbitrary geography) borders another U.S. state (or any other arbitrary geography).\r\n\r\nOur input will be a GeoJSON (just a JSON describing a complex shape) of U.S. States, which you can get from my GitHub here. And, our output will be a dictionary in python which maps each U.S. state to a single number indicating how many neighboring states it has.\r\n\r\nOur main tool for this post will be the python library shapely which helps us manipulate complex geographies in python.\r\n\r\nThe procedure will be pretty straightforward: for each U.S. state, we can loop over every other U.S. state and then check whether or not the two states touch. If they do, we can update a running list of neighboring states for the current state in question.\r\n\r\nFirst we will need to convert

In [11]:
import re

# Split each chunk into its own item
record_chunks = []
for item in records:
    for sentence_chunk in item["sentence_chunks"]:
        chunk_dict = {}
        chunk_dict["ID"] = item["ID"]
        chunk_dict["Title"] = item["Title"]
        
        # Join the sentences together into a paragraph-like structure, aka a chunk (so they are a single string)
        joined_sentence_chunk = "".join(sentence_chunk).replace("  ", " ").strip()
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk) # ".A" -> ". A" for any full-stop/capital letter combo 
        chunk_dict["sentence_chunk"] = joined_sentence_chunk

        # Get stats about the chunk
        chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
        chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
        chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4 # 1 token = ~4 characters
        
        record_chunks.append(chunk_dict)

# How many chunks do we have?
len(record_chunks)

11856

In [12]:
random.sample(record_chunks, k=1)

[{'ID': 1017,
  'Title': 'Car Image Classification Using Features Extracted from Pre-trained Neural Networks',
  'sentence_chunk': 'Likewise, as seen in figure 7, Sedans that are misclassified as Convertibles/Coupes are more colorful than the Sedans that are correctly classified. This is an encouraging result in most cases as this will pick up colors specific to certain car Make/Model but the weight placed on color is higher than the weight placed on features that represent shape and size. ConclusionsThe primary conclusions from the above image classification analysis are:Prototyping a classification model using pretrained CNN features is quite effective and easier than fully building a deep neural network from scratch. Error analysis is quite useful, and provides insights on how models can be employed. ReferencesGitHub Repo:Feature Extraction: extract_features.pyModel Building: 3_FinalModelsRuns.ipynbError Analysis: 3_Results_Presentation.ipynb',
  'chunk_char_count': 827,
  'chunk_wo

In [13]:
df = pd.DataFrame(record_chunks)
df.describe().round(2)

Unnamed: 0,ID,chunk_char_count,chunk_word_count,chunk_token_count
count,11856.0,11856.0,11856.0,11856.0
mean,697.86,645.16,102.92,161.29
std,401.69,324.12,45.6,81.03
min,0.0,2.0,1.0,0.5
25%,345.0,460.0,76.0,115.0
50%,714.0,604.0,99.0,151.0
75%,1054.0,773.0,125.0,193.25
max,1390.0,5621.0,577.0,1405.25


In [14]:
# Show random chunks with under 30 tokens in length
min_token_length = 30
for row in df[df["chunk_token_count"] <= min_token_length].sample(5).iterrows():
    print(f'Chunk token count: {row[1]["chunk_token_count"]} | Text: {row[1]["sentence_chunk"]}')

Chunk token count: 16.25 | Text: The encoding and decoding process all happen within the data set.
Chunk token count: 20.5 | Text: Ralph JohnsonHere are a few all-time classics you should strive to read this year:
Chunk token count: 7.75 | Text: So why do I feel so much worse?
Chunk token count: 21.25 | Text: Checking original test imagesCurious about the images that I’ve picked?Here they are:
Chunk token count: 17.75 | Text: Lab. Syst.,1998, 44, 175 — to maximize the explained covariance on the…


In [15]:
from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2", 
                                      device="cuda") # choose the device to load the model to (note: GPU will often be *much* faster than CPU)

# Create a list of sentences to turn into numbers
sentences = [
    "The Sentences Transformers library provides an easy and open-source way to create embeddings.",
    "Sentences can be embedded one by one or as a list of strings.",
    "Embeddings are one of the most powerful concepts in machine learning!",
    "Learn to use embeddings well and you'll be well on your way to being an AI engineer."
]

# Sentences are encoded/embedded by calling model.encode()
embeddings = embedding_model.encode(sentences)
embeddings_dict = dict(zip(sentences, embeddings))

# See the embeddings
for sentence, embedding in embeddings_dict.items():
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

  from .autonotebook import tqdm as notebook_tqdm


Sentence: The Sentences Transformers library provides an easy and open-source way to create embeddings.
Embedding: [-2.07981374e-02  3.03164907e-02 -2.01217979e-02  6.86483532e-02
 -2.55255420e-02 -8.47688597e-03 -2.07085031e-04 -6.32377118e-02
  2.81606112e-02 -3.33353020e-02  3.02634854e-02  5.30720577e-02
 -5.03526479e-02  2.62288060e-02  3.33313867e-02 -4.51578647e-02
  3.63044403e-02 -1.37114024e-03 -1.20171197e-02  1.14946719e-02
  5.04510701e-02  4.70856950e-02  2.11912710e-02  5.14607392e-02
 -2.03746427e-02 -3.58889289e-02 -6.67863991e-04 -2.94393171e-02
  4.95858900e-02 -1.05639501e-02 -1.52013879e-02 -1.31752633e-03
  4.48197015e-02  1.56022999e-02  8.60379885e-07 -1.21391134e-03
 -2.37978678e-02 -9.09396447e-04  7.34481821e-03 -2.53929826e-03
  5.23369834e-02 -4.68043387e-02  1.66214649e-02  4.71579060e-02
 -4.15599607e-02  9.01933818e-04  3.60278711e-02  3.42214555e-02
  9.68227163e-02  5.94828948e-02 -1.64984651e-02 -3.51249278e-02
  5.92518551e-03 -7.07988336e-04 -2.4103

In [16]:
import tqdm
records = df.to_dict(orient="records")
records[:2]

[{'ID': 0,
  'Title': 'A Beginner’s Guide to Word Embedding with Gensim Word2Vec Model',
  'sentence_chunk': '1. Introduction of Word2vecWord2vec is one of the most popular technique to learn word embeddings using a two-layer neural network. Its input is a text corpus and its output is a set of vectors. Word embedding via word2vec can make natural language computer-readable, then further implementation of mathematical operations on words can be used to detect their similarities. A well-trained set of word vectors will place similar words close to each other in that space.',
  'chunk_char_count': 468,
  'chunk_word_count': 75,
  'chunk_token_count': 117.0},
 {'ID': 0,
  'Title': 'A Beginner’s Guide to Word Embedding with Gensim Word2Vec Model',
  'sentence_chunk': 'For instance, the words women, men, and human might cluster in one corner, while yellow, red and blue cluster together in another. There are two main training algorithms for word2vec, one is the continuous bag of words(CBOW),

In [17]:
# Send the model to the GPU
embedding_model.to("cuda") # requires a GPU installed, for reference on my local machine, I'm using a NVIDIA RTX 4090

# Create embeddings one by one on the GPU
for item in records:
    item["embedding"] = embedding_model.encode(item["sentence_chunk"])

In [18]:
random.sample(records, k=1)

[{'ID': 1332,
  'Title': 'Machine Learning Cheat Sheet — Data Processing Techniques',
  'sentence_chunk': 'Disadvantage:Data normalization is sensitive to outliers. One-hot EncodingConvert categorical data into binary variables. For example, convert feature gender into two columns, male and female, with value 0 or 1. Imbalanced Data SetData is not well distributed among different classes. For example, only 0.1% of the transactions are fraud.',
  'chunk_char_count': 338,
  'chunk_word_count': 48,
  'chunk_token_count': 84.5,
  'embedding': array([-5.79277426e-02,  2.55570207e-02,  1.96328363e-03, -4.97448109e-02,
          1.17675355e-02,  7.16103911e-02,  6.12693615e-02, -1.25502900e-03,
         -5.49931414e-02, -9.77407489e-03,  3.99435274e-02,  2.05792878e-02,
          1.81785710e-02,  9.15679783e-02, -1.13642281e-02,  1.75265188e-03,
          1.67607684e-02, -9.53212287e-03, -2.00027451e-02,  1.79839488e-02,
          9.84726287e-03, -4.12900113e-02,  5.32371663e-02,  2.15621386e

In [19]:
# Save embeddings to file
chunks_and_embeddings_df = pd.DataFrame(records)
embeddings_df_save_path = "./data/chunks_and_embeddings_df.csv"
chunks_and_embeddings_df.to_csv(embeddings_df_save_path, index=False)

In [20]:
# Import saved file and view
text_chunks_and_embedding_df_load = pd.read_csv(embeddings_df_save_path)
text_chunks_and_embedding_df_load.set_index("ID")
text_chunks_and_embedding_df_load.head()

Unnamed: 0,ID,Title,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,0,A Beginner’s Guide to Word Embedding with Gens...,1. Introduction of Word2vecWord2vec is one of ...,468,75,117.0,[ 1.14987101e-02 4.61726263e-02 -7.97047652e-...
1,0,A Beginner’s Guide to Word Embedding with Gens...,"For instance, the words women, men, and human ...",670,112,167.5,[ 5.78960031e-02 -1.32116340e-02 4.08314820e-...
2,0,A Beginner’s Guide to Word Embedding with Gens...,"For more details about the word2vec algorithm,...",601,96,150.25,[ 3.93238813e-02 2.01313253e-02 1.00309318e-...
3,0,A Beginner’s Guide to Word Embedding with Gens...,Gensim depends on the following software:Pytho...,774,119,193.5,[ 7.78792519e-03 3.13875079e-02 -2.66925362e-...
4,0,A Beginner’s Guide to Word Embedding with Gens...,We will use these features to generate the wor...,725,120,181.25,[ 9.57979169e-03 1.54464487e-02 -1.20048057e-...


In [28]:
import hnswlib
import numpy as np

import random

import torch
import numpy as np 
import pandas as pd

device = "cuda" if torch.cuda.is_available() else "cpu"

# Import texts and embedding df
text_chunks_and_embedding_df = pd.read_csv("./data/chunks_and_embeddings_df.csv")

# Convert embedding column back to np.array (it got converted to string when it got saved to CSV)
text_chunks_and_embedding_df["embedding"] = text_chunks_and_embedding_df["embedding"].apply(lambda x: np.fromstring(x.strip("[]"), sep=" "))

# Convert texts and embedding df to list of dicts
pages_and_chunks = text_chunks_and_embedding_df.to_dict(orient="records")

# Convert embeddings to torch tensor and send to device (note: NumPy arrays are float64, torch tensors are float32 by default)
embeddings = np.array(text_chunks_and_embedding_df["embedding"].tolist())
embeddings.shape


(11856, 768)

In [40]:


# Create the HNSW index
dim = embeddings.shape[1]  # Dimensionality of the vectors

# Initializing an HNSW index
p = hnswlib.Index(space='cosine', dim=dim)  # or use 'cosine' if more appropriate

# Initialize the index
# Specify the maximum number of elements in the index
# Parameters can be adjusted based on dataset and requirements
p.init_index(max_elements=embeddings.shape[0], ef_construction=200, M=16)

# Add items to the index
# Here, we don't specify the ids, so HNSWLIB will generate them automatically
p.add_items(embeddings)

# Optional: Set ef parameter for controlling query time/accuracy trade-off
p.set_ef(50)  # Setting it higher leads to more accurate but slower searches

# Now, your index is ready for querying


In [41]:
from sentence_transformers import util, SentenceTransformer

embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2", 
                                      device=device) # choose the device to load the model to

In [42]:
query = "price prediction model"
print(f"Query: {query}")

# 2. Embed the query to the same numerical space as the text examples 
# Note: It's important to embed your query with the same model you embedded your examples with.
query_embedding = embedding_model.encode(query)

# # 3. Get similarity scores with the dot product (we'll time this for fun)
# from time import perf_counter as timer

# start_time = timer()
# dot_scores = util.dot_score(a=query_embedding, b=embeddings)[0]
# end_time = timer()

# print(f"Time take to get scores on {len(embeddings)} embeddings: {end_time-start_time:.5f} seconds.")

# # 4. Get the top-k results (we'll keep this to 5)
# top_results_dot_product = torch.topk(dot_scores, k=5)
# top_results_dot_product

Query: price prediction model


In [46]:
# Searching for the 5 nearest neighbors
ids, distances = p.knn_query(query_embedding, k = 5)

print("Nearest neighbor ids:", ids)
print("Distances:", distances)

Nearest neighbor ids: [[10871 11770  3020  5589 11771 11663  5585 11590 10624 11649]]
Distances: [[0.31617194 0.3314786  0.39219975 0.40445375 0.41107804 0.4178326
  0.42525178 0.42620337 0.42673832 0.4293744 ]]


In [44]:
# Define helper function to print wrapped text 
import textwrap

def print_wrapped(text, wrap_length=80):
    wrapped_text = textwrap.fill(text, wrap_length)
    print(wrapped_text)

In [47]:
print(f"Query: '{query}'\n")
print("Results:")
# Loop through zipped together scores and indicies from torch.topk
for score, idx in zip(distances[0], ids[0]):
    print(f"Score: {score}")
    # Print relevant sentence chunk (since the scores are in descending order, the most relevant chunk will be first)
    print("Title:")
    print_wrapped(pages_and_chunks[idx]["Title"])
    print("Text:")
    print_wrapped(pages_and_chunks[idx]["sentence_chunk"])
    # Print the page number too so we can reference the textbook further (and check the results)
    print(f"ID: {pages_and_chunks[idx]['ID']}")
    print("\n")

Query: 'price prediction model'

Results:
Score: 0.31617194414138794
Title:
Tools/Tips Critical to Any Machine Learning Project
Text:
Although we found that we can predict the future prices with some accuracy, it’s
still not fit to capture the special cases (sudden price change based on other
market factors). The purpose here was not to solve the problem completely
(although we can if we figure a way to integrate other factors to our problem
statement), but to realise, what we can do with Machine Learning, seating at our
homes, with our mediocre laptops. That’s the power of Machine Learning. Keep
coding.
ID: 1263


Score: 0.3314785957336426
Title:
Forecasting Future Prices of Cryptocurrency using Historical Data
Text:
We will try to predict the future prices of Bitcoin by using its closing_price
feature. What Model to Use?To perform forecasting, we will need a machine
learning model. Most people think of multi-linear regression when they want to
predict values. But for Time-series data

In [None]:
import torch

def dot_product(vector1, vector2):
    return torch.dot(vector1, vector2)

def cosine_similarity(vector1, vector2):
    dot_product = torch.dot(vector1, vector2)

    # Get Euclidean/L2 norm of each vector (removes the magnitude, keeps direction)
    norm_vector1 = torch.sqrt(torch.sum(vector1**2))
    norm_vector2 = torch.sqrt(torch.sum(vector2**2))

    return dot_product / (norm_vector1 * norm_vector2)

# Example tensors
vector1 = torch.tensor([1, 2, 3], dtype=torch.float32)
vector2 = torch.tensor([1, 2, 3], dtype=torch.float32)
vector3 = torch.tensor([4, 5, 6], dtype=torch.float32)
vector4 = torch.tensor([-1, -2, -3], dtype=torch.float32)

# Calculate dot product
print("Dot product between vector1 and vector2:", dot_product(vector1, vector2))
print("Dot product between vector1 and vector3:", dot_product(vector1, vector3))
print("Dot product between vector1 and vector4:", dot_product(vector1, vector4))

# Calculate cosine similarity
print("Cosine similarity between vector1 and vector2:", cosine_similarity(vector1, vector2))
print("Cosine similarity between vector1 and vector3:", cosine_similarity(vector1, vector3))
print("Cosine similarity between vector1 and vector4:", cosine_similarity(vector1, vector4))

Dot product between vector1 and vector2: tensor(14.)
Dot product between vector1 and vector3: tensor(32.)
Dot product between vector1 and vector4: tensor(-14.)
Cosine similarity between vector1 and vector2: tensor(1.0000)
Cosine similarity between vector1 and vector3: tensor(0.9746)
Cosine similarity between vector1 and vector4: tensor(-1.0000)


In [None]:
def retrieve_relevant_resources(query: str,
                                embeddings: torch.tensor,
                                model: SentenceTransformer=embedding_model,
                                n_resources_to_return: int=5,
                                print_time: bool=True):
    """
    Embeds a query with model and returns top k scores and indices from embeddings.
    """

    # Embed the query
    query_embedding = model.encode(query, 
                                   convert_to_tensor=True) 

    # Get dot product scores on embeddings
    start_time = timer()
    dot_scores = util.dot_score(query_embedding, embeddings)[0]
    end_time = timer()

    if print_time:
        print(f"[INFO] Time taken to get scores on {len(embeddings)} embeddings: {end_time-start_time:.5f} seconds.")

    scores, indices = torch.topk(input=dot_scores, 
                                 k=n_resources_to_return)

    return scores, indices

def print_top_results_and_scores(query: str,
                                 embeddings: torch.tensor,
                                 pages_and_chunks: list[dict]=pages_and_chunks,
                                 n_resources_to_return: int=5):
    """
    Takes a query, retrieves most relevant resources and prints them out in descending order.

    Note: Requires pages_and_chunks to be formatted in a specific way (see above for reference).
    """
    
    scores, indices = retrieve_relevant_resources(query=query,
                                                  embeddings=embeddings,
                                                  n_resources_to_return=n_resources_to_return)
    
    print(f"Query: {query}\n")
    print("Results:")
    # Loop through zipped together scores and indicies
    for score, index in zip(scores, indices):
        print(f"Score: {score:.4f}")
        print("Title:")
        print_wrapped(pages_and_chunks[idx]["Title"])
        print("Text:")
        print_wrapped(pages_and_chunks[idx]["sentence_chunk"])
        # Print the page number too so we can reference the textbook further (and check the results)
        print(f"ID: {pages_and_chunks[idx]['ID']}")
        print("\n")

In [None]:
scores, indices = retrieve_relevant_resources(query=query,
                                              embeddings=embeddings)

[INFO] Time taken to get scores on 11856 embeddings: 0.00004 seconds.


In [None]:
print_top_results_and_scores(query=query,
                             embeddings=embeddings)

[INFO] Time taken to get scores on 11856 embeddings: 0.00003 seconds.
Query: price prediction model

Results:
Score: 0.6838
Title:
Forecasting Future Prices of Cryptocurrency using Historical Data
Text:
And as we know that in linear regression any sort of extrapolation is not
advisable. For time-series data, it is better to use the Auto Regressive
Integrated Moving Average, or ARIMA Models. ARIMAARIMA is actually a class of
models that ‘explains’ a given time series based on its own past values, that
is, its own lags and the lagged forecast errors, so that equation can be used to
forecast future values. Any ‘non-seasonal’ time series that exhibits patterns
and is not a random white noise can be modeled with ARIMA models. The hypothesis
testing performed as discussed below, shows the prices were not seasonal, hence
we can use an ARIMA model.
ID: 1376


Score: 0.6685
Title:
Forecasting Future Prices of Cryptocurrency using Historical Data
Text:
And as we know that in linear regression an