In [1]:
import numpy as np
import pandas as pd
import math
from openai import OpenAI
from dotenv import load_dotenv
import os
load_dotenv()
api_key = os.getenv("OPENAI_API_KEY")

## Generate documents using OpenAI API

In [2]:
def generate_document(prompt):
    client = OpenAI(
        api_key=api_key,
    )
    messages = [{"role": "user", "content": prompt}]
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=messages,
        temperature=0,
    )
    return response.choices[0].message.content

In [3]:
doc1 = generate_document("Write a doucment about cars")
print(doc1)

Cars have become an essential part of modern society, providing individuals with a convenient and efficient mode of transportation. From commuting to work, running errands, or embarking on road trips, cars play a crucial role in our daily lives.

The history of cars dates back to the late 19th century, with the invention of the first gasoline-powered automobile by Karl Benz in 1885. Since then, cars have evolved significantly in terms of design, technology, and performance. Today, there are a wide variety of cars available on the market, ranging from compact sedans to luxury SUVs and electric vehicles.

One of the key benefits of owning a car is the freedom and flexibility it provides. With a car, individuals can travel to their desired destinations at their own pace and convenience, without having to rely on public transportation schedules. Cars also offer a sense of independence and autonomy, allowing individuals to explore new places and embark on spontaneous adventures.

In additio

In [4]:
doc2 = generate_document("Write a doucment about math and science")
print(doc2)

Math and science are two fundamental subjects that play a crucial role in our understanding of the world around us. Both disciplines are interconnected and rely on each other to explain and solve complex problems.

Mathematics is the language of science, providing the tools and techniques necessary to analyze and interpret data. It is a universal language that allows scientists to communicate and collaborate across different fields. From calculating the trajectory of a rocket to predicting the spread of a virus, math is essential in making informed decisions and solving real-world problems.

Science, on the other hand, is the systematic study of the natural world through observation, experimentation, and analysis. It encompasses a wide range of disciplines, including biology, chemistry, physics, and earth sciences. Science helps us understand the laws of nature and how they govern the universe, from the smallest particles to the largest galaxies.

The relationship between math and scie

In [5]:
doc3 = generate_document("Write a doucment about machine learning")
print(doc3)

Machine learning is a branch of artificial intelligence that focuses on the development of algorithms and models that allow computers to learn from and make predictions or decisions based on data. It is a rapidly growing field with applications in a wide range of industries, including healthcare, finance, marketing, and more.

One of the key concepts in machine learning is the idea of training a model on a dataset. This involves feeding the model with a set of input data, along with the correct output or label for each data point. The model then learns to make predictions or decisions based on this training data, and can be used to make predictions on new, unseen data.

There are several different types of machine learning algorithms, including supervised learning, unsupervised learning, and reinforcement learning. In supervised learning, the model is trained on labeled data, where the correct output is provided for each input. This type of learning is commonly used for tasks such as c

## Preprocessing

In [6]:
def preprocess(text):
    def convert_to_lower(text):
        copy_text = text
        return copy_text.lower()

    text = convert_to_lower(text)
    import string

    def remove_punctuations(text):
        copy_text = text
        punc = string.punctuation
        return copy_text.translate(str.maketrans("", "", punc))

    text = remove_punctuations(text)
    from nltk.corpus import stopwords

    STOPWORDS = stopwords.words("english")

    def remove_stopwords(text):
        copy_text = text
        copy_text = " ".join([word for word in text.split() if word not in STOPWORDS])
        return copy_text

    text = remove_stopwords(text)
    import re

    def remove_special_chars(text):
        copy_text = text
        copy_text = re.sub("[^a-zA-Z0-9]", " ", copy_text)
        copy_text = re.sub("\s+", " ", copy_text)
        return copy_text

    text = remove_special_chars(text)
    from nltk import pos_tag
    from nltk.corpus import wordnet
    from nltk.stem import WordNetLemmatizer

    lemmatizer = WordNetLemmatizer()

    wordnet_map = {
        "N": wordnet.NOUN,
        "V": wordnet.VERB,
        "J": wordnet.ADJ,
        "R": wordnet.ADV,
    }

    def lemmatize_words(text):
        copy_text = text
        # find pos tags
        pos_text = pos_tag(copy_text.split())
        # print(pos_text)
        # print(pos_text)
        return " ".join(
            [
                lemmatizer.lemmatize(word, wordnet_map.get(pos[0], wordnet.NOUN))
                for word, pos in pos_text
            ]
        )

    text = lemmatize_words(text)
    import spacy

    def lemmatize_spacy(text):
        # Load the English language model
        nlp = spacy.load("en_core_web_sm")

        # Process the text with spaCy
        doc = nlp(text)

        # Lemmatize each token in the processed text
        lemmatized_text = " ".join([token.lemma_ for token in doc])
        return lemmatized_text

    # Print the lemmatized text
    text = lemmatize_spacy(text)

    def remove_url(text):
        copy_text = text
        return re.sub(r"https?://\S+|www\.\S+", "", copy_text)

    text = remove_url(text)

    def remove_html_tags(text):
        copy_text = text
        return re.sub(r"<.*?>", "", copy_text)

    text = remove_html_tags(text)

    def remove_digits(text):
        copy_text = text
        return "".join([i for i in copy_text if not i.isdigit()])

    text = remove_digits(text)

    return text

In [7]:
text1 = preprocess(doc1)
text1

'car become essential part modern society provide individual convenient efficient mode transportation commute work run errand embark road trip car play crucial role daily live history car date back late th century invention first gasolinepowere automobile karl benz  since car evolve significantly term design technology performance today wide variety car available market range compact sedan luxury suvs electric vehicle one key benefit own car freedom flexibility provide car individual travel desire destination pace convenience without rely public transportation schedule car also offer sense independence autonomy allow individual explore new place embark spontaneous adventure addition convenience car also play crucial role economy automotive industry major contributor global gdp provide million job drive innovation technology engineering production sale car also generate significant revenue government tax tariff however important acknowledge environmental impact car combustion fossil fue

In [8]:
text2 = preprocess(doc2)
text2

'math science two fundamental subject play crucial role understand world around u discipline interconnect rely explain solve complex problem mathematic language science provide tool technique necessary analyze interpret datum universal language allow scientist communicate collaborate across different field calculate trajectory rocket predict spread virus math essential make informed decision solve realworld problem science hand systematic study natural world observation experimentation analysis encompass wide range discipline include biology chemistry physic earth science science help u understand law nature govern universe small particle large galaxy relationship math science symbiotic discipline inform enhance example math use science model simulate complex system climate pattern genetic mutation turn science provide realworld data observation math analyze interpret together math science revolutionize understand world lead countless technological advancement development vaccine explo

In [9]:
text3 = preprocess(doc3)
text3

'machine learn branch artificial intelligence focus development algorithm model allow computer learn make prediction decision base datum rapidly grow field application wide range industry include healthcare finance market one key concept machine learn idea training model dataset involve feed model set input datum along correct output label datum point model learn make prediction decision base training datum use make prediction new unseen datum several different type machine learn algorithm include supervise learn unsupervised learning reinforcement learn supervise learn model train label datum correct output provide input type learn commonly use task classification regression unsupervise learning hand involve train model unlabeled datum goal find pattern structure datum type learning often use task cluster dimensionality reduction reinforcement learn type learn model learn make decision interact environment receive feedback form reward penalty type learn commonly use task game play rob

## Prepare the documents to calculate TF, IDF, TF-IDF

In [10]:
all_words = []
all_words.extend(text1.split())
all_words.extend(text2.split())
all_words.extend(text3.split())
unique_words = set(all_words)
unique_words = list(unique_words)
unique_words = [word for word in unique_words if len(word) >= 2]
len(unique_words)

323

In [11]:
len(all_words)

543

In [12]:
len(unique_words)

323

In [13]:
all_docs = [
    text1,
    text2,
    text3,
]
# all_docs = [
#     "the cats are in the house",
#     "the dogs are in the house and outside",
#     "the cats and dogs are friends",
# ]

In [14]:
# unique_words = []
# for i in all_docs:
#     unique_words.extend(i.split())
#     unique_words = list(set(unique_words))
# len(unique_words)

In [15]:
def sort_df(df):
    sorted_matrix = df.sort_index(axis=1)
    return sorted_matrix

## Calculating TF (Term Frequency)

In [16]:
def get_tf(all_docs, unique_words):
    len_docs = len(all_docs)
    len_words = len(unique_words)
    tf = np.zeros((len_docs, len_words), dtype=int)
    for i in range(len_docs):
        for j in range(len_words):
            cur_word = unique_words[j]
            cur_doc = all_docs[i]
            freq = cur_doc.count(cur_word)
            tf[i, j] = freq
    return tf

In [17]:
tf = get_tf(all_docs, unique_words)
tf.shape

(3, 323)

In [18]:
df_tf = pd.DataFrame(tf, columns=unique_words)
df_tf = sort_df(df_tf)
df_tf

Unnamed: 0,acknowledge,across,addition,advance,advancement,adventure,air,algorithm,allow,along,...,use,vaccine,variety,vehicle,virus,way,wide,without,work,world
0,1,0,1,0,0,1,1,0,1,0,...,1,0,1,2,0,1,1,1,2,0
1,0,1,0,1,1,0,0,0,1,0,...,1,1,0,0,1,0,1,0,0,6
2,0,0,0,1,0,0,0,2,1,1,...,5,0,0,0,0,1,2,0,0,0


## Calculating IDF (Inverse Document Frequency)

In [19]:
def get_freq_all(word, all_docs):
    counter = 0
    for i in all_docs:
        if word in i.split():
            counter += 1
    return counter

In [20]:
def get_idf(all_docs, unique_words):
    len_docs = len(all_docs)
    len_words = len(unique_words)
    idf = np.zeros((len_words))
    for i in range(len_words):
        freq = get_freq_all(unique_words[i], all_docs)
        idf[i] = math.log(float(len_docs + 1) / float(freq + 1)) + 1
    return idf

In [21]:
idf = get_idf(all_docs, unique_words)
idf.shape

(323,)

In [22]:
df_idf = pd.DataFrame(idf, index=unique_words)
df_idf = sort_df(df_idf.T)
df_idf

Unnamed: 0,acknowledge,across,addition,advance,advancement,adventure,air,algorithm,allow,along,...,use,vaccine,variety,vehicle,virus,way,wide,without,work,world
0,1.693147,1.693147,1.693147,1.693147,1.693147,1.693147,1.693147,1.693147,1.0,1.693147,...,1.287682,1.693147,1.693147,1.693147,1.693147,1.287682,1.0,1.693147,1.693147,1.693147


## Calculating TF-IDF

In [23]:
def get_tfidf(tf, idf):
    tf_idf = idf * tf
    normalization = np.sqrt((tf_idf * tf_idf).sum(axis=1))
    for i in range(len(normalization)):
        tf_idf[i] /= normalization[i]
    return tf_idf

In [24]:
tf_idf = get_tfidf(tf, idf)
tf_idf.shape

(3, 323)

In [25]:
df_tf_idf = pd.DataFrame(tf_idf, columns=unique_words)
df_tf_idf = sort_df(df_tf_idf)
df_tf_idf

Unnamed: 0,acknowledge,across,addition,advance,advancement,adventure,air,algorithm,allow,along,...,use,vaccine,variety,vehicle,virus,way,wide,without,work,world
0,0.04611,0.0,0.04611,0.0,0.0,0.04611,0.04611,0.0,0.027234,0.0,...,0.035068,0.0,0.04611,0.092221,0.0,0.035068,0.027234,0.04611,0.092221,0.0
1,0.0,0.042679,0.0,0.042679,0.042679,0.0,0.0,0.0,0.025207,0.0,...,0.032458,0.042679,0.0,0.0,0.042679,0.0,0.025207,0.0,0.0,0.256071
2,0.0,0.0,0.0,0.032755,0.0,0.0,0.0,0.065509,0.019345,0.032755,...,0.124553,0.0,0.0,0.0,0.0,0.024911,0.038691,0.0,0.0,0.0


## TfidfVectorizer

In [26]:
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer

In [27]:
cv = CountVectorizer()
word_count_vector = cv.fit_transform(all_docs)

In [28]:
len(cv.get_feature_names_out())

323

In [29]:
word_count_vector.toarray()

array([[ 1,  0,  1,  0,  0,  1,  1,  0,  1,  0,  3,  0,  0,  0,  0,  0,
         0,  0,  1,  1,  1,  1,  1,  0,  2,  2,  1,  0,  0,  0,  0,  0,
        13,  1,  1,  0,  1,  0,  0,  1,  0,  0,  1,  0,  1,  0,  0,  1,
         1,  0,  0,  0,  0,  1,  0,  1,  0,  1,  3,  1,  0,  0,  0,  2,
         1,  0,  0,  1,  0,  0,  0,  1,  1,  1,  1,  0,  0,  0,  0,  1,
         0,  1,  1,  1,  2,  2,  2,  0,  0,  0,  1,  0,  1,  2,  1,  1,
         0,  0,  1,  0,  0,  0,  0,  0,  0,  0,  1,  0,  0,  0,  0,  0,
         0,  1,  1,  0,  0,  1,  1,  1,  0,  1,  0,  0,  1,  2,  1,  1,
         0,  1,  0,  0,  1,  1,  1,  0,  1,  0,  0,  1,  1,  0,  0,  0,
         1,  1,  2,  0,  0,  1,  3,  1,  0,  0,  1,  0,  1,  0,  0,  0,
         0,  0,  1,  0,  1,  1,  1,  0,  0,  0,  0,  1,  0,  0,  0,  0,
         1,  1,  0,  1,  0,  0,  1,  0,  0,  0,  1,  1,  1,  0,  2,  0,
         0,  0,  0,  0,  1,  1,  0,  2,  0,  1,  1,  0,  0,  1,  1,  1,
         2,  0,  0,  0,  1,  0,  1,  2,  0,  1,  0,  0,  0,  0, 

## Compute TF 

In [30]:
tf_vectorizer = word_count_vector.toarray()
df_count_vector = pd.DataFrame(
    tf_vectorizer,
    columns=cv.get_feature_names_out(),
)
df_count_vector = sort_df(df_count_vector)
df_count_vector

Unnamed: 0,acknowledge,across,addition,advance,advancement,adventure,air,algorithm,allow,along,...,use,vaccine,variety,vehicle,virus,way,wide,without,work,world
0,1,0,1,0,0,1,1,0,1,0,...,0,0,1,2,0,1,1,1,2,0
1,0,1,0,0,1,0,0,0,1,0,...,1,1,0,0,1,0,1,0,0,4
2,0,0,0,1,0,0,0,2,1,1,...,5,0,0,0,0,1,2,0,0,0


## Compute IDF

In [31]:
tfidf_transformer = TfidfTransformer(smooth_idf=True, use_idf=True)
tfidf_transformer.fit(word_count_vector)
idf_vectorizer = tfidf_transformer.idf_
df_tfidf_transformer = pd.DataFrame(idf_vectorizer, index=cv.get_feature_names_out())
df_tfidf_transformer = sort_df(df_tfidf_transformer.T)
df_tfidf_transformer

Unnamed: 0,acknowledge,across,addition,advance,advancement,adventure,air,algorithm,allow,along,...,use,vaccine,variety,vehicle,virus,way,wide,without,work,world
0,1.693147,1.693147,1.693147,1.693147,1.693147,1.693147,1.693147,1.693147,1.0,1.693147,...,1.287682,1.693147,1.693147,1.693147,1.693147,1.287682,1.0,1.693147,1.693147,1.693147


## Compute tf-idf by multiplying tf, idf (built in)

In [32]:
from sklearn.preprocessing import normalize
tf_idf_vectorizer = tf_vectorizer * idf_vectorizer
tf_idf_vectorizer = normalize(tf_idf_vectorizer)
df_tf_idf_vectorizer = pd.DataFrame(tf_idf_vectorizer, columns=cv.get_feature_names_out())
df_tf_idf_vectorizer = sort_df(df_tf_idf_vectorizer)
df_tf_idf_vectorizer

Unnamed: 0,acknowledge,across,addition,advance,advancement,adventure,air,algorithm,allow,along,...,use,vaccine,variety,vehicle,virus,way,wide,without,work,world
0,0.049806,0.0,0.049806,0.0,0.0,0.049806,0.049806,0.0,0.029416,0.0,...,0.0,0.0,0.049806,0.099612,0.0,0.037879,0.029416,0.049806,0.099612,0.0
1,0.0,0.051898,0.0,0.0,0.051898,0.0,0.0,0.0,0.030652,0.0,...,0.03947,0.051898,0.0,0.0,0.051898,0.0,0.030652,0.0,0.0,0.207593
2,0.0,0.0,0.0,0.03888,0.0,0.0,0.0,0.07776,0.022963,0.03888,...,0.147847,0.0,0.0,0.0,0.0,0.029569,0.045927,0.0,0.0,0.0


## Compute tf-idf built in

In [33]:
count_vector = cv.transform(all_docs)
tf_idf_vector = tfidf_transformer.transform(count_vector)
df_tf_idf_vector = pd.DataFrame(tf_idf_vector.todense(), columns=cv.get_feature_names_out())
df_tf_idf_vector = sort_df(df_tf_idf_vector)
df_tf_idf_vector

Unnamed: 0,acknowledge,across,addition,advance,advancement,adventure,air,algorithm,allow,along,...,use,vaccine,variety,vehicle,virus,way,wide,without,work,world
0,0.049806,0.0,0.049806,0.0,0.0,0.049806,0.049806,0.0,0.029416,0.0,...,0.0,0.0,0.049806,0.099612,0.0,0.037879,0.029416,0.049806,0.099612,0.0
1,0.0,0.051898,0.0,0.0,0.051898,0.0,0.0,0.0,0.030652,0.0,...,0.03947,0.051898,0.0,0.0,0.051898,0.0,0.030652,0.0,0.0,0.207593
2,0.0,0.0,0.0,0.03888,0.0,0.0,0.0,0.07776,0.022963,0.03888,...,0.147847,0.0,0.0,0.0,0.0,0.029569,0.045927,0.0,0.0,0.0


## Check the unique words from TfidfVectorizer and from scratch

In [34]:
s1 = cv.get_feature_names_out().copy()
s2 = unique_words.copy()
s1 = sorted(s1)
s2 = sorted(s2)
s1 == s2

True

In [35]:
def compare(df1, df2):
    arr1 = df1.to_numpy()
    arr2 = df2.to_numpy()
    element_wise_comparison = np.isclose(arr1, arr2)
    return element_wise_comparison

In [36]:
x = compare(df_tf_idf, df_tf_idf_vectorizer)
x

array([[False,  True, False,  True,  True, False, False,  True, False,
         True, False,  True,  True,  True,  True,  True,  True,  True,
        False, False, False, False, False,  True, False, False, False,
         True,  True,  True,  True,  True, False, False, False,  True,
        False,  True,  True, False,  True,  True, False,  True, False,
         True,  True, False, False,  True,  True,  True,  True, False,
         True, False,  True, False, False, False,  True,  True,  True,
        False, False,  True,  True, False,  True,  True,  True, False,
        False, False, False,  True,  True,  True,  True, False,  True,
        False, False, False, False, False, False,  True,  True,  True,
        False,  True, False, False, False, False, False,  True, False,
         True,  True,  True,  True,  True,  True,  True, False,  True,
         True,  True,  True,  True,  True, False, False,  True, False,
        False, False, False,  True, False,  True,  True, False, False,
      

In [37]:
from sklearn.metrics.pairwise import cosine_similarity
def cos_similarity(df1, df2):
    array1 = df1.to_numpy()
    array2 = df2.to_numpy()
    cos_similarity = cosine_similarity(array1.reshape(1, -1), array2.reshape(1, -1))
    return cos_similarity

In [38]:
y = cos_similarity(df_tf_idf, df_tf_idf_vectorizer)
y

array([[0.92420875]])