<a href="https://colab.research.google.com/github/DiaaEssam/Text-Generation-and-TF-IDF/blob/main/Text_Generation_and_TF_IDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Downloading libraries

In [None]:
!pip install langchain

Collecting langchain
  Downloading langchain-0.1.13-py3-none-any.whl (810 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m810.5/810.5 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.6.4-py3-none-any.whl (28 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl (12 kB)
Collecting langchain-community<0.1,>=0.0.29 (from langchain)
  Downloading langchain_community-0.0.29-py3-none-any.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-core<0.2.0,>=0.1.33 (from langchain)
  Downloading langchain_core-0.1.33-py3-none-any.whl (269 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m269.1/269.1 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-text-splitters<0.1,>=0.0.1 (from langchain)
  Downl

# Importing libraries

In [None]:
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer, pipeline
import torch
import os
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from langchain.chains import ConversationChain
from langchain.memory import ConversationBufferMemory
from langchain import PromptTemplate
import numpy as np
from tabulate import tabulate
import nltk
from nltk.stem import SnowballStemmer
import re
from collections import Counter
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
transformers.logging.set_verbosity_warning()

# Downloading set of stopping words in English

In [None]:
nltk.download('stopwords')
stop_words = set(nltk.corpus.stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


# Inserting Token

In [None]:
os.environ["HUGGINGFACEHUB_API_TOKEN"] = "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"

# Defining template for prompt

In [None]:
Template = """given the Topic below, generate text about this topic in details.

Current conversation:
{history}

Topic:
{input}

Response: """

PROMPT = PromptTemplate(
                        template=Template,
                        input_variables=["history", "input"])

# Defining the Hugging face pipline to use it in chains

In [None]:
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    # Defining model directory in kaggle
    model_id = "/kaggle/input/vicuna/pytorch/7b-v1pt5-16k/1"
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    # for GPU Usage only
    model = AutoModelForCausalLM.from_pretrained(model_id,
                                                 temperature=1.0,
                                                 torch_dtype = torch.float16,
                                                 device_map = "auto",
                                                 offload_folder = "./cache")
    streamer = TextStreamer(tokenizer, skip_prompt=True)
    # Task of Vicuna is text-generation
    # you can modify the maximum output tokens in the LLM's response
    pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=512,streamer=streamer, return_full_text=False)
    llm = HuggingFacePipeline(pipeline=pipe)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

# Defining conversation chains

In [None]:
generation = ConversationChain(
    llm=llm,
    prompt=PROMPT,
    verbose=False,
    memory=ConversationBufferMemory(ai_prefix="Response")
)

# Generating the documents

In [None]:
doc1 = generation("sports")['response']





Sports are a vital part of our lives, and they have been for centuries. They provide us with entertainment, exercise, and a sense of community. Whether it's watching a game on TV or playing a sport ourselves, sports have the power to bring people together and create lasting memories.

One of the most popular sports in the world is football (soccer). It is played by millions of people across the globe and is the most watched sport in the world. Football is a team sport that requires skill, strategy, and teamwork. It is played on a large field with a ball and two teams of 11 players each. The objective of the game is to score goals by kicking the ball into the opposing team's net.

Another popular sport is basketball. It is a fast-paced game that is played on a rectangular court. The objective of the game is to score points by throwing the ball through a hoop, which is 10 feet off the ground. Basketball is a team sport that requires quick reflexes, agility, and coordination.

Sports al

In [None]:
doc2 = generation("weather")['response']





The weather is a constantly changing phenomenon that affects our daily lives in many ways. It can impact our mood, activities, and even our health. From sunny days to stormy nights, the weather can be unpredictable and ever-changing.

One of the most significant impacts of the weather is on our mood. A sunny day can put us in a good mood and make us feel more energetic, while a rainy day can make us feel sluggish and moody. The weather can also affect our outdoor activities, such as gardening, hiking, or playing sports.

The weather can also impact our health. Extreme temperatures, such as heat waves or cold snaps, can be dangerous for our bodies. High temperatures can cause heat exhaustion or heatstroke, while low temperatures can lead to hypothermia. It is essential to take precautions and dress appropriately for the weather to stay safe and healthy.

In addition to its impact on our daily lives, the weather also plays a significant role in shaping our environment. It can cause nat

In [None]:
doc3 = generation("history of cars")['response']





The history of cars is a fascinating journey that spans over a century. From the early days of horse-drawn carriages to the high-speed, technologically advanced vehicles of today, cars have come a long way.

The first cars were introduced in the late 19th century, and they were powered by steam engines. These early cars were slow and unreliable, but they paved the way for the development of the internal combustion engine, which would become the standard power source for cars for the next century.

In the early 20th century, cars became more popular, and the automobile industry began to grow rapidly. Henry Ford's assembly line production method revolutionized the industry, making cars more affordable and accessible to the masses.

During World War II, cars played a crucial role in the war effort, with factories producing tanks, trucks, and other military vehicles. After the war, the automobile industry continued to grow, and cars became an essential part of American culture.

In the 1

# Preparing documents as corpus

In [None]:
docs = [doc1, doc2, doc3]

# Function to print the documents after each change

In [None]:
def print_docs():
    for i,doc in enumerate(docs):
        print(f"Document {i+1}:\n",doc,"\n")

# Cleaning

In [None]:
for i in range(len(docs)):
    docs[i] = re.sub(r'[^a-zA-Z\s]', '', docs[i])
print_docs()

Document 1:
 
Sports are a vital part of our lives and they have been for centuries They provide us with entertainment exercise and a sense of community Whether its watching a game on TV or playing a sport ourselves sports have the power to bring people together and create lasting memories

One of the most popular sports in the world is football soccer It is played by millions of people across the globe and is the most watched sport in the world Football is a team sport that requires skill strategy and teamwork It is played on a large field with a ball and two teams of  players each The objective of the game is to score goals by kicking the ball into the opposing teams net

Another popular sport is basketball It is a fastpaced game that is played on a rectangular court The objective of the game is to score points by throwing the ball through a hoop which is  feet off the ground Basketball is a team sport that requires quick reflexes agility and coordination

Sports also provide us with

# Normalization

In [None]:
for i in range(len(docs)):
    docs[i] = docs[i].lower()
print_docs()

Document 1:
 
sports are a vital part of our lives and they have been for centuries they provide us with entertainment exercise and a sense of community whether its watching a game on tv or playing a sport ourselves sports have the power to bring people together and create lasting memories

one of the most popular sports in the world is football soccer it is played by millions of people across the globe and is the most watched sport in the world football is a team sport that requires skill strategy and teamwork it is played on a large field with a ball and two teams of  players each the objective of the game is to score goals by kicking the ball into the opposing teams net

another popular sport is basketball it is a fastpaced game that is played on a rectangular court the objective of the game is to score points by throwing the ball through a hoop which is  feet off the ground basketball is a team sport that requires quick reflexes agility and coordination

sports also provide us with

# Tokenization

In [None]:
for i in range(len(docs)):
    docs[i] = docs[i].split()
print_docs()

Document 1:
 ['sports', 'are', 'a', 'vital', 'part', 'of', 'our', 'lives', 'and', 'they', 'have', 'been', 'for', 'centuries', 'they', 'provide', 'us', 'with', 'entertainment', 'exercise', 'and', 'a', 'sense', 'of', 'community', 'whether', 'its', 'watching', 'a', 'game', 'on', 'tv', 'or', 'playing', 'a', 'sport', 'ourselves', 'sports', 'have', 'the', 'power', 'to', 'bring', 'people', 'together', 'and', 'create', 'lasting', 'memories', 'one', 'of', 'the', 'most', 'popular', 'sports', 'in', 'the', 'world', 'is', 'football', 'soccer', 'it', 'is', 'played', 'by', 'millions', 'of', 'people', 'across', 'the', 'globe', 'and', 'is', 'the', 'most', 'watched', 'sport', 'in', 'the', 'world', 'football', 'is', 'a', 'team', 'sport', 'that', 'requires', 'skill', 'strategy', 'and', 'teamwork', 'it', 'is', 'played', 'on', 'a', 'large', 'field', 'with', 'a', 'ball', 'and', 'two', 'teams', 'of', 'players', 'each', 'the', 'objective', 'of', 'the', 'game', 'is', 'to', 'score', 'goals', 'by', 'kicking', 'th

# Stemming

In [None]:
snowball_stemmer = SnowballStemmer("english")

In [None]:
for i in range(len(docs)):
    docs[i] = [snowball_stemmer.stem(word) for word in docs[i]]
print_docs()

Document 1:
 ['sport', 'are', 'a', 'vital', 'part', 'of', 'our', 'live', 'and', 'they', 'have', 'been', 'for', 'centuri', 'they', 'provid', 'us', 'with', 'entertain', 'exercis', 'and', 'a', 'sens', 'of', 'communiti', 'whether', 'it', 'watch', 'a', 'game', 'on', 'tv', 'or', 'play', 'a', 'sport', 'ourselv', 'sport', 'have', 'the', 'power', 'to', 'bring', 'peopl', 'togeth', 'and', 'creat', 'last', 'memori', 'one', 'of', 'the', 'most', 'popular', 'sport', 'in', 'the', 'world', 'is', 'footbal', 'soccer', 'it', 'is', 'play', 'by', 'million', 'of', 'peopl', 'across', 'the', 'globe', 'and', 'is', 'the', 'most', 'watch', 'sport', 'in', 'the', 'world', 'footbal', 'is', 'a', 'team', 'sport', 'that', 'requir', 'skill', 'strategi', 'and', 'teamwork', 'it', 'is', 'play', 'on', 'a', 'larg', 'field', 'with', 'a', 'ball', 'and', 'two', 'team', 'of', 'player', 'each', 'the', 'object', 'of', 'the', 'game', 'is', 'to', 'score', 'goal', 'by', 'kick', 'the', 'ball', 'into', 'the', 'oppos', 'team', 'net', 'a

# Removing Stopping words

In [None]:
for i in range(len(docs)):
    stop_words = set(list(stop_words) + [word for word in docs[i] if len(word) < 3])
    docs[i] = [word for word in docs[i] if word not in (stop_words)]
print_docs()

Document 1:
 ['sport', 'vital', 'part', 'live', 'centuri', 'provid', 'entertain', 'exercis', 'sens', 'communiti', 'whether', 'watch', 'game', 'play', 'sport', 'ourselv', 'sport', 'power', 'bring', 'peopl', 'togeth', 'creat', 'last', 'memori', 'one', 'popular', 'sport', 'world', 'footbal', 'soccer', 'play', 'million', 'peopl', 'across', 'globe', 'watch', 'sport', 'world', 'footbal', 'team', 'sport', 'requir', 'skill', 'strategi', 'teamwork', 'play', 'larg', 'field', 'ball', 'two', 'team', 'player', 'object', 'game', 'score', 'goal', 'kick', 'ball', 'oppos', 'team', 'net', 'anoth', 'popular', 'sport', 'basketbal', 'fastpac', 'game', 'play', 'rectangular', 'court', 'object', 'game', 'score', 'point', 'throw', 'ball', 'hoop', 'feet', 'ground', 'basketbal', 'team', 'sport', 'requir', 'quick', 'reflex', 'agil', 'coordin', 'sport', 'also', 'provid', 'opportun', 'stay', 'activ', 'healthi', 'exercis', 'essenti', 'physic', 'mental', 'health', 'sport', 'great', 'way', 'get', 'play', 'sport', 'imp

# Concatenating words to as documents for built-in TF_IDF

In [None]:
for i in range(len(docs)):
    docs[i] = " ".join(word for word in docs[i])
print_docs()

Document 1:
 sport vital part live centuri provid entertain exercis sens communiti whether watch game play sport ourselv sport power bring peopl togeth creat last memori one popular sport world footbal soccer play million peopl across globe watch sport world footbal team sport requir skill strategi teamwork play larg field ball two team player object game score goal kick ball oppos team net anoth popular sport basketbal fastpac game play rectangular court object game score point throw ball hoop feet ground basketbal team sport requir quick reflex agil coordin sport also provid opportun stay activ healthi exercis essenti physic mental health sport great way get play sport improv cardiovascular health build muscl strength increas flexibl addit provid physic benefit sport also posit impact mental health particip sport help reduc stress improv selfesteem boost mood also provid sens purpos belong essenti overal wellb sport also play signific role shape histori use tool social polit chang so

# Using built in TF-IDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize a TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Fit the model and transform the documents
tfidf_matrix = vectorizer.fit_transform(docs)

# Getting feature names (words/terms)
feature_names = vectorizer.get_feature_names_out()

# TF-IDF from scratch

In [None]:
for i in range(len(docs)):
    docs[i] = docs[i].split()
print_docs()

Document 1:
 ['sport', 'vital', 'part', 'live', 'centuri', 'provid', 'entertain', 'exercis', 'sens', 'communiti', 'whether', 'watch', 'game', 'play', 'sport', 'ourselv', 'sport', 'power', 'bring', 'peopl', 'togeth', 'creat', 'last', 'memori', 'one', 'popular', 'sport', 'world', 'footbal', 'soccer', 'play', 'million', 'peopl', 'across', 'globe', 'watch', 'sport', 'world', 'footbal', 'team', 'sport', 'requir', 'skill', 'strategi', 'teamwork', 'play', 'larg', 'field', 'ball', 'two', 'team', 'player', 'object', 'game', 'score', 'goal', 'kick', 'ball', 'oppos', 'team', 'net', 'anoth', 'popular', 'sport', 'basketbal', 'fastpac', 'game', 'play', 'rectangular', 'court', 'object', 'game', 'score', 'point', 'throw', 'ball', 'hoop', 'feet', 'ground', 'basketbal', 'team', 'sport', 'requir', 'quick', 'reflex', 'agil', 'coordin', 'sport', 'also', 'provid', 'opportun', 'stay', 'activ', 'healthi', 'exercis', 'essenti', 'physic', 'mental', 'health', 'sport', 'great', 'way', 'get', 'play', 'sport', 'imp

# Function to calculate TF

In [None]:
def get_tf(docs):
    tf = []
    for doc in docs:
        tf.append({word: count/len(doc) for word,count in Counter(doc).items()})
    return tf

# Function to calculate number of documents that contain a certain word

In [None]:
def get_count_doc_of_word(word):
    found = False
    count = 0
    for doc in docs:
        if word in doc:
            found = True
        if found:
            count+= 1
            found = False
    return count

# Function to calculate IDF

In [None]:
def get_idf(docs):
    idf = []
    for doc in docs:
        idf.append({word: (np.log((1+len(docs)) / ( 1 + get_count_doc_of_word(word))) + 1) for word in doc})
    return idf

# Function to calculate TF-IDF

In [None]:
def get_tf_idf():
    tf_idf = []
    for i in range(len(docs)):
        tf_idf.append({word: (TF_value * IDF_value) for (word, TF_value), (word, IDF_value) in zip(get_tf(docs)[i].items(),get_idf(docs)[i].items())})
    return tf_idf

In [None]:
tf_idf = get_tf_idf()

# Normalizing TF_IDF

In [None]:
def normalize_TF_IDF(TF_IDF):
    for i in range(len(docs)):
        TF_IDF[i] = {word: (TF_IDF[i][word]/np.sqrt(np.sum(np.square(list(TF_IDF[i].values()))))) for word in docs[i]}

In [None]:
normalize_TF_IDF(tf_idf)

# Sorting each TF-IDF

In [None]:
def sort_docs(tf_idf):
    for doc_id, vec in enumerate(tf_idf):
        tf_idf[doc_id] = dict(sorted(vec.items()))
    return tf_idf

In [None]:
tf_idf = sort_docs(tf_idf)

In [None]:
# Create a list to store the table data
table_data = []

# Iterate over each document
for doc_id in range(len(tf_idf)):
    # Create a row for the document
    row = [f"Document {doc_id + 1}"]

    # Create dictionaries to store the word-value pairs for each implementation
    sklearn_dict = {}
    custom_dict = {}

    # Iterate over scikit-learn's TF-IDF scores
    for word, value in zip(feature_names, tfidf_matrix.toarray()[doc_id]):
        if value > 0:
            sklearn_dict[word] = f"{value:.2f}"

    # Iterate over custom TF-IDF scores
    for word, value in tf_idf[doc_id].items():
        if value > 0:
            custom_dict[word] = f"{value:.2f}"

    # Combine the dictionaries
    combined_dict = {word: [sklearn_dict.get(word, ""), custom_dict.get(word, "")] for word in set(sklearn_dict) | set(custom_dict)}

    # Append the word-value pairs to the row
    for word, values in combined_dict.items():
        row.extend([word, values[0], values[1]])

    # Append the row to the table data
    table_data.append(row)

# Create the table headers
headers = ["Document"] + ["Word", "Scikit-learn TF-IDF", "Custom TF-IDF"] * len(combined_dict)

# Print the table using tabulate
print(tabulate(table_data, headers, tablefmt="grid"))

+------------+---------+------+------+------------+------+------+-------+------+------+-------+------+------+-------+------+------+--------+------+------------+-----------+-----------------------+-----------------+--------+-----------------------+-----------------+--------+-----------------------+-----------------+---------+-----------------------+-----------------+--------+-----------------------+-----------------+---------+-----------------------+-----------------+---------+-----------------------+-----------------+--------+-----------------------+-----------------+------------+-----------------------+-----------------+--------+-----------------------+-----------------+--------+-----------------------+-----------------+-----------+-----------------------+-----------------+-------------+-----------------------+-----------------+--------+-----------------------+-----------------+----------+-----------------------+-----------------+---------+-----------------------+-----------------+---