<a href="https://colab.research.google.com/github/HimanshuRajput013/Gutenberg_dataset_text_generation_using_GPT-2_Model/blob/main/gutenberg_dataset_text_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [37]:
# Install LangChain, sentence-transformers, and faiss-cpu
!pip install langchain sentence-transformers faiss-cpu
!pip install langchain_huggingface
!pip install langchain_community
!pip install unstructured




In [38]:
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [39]:
import nltk
import os
nltk.download('punkt')
from nltk.tokenize import sent_tokenize


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [40]:
import os
from langchain_community.document_loaders import TextLoader

# Path to the folder where your text files are stored
directory = "/content/drive/MyDrive/Gutenberg-d/Gutenberg"

# Initialize an empty list to hold all text documents
all_documents = []

# Loop through each file in the directory
for filename in os.listdir(directory):
    if filename.endswith(".txt"):  # Check for .txt files
        file_path = os.path.join(directory, filename)
        loader = TextLoader(file_path)
        text_documents = loader.load()
        all_documents.extend(text_documents)  # Add loaded documents to the list

# Print the total number of loaded documents
print(f"Loaded {len(all_documents)} documents.")


Loaded 192 documents.


In [41]:
!pip install langchain_text_splitters



In [48]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter=RecursiveCharacterTextSplitter(chunk_size=1000,chunk_overlap=50)
final_documents=text_splitter.split_documents(all_documents)


In [43]:
# Create a .env file with your Hugging Face token
with open('.env', 'w') as f:
    f.write('HF_TOKEN= API_KEY_HERE')  # Replace with your actual token


In [44]:
import os
from dotenv import load_dotenv

# Load the .env file
load_dotenv()

# Set the Hugging Face token in the environment variable
os.environ['HF_TOKEN'] = os.getenv("HF_TOKEN")


In [54]:
from sentence_transformers import SentenceTransformer, util
import numpy as np

# Load the pre-trained SentenceTransformer model
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')  # Or any other suitable model

# Get the text content from your documents
texts = [doc.page_content for doc in final_documents]  # Assuming 'documents' contains your loaded documents

# Generate embeddings for the documents
embeddings = embedding_model.encode(texts, convert_to_tensor=True)

# Now let's assume we want to search for relevant documents given a query



In [60]:
query = "I will then give you one other dollar. By this, if you hire yourself at ten dollars a month, from me you will get ten more, making twenty dollars a month for your work. In this, I do not mean you shall go off to St. Louis, or the lead mines, or the gold mines, in California"
query_embedding = embedding_model.encode(query, convert_to_tensor=True)

# Compute cosine similarities between the query embedding and document embeddings
cosine_similarities = util.cos_sim(query_embedding, embeddings)

# Sort documents by similarity
top_k = 5  # Retrieve top 5 most similar documents
top_k_indices = np.argsort(-cosine_similarities[0].cpu())[:top_k]

# Show top_k most relevant documents
for idx in top_k_indices:
    print(f"Document {idx} is relevant with similarity score {cosine_similarities[0][idx].item()}")

Document 1136 is relevant with similarity score 0.6764839887619019
Document 1760 is relevant with similarity score 0.675173819065094
Document 51116 is relevant with similarity score 0.5246419906616211
Document 40491 is relevant with similarity score 0.5191895961761475
Document 51127 is relevant with similarity score 0.48789501190185547


In [61]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load GPT-2 model and tokenizer
gpt2_model_name = 'gpt2'
tokenizer = GPT2Tokenizer.from_pretrained(gpt2_model_name)
gpt2_model = GPT2LMHeadModel.from_pretrained(gpt2_model_name)

# Set the pad token ID to prevent the warning
gpt2_model.config.pad_token_id = gpt2_model.config.eos_token_id

# Use the most relevant document as a prompt for GPT-2
top_chunk = texts[top_k_indices[0]]  # Use the most similar document

# Encode the prompt text
input_ids = tokenizer.encode(top_chunk, return_tensors='pt')

# Generate text continuation using GPT-2 with parameters to reduce repetition
output = gpt2_model.generate(
    input_ids,
    max_new_tokens=100,  # Number of tokens to generate
    do_sample=True,      # Activate sampling
    top_p=0.95,          # Nucleus sampling (select tokens with cumulative probability >= 0.95)
    top_k=50,            # Top-k sampling (consider only the top 50 tokens by probability)
    temperature=0.7,     # Lower temperature generates less random output, higher adds creativity
    num_return_sequences=1
)

# Decode the output into readable text
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


You are now in need of some money; and what I propose is, that you shall
go to work, "tooth and nail," for somebody who will give you money for
it. Let father and your boys take charge of your things at home,
prepare for a crop, and make the crop, and you go to work for the best
money wages, or in discharge of any debt you owe, that you can get; and,
to secure you a fair reward for your labour, I now promise you, that for
every dollar you will, between this and the first of May, get for your
own labour, either in money or as your own indebtedness, I will then
give you one other dollar. By this, if you hire yourself at ten dollars
a month, from me you will get ten more, making twenty dollars a month
for your work. In this I do not mean you shall go off to St. Louis, or
the lead mines, or the gold mines in California, but I mean for you to
go at it for the best wages you can get close to home in Coles County.
Now, if you will do this, you will be soon out of debt, and, what is worse, you