# Importing Necessary Libraries

In [5]:
import json
import pandas as pd
from random import sample
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.callbacks import get_openai_callback
from langchain_core.runnables import RunnableParallel,RunnablePassthrough
from langchain.output_parsers import PydanticOutputParser,OutputFixingParser
from langchain_core.pydantic_v1 import BaseModel, Field, validator
from langchain_community.llms import OpenAI
from operator import itemgetter
import os
import re

# Defining Two Classes

We have 2 OpenAI class instances, called llm and llm3. We have a API key, with which we can use either 3.5 or 4. As one can see we have defined our model ot be 3.5. The last variable we need to define is temprature. In simpler terms, the temperature parameter in language models (like LLMs) controls how creative or predictable their responses are. When the temperature is high (closer to 1), the model generates imaginative but less precise answers. When it’s low (close to 0), the model becomes more focused and deterministic. The right balance depends on the task: low temperature for factual answers, high temperature for creative tasks like storytelling

In [6]:
llm=ChatOpenAI(openai_api_key="xyz",model="gpt-3.5-turbo-0125",temperature=0,model_kwargs={"response_format": { 
      "type": "json_object" 
    }})

llm3=ChatOpenAI(openai_api_key="xyz",model="gpt-3.5-turbo-0125",temperature=0)

Here we define the script number. This script number will be later used to get a specific input

In [7]:
filenum=input("ENTER THE SCRIPT NUMBER: ")

ENTER THE SCRIPT NUMBER: 42


## Defining The Class
So we make use of the pydantic library we define a class called descriptions which makes use of Base Model, So now we can define our vairables inside this description class. 

As you can see the first variable that we define is the response variable which is of string formal. Using the Field feature we can add additional metdata( or in simpler words be more specific) about what we want from our response variable. Over here it is validated using the sentence "description which is re rewritten". This means down the line when we encounter this class, we can validate if the generated content is the one which we want or not.


In [14]:
class Description(BaseModel):
    response: str = Field(description=" description which is rewritten")

description_parser=PydanticOutputParser(pydantic_object=Description)
description_output_parser=OutputFixingParser(parser=description_parser,llm=llm)

We have 3 levels of prompts over here, first the description_writer_prompt, then the anti_ai_detection_prompt, seo_prompt. Each prompt has its own unieq purpose on the prompt. 

The first level of our prompt asks the LLM to create generic descriptions within in 250 words. The description will be of a hotel that we feed in via the placeholder {inp}. The format description is something which is already described below. 

The second level of our prompt takes in the input given out by the first run and gives its own output. This description makes sure our conten generated does not seem AI generated. Over here the ipnut variable is in the placeholder called description.

The third level of our prompt is called the SEO prompt. It creates SEO(search engine optimised) descriptions of the hotel. It mainly focused on the burstiness of the sentences to make them seem even less AI generated. They can be seen by the input output examples that the user has provided in the prompt. 

In [9]:
description_writer_prompt="""
Your task is to create generic descriptions . Each description must be within 250 words. These descriptions should be versatile enough to apply to multiple hotels without mentioning specific names or locations. 

The target audience is a general group of potential guests. Focus on highlighting each hotel's unique features, amenities, and experiences to entice readers. 
Ensure the descriptions are clear, concise, and free from content that could lead to plagiarism claims. Put emphasis on any metrics like distance from airport or distance from key tourist locations and include them in your rewritten description. 
By capturing the essence of each hotel, the goal is to enable potential guests to envision their stay and make informed booking decisions.  

Be concise and stick to the point.  

The Description to be rewritten is {inp}.

JSON {format_instructions}
"""
description_writer_template=PromptTemplate(input_variables=["inp"],template
=description_writer_prompt,partial_variables={'format_instructions':description_output_parser.get_format_instructions()})

In [10]:
anti_ai_detection_prompt="""
Compose an engaging and original piece of writing showcasing the remarkable capabilities of natural language generation. 
The word count should be 250 words at a maximum.
Craft the content with utmost creativity to ingeniously avoid predictable patterns commonly associated with AI-generated text, 
ensuring it maintains the authenticity of human composition in English UK.  
When it comes to writing content, two factors are crucial, "perplexity" and "burstiness." Perplexity measures the complexity of text.
Separately, burstiness compares the variations of sentences. Humans tend to write with greater burstiness, for example, with some longer or complex sentences alongside shorter ones. 
AI sentences tend to be more uniform. Your challenge is to create a captivating text, rich in perplexity and burstiness.
This limitation will test your ability to wield language proficiency effectively without triggering AI content detectors. 
Description to be rewritten is: {description} 

JSON {format_instructions}
"""
anti_ai_detection_template=PromptTemplate(input_variables=["description"],template=anti_ai_detection_prompt,partial_variables={'format_instructions':description_output_parser.get_format_instructions()})

In [11]:
seo_prompt="""
Write SEO-friendly and optimized hotel descriptions for a series of hotels. Each description should include the exact phrases "{hotel_name} room prices" and "Book {hotel_name} Online". 
Highlight how each hotel offers affordable accommodations suitable for a variety of people and make sure to include the phrases "{hotel_name} room prices" and "Book {hotel_name} Online" word for word during this part.
Mention the hotel address which is {address} somewhere in the description so our customers can reach it easier.
Change at least 20 words to other words with the same meaning.
The SEO friendly text should have high perplexity and burstiness.Perplexity measures the complexity of text.
Separately, burstiness compares the variations of sentences. Humans tend to write with greater burstiness, for example, with some longer or complex sentences alongside shorter ones. 
AI sentences tend to be more uniform. 

Example input output pairs as below:
1)Input: The zebra ran away from the predators. It was fast and smart. It got to the water and swam. It was safe and happy.
Output: The zebra galloped across the savanna, dodging the lions and hyenas that were chasing it. It reached the river and jumped in, hoping to escape. But it was too late. The crocodiles were waiting for it, ready to snap their jaws.

2)Input: She saw him at the door with flowers. He said, “I love you.” She was happy and hugged him. She did not see the knife.
Output: She opened the door and saw him standing there, holding a bouquet of roses. He smiled and said, “Happy anniversary, my love.” She felt a surge of joy and hugged him. Then she noticed the blood on his shirt and the knife in his hand.

3)Input: He had a machine that could do things. He wanted to use it for good and bad. He pushed a button and went somewhere. He was gone.
Output: He was a genius, but also a madman. He had invented a device that could manipulate time and space. He wanted to use it to explore the mysteries of the universe, but also to change history. He pressed a button and disappeared, never to be seen again.

4)Input: She wanted to be a singer. She practiced a lot and tried hard. She got a chance to sing on TV. She sang well.
Output: She had always dreamed of becoming a singer. She practiced every day, sang in every audition, and worked hard to improve her skills. She finally got her big break when she was invited to perform on a popular TV show. She opened her mouth and nothing came out.


The description which has to be SEO optimized and whom's perplexity and burstiness needs to be changed is: {anti_ai}

JSON {format_instructions}
"""
seo_template=PromptTemplate(input_variables=["address","hotel_name","anti_ai"],template=seo_prompt,partial_variables={'format_instructions':description_output_parser.get_format_instructions()})

So the input csv file that we give in has 3 main inputs, one is the hotel adress, the other is the hotel name and the third is the hotel description. The hotel description is placed in the placeholder inp and that is what we feed inside thr prompt to create our first basic description. This is the passed inside the anti ai prompt and in the last step we change the hotel name and hotel adress based on what hotel we are dealing with. This is the basic architecture of what is happening over here. 

In [12]:
highlights_prompt="""
Generate concise bullet point highlights for a hotel booking website's hotel descriptions, emphasizing amenities, distances (in kilometers) from the airport and railway station, and proximity to tourist attractions. Cater to both casual and business travelers. 
Consider a mix of hotels, including luxury, boutique, and budget options. 
Highlight key amenities like free Wi-Fi, workspaces, fitness centers, and restaurants. 
Specify distances from the nearest airport and railway station. If unable to find metrics in the description then do not mention them in the highlights.
Mention nearby tourist attractions. Identify unique selling points for each hotel. 
Keep the total word limit to 100 words.
Example of how the highlights is supposed to look:
• Located within a 10-minute drive of Baga Beach and Baga Night Market
• Offers 44 guestrooms with minibars, flat-screen TVs, and free Wi-Fi
• Full-service spa with massages, body treatments, and facials
Highlights need to be generated for this article: {inp}  
"""
highlights_template=PromptTemplate(input_variables=["inp"],template=highlights_prompt)
highlights_runnable = highlights_template | llm3

So here we are running two chains. One chain comprises of 3 intrinsic chains (chain_1 chain_2 chain_3) and the other is running paralle to these. we are using the modle llm3 defined earlier for the highlights prompt and llm for the first description,anti, and seo optimised write prompts chain. The same is coded below.

In [13]:
chain_1 = description_writer_template | llm | description_output_parser | {"description": RunnablePassthrough()}
chain_2 = {"description": chain_1} | anti_ai_detection_template | llm | {"anti_ai":RunnablePassthrough()}
chain_3 = {"anti_ai":chain_2, "hotel_name":itemgetter("hotel_name"), "address":itemgetter("address")} | seo_template | llm
overall_chain = RunnableParallel(seo=chain_3,highlights=highlights_runnable)

Over here we just define the hotel path consistng of a relative and aboslute path. we read the excel file into the 3 main variables that is inp, adress and hotel name. The prompt generates generic desciptions and in the last step pf the langchain replaces hotel name and hotel adress with the relevant data corresponding to the hotel. 

In [10]:
cwd = "D:\Suresh\Hotel descriptions\Input"
relative_path = f"hoteldetails_{filenum}.xlsx"
absolute_path= os.path.join(cwd, relative_path)
temp_data=pd.read_excel(absolute_path)
temp_data.rename(columns={'basicInfo.description':'inp','locationInfo.address':'address','hotelName':'hotel_name'},inplace=True)

Our each excel file consisted over 4000 hoetl details. Since we have a toke limit of 10k tokens in gpt-4 and 60k tokens in 
gpt-3.5 we cant run in batches of 30 for gpt-4. Batches of 30 means the code runs descriptions of 30 hotels at once. 
Remember 1000 words corresponds to approximately 750 tokens. Over here we have batch size 30 and we run it for a specific number of batches of batches, and we divide out temp-data dataset in some number of batches which is denoted by len(batches). 

In [11]:
batch_size=30
batches=[temp_data[i:i+batch_size] for i in range(0,temp_data.shape[0],batch_size)]

In [48]:
start=588
end=len(batches)

The below code snippet is used for formatting purposes while appending our answers or response to our final file. 

In [49]:
def format_page(page,cityName,countryName,hotelId,hotelName):
    temp={}
    temp['cityName']=cityName
    temp['countryName']=countryName
    temp['hotelId']=hotelId
    temp['hotelName']=hotelName
    temp['new_description']=json.loads(page["seo"].content)['response']
    temp['highlights']=page['highlights'].content
    return temp

Over here we first import tqdm library which is used for tqdm is a Python library that provides a simple and intuitive way to 
add progress bars to your code. It allows you to visualize the progress of iterative processes, showing how much work has been
completed and how much remains. We first define 2 new columns in our temp_data frame. The 2 columns are called'new_descriptions'and 'highlights'. Post this we again define the folder path and file name in order to define an aboslute path to my output file. 
If the output file does not already exist in my system then it creates one, subsequently printing a message saying the corresponding file has been generated in the system.

Post this we iterate over the number of batches and for each batch dataframe we create a list of dictionaries. Each dictinoary
corresponds to a data point or row of the batch. Then we use overall chain which we used above to create descriptions of our hotels. Pos then we do formatting and update our information in the excel file. Thats it! 

In [50]:
from tqdm.auto import tqdm

temp_data['new_description']=None
temp_data['highlights']=None

folder_path = fr"D:\Suresh\Hotel descriptions\Output" # specify the relative folder path here
file_name = f'hoteldetails_output_{filenum}.xlsx'
file_path = os.path.join(folder_path, file_name)

if not os.path.exists(file_path):
    df = pd.DataFrame(columns=temp_data.columns)
    df.to_excel(file_path, index=False)
    print(f"{file_name} created in {os.path.abspath(folder_path)}")

for i in tqdm(range(start,end)):
    answers=pd.read_excel(file_path)
    print(f"{i+1}th batch started")
    batch=batches[i]
    list_of_dicts=batch.to_dict('records')
    duty_free_pages=overall_chain.batch(list_of_dicts)
    for j in range(len(duty_free_pages)):
        brand=list_of_dicts[j]['cityName']
        category=list_of_dicts[j]['countryName']
        product_name=list_of_dicts[j]['hotelId']
        description=list_of_dicts[j]['hotel_name']
        final_page=format_page(duty_free_pages[j],brand,category,product_name,description)
        answers.loc[len(answers)]=final_page
    answers.to_excel(file_path,index=False)
    print(f"{i+1}th batch finished successfully")
    start+=1

  0%|          | 0/61 [00:00<?, ?it/s]

589th batch started
589th batch finished successfully
590th batch started
590th batch finished successfully
591th batch started
591th batch finished successfully
592th batch started
592th batch finished successfully
593th batch started
593th batch finished successfully
594th batch started
594th batch finished successfully
595th batch started
595th batch finished successfully
596th batch started
596th batch finished successfully
597th batch started
597th batch finished successfully
598th batch started
598th batch finished successfully
599th batch started
599th batch finished successfully
600th batch started
600th batch finished successfully
601th batch started
601th batch finished successfully
602th batch started
602th batch finished successfully
603th batch started
603th batch finished successfully
604th batch started
604th batch finished successfully
605th batch started
605th batch finished successfully
606th batch started
606th batch finished successfully
607th batch started
607th ba