# NLP Project


# Data exploration
We chose [this dataset](https://www.kaggle.com/datasets/ruthgn/new-orleans-airbnb-listings-and-reviews?select=new_orleans_airbnb_listings.csv) about AirBnB reviews in New Orleans. The dataset gathers together 345953 reviews of 6028 listings in the 67 districts of New Orleans.

For the purposes of this project we focused on the Uptown district, which has 6087 reviews over 98 listings.

Below, a little data exploration section.

For the sake of simplicity, we saved the two CSV files in a publicly shared Google Drive folder, and loaded them into Colab with ```gdown```

In [None]:
# required libraries
!pip install lingua-language-detector   # language detection

!pip install langchain
!pip install langchain-community
!pip install sentence-transformers      # embedder
!pip install faiss-cpu                  # vector store

!pip install transformers
!pip install bitsandbytes               # model quantization
!pip install torch
!pip install accelerate==0.21.0
!pip install huggingface-hub -q

In [None]:
import pandas as pd
from pathlib import Path
from matplotlib import pyplot as plt
import matplotlib.colors as mcolors
from lingua import Language, LanguageDetectorBuilder

In [None]:
# define and create data path
data_path = Path('./data/')
data_path.mkdir(exist_ok=True)

In [None]:
# download the files in Colab filesystem
!gdown -O ./data/new_orleans_airbnb_reviews.csv #ask me for the "password" or download the dataset from the link
!gdown -O ./data/new_orleans_airbnb_listings.csv #ask me for the password" or download the dataset from the link

In [None]:
# load the files
listings_df = pd.read_csv(data_path / 'new_orleans_airbnb_listings.csv')
reviews_df = pd.read_csv(data_path / 'new_orleans_airbnb_reviews.csv').rename(columns={'id': 'review_id'})

### Info about the data

In [None]:
print('No. of Listings:', len(listings_df))
print('No. of Reviews:', len(reviews_df))

In [None]:
listings_df[:5]

In [None]:
listings_df.info()

In [None]:
reviews_df[:5]

In [None]:
reviews_df.info()

## Preprocessing

As a first step, we want to remove reviews that are not written in the English language.
To do so, we use the ```lingua``` library.

We add a column to the reviews dataframe, then filter by ```Language.ENGLISH```.

We adopted this library because it's much simpler to use with respect to others: it requires no API KEYs (e.g. Google Cloud Translation API) and it has no token limit (at least with a free plan, e.g. DeepL API). In short, everything was ready to use.

In [None]:
# we specified the most common EU languages for the detector
# non-informative text (missing reviews, reviews composed only by periods, emojis) and non-latin-alphabet languages
# will be marked as None (and if not None, hopefully not English!)
language_detector = LanguageDetectorBuilder.from_languages(Language.ENGLISH,
                                                             Language.FRENCH,
                                                             Language.SPANISH,
                                                             Language.GERMAN,
                                                             Language.ITALIAN).build()

In [None]:
reviews_df['review_language'] = reviews_df['comments'].apply(lambda rev: language_detector.detect_language_of(str(rev)))
reviews_df['review_language'] = reviews_df['review_language'].fillna('NOLANG')
reviews_df.groupby('review_language').size()

In [None]:
english_reviews = reviews_df[reviews_df['review_language'] == Language.ENGLISH]

Here we add the meaningful columns to the reviews dataframe, and we also show the reviews distribution per district.

We needed a only a few thousands of documents (in our case, reviews) for this project, so we picked one neighborhood at random from the ones displayed in red in the plot (the ones with >5k and \<10k reviews).

Our choice fell on the <mark>**Uptown** district</mark>.

Later, we will save the CSV file containing only the chosen reviews. This filtered file will then be used for the rest of the project.

In [None]:
# columns to add to the review dataframe
meaningful_columns = ['id',                     # neede for the merge, then dropped
                      'name',                   # name of the property
                      'neighbourhood_cleansed', # name of the neighbourhood, then dropped
                      'property_type',          # additional info on the property
                      'host_id']                # host id

# merge columns, drop id (it's a duplicate column)
# and rename neighbourhood to district for simplicity
review_full_df = pd.merge(english_reviews, listings_df[meaningful_columns], left_on='listing_id', right_on='id', how='left')
review_full_df.drop(columns=['id', 'review_language'], inplace=True)
review_full_df.rename(columns={'neighbourhood_cleansed': 'district'}, inplace=True)

In [None]:
# group by district to count reviews
reviews_per_district = pd.DataFrame(review_full_df.groupby('district').size(), columns=['num'])

In [None]:
chosen_district = 'Uptown'

colors = []
for idx, row in reviews_per_district.iterrows():
    if idx == chosen_district:
        colors.append(mcolors.TABLEAU_COLORS['tab:green'])
    elif 5000 <= row['num'] <= 10000:                   # eligible districts
                                                        # i.e., they have a few thousands documents
        colors.append(mcolors.TABLEAU_COLORS['tab:red'])
    else: # the district has too few ot too many documents
        colors.append(mcolors.TABLEAU_COLORS['tab:gray'])

In [None]:
# plot of the reviews distribution among districts
fig, ax = plt.subplots(figsize=(14,12))
ax.grid(which = 'major', axis = 'x', color = 'gray', linestyle = 'dashed', linewidth = 0.3)
ax.set_axisbelow(True)
ax.barh(reviews_per_district.index, reviews_per_district['num'], color = colors)

# add values next to the bars
plt.xlim(0, plt.xlim()[1] + 1000)
for i, v in enumerate(reviews_per_district['num']):
    ax.text(v + 100, i, str(v), color = colors[i], fontweight = 'bold', fontsize = 9, ha = 'left', va = 'center')

# chart title
plt.title('Reviews per district')

# chart legend
color_legend = {
    'eligible district': mcolors.TABLEAU_COLORS['tab:red'],
    'chosen district': mcolors.TABLEAU_COLORS['tab:green'],
    'discarded district': mcolors.TABLEAU_COLORS['tab:gray']
    }
legend_labels = list(color_legend.keys())
legend_handles = [plt.Rectangle((0, 0), 1, 1, color = color_legend[label]) for label in legend_labels]
plt.legend(legend_handles, legend_labels)

plt.show()

In [None]:
uptown_reviews = pd.DataFrame(review_full_df[review_full_df['district'] == chosen_district])
uptown_reviews.drop(columns = ['district'], inplace = True) # remove district column since it adds no information:
                                                        # all the reviews come from the same district now

# additional post processing - without this step the csv didn't save properly resulting in a CSV with 16 more lines than expected
uptown_reviews = uptown_reviews.applymap(lambda s: s.replace('\n', '').replace('\r', '') if isinstance(s, str) else s) # these characters created the extra lines
uptown_reviews['comments'] = uptown_reviews['comments'].str.replace(r'<[^<>]*>', ' ', regex=True) # get rid of HTML tags

# IMPORTANT NOTE
# if you are using Pandas >=2.1, .applymap() function has been deprecated and renamed to .map()
# Google Colab uses Pandas 2.0, which still has .applymap()

In [None]:
uptown_reviews

In [None]:
# save filtered reviews
# we will process this DataFrame using LangChain in the next sections

# given that LangChain provides a DataFrameLoader, this save operation is not really needed
# we save it anyway for convenience
uptown_reviews.to_csv(data_path / 'new_orleans_airbnb_reviews_uptown.csv')

# Documents' Setup

## Data Loading
We've already composed a Pandas DataFrame with everything we need, so the data loader that best suits our case is the [DataFrame loader](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.dataframe.DataFrameLoader.html#langchain-community-document-loaders-dataframe-dataframeloader).

This data loader imposes a rather strict limitation, that is, only one column is allowed to be the page content of the document, and all the other columns are automatically added as document's metadata.

With respect to the ```CSVLoader```, the ```DataFrameLoader``` provides much cleaner documents because it ignores the row index and it does not insert the column name before the corresponding value in the page content.

Here a brief example to see the difference:
- CSVLoader page_content: ```: 16332 comments: Great residential area to get to know [...]```
    - we have an unnamed column for the index row (it's the reason why the document starts with the colon)
    - and the ```comments: ``` prefix
- DataFrameLoader page_content: ```Great residential area to get to know [...]```
    - we only have the actual content!

In [None]:
from langchain.document_loaders import DataFrameLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import CacheBackedEmbeddings, HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.storage import LocalFileStore

In [None]:
# dataframe loader
loader = DataFrameLoader(uptown_reviews, page_content_column = 'comments')  # comments column is the page_content
                                                                            # the remaining columns are inserted automatically as document metadata
reviews = loader.load()
print('No. of Uptown reviews:', len(reviews))

## Text Splitter
We chose the [RecursiveCharacterTextSplitter](https://api.python.langchain.com/en/latest/character/langchain_text_splitters.character.RecursiveCharacterTextSplitter.html#langchain_text_splitters.character.RecursiveCharacterTextSplitter) with a chunk size of ```1000``` and a chunk overlap of ```100```, using the ```len``` function.

We tried other values maintaining the same ratio (```1:10``` for ```size:overlap```), such as ```2000:200``` and ```1500:150```, but this seems to be the best one.

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 100,
    length_function = len
)

reviews_documents = text_splitter.transform_documents(reviews)

## Embedding model
In the following block, we performed a series of operations to prepare and optimize an embedding model for sentence similarity evaluation

We tested three different embedding models provided by Hugging Face:
- [all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)
- [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
- [bert-base-nli-mean-tokens](https://huggingface.co/sentence-transformers/bert-base-nli-mean-tokens)

Our choice fell on ```all-mpnet-base-v2```, as it was the one that brought the best results in the experiments we did.

The embedding model will be loaded with a cache to improve performance (it avoids re-computing embeddings for the same queries).

Finally, we will create an archive of vectors via [FAISS](https://github.com/facebookresearch/faiss/) on which we will perform similarity search.



In [None]:
cache_path = LocalFileStore('./cache/')

# HufggingFace name of the desired model
embedding_model_id = 'sentence-transformers/all-mpnet-base-v2'

# creation of the embedding given the selected model
embedding_model_HF = HuggingFaceEmbeddings(model_name = embedding_model_id)

# We load an embedding model with a cache to improve performance
embedder = CacheBackedEmbeddings.from_bytes_store(embedding_model_HF, cache_path, namespace = embedding_model_id)

# We create an archive of vectors using FAISS from our reviews
vector_store = FAISS.from_documents(reviews_documents, embedder)

In [None]:
# here we perform a similarity search to get a glance of the performance of the embedder
query = 'Which one is good for kids?'

embedding_vector = embedding_model_HF.embed_query(query)
docs = vector_store.similarity_search_by_vector(embedding_vector, k = 8)

for idx, doc in enumerate(docs):
  print(f'Document n. {idx+1}\nContent: {doc.page_content}\nMetadata: {doc.metadata}\n')

### Notebook Login
To be able to use the language model provided by Hugging Face we need to login and pass the token:

```put your token```

In [None]:
from huggingface_hub import notebook_login

# login to hugginfacehub, copy and paste the token inside the box
token = 'put your token'

notebook_login()

## Large Language Model

In this code block, we performed a series of operations to configure a text generation model.

We considered four different text generation models provided by Hugging Face:

- [meta-llama/Llama-2-13b-chat-hf](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf)
- [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)
- [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)
- [openai-community/gpt2](https://huggingface.co/openai-community/gpt2)

Among the various models, **Mistral-7B** was chosen because it was the one that was found to be most suitable after various tests.

Mistral 7B v0.2 [[paper](https://arxiv.org/abs/2310.06825)], is a 7-billion-parameter language model engineered for superior performance and efficiency.
Mistral 7B outperforms [[source](https://mistral.ai/news/announcing-mistral-7b/)]:
- Llama 2 13B across all evaluated benchmarks
- Llama 1 34B on many benchmarks including reasoning, mathematics, and code generation, as the paper written by the authors of the model suggests.




Once done we will proceed with quantization and tokenization, first loading the model for causal language modeling.




In [None]:
import transformers
import torch
from langchain.llms import HuggingFacePipeline

# Hugging Face model name
model_id = 'mistralai/Mistral-7B-Instruct-v0.2'

# config the for 4-bit quantization
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit = True,
    bnb_4bit_quant_type = "nf4",
    bnb_4bit_use_double_quant = True,
    bnb_4bit_compute_dtype = torch.bfloat16
)

# set the desired model
model_config = transformers.AutoConfig.from_pretrained(model_id)

# load the model for causal language modeling
model = transformers.AutoModelForCausalLM.from_pretrained(
        model_id,
        trust_remote_code = True,
        config = model_config,
        quantization_config = bnb_config,
        torch_dtype = torch.bfloat16,
        device_map = "auto",
    )

# evaluation
model.eval()

# load the pre-trained tokenizer associated with the selected model
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)

## Pipeline, templates and chain

In this code block, we created a pipeline for text generation using the previously configured model and tokenizer.



In [None]:
# pipeline for text generation using our model and the tokenizer
generate_text_pipeline = transformers.pipeline(
    task = "text-generation",
    model = model,
    tokenizer = tokenizer,
    do_sample = True,
    temperature = 0.1,
    repetition_penalty = 1.1,
    return_full_text = True,
    max_new_tokens = 512
)

llm = HuggingFacePipeline(pipeline = generate_text_pipeline)

In [None]:
from langchain_core.prompts.prompt import PromptTemplate
from langchain.chains import RetrievalQA
from langchain.callbacks import StdOutCallbackHandler

We provide a template for Mistral-7B Instruct, indicating its goal, the theme of the questions, the model behavior, and what we expect (a helpful answer).

In order to leverage instruction fine-tuning, the prompt should be surrounded by ```[INST]``` and ```[/INST]``` tokens. The very first instruction should begin with a begin of sentence id. The next instructions should not. The assistant generation will be ended by the end-of-sentence token id as stated on the [page on Hugging Face](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2#instruction-format).

We also define a template for the documents, in order to provide both page content and metadata.

In [None]:
# prompt
template = (
    """
    <s>[INST]
    You are an assisant for question-answering tasks.
    The theme of the questions is: AirBnB reviews in New Orleans.
    Use the following pieces of retrieved context to answer the question.
    Feel free to ignore some context if it is not useful for the answer.
    If you don't know the answer, apologize and say you don't know.
    Please ensure your responses are supported by the information from the reviews.

    </s>
    [INST]
    Context: {context}
    [/INST]
    Question: {question}
    Helpful Answer:
    """
)

# this document template allows us to pass metadata to the model
# otherwise only page_content is used
document_prompt = PromptTemplate(
    input_variables = ['page_content', 'listing_id', 'review_id', 'date', 'reviewer_id', 'name', 'property_type', 'host_id'],
    template = """
        listing ID: {listing_id};
        listing name: {name};
        host ID: {host_id}
        property type: {property_type};
        review ID: {review_id};
        review date: {date};
        reviewer ID: {reviewer_id};
        review: {page_content}"""
)

In [None]:
# retriever based on the vector store to perform sentence similarity searches
retriever = vector_store.as_retriever(search_type='similarity', search_kwargs = {"k" : 30})

# callback for std out
stdout_handler = StdOutCallbackHandler()

# QA chain
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type = "stuff",
    retriever = retriever,
    callbacks=[stdout_handler],
    chain_type_kwargs = {
        "prompt": PromptTemplate(template=template, input_variables=[]),
        "document_prompt": document_prompt
    },
    return_source_documents=True
)

## Questions
We asked the following questions:
1. Which apartment is near restaurants?
2. Which bed and breakfast received the best reviews overall and what are they?
3. Which host turned out to be the most popular and why?
4. I have a son and a dog, which host do you recommend?
5. I'm going to have to move around a lot with transportation, where is it best for me to go?
6. Which one is recommended for an elderly couple?\

In [None]:
# helper function to get a cleaner output
def parse_and_show_answer(answer):
    print(f'QUESTION: {answer["query"]}\n')
    print(f'MODEL ANSWER:\n{answer["result"].split("Helpful Answer:")[-1].strip()}\n')
    print('RETRIEVED DOCUMENTS:')
    for idx, doc in enumerate(answer['source_documents']):
        print(f'Document n. {idx+1}:\nContent: {doc.page_content}\nMetadata: {doc.metadata}\n')

In [None]:
answer = qa({'query' : "Which apartment is near restaurants?"})
parse_and_show_answer(answer)

In [None]:
answer = qa({"query" : "Which bed and breakfast received the best reviews overall and what are they?"})
parse_and_show_answer(answer)

In [None]:
answer = qa({"query" : "Which host turned out to be the most popular and why?"})
parse_and_show_answer(answer)

In [None]:
answer = qa({"query" : "I have a son and a dog, which host do you recommend?"})
parse_and_show_answer(answer)

In [None]:
answer = qa({"query" : "I'm going to have to move around a lot with transportation, where is it best for me to go?"})
parse_and_show_answer(answer)

In [None]:
answer = qa({"query" : "Which one is recommended for an elderly couple?"})
parse_and_show_answer(answer)

# General qualitative analysis of the results (over our trials)

To better understand the model behavior, we considered the following parameters:
- number of retrieved documents
- model temperature
- repetition penalty
- length of the answer

## Number of retrieved documents
We tried retrieving different numbers of documents: ```k = {8, 20, 30}```.

No substantial difference can be appreciated with ```k=8``` or ```k=20```, the recommendations vary very little with the major influence given by high temperature values (see below).

**Important**: we had to limit this parameter to ```k=30``` in order to fit in the memory Colab offers freely. In fact, we wanted to try with ```k=50```.

Instead, we can observe some change in the recommendation when using ```k=30```. In most cases the alternative recommendations are equivalent to the ones provided with ```k=20``` documents, and when they are not equivalent they bring improvements.

As an example, consider Question 2:
- with ```k=20``` the model suggested "Uptown Trendy Neighborhood Apt permit# 17STR-06483", based on just one review (doc. n. 4)
    - review n. 4 describes the AirBnB as a comfortable place to stay, managed by very nice people
- with ```k=30``` the model suggested "Private Pool + Hot Tub!", which is more represented (docs. n. 7 and 20)
    - review n. 7 says that it was the best AirBnB they've ever experienced, as everything was perfect
    - review n. 20 calls the place superb and explicitly highly recommend the AirBnB

We find the second answer more suitabe given the reviews, but we still asked ourselves the reason behind this different suggestion considering the fact that both documents are available in both cases. Our insight lies in the attention mechanism: in the second variant the model is fed more text, so the attention is distributed in a different way than before, possibly highlighing different pieces of the provided context.

## Model temperature
The temperature changes the model beahvior introducing a trade-off between coherence and diversity of the generated text.

We tried three values for the model temperature: ```temperature = {0.1, 0.5, 1}```.

We noticed little-to-no differences between the first two values (```0.1``` and ```0.5```), with the major changes being a bulleted list written as a series of sentences, but presenting the same content. Most of the proposed AirBnBs were those whose reviews were the most relevant for the retrievers, i.e., AirBnB of the first reviews.

By utilizing ```temperature=1``` we observed a little more difference in the proposed AirBnBs, with the model considering also lower-ranking reviews. For instance, when we tried with ```k=20``` retrieved documents on question 5, the model considered places appearing for the first time in documents n. 17 and 18, while with lower temperature values the model considered mainly the first 5 or 6 places appearing in the reviews.

On the other side, setting ```k=30``` can lead to hallucinations and repetitions over all the generated sentences:

- hallucinations happened on Question 2, where the model reported a piece of a review for an AirBnB which wasn't recommended
    - the recommended AirBnB was "Private Pool + Hot Tub!" but the list provided as evidence talked about "Avocado Ranch" or "Michael's place" - note that the documents recommending "Private Pool + Hot Tub!" are the ones reported as an example in the previous section of the analysis
- repetitions were observable in the last question, where the model was constantly indcating "for an elderly couple/person"
    - Based on the context provided, the review for listing ID: 1738168 mentions that the shower might be difficult ***for an elderly person***. Therefore, it would not be the best recommendation ***for an elderly couple***. Other listings that could potentially be suitable ***for an elderly couple*** include "New Orleans Uptown Tall Tales" (listing ID: 11675374), "Uptown Haven" (listing ID: 30505052), and "Charming uptown victorian!" (listing ID: 1466244). These listings do not mention any specific difficulties related to accessibility ***for the elderly***. However, it's essential to read additional reviews or contact the hosts directly to confirm if their properties meet the specific requirements ***of the elderly couple***.

## Repetition penalty
This parameter regulates how much the model repeats himself in the answer, lowering the sampling probabilities of the tokens that have recently appeared in the text.

We started from the value ```1.1```, and we never changed it because we noticed no repetition at all in the answers.

We noticed some repetitions when ```temperature=1```, as shown earlier, indicating the need for a higher penalty. However, we did not chose that high value for temperature and the current choice of ```1.1``` best fits our needs.


## Length of the answer
We tuned this parameter by modifying the value of ```max_new_token``` in the pipeline.

Our very first try was ```256```, but we observed that some answers were being truncated, with the last sentence of the answer left unsfinished.

We eventually increased this value to ```512```, observing that this new value leaves the model enough room to properly compose the answer.

---
## Our final choice
- number of retrieved documents: ```k=30```
- model temperature: ```0.1```
- repetition penalty: ```1.1```
- length of the answer: ```max_new_tokens=512```


# Answer analysis
## Question 1
The answer is coherent with the retrieved reviews, as it reports all BnB whose reviews explicitly talk about restaurants.

## Question 2
There are many reviews praising many BnBs. We felt that the model focused on the sentiment expressed by the review, which we found stronger in docs. n. 7 and 20 compared to other ones (for instance, "superb" vs "beautiful").

## Question 3
The recommended host, Ernest, appeared in more review than most of the others (both Ernest and Luke appeared on 3 reviews over 30). Ernest stands out because the review of his BnB provided more insight about him being a great host, while reviews of Luke's place mentioned just "great/awesome host \<end of review\>"

## Question 4
The answer is correct as Joseph appeared in more than half of the retrieved reviews (17/30), and all of them were praising him and his dogs.

## Question 5
This answer is good but at the same time it has some flaws:
- recommendation n. 2 fits but we can understand it only by the BnB name, because the reviews don't mention explicitly nearness to public transportation
- recommendation n. 3 fits as well but the model reports the wrong review for it (doc n. 7 instead of doc n. 14)

## Question 6
The answer is not very precise and tends to address the user to what could be good. This is due to lack of content about elderly people. In some trials we also noticed the model repeats itself, similar to what we observed earlier in the temperature analysis.

We also noticed that the model tells the user the wrong listing ID, and this error was not corrected even after regenerating the answer several times.
