**Important:** Each notebook can be executed independently and does not rely on running the others.

# 0. Installing and Importing the Necessary Libraries

General Installs and Imports that should be used in *Every Notebook*:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statistics
import re

#### For *Notebook 1. Data Exploration*:

In [None]:
#!pip install langdetect
#!pip install wordcloud

In [None]:
#from langdetect import detect
#from wordcloud import WordCloud

#### For *Notebook 2. Pre-Processing*:

In [None]:
#!pip install langdetect
#!pip install ipython
#!pip install deep_translator
#!pip install nltk
#!pip install autocorrect
#!pip install spacy
#!pip install timeout-decorator
#!pip install contractions

In [None]:
from langdetect import detect
from IPython import display
from deep_translator import GoogleTranslator
import nltk
import string
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import pos_tag
from nltk.corpus import wordnet
from tqdm import tqdm
from autocorrect import Speller
import spacy
from sklearn.model_selection import train_test_split
from timeout_decorator import timeout, TimeoutError
import contractions

In [None]:
#nltk.download("stopwords")
#nltk.download("wordnet")
#nltk.download('omw-1.4')
#nltk.download("en_core_web_sm")
#ltk.download('averaged_perceptron_tagger')

# 1. Multilingual Text Processing

As we saw during the previous section we have a lot of different languages apart from the English in our dataset. This may be a problem during the pre-processing phase since some of the methods that we are going to use like Stopwords, Lemmatisation, Stemming and Spelling Check don't fully support multilingual datasets. We could probably find different libraries from the standard ones to solve this multilingual issue but we decided to follow the easiest path that was to translate the whole dataset to the English Language as this was the language of the majority of the textual fields. Below we are going to exemplify with a short example how did we do this. We are not going to include in the notebook the whole process as we don't feel this is something totally required for the accomplishment of the project.

In [None]:
# Function that we used to translate each one of the texts of the three different
# textual field ("comments", "host_about", and "description").
def translate_text(text):

    # We had to add this try-except as empty strings and "." on the textual fields would
    # give us an error when trying to identify their language.
    try:
        source_lang = detect(text)
    except Exception:
        return text

    # As GoogleTranslator() only translates up to 5000 characters we had to add this if
    # condition to only apply it in positive cases.
    if len(text) < 5000:
        # Choosing English as the target language.
        target_lang = "en"
        translator = GoogleTranslator(from_lang=source_lang, to_lang=target_lang)
        translation = translator.translate(text)
        return translation

    # As we had a small number of textual fields bigger than 5000 characters we decided
    # to manually translate them, so that we guarantee the whole dataset is in English.
    elif len(text) > 5000:
        return text

In [None]:
# By using this function through the different texts of our textual fields we were
# able to translated completely all the textual fields from the training and test
# set. Although the test set won't be used until the end of the project, we also have
# to apply the pre-processing techniques on it. Now we will just present a simple example
# to show that it is working.

text = "Este grupo é composto pelo Miguel, Duarte, Eduardo e José. São alunos do primeiro ano de mestrado."
translate_text(text)

"This group is composed by Miguel, Duarte, Eduardo and José. They are first-year master's students."

# 2. Importing the Translated Datasets

As explained previously in this step we are just going to import the transalted train and test set (train set before splitting).

In [None]:
# Allowing access to our Google Drive where the original datasets are stored.
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Importing our translated datasets.
df_train_reviews_translated = pd.read_excel('/content/drive/MyDrive/Text Mining Project/Original Train Dataset (Original, Translated, Pre-Processed))/Translated/train_reviews_en_translated.xlsx')
df_train_translated = pd.read_excel('/content/drive/MyDrive/Text Mining Project/Original Train Dataset (Original, Translated, Pre-Processed))/Translated/train_en_translated.xlsx')
df_test_reviews_translated = pd.read_excel('/content/drive/MyDrive/Text Mining Project/Test Dataset (Original, Translated, Pre-Processed)/Translated/test_reviews_en_translated.xlsx')
df_test_translated = pd.read_excel('/content/drive/MyDrive/Text Mining Project/Test Dataset (Original, Translated, Pre-Processed)/Translated/test_en_translated.xlsx')

# 3. Pre-Processing

In this section we will first import our translated training and data sets. These datasets were translated using the method presented, previously. Then, we will have a sub-section to test individually each one of the pre-processing methods that we are thinking of including in our final pre-processing pipeline. We will test individually the following methods: Lowercasing, Elimination of Stop Words, Named Entity Recognition (NER), Stemming, Lemmatization, Spelling Check, Elimination of Punctuation, URL Links and HTML Tags. The order that they are included in the pipeline is very important and we are going to address it during that subsection.

## 3.1. Lowercasing

Lowercasing means changing all uppercase letters to their corresponding lowercase counterparts while leaving lowercase letters unchanged. It is important in terms of text normalization, word frequency, text comparisons, and consistency. We add a try-except clause.

In [None]:
# Creating a function for lowercasing our textual fields.
def lowercasing(text):
    try:
        text = text.lower()
        return text
    except:
        return text

In [None]:
# Testing our function.
text = "Hello my name is Miguel. Nice to meet you!"
lowercasing(text)

'hello my name is miguel. nice to meet you!'

## 3.2. Removal of Stop Words

Stop words are commonly used words in a language that often do not carry significant meaning or contribute much to the understanding of the text. Examples of stop words in English include "a", "an", "the", "is", "and", "in", "to", and so on. It is important in terms of noise reduction and memory and storage optimization. We add a try-except clause.

In [None]:
# Defining in a set what are the stopword we want to eliminate.
stop = set(stopwords.words("english"))

# Creating a function to remove the stopwords in the textual field.
def removing_stopword(text):
  try:
    text = " ".join([word for word in text.split() if word not in stop])
    return text
  except:
    return text

In [None]:
# Testing our function.
text = "Duarte and his friend Eduardo are going to play football together."
removing_stopword(text)

'Duarte friend Eduardo going play football together.'

## 3.3. Named Entity Recognition (NER)

NER is a subtask of information extraction that involves identifying and classifying named entities in text into predefined categories such as person names, locations, organizations, dates, etc. It is important in terms of information extraction, text understanding, entity linking and disambiguation. We add a try-except clause.


In [None]:
# Creating a function to switch names, locations, organization and dates by "NAME", "LOCATION", "ORG", and "DATE".
def named_entity_recognition(text):

    # Loading a pre-trained english language model from the spacy library.
    nlp = spacy.load("en_core_web_sm")

    # Applies the loaded language model to the input text. Then we can call
    # different methods from the model nlp.
    doc = nlp(text)

    # Open a dictionary that stores if a specific "NAME", "LOCATION", "ORG", or "DATE" is being repeated.
    entity_mapping = {"PERSON": {}, "GPE": {}, "ORG": {}, "DATE": {}}

    # A list storing the treated text.
    preprocessed_text = []

    # Iterating over each word of our text and checking the entity type of each word.
    # Switching the word by its entity in case it is one of the entities that we defined
    # previously.
    for token in doc:
        if token.ent_type_ == "PERSON":
            if token.text not in entity_mapping["PERSON"]:
                entity_mapping["PERSON"][token.text] = "NAME" + str(len(entity_mapping["PERSON"]) + 1)
            preprocessed_text.append(entity_mapping["PERSON"][token.text])
        elif token.ent_type_ == "GPE":
            if token.text not in entity_mapping["GPE"]:
                entity_mapping["GPE"][token.text] = "LOCATION" + str(len(entity_mapping["GPE"]) + 1)
            preprocessed_text.append(entity_mapping["GPE"][token.text])
        elif token.ent_type_ == "ORG":
            if token.text not in entity_mapping["ORG"]:
                entity_mapping["ORG"][token.text] = "ORG" + str(len(entity_mapping["ORG"]) + 1)
            preprocessed_text.append(entity_mapping["ORG"][token.text])
        elif token.ent_type_ == "DATE":
            if token.text not in entity_mapping["DATE"]:
                entity_mapping["DATE"][token.text] = "DATE" + str(len(entity_mapping["DATE"]) + 1)
            preprocessed_text.append(entity_mapping["DATE"][token.text])
        else:
            preprocessed_text.append(token.text)

    return " ".join(preprocessed_text)

In [None]:
# Testing our function.
text = "José and Duarte work both in Lisbon at KPMG and Deloitte, respectively."
named_entity_recognition(text)

'NAME1 and NAME2 work both in LOCATION1 at ORG1 and ORG2 , respectively .'

## 3.4. Stemming

Stemming is a technique used in natural language processing (NLP) to reduce words to their base or root form, known as the "stem." The stem may not necessarily be a valid word itself, but it represents the core meaning or essence of the word. The main goal of stemming is to consolidate words with similar meanings into a common form. By reducing words to their stems, variations of the same word can be grouped together, enabling more effective text analysis and retrieval. Stemming helps to overcome challenges such as different tenses, plurals, or derived forms of words that may occur in a given text. For example, consider the words "change", "changing", "changes", and "changer". By applying stemming, these words would be reduced to their common stem "chang." Similarly, "dogs", "dog's", and "dog" would all be stemmed to "dog." We add a try-except clause.

In [None]:
# Defining the language of the stemmer object.
stemmer_obj = SnowballStemmer("english")

# Defining the stemming function.
def stemming(text):
  try:
    text = " ".join(stemmer_obj.stem(word) for word in text.split())
    return text
  except:
    return text

In [None]:
# Testing our function.
text = "While Miguel was running he had several dogs trying to catch him."
stemming(text)

'while miguel was run he had sever dog tri to catch him.'

## 3.5. Lemmatization

Lemmatization considers the context and meaning of the word to determine its lemma. Lemmatization aims to transform different inflected forms of a word into a single base form while preserving the correct part of speech. The main advantage of lemmatization over stemming is that it produces valid words that exist in the language. This helps maintain the interpretability and semantic integrity of the text. By reducing words to their lemmas, words with the same meaning but different inflections are grouped together, facilitating more accurate text analysis and comprehension. For example, consider the words "walking," "walks," and "walked." Lemmatization would convert these words to the common lemma "walk." Similarly, "change", "changing", "changes", "changer" would be lemmatized to "change". We add a try-except clause.


In [None]:
# Defining the Lemmatizer.
lemma = WordNetLemmatizer()

# Map POS tag to WordNet POS tag.
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        # If no specific mapping it assumes a noun.
        return wordnet.NOUN

# Defining the Lemmatizer function.
def lemmatizer(text):
  try:
    # Perform POS tagging on the text
    tagged_text = pos_tag(text.split())

    # Lemmatize each word with its corresponding POS tag
    lemmatized_text = " ".join(lemma.lemmatize(word, get_wordnet_pos(pos_tag)) for word, pos_tag in tagged_text)

    return lemmatized_text
  except:
    return text

In [None]:
# Testing our function.
text = "Eduardo was walking down the street while at the same thinking about the best route."
lemmatizer(text)

'Eduardo be walk down the street while at the same thinking about the best route.'

## 3.6. Spell Checking

Spell checking refers to the process of identifying and correcting spelling errors in textual data before further analysis. It involves automatically detecting words that are not spelled correctly and suggesting or applying appropriate corrections. It is important in terms of improving the data quality, vocabulary standardization, and enhanced text understanding. We add a try-except clause.

In [None]:
# Defining the speller.
spell = Speller(lang="en")

# Defining the function with timeout.
@timeout(5)  # Set the timeout duration in seconds (e.g., 5 seconds).
def spell_checking(text):
    try:
        corrected_text = " ".join(spell(word) for word in text.split())
        return corrected_text
    except TimeoutError:
        return text  # Return the original word if timeout is reached.

In [None]:
# Testing our function.
text = "Eduardo was waalking down the stret whille at the samme thinkiing abouut the beest rroute."
spell_checking(text)

'Eduardo was walking down the street while at the same thinking about the best route.'

## 3.7. Punctuation Removal


Punctuation removal is important in text processing because it helps reduce noise and simplify the analysis of text data. Punctuation marks, such as periods, commas, and question marks, don't carry significant meaning on their own. By removing punctuation, the focus can be placed on the actual content and structure of the text, rather than the specific punctuation choices made by the author. This simplifies tasks like tokenization, where words are separated, and helps ensure consistent analysis across different texts. Additionally, punctuation-free text facilitates language modeling, sentiment analysis, part-of-speech tagging, and other natural language processing tasks, as it provides cleaner and more standardized input. Overall, punctuation removal improves the accuracy and efficiency of text analysis and information extraction. We add a try-except clause.

In [None]:
# Defining the function.
def ponctuation_removal(text):
  try:
    # Substituing everything that is not a letter or a number by an empty string.
    text = re.sub("[^a-zA-Z0-9]", " ", text)
    return text
  except:
    return text

In [None]:
# Testing our function.
text = "</> Eduardo lives in South Margin. Jose lives in Famalicao. Miguel and Duarte live in Faro. </>"
ponctuation_removal(text)

'    Eduardo lives in South Margin  Jose lives in Famalicao  Miguel and Duarte live in Faro     '

## 3.8. URL Links and HTML Tags Removal

URL links and HTML tags are not meaningful for textual analysis and can introduce biases or distractions. By removing them, the focus remains on the content itself, improving the accuracy of natural language processing tasks. Additionally, eliminating these elements enhances text cohesion, readability, and privacy, as well as ensures standardized processing across different texts.

In [None]:
# Defining the function.
def url_html_removal(text):
  try:
    # Substituting URL links by an empty string.
    text = re.sub(r'https?://\S+', ' ', text)

    # Substituting various HTML tags and entities.
    text = re.sub(r'<.*?>', ' ', text)  # Remove all HTML tags
    text = re.sub(r'&amp;', ' ', text)  # Remove HTML entity for ampersand (&)
    text = re.sub(r'&lt;', ' ', text)   # Remove HTML entity for less than (<)
    text = re.sub(r'&gt;', ' ', text)   # Remove HTML entity for greater than (>)
    text = re.sub(r'&quot;', ' ', text) # Remove HTML entity for double quotation marks (")
    text = re.sub(r'&#39;', ' ', text)  # Remove HTML entity for single quotation marks (')
    text = re.sub(r'&nbsp;', ' ', text) # Remove HTML entity for non-breaking space

    # Substituting the string "_x000d_" by an empty string.
    text = re.sub(r'_x000d_', ' ', text)
    text = re.sub(r'_ x000d_', ' ', text)
    text = re.sub(r'_x000d _', ' ', text)

    # Substituting various newline characters and spaces.
    text = re.sub(r'<br\s*/*>', ' ', text)  # Remove HTML line break tags
    text = re.sub(r'\r\n|\r|\n', ' ', text) # Remove newline characters
    text = re.sub(r'\s+', ' ', text)        # Replace multiple spaces with a single space

    return text
  except:
    return text

In [None]:
# Trying the function.
text = "Visit Duarte LinkedIn at https://www.linkedin.com/in/duartegirao/ for more information."
url_html_removal(text)

'Visit Duarte LinkedIn at for more information.'

## 3.9. Contractions - Identify and Expand Contractions (Transforming Abbreviations)

By expanding contractions, you convert abbreviated or contracted words into their complete forms. For example, expanding "don't" to "do not" or "it's" to "it is." This transformation helps in standardizing the text and making it consistent for further analysis.

In [None]:
# Defining the function.
def contraction(text):
  try:
    text = contractions.fix(text)
    return text
  except:
    return text

In [None]:
text = "Miguel doesn't know if he's goin' to make it."
contraction(text)

'Miguel does not know if he is going to make it.'

## 3.10. Bigger Than 45 Words Remotion

The biggest word in the english dictionary is "Pneumonoultramicroscopicsilicovolcanoconiosis". Any word bigger than this has to be two words together or another type of error. We are going to remove these words.

In [None]:
# Defining the function that removes this words.
def big_word_removal(text):
  try:
    # Matches and replace words with more than 45 characters with an empty string.
    text = re.sub(r'\b\w{46,}\b', '', text)
    return text
  except:
    return text

In [None]:
text = "I'm trying to use a very big word like qwertyuiopasdfghjklçzxcvbnmqwertyuiopasdfghjklçzxcvbnm."
big_word_removal(text)

"I'm trying to use a very big word like ."

## 3.11. Pre-Processing Pipeline Creation

The order of pre-processing techniques in a pipeline is crucial, and we carefully followed a specific sequence. Initially, we applied lowercasing as the first step since it doesn't interfere with other techniques. Subsequently, we performed URL and HTML tag removal to ensure proper handling of these elements, followed by contraction to transform abbreviations. If we had applied these steps after punctuation removal, there was a risk of missing URLs, HTML tags, and abbreviations due to the deletion of important punctuation marks. To avoid this, we proceeded with punctuation removal, stopword removal, and big word removal, maintaining this order. The remaining steps in our pipeline involve spell checking and lemmatization. Since lemmatization can create non-existent words, it is essential to perform spell checking before lemmatization. Therefore, we apply spell checking first and conclude with lemmatization. Unfortunately, we omitted Named Entity Recognition (NER) in our pipeline due to its computational demands. However, the methods for NER were developed in earlier sections of this notebook.


In [None]:
# Defining the function that will be our pipeline. This function will receive
# the previous functions that we defined and selected.
def pre_processing_pipeline(text_list):

  # List that will receive the different phrases pre-processed.
  new_text = []

  # Englobing the for-loop in tqdm() allows us to see the progress when converting.
  for i in tqdm(text_list):

    text = i

    # Applying the lowercasing function.
    text = lowercasing(text)

    # Applying the url and html tag removal function.
    text = url_html_removal(text)

    # Applying the transformation of abbreviations.
    text = contraction(text)

    # Applying the ponctuation removal function.
    text = ponctuation_removal(text)

    # Applying the function that removes stopwords.
    text = removing_stopword(text)

    # Applying the function that removes the big words.
    text = big_word_removal(text)

    # Applying the function that does the spell checking.
    text = spell_checking(text)

    # Applying the lemmatization function.
    text = lemmatizer(text)

    new_text.append(text)

  return new_text

In [None]:
# Testing the function.
testing_function = pre_processing_pipeline(df_train_reviews_translated["comments"][:100])
testing_function

100%|██████████| 100/100 [00:00<00:00, 686.64it/s]


['cozy comfortable house stay never worry safety host nice close parque metro station easy find',
 'good',
 'first hostel experience say pretty hard beat place book room 6 people end get room locker person room well know belonging safe since one key card room 5 euro deposit keycard lose bed comfortable mine home close curtain shade tune car drive cannot ask good location could see parque metro station window hostel first floor really pretty build metro convenient come every 4 5 minute could go many local cafe right outside walk 15 20 minute street get fancy clothing boutique free breakfast great kitchen look even well picture brand new arrive lisbon around 8am let store belonging even check kept even already check 100 safe bathroom constantly keep clean since reception person help 24 7 laundry facility linens super soft tv room great bathroom luxurious thanks zabed host go beyond',
 'hostel new everything work perfectly fast wifi sturdy bunk bed comfortable mattress new bathroom really

## 3.12. Pre-Processing our Translated Training and Test Set

We will use the previously created pipeline on both the training set and test set. It is important to pre-process them for the reasons explained until now. We can pre-process the test set because these methods are independent of the data, that is, we don't have to have the knowledge of them to apply these techniques. We will first dfine required functions to update the DataFrames and save them in our drive. Then, we will first pre-process the training set and then the test set.

### 3.11.1. Defining Needed Functions

In [None]:
# Defining the function that take as input the dataframe we want to modify,
# a list that represents the new updated column, and the column that we want to
# switch the information with.
def update_df(dataframe, list_updated, column_name):
  dataframe.update(pd.DataFrame({f"{column_name}": list_updated}))

In [None]:
# This function takes the dataframe we want to save, the direction to the folder,
# and the name that we want to give to the excel sheet.
def save_dataframe_drive(dataframe, path, file_name):
  # Defining the file path.
  file_path = f'{path}/{file_name}.xlsx'

  # Create an Excel writer object
  writer = pd.ExcelWriter(file_path)

  # Export the DataFrame to Excel
  dataframe.to_excel(writer, index=False)

  # Save the Excel file
  writer.save()

  # Close the Excel writer object
  writer.close()

### 3.11.1. Applying on the Training Set

In [None]:
#comments_train_pre_processed = pre_processing_pipeline(df_train_reviews_translated["comments"])

100%|██████████| 721402/721402 [25:59<00:00, 462.58it/s]


In [None]:
#df_train_reviews_pre_processed = df_train_reviews_translated.copy()

In [None]:
#update_df(df_train_reviews_pre_processed, comments_train_pre_processed, "comments")

In [None]:
#df_train_reviews_pre_processed.head()

Unnamed: 0,index,comments
0,1,cozy comfortable house stay never worry safety...
1,1,good
2,1,first hostel experience say pretty hard beat p...
3,1,hostel new everything work perfectly fast wifi...
4,1,fine dorm think people stay far less bathrooms...


In [None]:
# Finally using the previously created function to save the new pre-processed dataframe as an excel
# in our drive.

#save_dataframe_drive(df_train_reviews_pre_processed, "/content/drive/MyDrive/Text Mining Project/Original Train Dataset (Original, Translated, Pre-Processed))/Pre-Processed", "train_reviews_pre_processed")

  writer.save()


In [None]:
#description_train_pre_processed = pre_processing_pipeline(df_train_translated["description"])

100%|██████████| 12496/12496 [01:17<00:00, 160.96it/s]


In [None]:
#host_about_train_pre_processed = pre_processing_pipeline(df_train_translated["host_about"])

100%|██████████| 12496/12496 [00:40<00:00, 305.77it/s]


In [None]:
#df_train_pre_processed = df_train_translated.copy()

In [None]:
#update_df(df_train_pre_processed, description_train_pre_processed, "description")

In [None]:
#update_df(df_train_pre_processed, host_about_train_pre_processed, "host_about")

In [None]:
#df_train_pre_processed.head()

Unnamed: 0,index,description,host_about,unlisted
0,1,share mixed room hostel share bathroom locate ...,local accommodation registry 20835 al,0
1,2,space close parque eduardo vii saldanha estefa...,friendly host try always around need anything ...,1
2,3,trafaria house cozy familiar villa facility ne...,social person like communicate read travel lik...,1
3,4,charm apartment chiado largo carmo travessa tr...,hello portuguese love meet people around word ...,0
4,5,nice apartment sea 2 min walk beach magnificen...,family two child age 17 10 live several year p...,0


In [None]:
#save_dataframe_drive(df_train_pre_processed, "/content/drive/MyDrive/Text Mining Project/Original Train Dataset (Original, Translated, Pre-Processed))/Pre-Processed", "train_pre_processed")

  writer.save()


### 3.11.2. Applying on the Test Set

In [None]:
#comments_test_pre_processed = pre_processing_pipeline(df_test_reviews_translated["comments"])

100%|██████████| 80877/80877 [02:59<00:00, 449.34it/s]


In [None]:
#df_test_reviews_pre_processed = df_test_reviews_translated.copy()

In [None]:
#update_df(df_test_reviews_pre_processed, comments_test_pre_processed, "comments")

In [None]:
#df_test_reviews_pre_processed.head()

Unnamed: 0,index,comments
0,1,thank much antonio perfect stay appartment per...
1,1,nice appartment old town lissabon quite centra...
2,1,travel look kid friendly place stay antonios p...
3,1,lisbon march 2013 3 adult 3 child house big co...
4,1,host antonio helpful information lissabon pick...


In [None]:
#save_dataframe_drive(df_test_reviews_pre_processed, "/content/drive/MyDrive/Text Mining Project/Test Dataset (Original, Translated, Pre-Processed)/Pre-Processed", "test_reviews_pre_processed")

  writer.save()


In [None]:
#description_test_pre_processed = pre_processing_pipeline(df_test_translated["description"])

100%|██████████| 1389/1389 [00:07<00:00, 178.61it/s]


In [None]:
#host_about_test_pre_processed = pre_processing_pipeline(df_test_translated["host_about"])

100%|██████████| 1389/1389 [00:05<00:00, 239.34it/s]


In [None]:
#df_test_pre_processed = df_test_translated.copy()

In [None]:
#update_df(df_test_pre_processed, description_test_pre_processed, "description")

In [None]:
#update_df(df_test_pre_processed, host_about_test_pre_processed, "host_about")

In [None]:
#df_test_pre_processed.head()

Unnamed: 0,index,description,host_about
0,1,space apartment locate historic center lisbon ...,like travel meet people like receive friend ho...
1,2,important response covid 19 property extend cl...,home team count u take care every single detai...
2,3,bright beautiful spacious four bedroom apartme...,hi guestready professional property management...
3,4,charm apartment close bay cascais 1 bedroom do...,
4,5,look holiday close beach casino tourist attrac...,welcome portugal love country also love get kn...


In [None]:
#save_dataframe_drive(df_test_pre_processed, "/content/drive/MyDrive/Text Mining Project/Test Dataset (Original, Translated, Pre-Processed)/Pre-Processed", "test_pre_processed")

  writer.save()


# 4. Train-Validation Split

The train-validation split is important because it serves several purposes. Firstly, it allows us to evaluate the performance of a trained model on unseen data, which helps us understand how well the model generalizes. Secondly, it enables us to tune the hyperparameters of the model effectively by comparing performance on the validation set. Thirdly, it helps us detect and prevent overfitting, where a model performs well on the training data but poorly on new data. Finally, the train-validation split helps us avoid data leakage, ensuring unbiased performance estimation on unseen data. We are going to apply a stratified train-validation split as we have a very unbalanced dataset. Later we will address this issue by training the models giving more weight to the minority class, so we guarantee our model generalizes the best possible.

## 4.1. Importing the Translated and Pre-Processed Training Datasets

In [None]:
# Importing the pre-processed and translated training test.
df_train_reviews_pre_processed = pd.read_excel("/content/drive/MyDrive/Text Mining Project/Original Train Dataset (Original, Translated, Pre-Processed))/Pre-Processed/train_reviews_pre_processed.xlsx")
df_train_pre_processed = pd.read_excel("/content/drive/MyDrive/Text Mining Project/Original Train Dataset (Original, Translated, Pre-Processed))/Pre-Processed/train_pre_processed.xlsx")

## 4.2. Train-Validation Split of the Host About and Description Dataset

In [None]:
# Performing the train test split.
df_train_split, df_val_split, df_train_labels_split, df_val_labels_split = train_test_split(
    df_train_pre_processed[["index", "description", "host_about"]],
    df_train_pre_processed["unlisted"],
    test_size = 0.3,
    random_state = 1,
    stratify = df_train_pre_processed["unlisted"])

In [None]:
# Checking if the split went smoothly.
df_train_split.head()

Unnamed: 0,index,description,host_about
2137,2138,comfortable apartment private terrace quiet ar...,work deck officer portuguese merchant marine s...
1966,1967,space lisbon hostel locate center lisbon offer...,lisbon whole work life dedicate tourism
4307,4308,spacious brand new guest house 4 independent r...,favourite travel destination far iceland india...
7125,7126,proud offer exclusive apartment enjoy best pos...,dear guest take pride help thousand busy host ...
4813,4814,modern cozy one bedroom apartment outdoor pati...,found travel enthusiast like bnbird want conne...


In [None]:
# Checking if the split went smoothly.
df_train_split.shape

(8747, 3)

In [None]:
# Saving "df_train_split" as it will be needed in the next sections.

#save_dataframe_drive(df_train_split, "/content/drive/MyDrive/Text Mining Project/Splitted Training Dataset (Pre-Processed)/Pre-Processed/Training", "train_split")

  writer.save()


In [None]:
# Checking if the split went smoothly.
df_val_split.head()

Unnamed: 0,index,description,host_about
10848,10849,stay traditional quiet neighborhood book pecul...,homeful portuguese short term management servi...
7260,7261,apartment ericeira 1 bedroom capacity 4 people...,base root ericeira grupo da casas born 2019 co...
6141,6142,proud offer spacious elegant apartment quiet r...,dear guest take pride help thousand busy host ...
4323,4324,locate sao paulo area ideal home traveler wish...,hi guestready professional property management...
9371,9372,flat locate axis coolness include walk distanc...,born raise lose find alley lisbon kind person ...


In [None]:
# Checking if the split went smoothly.
df_val_split.shape

(3749, 3)

In [None]:
# Saving "df_val_split" as it will be needed in the next sections.

#save_dataframe_drive(df_val_split, "/content/drive/MyDrive/Text Mining Project/Splitted Training Dataset (Pre-Processed)/Pre-Processed/Validation", "val_split")

  writer.save()


In [None]:
# Checking if the split went smoothly.
df_train_labels_split.head()

2137    0
1966    0
4307    0
7125    1
4813    1
Name: unlisted, dtype: int64

In [None]:
# Checking if the split went smoothly.
df_train_labels_split.shape

(8747,)

In [None]:
# Saving "df_train_labels_split" as it will be needed in the next sections.

#save_dataframe_drive(df_train_labels_split, "/content/drive/MyDrive/Text Mining Project/Splitted Training Dataset (Pre-Processed)/Pre-Processed/Training", "train_labels_split")

  writer.save()


In [None]:
# Checking if the split went smoothly.
df_val_labels_split.head()

10848    0
7260     0
6141     0
4323     0
9371     0
Name: unlisted, dtype: int64

In [None]:
# Checking if the split went smoothly.
df_val_labels_split.shape

(3749,)

In [None]:
# Saving "df_val_labels_split" as it will be needed in the next sections.

#save_dataframe_drive(df_val_labels_split, "/content/drive/MyDrive/Text Mining Project/Splitted Training Dataset (Pre-Processed)/Pre-Processed/Validation", "val_labels_split")

  writer.save()


## 4.3. Train-Validation Split of the Comments Dataset

Now, based on the split of the dataset that has the unique keys ("index" column) and the labels ("unlisted") we need to filter the comments dataset to separate the comments that belong to the indexes from the train split dataset and the comments that belong to the indexes from the validation split dataset. That's what we are going to do in the next cells.

In [None]:
# Creating a list of indexes present in each one of the previously splited datasets.
index_train_set = df_train_split["index"].tolist()
index_val_set = df_val_split["index"].tolist()

In [None]:
# From the comments/reviews dataset we filter only be the indexes previously obtained,
# both for the training set as for the validation set.
df_train_reviews_split = df_train_reviews_pre_processed[df_train_reviews_pre_processed["index"].isin(index_train_set)]
df_val_reviews_split = df_train_reviews_pre_processed[df_train_reviews_pre_processed["index"].isin(index_val_set)]

In [None]:
# Checking if the filtering went smoothly.
df_train_reviews_split

Unnamed: 0,index,comments
91,8,shani helpful throughout process thank answer ...
92,8,accommodation spectacular clean modern 2 bedro...
93,8,great place
94,8,excellent appartement host efficient communica...
95,8,shani apartment excellent well locate right re...
...,...,...
721397,12494,good time apartment great location close hustl...
721398,12494,great apartment central location host responsi...
721399,12494,airbnb super host trust liliana super super su...
721400,12494,lovely stay apartment sofia helpful offer chec...


In [None]:
# Saving "df_train_reviews_split" in our drive as we will need it in further sections.

#save_dataframe_drive(df_train_reviews_split, "/content/drive/MyDrive/Text Mining Project/Splitted Training Dataset (Pre-Processed)/Pre-Processed/Training", "train_reviews_split")

  writer.save()


In [None]:
# Checking if the filtering went smoothly.
df_val_reviews_split

Unnamed: 0,index,comments
0,1,cozy comfortable house stay never worry safety...
1,1,good
2,1,first hostel experience say pretty hard beat p...
3,1,hostel new everything work perfectly fast wifi...
4,1,fine dorm think people stay far less bathrooms...
...,...,...
720830,12490,beautiful cozy flat everything need
720831,12490,simple clean
720832,12490,chose apartment due free parking could park ca...
720833,12490,good location nice apartment appear picture


In [None]:
# Saving "df_val_reviews_split" in our drive as we will need it in further sections.

#save_dataframe_drive(df_val_reviews_split, "/content/drive/MyDrive/Text Mining Project/Splitted Training Dataset (Pre-Processed)/Pre-Processed/Validation", "val_reviews_split")

  writer.save()
