# *Extract data using Google **Places API***

The **Google Places API** is a programming interface that allows you to access detailed information about places around the world, such as businness, points of interest and venues, including data such as names, addresses, ratings, reviews and geographic coordinates.

## Librarys
Here we import the librarys we need to extract our data.

In [35]:
import requests
import csv
from dotenv import load_dotenv
import os
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from transformers.utils import logging
import re
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\diogo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\diogo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Data Extraction

In [36]:
viana_axis_hotel = 'AXIS VIANA BUSINESS & SPA HOTEL'
ponte_lima_axis_hotel = 'Axis Ponte de Lima Golf Resort Hotel'
ofir_axis_hotel = 'Axis Ofir Beach Resort Hotel'
braga_axis_hotel = 'Basic Braga by Axis'
vermar_axis_hotel = 'Hotel Axis Vermar Conference & Beach Hotel'
porto_axis_hotel = 'Axis Porto Business & SPA Hotel'
porto_club_axis_hotel = 'Axis Porto Club Hotel'


def get_hotel_reviews(hotel_name, api_key):
    import googlemaps
    """Fetches all reviews for a hotel using the Google Places API, handling pagination."""
    gmaps = googlemaps.Client(key=api_key)

    place_id = gmaps.find_place(
        input=hotel_name,
        input_type='textquery',
        fields=['place_id']
    )['candidates'][0]['place_id']

    all_reviews = []
    next_page_token = None

    while True:
        place_details = gmaps.place(
            place_id,
            fields=['name', 'reviews']
        )

        reviews = place_details['result']['reviews']

        for review in reviews:
            all_reviews.append({
                'review_text': review.get('text', ''),
                'rating': review.get('rating', None)
            })

        if 'next_page_token' in place_details['result']:
            next_page_token = place_details['result']['next_page_token']
        else:
            break

    return all_reviews


def save_reviews_to_csv(reviews, file_name):
    """Saves hotel reviews to a CSV file."""
    with open(file_name, 'w', newline='', encoding='utf-8') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(['Review', 'Classification'])
        for review in reviews:
            writer.writerow([review.get('review_text', ''), review.get('rating', '')])

In [37]:
load_dotenv()
API_KEY = os.getenv("MAPS_API_KEY")

viana_reviews = get_hotel_reviews(viana_axis_hotel, API_KEY)
save_reviews_to_csv(viana_reviews, 'viana_reviews.csv')

ponte_lima_reviews = get_hotel_reviews(ponte_lima_axis_hotel, API_KEY)
save_reviews_to_csv(ponte_lima_reviews, 'ponte_lima_reviews.csv')

ofir_reviews = get_hotel_reviews(ofir_axis_hotel, API_KEY)
save_reviews_to_csv(ofir_reviews, 'ofir_reviews.csv')

braga_reviews = get_hotel_reviews(braga_axis_hotel, API_KEY)
save_reviews_to_csv(braga_reviews, 'braga_reviews.csv')

vermar_reviews = get_hotel_reviews(vermar_axis_hotel, API_KEY)
save_reviews_to_csv(vermar_reviews, 'vermar_reviews.csv')

porto_business_reviews = get_hotel_reviews(porto_axis_hotel, API_KEY)
save_reviews_to_csv(porto_business_reviews, 'porto_business_reviews.csv')

porto_club_reviews = get_hotel_reviews(porto_club_axis_hotel, API_KEY)
save_reviews_to_csv(porto_club_reviews, 'porto_club_reviews.csv')

The code aims to extract hotel reviews and ratings and save the data to a CSV file.

In the first part, the `get_hotel_reviews` function is reponsible for fetching the reviews. It takes the hotel name as input and finds the associated location ID. It then collects all available reviews, including the review text and rating. The code also handles automatic pagination to ensure that all reviews are captured (for free use, the API only provides 5 reviews per location).

In the second part, the `save_reviews_to_csv` function organises this data and saves it to a CSV file. The final table contains two columns: one for the text of the reviews (*Review*) and another for the numerical classifications (*Classification*).

With this process, the date is cleanly structured and can be analysed later in **Power BI**.

In [39]:
file_paths = ['viana_reviews.csv', 'ponte_lima_reviews.csv', 'ofir_reviews.csv', 
              'braga_reviews.csv', 'vermar_reviews.csv', 'porto_business_reviews.csv', 'porto_club_reviews.csv']

combined_df = pd.read_csv(file_paths[0])

for file_path in file_paths[1:]:
    temp_df = pd.read_csv(file_path, header=0) 
    combined_df = pd.concat([combined_df, temp_df], ignore_index=True)

output_path = 'combined.csv'
with open(output_path, 'w', newline='', encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerow(combined_df.columns)
    writer.writerows(combined_df.values)

for file_path in file_paths:
    if os.path.exists(file_path):
        os.remove(file_path)

This code combines several databases stored in CSV files into a single consolidated file called `combined.csv`, removing the original files at the end of the process.

First, it creates a list of paths from the individual files, each containing reviews of different hotels. The script reads the first file in the list and stores its contents in a *Pandas DataFrame*, which is an efficient data structure for manipulating tables. Then, for each subsequent file, it loads the data into a temporary *DataFrame* and concatenates (joins) this data to the main *DataFrame*, ensuring that all the information is consistent.

Once the data from all the files has been combined, the code writes the consolidated content to a new CSV file called `combined.csv`, preserving the original columns and their respective rows. To do this, it uses the *csv* library and writes both the column names and the data in tabular format.

Finally, the script checks for the existence of the individual input files and deletes them, leaving only the combined file. The result is a single file containing all the hotel reviews, making it easier to analyse them later without having to deal with several separate files.

In [40]:
def removing_stop_words(text: str):
    # Ensure required NLTK resources are downloaded
    nltk.download('punkt')
    nltk.download('stopwords')
    
    # Initialize stop sword set and stemmer
    stop_words = set(stopwords.words('english'))

    # Validate input type
    if not isinstance(text, str):
        return ''

    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE) # Remove URLs 
    text = re.sub(r'\W', ' ', text) # Remove non-alphanumeric characters (punctuation, special symbols, etc)
    text = re.sub(r'\d+', '', text) # Remove numeric values
    tokenize = word_tokenize(text) 
    words = [word for word in tokenize if word not in stop_words] # Remove stop words

    # Join the processed words back into a single string with spaces
    return ' '.join(words)

The function `removing_stop_words` takes a string as input and performs several text cleaning steps. It first ensures the required NLTK resources for tokenization and stop words are downloaded. Then, it checks if the input is a string; if not, it returns an empty string.

The function proceeds by removing URLs, non-alphanumeric characters, and numeric values using regular expressions. It then tokenizes the cleaned text into individual words and filters out stop words (common, less meaningful words) using the NLTK stop word list.

Finally, the function joins the remaining words back into a single string and returns it

In [41]:
file_path = 'combined.csv'
df = pd.read_csv(file_path)

if 'Review' in df.columns:
    # Adding a new column with the original reviews
    df['Original_Review'] = df['Review']
    # Apply the preprocessing function in the data
    df['Review'] = df['Review'].apply(removing_stop_words)

    # Saving the result in a new csv file
    processed_file_path = 'Processed_Project.csv'
    with open(processed_file_path, 'w', newline='', encoding='utf-8') as f:
        writer = csv.writer(f)
        writer.writerow(df.columns) # Writing the headers
        writer.writerows(df.values) # Writing the DataFrame lines

os.remove(file_path) # Removing the original file

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\diogo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\diogo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\diogo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\diogo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\diogo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\diogo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\User

This Python code processes a CSV file containing reviews, applying a text preprocessing function to clean the *Review* column. First, it reads the data from a file called `combined.csv` into a pandas DataFrame. It then checks if the *Review* column exists in the DataFrame. If it does, the code creates a new column, *Original_Review*, to store the original review content. Next, the function `removing_stop_words` is applied to each review, removing stop words, special characters, numbers, and URLs.

After preprocessing the reviews, the updated DataFrame is saved to a new file, `Processed_Project.csv`. The code writes both the column headers and the processed review data to the new file. Finally, it deletes the original CSV file (`combined.csv`) to leave only the processed version.

In [42]:
logging.set_verbosity_error()

file_path = 'Processed_Project.csv' 
df = pd.read_csv(file_path)

review_column = 'Review'

def analyze_sentiment(text: str):
    """
    Performs sentiment analysis on a given text using a pre-trained model from Hugging Face Transformers.

    Args:
        text (str): The input text to be analyzed

    Returns:
        str: The sentiment label predicted by the model ('POSITIVE' or 'NEGATIVE').env
            Returns 'No analysis' if the input is invalid or null.
    """
    from transformers import pipeline

    # Initialize the sentiment-analysis pipeline
    # The pipeline uses a pre-trained model to classify sentiment (e.g., POSITIVE or NEGATIVE)
    sentiment_pipeline = pipeline(task='sentiment-analysis', device="cuda")

    # Check if the input is a valid, no-null string
    if isinstance(text, str) and pd.notnull(text): 
        # If the input is valid, call the sentiment-analysis pipeline
        # The pipeline returns a list of dictionares; [0] accesses the first result, and ['label'] gets the sentiment label
        return sentiment_pipeline(text)[0]['label'] 
    else:
         # If the input is invalid (not a string or null), return "No analysis"
        return "No analysis"

df['Sentiment'] = df[review_column].apply(analyze_sentiment)
sentiment_map = {'POSITIVE': 1, 'NEGATIVE': 0}
df['Binary_Sentiment'] = df['Sentiment'].map(sentiment_map)

This Python code processes reviews from a CSV file by performing sentiment analysis on the text using a pre-trained model from *Hugging Face’s Transformers* library. It defines a function called `analyze_sentiment` that classifies each review as either *POSITIVE* or *NEGATIVE* based on the model’s prediction. This function is applied to the *Review* column of the DataFrame, and the sentiment results are stored in a new column called *Sentiment*.

Additionally, the code converts the sentiment labels into binary values, mapping *POSITIVE* to 1 and *NEGATIVE* to 0. These binary values are saved in a new column, *Binary_Sentiment*. The overall goal of the code is to categorize the sentiment of the review texts for further analysis.

In [43]:
# Saving the results in a new CSV file
output_path = 'Sentiment_analysis.csv'

with open(output_path, 'w', newline='', encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerow(df.columns)
    writer.writerows(df.values)

# Deleting the original file
if os.path.exists(file_path):
    os.remove(file_path)

This Python code saves the processed DataFrame, which contains the sentiment analysis results, into a new CSV file (`Sentiment_analysis.csv`). It opens the file in write mode and uses the csv.writer to write the column headers and the data from the DataFrame into the new file.

After saving the results, the code checks if the original file (file_path) exists. If it does, the file is deleted using os.remove(file_path), ensuring that only the new file with the sentiment analysis results remains.

<img src="Dashboard.png" width="20%" align="center"/>

dsdssd