![Cover Sheet Image](BP0289232_CoverSheet.jpg)

# <b>Project overview</b>
- To gain insight into customer opinion of the company using Trustpilot.
- To acheive this we will scrape the company pages on Trustpilot for *individual reviews text*, *individual review scores* and *date of review*.

---

### <b>Ensure that scrape.py, structure.py and sentiment_and_topics.py are placed in the same directory as main.ipynb</b>

---

<b>Outline of the process</b>

At a high level the process is broken out into 3 sections
<ul>
<li>Scraping</li>
<li>Structuring</li>
<li>Sentiments and topics</li>
</ul>
When the main code is run the user is asked to choose whether to scrape Trustpilot or use any cached files in the pages directory.

<ul>
<li>Parameters are hardcoded to scrape The AA's Trustpilot page for the last 30 days</li>
<li>Ask the user if they wish to scrape the site or use the cached pages in the pages folder</li>
<li>If the user wants to scrape</li>
    <ul>
    <b>Scraping process</b>
    <li>Check if we can are allowed to scrape the url by reading the Robots.txt file</li>
    <li>Start a while loop</li>
        <ul>
        <li>Build the URL for page to scrape</li>
        <li>Try to scrape that page</li>
        <li>If we get a 200 response</li>
            <ul>
            <li>If it's the first page, get the overall rating</li>
            <li>Save the page content to the pages folder</li>
            <li>Increment the page number</li>
            </ul>
        <li>If we get a rate limit message</li>
            <ul>
            <li>Stop fetching pages and exit the while loop</li>
            </ul>
        <li>If we receive no content</li>
            <ul>
            <li>Stop fetching pages and exit the while loop</li>
            </ul>
        </ul>
    </ul>

<li><b>Structuring process</b></li>
<li>For each html file in the pages directory</li>
    <ul>
    <li>find the reviews</li>
        <ul>
        <li>for each review</li>
            <li>get the review id</li>
            <li>get the review text</li>
            <li>get the review rating</li>
            <li>get the source of the review</li>
            <li>get the date of the experience</li>
            <li>get the date of pubication</li>
            <li>add these to a dataframe</li>
        </ul>
    </ul>
<li>with the complete dataframe change the source codes to something more understandable</li>
<li>save the dataframe as a csv file in the csv files directory</li>
<br>
<li><b>Sentiment and topic process</b></li>                
<li>Perform sentiment analysis on the reviews text</li>
<li>Save the new dataframe as a csv file to the csv files directory/li>
<li>Create metrics for the reviews, split by source and at an overall level</li>
    <ul>
    <li>Count of reviews</li>
    <li>mean average rating</li>
    <li>mode average rating</li>
    <li>Average polarity score</li>
    <li>Average subjectivity score</li>
    <li>Positive review count and percentage</li>
    <li>Negative review count and percentage</li>
    </ul>
<li>Metrics are saved in the outputs directory</li>
<li>Reviews are filtered to only include the negative reviews (sentiment less than -0.25 and rating less than 3)</li>
<li>The review text is then cleaned and English stop words as well as AA and Car are removed</li>
<li>A word cloud image is then created of the negative reviews</li>
<li>Next the text is vectorised based on the frequency of each word</li>
<li>Topic modelling is then performed using LDA</li>
<li>The top 5 words for each topic will then be prined on screen</li>
<li>An interactive lda_visualisation.html file will be saved in the outputs directory for further topic analysis</li>
<li>A timeline graph of the daily mean average review score will be displayed and a copy saved in the outputs directory</li>
</ul>

---


In [None]:
# main module


# Importing libraries
# Standard library imports
from datetime import datetime
import json
import logging
import os
import random
import re
import requests
import sys
import time


# Third party imports
from bs4 import BeautifulSoup
from pathlib import Path
from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS
from textblob import TextBlob
from wordcloud import WordCloud
from urllib.robotparser import RobotFileParser
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
import pandas as pd
import pyLDAvis
import pyLDAvis.lda_model


# Local application/library imports - assumes this script is in the same directory as the modules
import scrape
import structure
import sentiment_and_topics

# Function to clear a log file
def clear_this_log(logfile_name):
    """
    Clear the given log file.
    """
    with open(logfile_name, 'w'):
        pass

# Function to configure logging for the script
def configure_logging(logfile_name, clear_log):
    """
    Configures logging for the script.
    Args:
    logfile_name (str): The name of the log file.
    clear_log (bool): If True, clears the log file at the start of the script.

    Returns:
    logger (logging.Logger): Configured logger instance.
    """
    logging.basicConfig(
        filename=logfile_name,
        level=logging.INFO,  # Log level
        format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'  # Log format
    )

    # Create a logger
    logger = logging.getLogger(__name__)

    if clear_log:
        logger.info("Clearing log file at the start of the script.")
        clear_this_log(logfile_name)

    # Assert statement for testing purposes
    assert logger.hasHandlers(), "Logger should have handlers"

    return logger        

# Function to get user input for scraping or using cached pages
def get_user_input():
    """
    Gets user input for whether to scrape or use cached pages.
    
    Returns:
    user_input (str): User input ('scrape' or 'cache').
    """

    while True:
        user_input = input("Do you want to scrape or use cached pages? (scrape/cache): ").strip().lower()
        if user_input in ["scrape", "cache"]:
            return user_input
        else:
            print("Invalid input. Please enter 'scrape' or 'cache'.")


# Main code
def main():
    """
    Main function to execute the script.
    It configures logging, gets user input, and calls the appropriate functions for scraping, structuring data,
    and modelling sentiment and topics.
    """
    # Set up the log file name and whether to clear it
    logfile_name = 'main.log'
    clear_log = True # True = clear the log file. False = keep the existing log file and append to it.

    # Configure logging
    logger = configure_logging(logfile_name, clear_log)
    
    logger.info("Script execution started.")    

    # Get user input for scraping or using cached pages
    scrape_or_cache = get_user_input()
    logger.debug(f"User input for scraping or caching: {scrape_or_cache}")

    if scrape_or_cache == "scrape":
        logger.info("Scraping pages")
        scrape.scrape()
        logger.info("Structuring data")
        structure.structure()
        logger.info("Modelling sentiment and identifying topics")
        sentiment_and_topics.sentiment_and_topics()
    elif scrape_or_cache == "cache":
        logger.info("Using cached pages")
        logger.info("Structuring data")
        structure.structure()
        logger.info("Modelling sentiment and identifying topics")
        sentiment_and_topics.sentiment_and_topics()

    # close the logger
    logger.info("Script execution completed.")

if __name__ == "__main__":
    # Check if the script is being run directly
    # If so, call the main function to start the program
    main()