# Notebook 1: Text Mining from Annual Reports and News

## ADSS summer project: Sentiment Analysis on Financial Risk and Sustainability in UK Retail Industry.

Student Name: Shing Fai, Wong (Amos)  
Student Number: 720062083   
Programme: MSc Applied Data Science and Statistics

This notebook is mining the text from the annual report, and then clean the sentence and label the sentiment accordingly.

# Part A. Global functions

## 1. extract_pdf_text(pdf_file)
Function: Extracting text from pdf

In [1]:
def extract_pdf_text(pdf_file):
    
    """Extracts text from a PDF
    
    Args:
        pdf_file: the path to the pdf file.
    
    Steps:
    Part A. Extract sentence from pdf
        1. Read the pdf document
        2. Create a pdf reader
        3. Get the total numbers of pages of the pdf document
        4. Extract text from each pages of the pdf document
        5. Split the text into sentence by full stop

    Part B. Clean the sentence
        1. Remove empty sentence
        2. Remove newline indicator /n
        3. Replace mutiple conseuctive spaces with single space
        4. Replace double quotation marks "" with signle quotation marks

    Return:
        A list of numbers of cleanned sentence from the pdf documents
    """
    # libraries
    import re # regular expression library to handle sentence pattern
    import PyPDF2 # pdf library 
    
    # Create a list to store the extracted text from each page
    sentences = []
    
    ### Part A. Extract text from pdf
    with open(pdf_file, "rb") as f:
        # Create a PDF reader object
        pdf_reader = PyPDF2.PdfReader(f)
        num_pages = len(pdf_reader.pages)
        
        # Iteration over each page of the PDF and extract the text content
        for page_num in range(num_pages):
            page = pdf_reader.pages[page_num]
            text = page.extract_text()
        
            # split the text into sentences
            sentences.extend(text.split('.'))
    
    ### Part B. Clean the sentence
    # Remove any empty sentences
    sentences = [sentence.strip() for sentence in sentences if sentence.strip()]
    
    # Remove any /n indicating newline characters
    sentences = [sentence.replace('\n', ' ') for sentence in sentences]
    
    # Replace any mulitple consecutive spaces with a single space '\s+'
    sentences = [re.sub(r'\s+', ' ', sentence) for sentence in sentences]
    
    # Replace any double quotation marks "" with a single
    sentences = [re.sub(r'""', '"', sentence) for sentence in sentences]

    return sentences
    # End

Output:

In [2]:
pdf_file_path = "AR/sainsbury/sainsbury-ar2012.pdf"
extracted_sentences = extract_pdf_text(pdf_file_path)

# print the total number of sentences
print(f"Total number of sentences: {len(extracted_sentences)}")
print("")
# print the first 5 sentences
print(f'Print the first 5 sentences: {extracted_sentences[:5]}')
print(f"\nData type: {type(extracted_sentences)}")

Total number of sentences: 3295

Print the first 5 sentences: ['Annual Report and Financial Statements 2012 Annual Report and Financial Statements 2012', 'Welcome to our Annual Report and Financial Statements 2012 It’s an exciting time for Sainsbury’s', 'Our clear, long-term strategy continues to deliver for our customers, ensuring we are well positioned for future growth', 'In recent years we have transformed our business, while remaining true to our 143 year heritage', 'By offering universal appeal we are helping our customers Live Well for Less']

Data type: <class 'list'>


## 2. standardize_sentences(extracted_sentences)
Function: lowering, removing punctuation, numbers, stopwords.

In [3]:
def standardize_sentences(extracted_sentences):
    """
    Lowercases all texts, removes punctuation, numbers, and stopwords.

    Args:
        extracted_sentences: Extracted and cleaned sentences from the PDF.

    Steps:
        1. Convert text to lowercase
        2. Remove punctuation and numbers
        3. Remove stopwords such as "the," "and," "is," "a," which are commonly used
           without any sentiment.

    Return:
        A list of sentences with all lowercase text,
        without punctuation, numbers, and stopwords.
    """

    import re  # regular expression library to handle sentence pattern
    from nltk.corpus import stopwords  # natural language toolkit stopwords module

    clean_sentences = []  # storing the cleaned sentences
    stop_words = set(stopwords.words('english'))  # using English stopwords

    # iteration over each extracted sentence and then preprocess the sentence
    for sentence in extracted_sentences:
        # Convert text to lowercase
        lower_text = sentence.lower()

        # Remove punctuation and numbers
        # [^a-zA-Z] regular expression for non-English characters
        removed_punc_num = re.sub('[^a-zA-Z]', ' ', lower_text)

        # Remove stopwords
        # iteration over each word and check whether it matches with the stopwords
        # If the word does not match with the stopword, then join into a sentence using a single separator " "
        clean_sentence = ' '.join(word for word in removed_punc_num.split() if word not in stop_words)

        # Append the preprocessed sentence to the list
        clean_sentences.append(clean_sentence)

    return clean_sentences


Output:

In [4]:
clean_sentences = standardize_sentences(extracted_sentences)

# Print the preprocessed sentences
print(f"Total number of preprocessed sentences: {len(clean_sentences)}")
print("")
# print the first 5 sentences
print(f'Print the first 5 sentences: {clean_sentences[:5]}')

Total number of preprocessed sentences: 3295

Print the first 5 sentences: ['annual report financial statements annual report financial statements', 'welcome annual report financial statements exciting time sainsbury', 'clear long term strategy continues deliver customers ensuring well positioned future growth', 'recent years transformed business remaining true year heritage', 'offering universal appeal helping customers live well less']


## 3. label_sentences(sentences, word_list)
Function: labeling the sentences as negative, positive, or neutral based on the given word list.

In [5]:
def label_sentences(sentences, word_list):
    """
    Labels sentences as negative, positive, or neutral based on the given word list.

    Args:
        sentences: List of preprocessed sentences.
        word_list: DataFrame containing financial words and their sentiment.

    Returns:
        DataFrame with sentences and labels.
    """
    import pandas as pd
    
    labels = []
    positive_words = set(word_list[word_list['sentiment'] == 'Positive']['word'])
    negative_words = set(word_list[word_list['sentiment'] == 'Negative']['word'])

    for sentence in sentences:
        words = sentence.split()
        if any(word in positive_words for word in words):
            labels.append('positive')
        elif any(word in negative_words for word in words):
            labels.append('negative')
        else:
            labels.append('neutral')

    # Create a DataFrame with sentences and labels
    data = pd.DataFrame({'sentences': sentences, 'labels': labels})

    return data


output:

In [6]:
import pandas as pd

financial_words = pd.read_csv("word_list/Loughran_McDonald_Sentiment_Word_List.csv")

# Apply label_sentences function to the clean_sentences
df = label_sentences(clean_sentences, financial_words)

# Print the selected rows of the dataset
print(df.loc[0:20,:])

                                            sentences    labels
0   annual report financial statements annual repo...   neutral
1   welcome annual report financial statements exc...  positive
2   clear long term strategy continues deliver cus...  positive
3   recent years transformed business remaining tr...  negative
4   offering universal appeal helping customers li...  positive
5   delivering business strategy five areas focus ...  positive
6   know colleagues culture values make us differe...  positive
7                                    find j sainsbury  positive
8                                                  co   neutral
9   uk cover picture taken tv advertising campaign...  positive
10  advertisement achieved record ratings likeabil...  positive
11  first store opened london drury lanevision goa...  positive
12  sainsbury founded john james sainsbury wife ma...  positive
13  make customers lives easier every day offering...  positive
14  years strong began recruiting women 

## 4. convert_labels_to_numeric(data, label_mapping)

Function: Converting categorical labels into numeric values based on the provided label mapping.

In [7]:
def convert_labels_to_numeric(data, label_mapping):
    """
    Converts categorical labels into numeric values based on the provided label mapping.
    
    Args:
        data (pd.DataFrame): DataFrame containing sentences and labels.
        label_mapping (dict): Mapping of categorical labels to numeric values.
        
    Returns:
        pd.DataFrame: DataFrame with labels converted to numeric values.
    """
    # Create a copy of the DataFrame to avoid modifying the original data
    converted_data = data.copy()
    
    # Convert labels to numeric values
    converted_data['labels'] = converted_data['labels'].map(label_mapping)
    
    return converted_data


In [8]:
# Define label mapping
label_mapping = {'positive': 1, 'negative': -1, 'neutral': 0}

# Convert labels to numeric values
converted_df = convert_labels_to_numeric(df, label_mapping)

# Print the first few rows of the converted data
print(converted_df.iloc[0:20,:])

                                            sentences  labels
0   annual report financial statements annual repo...       0
1   welcome annual report financial statements exc...       1
2   clear long term strategy continues deliver cus...       1
3   recent years transformed business remaining tr...      -1
4   offering universal appeal helping customers li...       1
5   delivering business strategy five areas focus ...       1
6   know colleagues culture values make us differe...       1
7                                    find j sainsbury       1
8                                                  co       0
9   uk cover picture taken tv advertising campaign...       1
10  advertisement achieved record ratings likeabil...       1
11  first store opened london drury lanevision goa...       1
12  sainsbury founded john james sainsbury wife ma...       1
13  make customers lives easier every day offering...       1
14  years strong began recruiting women work store...       1
15  put 

## 5. split_the_dataset(dataset, training_size)
Function: Spliting the dataset into training and testing sets.

In [9]:
def split_the_dataset(dataset, training_size):
    """
    Splits the dataset into training and testing sets.

    Args:
        dataset: The dataset to be split.
        training_size: The proportion of the dataset to be used for training.

    Returns:
        Two datasets: the training dataset and the testing dataset.
    """
    # library
    from sklearn.model_selection import train_test_split
    
    # Split the dataset into training and testing sets
    training_data, testing_data = train_test_split(dataset, 
                                                   train_size=training_size, 
                                                   random_state=42)
    
    return training_data, testing_data

Output:

In [10]:
training_size = 0.8
sentences = converted_df["sentences"]
labels = converted_df["labels"]

# Split the sentences
training_sentences, testing_sentences = split_the_dataset(sentences, training_size)

# Print the training sentences dataframe
print(f"There are {len(training_sentences)} in the training sentences data.")
print("The following are the training sentences: ")
print(training_sentences)

# Print the testing sentences dataframe
print("")
print("")
print(f"There are {len(training_sentences)} in the testing sentences data.")
print("The following are the training sentences: ")
print(testing_sentences)

There are 2636 in the training sentences data.
The following are the training sentences: 
2989    arose normal course business immaterial group ...
2882    details options march set date grant ate expir...
2643    group considers basis point increase reasonabl...
436     per cent year year facilitated part b overall ...
2215    property leases land building elements treated...
                              ...                        
1095                                                   co
1130      committee terms reference available website www
1294    key r isk facing group area relates reducing e...
860     ddition shareholders asked renew general autho...
3174    implications capital gains tax purposes gain l...
Name: sentences, Length: 2636, dtype: object


There are 2636 in the testing sentences data.
The following are the training sentences: 
3101                                                     
3164    telephone freephone quote sainsbury computersh...
3102                

In [11]:
# Split the labels
training_labels, testing_labels = split_the_dataset(labels, training_size)

# Print the training sentences dataframe
print(f"There are {len(training_labels)} in the training label data.")
print("The following are the training labels: ")
print(training_labels)

# Print the testing sentences dataframe
print("")
print("")
print(f"There are {len(testing_labels)} in the testing label data.")
print("The following are the training labels: ")
print(testing_labels)

There are 2636 in the training label data.
The following are the training labels: 
2989   -1
2882    1
2643    1
436     1
2215   -1
       ..
1095    0
1130   -1
1294   -1
860     1
3174    1
Name: labels, Length: 2636, dtype: int64


There are 659 in the testing label data.
The following are the training labels: 
3101    0
3164    1
3102    0
1947   -1
2400    0
       ..
802    -1
2308   -1
839    -1
817    -1
69      1
Name: labels, Length: 659, dtype: int64


## 6. tokenize_pad_sequences(sentences, num_words, oov_token, maxlen, padding, truncating)
Function: Tokenize the text from sentences and pad the sequences

In [12]:
def tokenize_pad_sequences(sentences, num_words, oov_token, maxlen, padding, truncating):
    """Tokenize the text from sentences and pad the sequences
    Args:
        1. sentences: A list of cleaned and preprocessed sentences.
        2. num_words: An integer specifying the maximum number of words 
        to keep based on word frequency.
        3. oov_token: A string specifying the out-of-vocabulary token to 
        be used for words not present in the tokenizer's word index.
        4. maxlen: An integer specifying the maximum length of the sequences.
        5. padding: A string specifying the padding type to use. It can be either 'pre' or 'post'.
        6. truncating: A string specifying the truncation type to use. It can be 
        either 'pre' or 'post'.
    """
    
    # library
    from tensorflow.keras.preprocessing.text import Tokenizer
    from tensorflow.keras.preprocessing.sequence import pad_sequences
    
    # Initialize the Tokenizer class
    tokenizer = Tokenizer(num_words=num_words, oov_token=oov_token)
    
    # Split each sentence into words
    tokenized_sentences = [sentence.split() for sentence in sentences]

    # Generate indices for each word in the corpus
    tokenizer.fit_on_texts(tokenized_sentences)

    # Get the indices
    word_index = tokenizer.word_index
    
    # Generate list of token sequences
    sequences = tokenizer.texts_to_sequences(tokenized_sentences)
    
    # Pad the sequences with the assigned length, padding, and truncating
    padded_sequences = pad_sequences(sequences, maxlen=maxlen, padding=padding, truncating=truncating)
    
    return tokenizer, word_index, sequences, padded_sequences


## 7. summary_tokpad(padded_sentences, start_index, end_index)
Function: Printing a summary of the tokenization and padding process for a given range of entries.

In [13]:
def summary_tokpad(padded_sentences, start_index, end_index):
    
    """
    Prints a summary of the tokenization and padding process for a given range of entries.

    Args:
        padded_sentences: A tuple containing the word index, sequences, and padded sequences.
        start_index: The starting index of the entries to be summarized.
        end_index: The ending index of the entries to be summarized.

    Returns:
        None (Prints the summary to the console).
    """
    
    # Retrieve the word index, sequences, and padded sequences
    word_index = padded_sentences[1]
    sequences = padded_sentences[2]
    padded_sequences = padded_sentences[3]

    # Print the assigned number of entries of the word index
    print(f"start index: {start_index}")
    print(f"end index: {end_index}")
    print("Summary of the tokenization and padding:")
    print("\nSelected Word Index:")
    for word, index in list(word_index.items())[start_index:end_index]:
        print(f"{word}: {index}")

    # Print the selected sentences
    print("\nSelected Sentences:")
    for sentence in sequences[start_index:end_index]:
        print(sentence)

    # Print the selected padded sequences
    print("\nSelected Padded Sequences:")
    for padded_seq in padded_sequences[start_index:end_index]:
        print(padded_seq)


Output:

In [14]:
# Global parameters
num_words = 100
oov_token = '<OOV>'
maxlen = 5
padding = 'post'
truncating = 'post'

# sentences to be tokenized and padded
sentences = clean_sentences

In [15]:
# tokenize and padded the sentences
padded_sentences = tokenize_pad_sequences(sentences, num_words, oov_token, maxlen, padding, truncating)

2023-06-22 18:44:27.863043: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [16]:
# Using the global function summary_tokpad()
summary_tokpad(padded_sentences, 0, 5)

start index: 0
end index: 5
Summary of the tokenization and padding:

Selected Word Index:
<OOV>: 1
financial: 2
year: 3
million: 4
group: 5

Selected Sentences:
[34, 15, 2, 9, 34, 15, 2, 9]
[1, 34, 15, 2, 9, 1, 1, 6]
[1, 1, 66, 1, 1, 1, 67, 1, 1, 1, 1, 59]
[1, 55, 1, 26, 1, 1, 3, 1]
[1, 1, 1, 1, 67, 1, 1, 1]

Selected Padded Sequences:
[34 15  2  9 34]
[ 1 34 15  2  9]
[ 1  1 66  1  1]
[ 1 55  1 26  1]
[ 1  1  1  1 67]


## 8. scrap_yahoo_news(stock_sym, start_date, end_date)
Function: Scrapes Yahoo Finance news search results for a given query and date range.

In [17]:
def scrap_yahoo_news(stock_sym, start_date, end_date):
    """
    Scrapes Yahoo Finance news search results for a given query and date range.

    Args:
        stock_sym (str): The stock symbol
        start_date (str): "YYYY-MM-DD"
        end_date (str): "YYYY-MM-DD"

    Returns:
        list: A list of sentences extracted from the news articles.

    """
    # Library
    import requests
    from bs4 import BeautifulSoup
    import re

    # Format the query and date range in the URL
    formatted_query = stock_sym.replace(" ", "+")
    url = f"https://search.yahoo.com/search?p={formatted_query}+news&b={start_date}&bt={end_date}"

    # Send a GET request to the search results page
    response = requests.get(url)

    # Create a BeautifulSoup object to parse the HTML content
    soup = BeautifulSoup(response.content, 'html.parser')

    # Find the news articles on the search results page
    articles = soup.find_all("div", class_="algo-sr")

    # Create a list to store the extracted content
    contents = []

    # Loop through the articles and extract the content
    for article in articles:
        content = article.find("p").text.strip()
        
        # Remove the date using regex
        content_without_date = re.sub(r"[A-Za-z]+\s+\d{1,2},\s+\d{4}\s*·\s*", "", content)
        
        contents.append(content_without_date)

    # Combine the content into a single string
    combined_content = ' '.join(contents)

    # Split the combined content into sentences
    sentences = re.split(r'(?<=[.!?])\s+', combined_content)

    return sentences



Output:

In [18]:
# Define the search query and date range
query = "SBRY.L"
start_date = "2022-05-01"
end_date = "2022-05-30"

# Extract the news content
example = scrap_yahoo_news(query, start_date, end_date)

# Print the result
print(f"Extracted sentences from the news {query}")
example[:5]

Extracted sentences from the news SBRY.L


['Get J Sainsbury PLC (SBRY.L) real-time stock quotes, news, price and financial information from Reuters to inform your trading and investments Find the latest J Sainsbury plc (SBRY.L) stock quote, history, news and other vital information to help you with your stock trading and investing.',
 'Get the latest J Sainsbury plc (SBRY.L) stock news and headlines to help you in your trading and investing decisions.',
 'Get the latest J Sainsbury plc (SBRY) real-time quote, historical performance, charts, and other financial information to help you make more informed trading and investment decisions.',
 'J Sainsbury PLC SBRY.L Latest Trade 278 GBp 0 0.00% As of May 23 2023.',
 "Values delayed up to 15 minutes Today's Range -- - -- 52 Week Range 168.70 - 291.00 Profile Charts Financials Key Metrics..."]

## 9. Removing pdf encryption

In [19]:
def remove_pdf_encryption(input_path, output_path):
    """
    Function to remove security settings from a PDF file.
    
    Args:
        input_path (str): The path to the input PDF file.
        output_path (str): The path to save the output PDF file without security settings.
    """
    import PyPDF2
    
    with open(input_path, 'rb') as file:
        pdf_reader = PyPDF2.PdfReader(file)
        if pdf_reader.is_encrypted:
            pdf_reader.decrypt('')
        
        pdf_writer = PyPDF2.PdfWriter()
        for page in pdf_reader.pages:
            pdf_writer.add_page(page)
        
        with open(output_path, 'wb') as output_file:
            pdf_writer.write(output_file)

In [20]:
# Specify the input and output paths
# input_pdf_path = "AR/morrisons/morri-ar2017.pdf"
# output_pdf_path = "AR/morrisons/morri-ar2017_removed.pdf"

# Call the function to remove security settings
# remove_pdf_encryption(input_pdf_path, output_pdf_path)

# Part B. Annual Reports Extraction (pdf)

## B1. Sainsbury

Using global function to extract the texts from the annual report.

In [21]:
# Setting the path of 11 years Sainsbury annual reports
sainsbury_2022_path = "AR/sainsbury/sainsbury-ar2022.pdf"
sainsbury_2021_path = "AR/sainsbury/sainsbury-ar2021.pdf"
sainsbury_2020_path = "AR/sainsbury/sainsbury-ar2020.pdf"
sainsbury_2019_path = "AR/sainsbury/sainsbury-ar2019.pdf"
sainsbury_2018_path = "AR/sainsbury/sainsbury-ar2018.pdf"
sainsbury_2017_path = "AR/sainsbury/sainsbury-ar2017.pdf"
sainsbury_2016_path = "AR/sainsbury/sainsbury-ar2016.pdf"
sainsbury_2015_path = "AR/sainsbury/sainsbury-ar2015.pdf"
sainsbury_2014_path = "AR/sainsbury/sainsbury-ar2014.pdf"
sainsbury_2013_path = "AR/sainsbury/sainsbury-ar2013.pdf"
sainsbury_2012_path = "AR/sainsbury/sainsbury-ar2012.pdf"

In [22]:
# Extract text from sainsbury annual report
sain_sen_2022 = extract_pdf_text(sainsbury_2022_path)
sain_sen_2021 = extract_pdf_text(sainsbury_2021_path)
sain_sen_2020 = extract_pdf_text(sainsbury_2020_path)
sain_sen_2019 = extract_pdf_text(sainsbury_2019_path)
sain_sen_2018 = extract_pdf_text(sainsbury_2018_path)
sain_sen_2017 = extract_pdf_text(sainsbury_2017_path)
sain_sen_2016 = extract_pdf_text(sainsbury_2016_path)
sain_sen_2015 = extract_pdf_text(sainsbury_2015_path)
sain_sen_2014 = extract_pdf_text(sainsbury_2014_path)
sain_sen_2013 = extract_pdf_text(sainsbury_2013_path)
sain_sen_2012 = extract_pdf_text(sainsbury_2012_path)


Print the sentences from the annual report from 2021 and 2014.

In [23]:
# print the total number of sentences of 2021 sainsbury sentences
print(f"Total number of sentences: {len(sain_sen_2021)}")
print("")
# print the first 5 sentences
print(f'Print the first 5 sentences: {sain_sen_2021[:5]}')
print(f"\nData type: {type(sain_sen_2021)}")

Total number of sentences: 5308

Print the first 5 sentences: ['Annual Report and Financial Statements 2021 Driven by our passion for food, together we serve and help every customer', 'Offering delicious, great quality food at competitive prices has been at the heart of what we do since John James and Mary Ann Sainsbury opened our first store in 1869', 'Today, inspiring and delighting our customers with tasty food remains our priority', 'Our purpose is that driven by our passion for food, together we serve and help every customer', 'Our focus on great value food and convenient shopping, whether in-store or online is supported by our brands – Argos, Habitat, Tu, Nectar and Sainsbury’s Bank']

Data type: <class 'list'>


In [24]:
# print the total number of sentences of 2014 sainsbury sentences
print(f"Total number of sentences: {len(sain_sen_2014)}")
print("")
# print the first 5 sentences
print(f'Print the first 5 sentences: {sain_sen_2014[:5]}')
print(f"\nData type: {type(sain_sen_2014)}")

Total number of sentences: 4098

Print the first 5 sentences: ['Annual Report and Financial Statements 2014', 'Our business strategy: Our five areas of focus are underpinned by our values and operational excellence Note: this page forms part of our Strategic ReportOur values set us apart from other retailers', 'They underpin our strategy and the way we operate our business, and they govern the way we relate to customers, colleagues and stakeholders', 'See pages 8 to 23Read more about our strategy on pages 8 to 23 We have achieved around £120 million of operational cost savings over the year, totalling over £570 million over five years Great food Comp lementary channels and servicesDeveloping new busin essGrowing space and creating property value Compelling general merc handise and clothingOperational excellence Our values make us different', 'Supermarkets Convenience Online Logistics Central Bank22 depots cover 9']

Data type: <class 'list'>


Second, it will standardise the sentences for each annual reports.

In [25]:
# Standardise all sentences from all Sainsbury annual reports
sain_standsent_2022 = standardize_sentences(sain_sen_2022)
sain_standsent_2021 = standardize_sentences(sain_sen_2021)
sain_standsent_2020 = standardize_sentences(sain_sen_2020)
sain_standsent_2019 = standardize_sentences(sain_sen_2019)
sain_standsent_2018 = standardize_sentences(sain_sen_2018)
sain_standsent_2017 = standardize_sentences(sain_sen_2017)
sain_standsent_2016 = standardize_sentences(sain_sen_2016)
sain_standsent_2015 = standardize_sentences(sain_sen_2015)
sain_standsent_2014 = standardize_sentences(sain_sen_2014)
sain_standsent_2013 = standardize_sentences(sain_sen_2013)
sain_standsent_2012 = standardize_sentences(sain_sen_2012)

Print the short summary of standardised sentences from 2021 and 2014.

In [26]:
# print the standardized sentences summary of 2021 sainsbury annual report
print(f"Total number of standardized sentences of 2021: {len(sain_standsent_2021)}")
print("\nPrint the first 5 standardized sentences: ")
print(sain_standsent_2021[:5])
print(f"\nData type: {type(sain_standsent_2021)}")

Total number of standardized sentences of 2021: 5308

Print the first 5 standardized sentences: 
['annual report financial statements driven passion food together serve help every customer', 'offering delicious great quality food competitive prices heart since john james mary ann sainsbury opened first store', 'today inspiring delighting customers tasty food remains priority', 'purpose driven passion food together serve help every customer', 'focus great value food convenient shopping whether store online supported brands argos habitat tu nectar sainsbury bank']

Data type: <class 'list'>


In [27]:
# print the standardized sentences summary of 2014 sainsbury annual report
print(f"Total number of standardized sentences of 2014: {len(sain_standsent_2014)}")
print("\nPrint the first 5 standardized sentences: ")
print(sain_standsent_2014[:5])
print(f"\nData type: {type(sain_standsent_2014)}")

Total number of standardized sentences of 2014: 4098

Print the first 5 standardized sentences: 
['annual report financial statements', 'business strategy five areas focus underpinned values operational excellence note page forms part strategic reportour values set us apart retailers', 'underpin strategy way operate business govern way relate customers colleagues stakeholders', 'see pages read strategy pages achieved around million operational cost savings year totalling million five years great food comp lementary channels servicesdeveloping new busin essgrowing space creating property value compelling general merc handise clothingoperational excellence values make us different', 'supermarkets convenience online logistics central bank depots cover']

Data type: <class 'list'>


Next, it will combine all standardised sentences together as a list.

In [28]:
# Combine all standardised sentence into a list
sain_standsent_all = (
    sain_standsent_2022 +
    sain_standsent_2021 +
    sain_standsent_2020 +
    sain_standsent_2019 +
    sain_standsent_2018 +
    sain_standsent_2017 +
    sain_standsent_2016 +
    sain_standsent_2015 +
    sain_standsent_2014 +
    sain_standsent_2013 +
    sain_standsent_2012
)

print(f'Total number of all standardised sentences: {len(sain_standsent_all)}')
print('The first five sentences of sainsbury standardised sentences:')
print(sain_standsent_all[:5])
print(f"\nData type: {type(sain_standsent_all)}")

Total number of all standardised sentences: 48742
The first five sentences of sainsbury standardised sentences:
['annual report financial statements helping everyone eat better', 'offering delicious great quality food competitive prices heart since john james mary ann sainsbury opened first store', 'today inspiring delighting customers tasty food remains priority', 'purpose driven passion food together serve help every customer', 'focus great value food convenient shopping whether store online supported brands argos habitat tu nectar sainsbury bank']

Data type: <class 'list'>


## B2. Marks and Spencer

In [29]:
# Setting the path of 11 years M&S annual reports
ms_2022_path = "AR/marks_and_spencer/ms-ar2022.pdf"
ms_2021_path = "AR/marks_and_spencer/ms-ar2021.pdf"
ms_2020_path = "AR/marks_and_spencer/ms-ar2020.pdf"
ms_2019_path = "AR/marks_and_spencer/ms-ar2019.pdf"
ms_2018_path = "AR/marks_and_spencer/ms-ar2018.pdf"
ms_2017_path = "AR/marks_and_spencer/ms-ar2017.pdf"
ms_2016_path = "AR/marks_and_spencer/ms-ar2016.pdf"
ms_2015_path = "AR/marks_and_spencer/ms-ar2015.pdf"
ms_2014_path = "AR/marks_and_spencer/ms-ar2014.pdf"
ms_2013_path = "AR/marks_and_spencer/ms-ar2013.pdf"
ms_2012_path = "AR/marks_and_spencer/ms-ar2012.pdf"

In [30]:
# Extract text from M&S annual reports
ms_sen_2022 = extract_pdf_text(ms_2022_path)
ms_sen_2021 = extract_pdf_text(ms_2021_path)
ms_sen_2020 = extract_pdf_text(ms_2020_path)
ms_sen_2019 = extract_pdf_text(ms_2019_path)
ms_sen_2018 = extract_pdf_text(ms_2018_path)
ms_sen_2017 = extract_pdf_text(ms_2017_path)
ms_sen_2016 = extract_pdf_text(ms_2016_path)
ms_sen_2015 = extract_pdf_text(ms_2015_path)
ms_sen_2014 = extract_pdf_text(ms_2014_path)
ms_sen_2013 = extract_pdf_text(ms_2013_path)
ms_sen_2012 = extract_pdf_text(ms_2012_path)


Print the sentences from the annual report from 2021.

In [31]:
# print the sentences summary of 2021 M&S annual report
print(f"Total number of standardized sentences of 2021: {len(ms_sen_2021)}")
print("\nPrint the first 10 sentences from MS annual reports: ")
print(ms_sen_2021[:10])
print(f"\nData type: {type(ms_sen_2021)}")

Total number of standardized sentences of 2021: 9052

Print the first 10 sentences from MS annual reports: 
['Marks and Spencer Group plc Annual Report & Financial Statements 2021 Never the Same Again Forging a reshaped M&S Marks and Spencer Group plc Annual Report & Financial Statements & Notice of Annual General Meeting 2021', 'GROUP OVERVIEW £9', '0bn Group revenue (9', '8)p Basic loss per share No dividend paid for 20/21-11', '8% ( 19/20: 1', '3p)50', '5% Percentage of UK Clothing & Home sales online 67% Food: Value for money perception 81 Stores: Net promoter score1 51 M&S', 'com: Net promoter score1(19/20: 22', '5%) (19/20: 63%) (19/20: 68) (19/20: 57)£(201', '2)m Group loss before tax £41']

Data type: <class 'list'>


Second, it will standardise the sentences for each Marks and Spencer annual reports.

In [32]:
# Standardise all sentences from all M&S annual reports
ms_standsent_2022 = standardize_sentences(ms_sen_2022)
ms_standsent_2021 = standardize_sentences(ms_sen_2021)
ms_standsent_2020 = standardize_sentences(ms_sen_2020)
ms_standsent_2019 = standardize_sentences(ms_sen_2019)
ms_standsent_2018 = standardize_sentences(ms_sen_2018)
ms_standsent_2017 = standardize_sentences(ms_sen_2017)
ms_standsent_2016 = standardize_sentences(ms_sen_2016)
ms_standsent_2015 = standardize_sentences(ms_sen_2015)
ms_standsent_2014 = standardize_sentences(ms_sen_2014)
ms_standsent_2013 = standardize_sentences(ms_sen_2013)
ms_standsent_2012 = standardize_sentences(ms_sen_2012)

Print the short summary of standardised sentences of 2021 annual reports.

In [33]:
# print the standardized sentences summary of 2021 M&S annual report
print(f"Total number of standardized sentences of 2021: {len(ms_standsent_2021)}")
print("\nPrint the first 10 standardized sentences from MS annual reprots: ")
print(ms_standsent_2021[:10])
print(f"\nData type: {type(ms_standsent_2021)}")

Total number of standardized sentences of 2021: 9052

Print the first 10 standardized sentences from MS annual reprots: 
['marks spencer group plc annual report financial statements never forging reshaped marks spencer group plc annual report financial statements notice annual general meeting', 'group overview', 'bn group revenue', 'p basic loss per share dividend paid', '', 'p', 'percentage uk clothing home sales online food value money perception stores net promoter score', 'com net promoter score', '', 'group loss tax']

Data type: <class 'list'>


Next, it will combine all standardised sentences together as a list.

In [34]:
# Combine all standardised sentence into a list
ms_standsent_all = (
    ms_standsent_2022 +
    ms_standsent_2021 +
    ms_standsent_2020 +
    ms_standsent_2019 +
    ms_standsent_2018 +
    ms_standsent_2017 +
    ms_standsent_2016 +
    ms_standsent_2015 +
    ms_standsent_2014 +
    ms_standsent_2013 +
    ms_standsent_2012
)

print(f'Total number of all standardised sentences: {len(ms_standsent_all)}')
print('\nThe first 10 sentences of M&S standardised sentences:')
print(ms_standsent_all[:10])
print(f"\nData type: {type(ms_standsent_all)}")

Total number of all standardised sentences: 76174

The first 10 sentences of M&S standardised sentences:
['friend studio ltd file name cover v modification date may pm marks spencer group plc annual report financial statements next phase transformation stronger team stronger business stronger balance sheetshaping future marks spencer group plc annual report financial statements', 'friend studio ltd file name cover v modification date may pm friend studio ltd file name cover v modification date may pm food p marks spencer group plcm leading british retailer unique heritage strong brand values', 'operate family businesses selling high quality great value brand products services alongside carefully selected range third party brands', 'network stores websites globally together across stores support centres warehouses supply chain colleagues serve million customers year', 'clothing home p people culture p glance international p cover inside stevenage strong example modernising store estate 

## B3. Tesco

In [35]:
# Setting the path of 11 years Tesco annual reports
tesco_2022_path = "AR/tesco/tesco-ar2022.pdf"
tesco_2021_path = "AR/tesco/tesco-ar2021.pdf"
tesco_2020_path = "AR/tesco/tesco-ar2020.pdf"
tesco_2019_path = "AR/tesco/tesco-ar2019.pdf"
tesco_2018_path = "AR/tesco/tesco-ar2018.pdf"
tesco_2017_path = "AR/tesco/tesco-ar2017.pdf"
tesco_2016_path = "AR/tesco/tesco-ar2016.pdf"
tesco_2015_path = "AR/tesco/tesco-ar2015.pdf"
tesco_2014_path = "AR/tesco/tesco-ar2014.pdf"
tesco_2013_path = "AR/tesco/tesco-ar2013.pdf"
tesco_2012_path = "AR/tesco/tesco-ar2012.pdf"

In [36]:
# Extract text from Tesco annual report
tesco_sen_2022 = extract_pdf_text(tesco_2022_path)
tesco_sen_2021 = extract_pdf_text(tesco_2021_path)
tesco_sen_2020 = extract_pdf_text(tesco_2020_path)
tesco_sen_2019 = extract_pdf_text(tesco_2019_path)
tesco_sen_2018 = extract_pdf_text(tesco_2018_path)
tesco_sen_2017 = extract_pdf_text(tesco_2017_path)
tesco_sen_2016 = extract_pdf_text(tesco_2016_path)
tesco_sen_2015 = extract_pdf_text(tesco_2015_path)
tesco_sen_2014 = extract_pdf_text(tesco_2014_path)
tesco_sen_2013 = extract_pdf_text(tesco_2013_path)
tesco_sen_2012 = extract_pdf_text(tesco_2012_path)


Print the sentences from the annual report from 2021.

In [37]:
# print the sentences summary of 2021 Tesco annual report
print(f"Total number of sentences of 2021: {len(tesco_sen_2021)}")
print("\nPrint the first 10 sentences from Tesco annual reports: ")
print(tesco_sen_2021[:10])
print(f"\nData type: {type(tesco_sen_2021)}")

Total number of sentences of 2021: 6916

Print the first 10 sentences from Tesco annual reports: 
['Serving shoppers a little better every day', 'Annual Report and Financial Statements 2021 Tesco PLC Annual Report and Financial Statements 2021', 'Strategic report 3 2021 highlights 4 Tesco at a glance 5 Chairman’s statement 6 Group Chief Executive’s review 8 Engaging with our stakeholders 10 Our business model 11 Key performance indicators 12 Little Helps Plan (LHP) 17 Diversity and inclusion 19 Financial review 26 Task Force on Climate-related Financial Disclosures 29 Non-financial reporting statement 30 Section 172 statement 31 Principal risks and uncertainties 38 Longer-term viability statement Corporate governance 40 Chairman’s letter 42 Board of Directors 47 Executive Committee 48 Compliance with the UK Corporate Governance Code 51 Board leadership and company purpose 58 Division of responsibilities 60 Composition, succession and evaluation 62 Nominations and Governance Committee 6

Second, it will standardise the sentences for each annual reports.

In [38]:
# Standardise all sentences from all Tesco annual reports
tesco_standsent_2022 = standardize_sentences(tesco_sen_2022)
tesco_standsent_2021 = standardize_sentences(tesco_sen_2021)
tesco_standsent_2020 = standardize_sentences(tesco_sen_2020)
tesco_standsent_2019 = standardize_sentences(tesco_sen_2019)
tesco_standsent_2018 = standardize_sentences(tesco_sen_2018)
tesco_standsent_2017 = standardize_sentences(tesco_sen_2017)
tesco_standsent_2016 = standardize_sentences(tesco_sen_2016)
tesco_standsent_2015 = standardize_sentences(tesco_sen_2015)
tesco_standsent_2014 = standardize_sentences(tesco_sen_2014)
tesco_standsent_2013 = standardize_sentences(tesco_sen_2013)
tesco_standsent_2012 = standardize_sentences(tesco_sen_2012)

Print the short summary of standardised sentences of 2021 annual reports.

In [39]:
# print the standardized sentences summary of 2021 Tesco annual report
print(f"Total number of standardized sentences of 2021: {len(tesco_standsent_2021)}")
print("\nPrint the first 10 standardized sentences from Tesco annual reprots: ")
print(tesco_standsent_2021[:10])
print(f"\nData type: {type(tesco_standsent_2021)}")

Total number of standardized sentences of 2021: 6916

Print the first 10 standardized sentences from Tesco annual reprots: 
['serving shoppers little better every day', 'annual report financial statements tesco plc annual report financial statements', 'strategic report highlights tesco glance chairman statement group chief executive review engaging stakeholders business model key performance indicators little helps plan lhp diversity inclusion financial review task force climate related financial disclosures non financial reporting statement section statement principal risks uncertainties longer term viability statement corporate governance chairman letter board directors executive committee compliance uk corporate governance code board leadership company purpose division responsibilities composition succession evaluation nominations governance committee corporate responsibility committee audit committee directors remuneration report directors report financial statements independent au

Next, it will combine all standardised sentences together as a list.

In [40]:
# Combine all standardised sentence into a list
tesco_standsent_all = (
    tesco_standsent_2022 +
    tesco_standsent_2021 +
    tesco_standsent_2020 +
    tesco_standsent_2019 +
    tesco_standsent_2018 +
    tesco_standsent_2017 +
    tesco_standsent_2016 +
    tesco_standsent_2015 +
    tesco_standsent_2014 +
    tesco_standsent_2013 +
    tesco_standsent_2012
)

print(f'Total number of all standardised sentences: {len(tesco_standsent_all)}')
print('\nThe first 10 Tesco standardised sentences:')
print(tesco_standsent_all[:10])
print(f"\nData type: {type(tesco_standsent_all)}")

Total number of all standardised sentences: 58917

The first 10 Tesco standardised sentences:
['serving customers communities planet little better every day', 'annual report financial statements', 'serving customers communities planet little better every day', 'contents strategic report introduction', 'highlights', 'tesco glance', 'purpose', 'purpose action', 'chairman statement', 'group chief executive review']

Data type: <class 'list'>


## B4. John Lewis & Partners

In [41]:
# Setting the path of 11 years John Lewis annual reports
jl_2022_path = "AR/john_lewis/jl-ar2022.pdf"
jl_2021_path = "AR/john_lewis/jl-ar2021.pdf"
jl_2020_path = "AR/john_lewis/jl-ar2020.pdf"
jl_2019_path = "AR/john_lewis/jl-ar2019.pdf"
jl_2018_path = "AR/john_lewis/jl-ar2018.pdf"
jl_2017_path = "AR/john_lewis/jl-ar2017.pdf"
jl_2016_path = "AR/john_lewis/jl-ar2016.pdf"
jl_2015_path = "AR/john_lewis/jl-ar2015.pdf"
jl_2014_path = "AR/john_lewis/jl-ar2014.pdf"
jl_2013_path = "AR/john_lewis/jl-ar2013.pdf"
jl_2012_path = "AR/john_lewis/jl-ar2012.pdf"

In [42]:
# Extract text from John Lewis annual report
jl_sen_2022 = extract_pdf_text(jl_2022_path)
jl_sen_2021 = extract_pdf_text(jl_2021_path)
jl_sen_2020 = extract_pdf_text(jl_2020_path)
jl_sen_2019 = extract_pdf_text(jl_2019_path)
jl_sen_2018 = extract_pdf_text(jl_2018_path)
jl_sen_2017 = extract_pdf_text(jl_2017_path)
jl_sen_2016 = extract_pdf_text(jl_2016_path)
jl_sen_2015 = extract_pdf_text(jl_2015_path)
jl_sen_2014 = extract_pdf_text(jl_2014_path)
jl_sen_2013 = extract_pdf_text(jl_2013_path)
jl_sen_2012 = extract_pdf_text(jl_2012_path)


Print the sentences from the annual report from 2021.

In [43]:
# print the sentences summary of 2021 John Lewis annual report
print(f"Total number of sentences of 2021: {len(jl_sen_2021)}")
print("\nPrint the first 10 sentences from John Lewis annual reports: ")
print(jl_sen_2021[:10])
print(f"\nData type: {type(jl_sen_2021)}")

Total number of sentences of 2021: 5902

Print the first 10 sentences from John Lewis annual reports: 
['John Lewis Partnership plc Annual Report and Accounts 2021', 'CONTENTS STRATEGIC REPO RT Message from the Chairman - Emerging stronger 4 Who we are - Our purpose 8 At a glance - Our year 9 At a glance - Our financial performance 11 How we are different - Our Partnership business model 13 How we are different - Our culture 15 Partnership for good - Supporting our Partners and communities 17 Be Yourself', 'Always - Our diversity and inclusion plan 19 Working together - Our Partnership Plan 21 Market review - Market context and key trends shaping retail 24 Our Partnership Plan - Retail customers love 28 Our Partnership Plan - Inspirational new services 31 Our Partnership Plan - Partnerships for growth 32 Our Partnership Plan - Lean, simple, fast 34 Our Partnership Plan - Our Ethics and Sustainability Strategy 36 Tackling climate change - Task Force on Climate-related Financial Disclosu

Second, it will standardise the sentences for each annual reports.

In [44]:
# Standardise all sentences from all John Lewis annual reports
jl_standsent_2022 = standardize_sentences(jl_sen_2022)
jl_standsent_2021 = standardize_sentences(jl_sen_2021)
jl_standsent_2020 = standardize_sentences(jl_sen_2020)
jl_standsent_2019 = standardize_sentences(jl_sen_2019)
jl_standsent_2018 = standardize_sentences(jl_sen_2018)
jl_standsent_2017 = standardize_sentences(jl_sen_2017)
jl_standsent_2016 = standardize_sentences(jl_sen_2016)
jl_standsent_2015 = standardize_sentences(jl_sen_2015)
jl_standsent_2014 = standardize_sentences(jl_sen_2014)
jl_standsent_2013 = standardize_sentences(jl_sen_2013)
jl_standsent_2012 = standardize_sentences(jl_sen_2012)

Print the short summary of standardised sentences of 2021 annual reports.

In [45]:
# print the standardized sentences summary of 2021 John Lewis annual report
print(f"Total number of standardized sentences of 2021: {len(jl_standsent_2021)}")
print("\nPrint the first 10 standardized sentences from John Lewis annual reprots: ")
print(jl_standsent_2021[:10])
print(f"\nData type: {type(jl_standsent_2021)}")

Total number of standardized sentences of 2021: 5902

Print the first 10 standardized sentences from John Lewis annual reprots: 
['john lewis partnership plc annual report accounts', 'contents strategic repo rt message chairman emerging stronger purpose glance year glance financial performance different partnership business model different culture partnership good supporting partners communities', 'always diversity inclusion plan working together partnership plan market review market context key trends shaping retail partnership plan retail customers love partnership plan inspirational new services partnership plan partnerships growth partnership plan lean simple fast partnership plan ethics sustainability strategy tackling climate change task force climate related financial disclosures section statement promoting success partnership risks uncertainties managing risks governance repo rt governance governance shared partnership governance partnership purpose values governance chairman g

Next, it will combine all standardised sentences together as a list.

In [46]:
# Combine all standardised sentence into a list
jl_standsent_all = (
    jl_standsent_2022 +
    jl_standsent_2021 +
    jl_standsent_2020 +
    jl_standsent_2019 +
    jl_standsent_2018 +
    jl_standsent_2017 +
    jl_standsent_2016 +
    jl_standsent_2015 +
    jl_standsent_2014 +
    jl_standsent_2013 +
    jl_standsent_2012
)


## B5. Morrisons

For Morrisions annual reports, 2012 and 2013 reports were not avaiable online.

In [47]:
# Setting the path of Morrisons annual reports from 2014 to 2022
morri_2022_path = "AR/morrisons/morri-ar2022.pdf"
morri_2021_path = "AR/morrisons/morri-ar2021.pdf"
morri_2020_path = "AR/morrisons/morri-ar2020.pdf"
morri_2019_path = "AR/morrisons/morri-ar2019.pdf"
morri_2018_path = "AR/morrisons/morri-ar2018.pdf"
morri_2017_path = "AR/morrisons/morri-ar2017_removed.pdf" # removed encryption
morri_2016_path = "AR/morrisons/morri-ar2016.pdf"
morri_2015_path = "AR/morrisons/morri-ar2015.pdf"
morri_2014_path = "AR/morrisons/morri-ar2014.pdf"


In [48]:
# Extract text from Morrisons annual report
morri_sen_2022 = extract_pdf_text(morri_2022_path)
morri_sen_2021 = extract_pdf_text(morri_2021_path)
morri_sen_2020 = extract_pdf_text(morri_2020_path)
morri_sen_2019 = extract_pdf_text(morri_2019_path)
morri_sen_2018 = extract_pdf_text(morri_2018_path)
morri_sen_2017 = extract_pdf_text(morri_2017_path)
morri_sen_2016 = extract_pdf_text(morri_2016_path)
morri_sen_2015 = extract_pdf_text(morri_2015_path)
morri_sen_2014 = extract_pdf_text(morri_2014_path)


Print the sentences from the annual report from 2021.

In [49]:
# print the sentences summary of 2021 Morrisons annual report
print(f"Total number of sentences of 2021: {len(morri_sen_2021)}")
print("\nPrint the first 10 sentences from Morrisons annual reports: ")
print(morri_sen_2021[:10])
print(f"\nData type: {type(morri_sen_2021)}")

Total number of sentences of 2021: 4785

Print the first 10 sentences from Morrisons annual reports: 
['Wm Morrison Supermarkets PLC Annual Report and Financial Statements 2020/21We are responding NHS hourdoorstep deliveries to the vulnerable650,000+ WE ARE DONATING LUNCHBOXES DAILY OVER T H E S C H O O L HOLIDAYS Wm Morrison Supermarkets PLC An nual Report and Financial Statements 2020/21Wm Morrison Supermarkets PLCHilmore House, Gain Lane Bradford BD3 7DL Telephone: 0345 611 5000 Visit our website: www', 'morrisons', 'com 24629_Morrisons AR_Covers_2021_29-04-21', 'indd 1-324629_Morrisons AR_Covers_2021_29-04-21', 'indd 1-3 29/04/2021 18:0929/04/2021 18:09', '* Alternative Performance Measures as defined in the Glossary on pages 157 to 159', 'We are responding to the global crisis by playing our full part in feeding the nation', 'Our core purpose remains: to make and provide food we’re all proud of, where everyone’s effort is worthwhile, so more and more people can afford to enjoy eat

Second, it will standardise the sentences for each Morrisons annual reports.

In [50]:
# Standardise all sentences from all Morrisons annual reports
morri_standsent_2022 = standardize_sentences(morri_sen_2022)
morri_standsent_2021 = standardize_sentences(morri_sen_2021)
morri_standsent_2020 = standardize_sentences(morri_sen_2020)
morri_standsent_2019 = standardize_sentences(morri_sen_2019)
morri_standsent_2018 = standardize_sentences(morri_sen_2018)
morri_standsent_2017 = standardize_sentences(morri_sen_2017)
morri_standsent_2016 = standardize_sentences(morri_sen_2016)
morri_standsent_2015 = standardize_sentences(morri_sen_2015)
morri_standsent_2014 = standardize_sentences(morri_sen_2014)


Print the short summary of standardised sentences of 2021 annual reports.

In [51]:
# print the standardized sentences summary of 2021 Morrisons annual report
print(f"Total number of standardized sentences of 2021: {len(morri_standsent_2021)}")
print("\nPrint the first 10 standardized sentences from Morrisons annual reprots: ")
print(morri_standsent_2021[:10])
print(f"\nData type: {type(morri_standsent_2021)}")

Total number of standardized sentences of 2021: 4785

Print the first 10 standardized sentences from Morrisons annual reprots: 
['wm morrison supermarkets plc annual report financial statements responding nhs hourdoorstep deliveries vulnerable donating lunchboxes daily h e c h l holidays wm morrison supermarkets plc nual report financial statements wm morrison supermarkets plchilmore house gain lane bradford bd dl telephone visit website www', 'morrisons', 'com morrisons ar covers', 'indd morrisons ar covers', 'indd', 'alternative performance measures defined glossary pages', 'responding global crisis playing full part feeding nation', 'core purpose remains make provide food proud everyone effort worthwhile people afford enjoy eating well', 'feeding nation throughout directors report strategic report unless otherwise stated refers week period ended january refers week period ended february', 'refer calendar years']

Data type: <class 'list'>


Next, it will combine all standardised sentences together as a list.

In [52]:
# Combine all standardised sentence into a list
morri_standsent_all = (
    morri_standsent_2022 +
    morri_standsent_2021 +
    morri_standsent_2020 +
    morri_standsent_2019 +
    morri_standsent_2018 +
    morri_standsent_2017 +
    morri_standsent_2016 +
    morri_standsent_2015 +
    morri_standsent_2014
)

print(f'Total number of all standardised sentences: {len(morri_standsent_all)}')
print('\nThe first 10 John Morrisons standardised sentences:')
print(morri_standsent_all[:10])
print(f"\nData type: {type(morri_standsent_all)}")

Total number of all standardised sentences: 31730

The first 10 John Morrisons standardised sentences:
['company registration number wm morrison supermarkets limited annual report financial statements weeks ended october', 'contents company information strategic report principal activities business model financial results customers colleagues suppliers protecting environment supporting communities managing risks section governance report directors report statement directors responsibilities financial statements independent auditors report members wm morrison supermarkets limited consolidated income statement consolidated statement comprehensive income consolidated statement financial position consolidated statement cash flows consolidated statement changes equity general information notes group financial statements company statement financial position company statement changes equity company accounting policies notes company financial statements related undertakings supplementary infor

## B6. Co-op Group

In [53]:
# Setting the path of Co-op annual reports from 2012 to 2022
coop_2022_path = "AR/co_op/coop-ar2022.pdf"
coop_2021_path = "AR/co_op/coop-ar2021.pdf"
coop_2020_path = "AR/co_op/coop-ar2020.pdf"
coop_2019_path = "AR/co_op/coop-ar2019.pdf"
coop_2018_path = "AR/co_op/coop-ar2018.pdf"
coop_2017_path = "AR/co_op/coop-ar2017.pdf"
coop_2016_path = "AR/co_op/coop-ar2016.pdf"
coop_2015_path = "AR/co_op/coop-ar2015.pdf"
coop_2014_path = "AR/co_op/coop-ar2014.pdf"
coop_2013_path = "AR/co_op/coop-ar2013.pdf"
coop_2012_path = "AR/co_op/coop-ar2012.pdf"

In [54]:
# Extract text from Co-op annual report
coop_sen_2022 = extract_pdf_text(coop_2022_path)
coop_sen_2021 = extract_pdf_text(coop_2021_path)
coop_sen_2020 = extract_pdf_text(coop_2020_path)
coop_sen_2019 = extract_pdf_text(coop_2019_path)
coop_sen_2018 = extract_pdf_text(coop_2018_path)
coop_sen_2017 = extract_pdf_text(coop_2017_path)
coop_sen_2016 = extract_pdf_text(coop_2016_path)
coop_sen_2015 = extract_pdf_text(coop_2015_path)
coop_sen_2014 = extract_pdf_text(coop_2014_path)
coop_sen_2013 = extract_pdf_text(coop_2013_path)
coop_sen_2012 = extract_pdf_text(coop_2012_path)


Print the sentences from the annual report from 2021.

In [55]:
# print the sentences summary of 2021 Co-op annual report
print(f"Total number of sentences of 2021: {len(coop_sen_2021)}")
print("\nPrint the first 10 sentences from Co-op annual reports: ")
print(coop_sen_2021[:10])
print(f"\nData type: {type(coop_sen_2021)}")

Total number of sentences of 2021: 4329

Print the first 10 sentences from Co-op annual reports: 
['Co-op Annual Report & Accounts for 2021Co-operating for a Fairer World', 'Contents Strategic report 3 2021 in brief 4 Co-operating for a Fairer World 5 Chair’s introduction – Allan Leighton 7 Report from the President of the National Members’ Council – Denise Scott-McDonald 9 Chief Executive’s overview - Steve Murrells 12 Business unit updates 19 Fairer for our members and communities 25 Fairer for our colleagues 28 Fairer for our planet 30 Creating an even stronger and more agile Co-op 33 Our financial performance 41 Key performance indicators 43 Risk management Governance reports 54 Board biographies 57 Executive biographies 59 Governance review 72 The report of the Risk and Audit Committee83 The report of the Remuneration Committee 101 The report of the Nominations Committee 105 Directors’ report 113 Statement of Co-op Board 115 Co-op’s National Members’ Council: annual statement for 

Second, it will standardise the sentences for each annual reports.

In [56]:
# Standardise all sentences from all Co-op annual reports
coop_standsent_2022 = standardize_sentences(coop_sen_2022)
coop_standsent_2021 = standardize_sentences(coop_sen_2021)
coop_standsent_2020 = standardize_sentences(coop_sen_2020)
coop_standsent_2019 = standardize_sentences(coop_sen_2019)
coop_standsent_2018 = standardize_sentences(coop_sen_2018)
coop_standsent_2017 = standardize_sentences(coop_sen_2017)
coop_standsent_2016 = standardize_sentences(coop_sen_2016)
coop_standsent_2015 = standardize_sentences(coop_sen_2015)
coop_standsent_2014 = standardize_sentences(coop_sen_2014)
coop_standsent_2013 = standardize_sentences(coop_sen_2013)
coop_standsent_2012 = standardize_sentences(coop_sen_2012)


Print the short summary of standardised sentences of 2021 annual reports.

In [57]:
# print the standardized sentences summary of 2021 Co-op annual report
print(f"Total number of standardized sentences of 2021: {len(coop_standsent_2021)}")
print("\nPrint the first 10 standardized sentences from Co-op annual reprots: ")
print(coop_standsent_2021[:10])
print(f"\nData type: {type(coop_standsent_2021)}")

Total number of standardized sentences of 2021: 4329

Print the first 10 standardized sentences from Co-op annual reprots: 
['co op annual report accounts co operating fairer world', 'contents strategic report brief co operating fairer world chair introduction allan leighton report president national members council denise scott mcdonald chief executive overview steve murrells business unit updates fairer members communities fairer colleagues fairer planet creating even stronger agile co op financial performance key performance indicators risk management governance reports board biographies executive biographies governance review report risk audit committee report remuneration committee report nominations committee directors report statement co op board co op national members council annual statement report scrutiny committee promoting success co op financial statements consolidated income statement consolidated statement comprehensive income consolidated balance sheet consolidated sta

Next, it will combine all standardised sentences together as a list.

In [58]:
# Combine all standardised sentence into a list
coop_standsent_all = (
    coop_standsent_2022 +
    coop_standsent_2021 +
    coop_standsent_2020 +
    coop_standsent_2019 +
    coop_standsent_2018 +
    coop_standsent_2017 +
    coop_standsent_2016 +
    coop_standsent_2015 +
    coop_standsent_2014 +
    coop_standsent_2013 +
    coop_standsent_2012
)

print(f'Total number of all standardised sentences: {len(coop_standsent_all)}')
print('\nThe first 10 John Co-op standardised sentences:')
print(coop_standsent_all[:10])
print(f"\nData type: {type(coop_standsent_all)}")

Total number of all standardised sentences: 44736

The first 10 John Co-op standardised sentences:
['co op annual report accounts co operating fairer world', 'contents strategic report brief co operating fairer world chair introduction allan leighton report president national members council denise scott mcdonald chief executive overview shirine khoury haq financial overview membership update business unit updates vision update financial performance key performance indicators risk management governance reports board biographies executive biographies governance review report risk audit committee report remuneration committee report nominations committee directors report statement co op board co op national members council annual statement report scrutiny committee promoting success co op financial statements consolidated income statement consolidated statement comprehensive income consolidated balance sheet consolidated statement changes equity consolidated statement cash flows notes fi

## B7. Assocaited British Food

In [59]:
# Setting the path of ABF annual reports from 2012 to 2022
abf_2022_path = "AR/abf/abf-ar2022.pdf"
abf_2021_path = "AR/abf/abf-ar2021.pdf"
abf_2020_path = "AR/abf/abf-ar2020.pdf"
abf_2019_path = "AR/abf/abf-ar2019.pdf"
abf_2018_path = "AR/abf/abf-ar2018.pdf"
abf_2017_path = "AR/abf/abf-ar2017.pdf"
abf_2016_path = "AR/abf/abf-ar2016.pdf"
abf_2015_path = "AR/abf/abf-ar2015.pdf"
abf_2014_path = "AR/abf/abf-ar2014.pdf"
abf_2013_path = "AR/abf/abf-ar2013.pdf"
abf_2012_path = "AR/abf/abf-ar2012.pdf"

In [60]:
# Extract text from ABF annual report
abf_sen_2022 = extract_pdf_text(abf_2022_path)
abf_sen_2021 = extract_pdf_text(abf_2021_path)
abf_sen_2020 = extract_pdf_text(abf_2020_path)
abf_sen_2019 = extract_pdf_text(abf_2019_path)
abf_sen_2018 = extract_pdf_text(abf_2018_path)
abf_sen_2017 = extract_pdf_text(abf_2017_path)
abf_sen_2016 = extract_pdf_text(abf_2016_path)
abf_sen_2015 = extract_pdf_text(abf_2015_path)
abf_sen_2014 = extract_pdf_text(abf_2014_path)
abf_sen_2013 = extract_pdf_text(abf_2013_path)
abf_sen_2012 = extract_pdf_text(abf_2012_path)


Print the sentences from the annual report from 2021.

In [61]:
# print the sentences summary of 2021 ABF annual report
print(f"Total number of sentences of 2021: {len(abf_sen_2021)}")
print("\nPrint the first 10 sentences from ABF annual reports: ")
print(abf_sen_2021[:10])
print(f"\nData type: {type(abf_sen_2021)}")

Total number of sentences of 2021: 4863

Print the first 10 sentences from ABF annual reports: 
['Associated British Foods plc Annual Report and Accounts 2021 Annual Report 2021 Creating value', 'Group revenue £13', '9bn (2020: £13', '9bn)Adjusted profit before tax £908m (2020: £914m) Dividends per share 26', '7p (2020: Nil)Special dividend per share 13', '8p Net cash before lease liabilities £1,901m (2020: £1,558m) Profit before tax £725m (2020: £686m)Adjusted operating profit £1,011m (2020: £1,024m) Adjusted earnings per share 80', '1p (2020: 81', '1p) Gross investment £721m (2020: £641m) Operating profit £808m (2020: £810m)Basic earnings per share 60', '5p (2020: 57', "6p)2021 GROUP FINANCIAL HIGHLIGHTS Strategic report IFC 2021 Group financial highlights IFC At a glance 1 Introduction 12 Chairman's statement 16 Chief Executive's statement 18 Our business model and strategy 20 Key performance indicators 22 Operating review 22 Grocery 32 Sugar 40 Agriculture 46 Ingredients 52 Retail 

Second, it will standardise the sentences for each annual reports.

In [62]:
# Standardise all sentences from all ABF annual reports
abf_standsent_2022 = standardize_sentences(abf_sen_2022)
abf_standsent_2021 = standardize_sentences(abf_sen_2021)
abf_standsent_2020 = standardize_sentences(abf_sen_2020)
abf_standsent_2019 = standardize_sentences(abf_sen_2019)
abf_standsent_2018 = standardize_sentences(abf_sen_2018)
abf_standsent_2017 = standardize_sentences(abf_sen_2017)
abf_standsent_2016 = standardize_sentences(abf_sen_2016)
abf_standsent_2015 = standardize_sentences(abf_sen_2015)
abf_standsent_2014 = standardize_sentences(abf_sen_2014)
abf_standsent_2013 = standardize_sentences(abf_sen_2013)
abf_standsent_2012 = standardize_sentences(abf_sen_2012)


Print the short summary of standardised sentences of 2021 annual reports.

In [63]:
# print the standardized sentences summary of 2021 ABF annual report
print(f"Total number of standardized sentences of 2021: {len(abf_standsent_2021)}")
print("\nPrint the first 10 standardized sentences from ABF annual reprots: ")
print(abf_standsent_2021[:10])
print(f"\nData type: {type(abf_standsent_2021)}")

Total number of standardized sentences of 2021: 4863

Print the first 10 standardized sentences from ABF annual reprots: 
['associated british foods plc annual report accounts annual report creating value', 'group revenue', 'bn', 'bn adjusted profit tax dividends per share', 'p nil special dividend per share', 'p net cash lease liabilities profit tax adjusted operating profit adjusted earnings per share', 'p', 'p gross investment operating profit basic earnings per share', 'p', 'p group financial highlights strategic report ifc group financial highlights ifc glance introduction chairman statement chief executive statement business model strategy key performance indicators operating review grocery sugar agriculture ingredients retail financial review section stakeholders responsibility climate related financial disclosures tcfd principal risks uncertainties viability statement going concerncontents governance chairman introduction board directors corporate governance directors remunerat

Next, it will combine all standardised sentences together as a list.

In [64]:
# Combine all standardised sentence into a list
abf_standsent_all = (
    abf_standsent_2022 +
    abf_standsent_2021 +
    abf_standsent_2020 +
    abf_standsent_2019 +
    abf_standsent_2018 +
    abf_standsent_2017 +
    abf_standsent_2016 +
    abf_standsent_2015 +
    abf_standsent_2014 +
    abf_standsent_2013 +
    abf_standsent_2012
)

print(f'Total number of all standardised sentences: {len(abf_standsent_all)}')
print('\nThe first 10 ABF standardised sentences:')
print(abf_standsent_all[:10])
print(f"\nData type: {type(abf_standsent_all)}")

Total number of all standardised sentences: 45229

The first 10 ABF standardised sentences:
['invested future annual report associated british foods plc annual report accounts', 'grocery ingredients retailsugar agricultureat glance operating businesses brandsrevenue adjusted operating profit one largest sugar producers world uk largest animal feed business uk households use grocery brands one leading suppliers specialty yeast ingredients globally one largest fashion retailers europeour grocery division employs people comprises brands occupy leading positions markets across globe', 'uk nine households use brands', 'twinings ovaltine brands enjoyed countries worldwide', 'ab sugar leading producer sugar sugar derived co products africa uk spain north east china', 'ab agri leading international agri food business operating across supply chain producing marketing animal feed nutrition technology based products services', 'ingredients businesses leaders yeast bakery ingredients well specialt

## B8. Next

In [65]:
# Setting the path of Next annual reports from 2012 to 2022
nxt_2022_path = "AR/next/nxt-ar2022.pdf"
nxt_2021_path = "AR/next/nxt-ar2021.pdf"
nxt_2020_path = "AR/next/nxt-ar2020.pdf"
nxt_2019_path = "AR/next/nxt-ar2019.pdf"
nxt_2018_path = "AR/next/nxt-ar2018.pdf"
nxt_2017_path = "AR/next/nxt-ar2017.pdf"
nxt_2016_path = "AR/next/nxt-ar2016.pdf"
nxt_2015_path = "AR/next/nxt-ar2015.pdf"
nxt_2014_path = "AR/next/nxt-ar2014.pdf"
nxt_2013_path = "AR/next/nxt-ar2013.pdf"
nxt_2012_path = "AR/next/nxt-ar2012.pdf"

In [66]:
# Extract text from Next annual report
nxt_sen_2022 = extract_pdf_text(nxt_2022_path)
nxt_sen_2021 = extract_pdf_text(nxt_2021_path)
nxt_sen_2020 = extract_pdf_text(nxt_2020_path)
nxt_sen_2019 = extract_pdf_text(nxt_2019_path)
nxt_sen_2018 = extract_pdf_text(nxt_2018_path)
nxt_sen_2017 = extract_pdf_text(nxt_2017_path)
nxt_sen_2016 = extract_pdf_text(nxt_2016_path)
nxt_sen_2015 = extract_pdf_text(nxt_2015_path)
nxt_sen_2014 = extract_pdf_text(nxt_2014_path)
nxt_sen_2013 = extract_pdf_text(nxt_2013_path)
nxt_sen_2012 = extract_pdf_text(nxt_2012_path)


Print the sentences from the annual report from 2021.

In [67]:
# print the sentences summary of 2021 Next annual report
print(f"Total number of sentences of 2021: {len(nxt_sen_2021)}")
print("\nPrint the first 10 sentences from Next annual reports: ")
print(nxt_sen_2021[:10])
print(f"\nData type: {type(nxt_sen_2021)}")

Total number of sentences of 2021: 8619

Print the first 10 sentences from Next annual reports: 
['ANNUAL REPORT & ACCOUNTS JANUARY 2022', 'CONTENTS Strategic Report 2 Chairman’s Statement 3 Chief Executive’s Review 74 Business Model 76 Key Performance Indicators 78 Risks and Uncertainties 87 Viability Assessment 89 Corporate Responsibility 110 Section 172 Statement 114 Non-Financial Information Statement Governance 116 Directors’ Biographies 118 Directors’ Responsibilities 119 Corporate Governance Report 126 Nomination Committee Report 127 Audit Committee Report 135 Remuneration Report 160 Directors’ Report 162 Independent Auditor’s Report Financial Statements Group Financial Statements 173 Consolidated Income Statement 174 Consolidated Statement of Comprehensive Income 175 Consolidated Balance Sheet 176 Consolidated Statement of Changes in Equity 177 Consolidated Cash Flow Statement 178 Group Accounting Policies 192 Notes to the Consolidated Financial Statements Parent Company Financ

Second, it will standardise the sentences for each annual reports.

In [68]:
# Standardise all sentences from all Next annual reports
nxt_standsent_2022 = standardize_sentences(nxt_sen_2022)
nxt_standsent_2021 = standardize_sentences(nxt_sen_2021)
nxt_standsent_2020 = standardize_sentences(nxt_sen_2020)
nxt_standsent_2019 = standardize_sentences(nxt_sen_2019)
nxt_standsent_2018 = standardize_sentences(nxt_sen_2018)
nxt_standsent_2017 = standardize_sentences(nxt_sen_2017)
nxt_standsent_2016 = standardize_sentences(nxt_sen_2016)
nxt_standsent_2015 = standardize_sentences(nxt_sen_2015)
nxt_standsent_2014 = standardize_sentences(nxt_sen_2014)
nxt_standsent_2013 = standardize_sentences(nxt_sen_2013)
nxt_standsent_2012 = standardize_sentences(nxt_sen_2012)


Print the short summary of standardised sentences of 2021 annual reports.

In [69]:
# print the standardized sentences summary of 2021 Next annual report
print(f"Total number of standardized sentences of 2021: {len(nxt_standsent_2021)}")
print("\nPrint the first 10 standardized sentences from Next annual reprots: ")
print(nxt_standsent_2021[:10])
print(f"\nData type: {type(nxt_standsent_2021)}")

Total number of standardized sentences of 2021: 8619

Print the first 10 standardized sentences from Next annual reprots: 
['annual report accounts january', 'contents strategic report chairman statement chief executive review business model key performance indicators risks uncertainties viability assessment corporate responsibility section statement non financial information statement governance directors biographies directors responsibilities corporate governance report nomination committee report audit committee report remuneration report directors report independent auditor report financial statements group financial statements consolidated income statement consolidated statement comprehensive income consolidated balance sheet consolidated statement changes equity consolidated cash flow statement group accounting policies notes consolidated financial statements parent company financial statements parent company balance sheet parent company statement changes equity notes parent comp

Next, it will combine all standardised sentences together as a list.

In [70]:
# Combine all standardised sentence into a list
nxt_standsent_all = (
    nxt_standsent_2022 +
    nxt_standsent_2021 +
    nxt_standsent_2020 +
    nxt_standsent_2019 +
    nxt_standsent_2018 +
    nxt_standsent_2017 +
    nxt_standsent_2016 +
    nxt_standsent_2015 +
    nxt_standsent_2014 +
    nxt_standsent_2013 +
    nxt_standsent_2012
)

print(f'Total number of all standardised sentences: {len(nxt_standsent_all)}')
print('\nThe first 10 Next standardised sentences:')
print(nxt_standsent_all[:10])
print(f"\nData type: {type(nxt_standsent_all)}")

Total number of all standardised sentences: 67838

The first 10 Next standardised sentences:
['annual report accounts january', 'contents strategic report chairman statement chief executive review business model key performance indicators risks uncertainties viability assessment corporate responsibility section statement non financial information statement governance directors biographies irectors responsibilities statement corporate governance report nomination committee report audit committee report remuneration report directors report independent auditors reportfinancial statements group financial statements consolidated income statement c onsolidated statement comprehensive income consolidated balance sheet c onsolidated statement changes equity consolidated cash flow statement group accounting policies n otes consolidated financial statements parent company financial statements parent company balance sheet p arent company statement changes equity n otes parent company financial st

## B9. Ted Baker

In [71]:
# Setting the path of Ted Baker annual reports from 2012 to 2022
tbk_2022_path = "AR/ted_baker/tbk-ar2022.pdf"
tbk_2021_path = "AR/ted_baker/tbk-ar2021.pdf"
tbk_2020_path = "AR/ted_baker/tbk-ar2020.pdf"
tbk_2019_path = "AR/ted_baker/tbk-ar2019.pdf"
tbk_2018_path = "AR/ted_baker/tbk-ar2018.pdf"
tbk_2017_path = "AR/ted_baker/tbk-ar2017.pdf"
tbk_2016_path = "AR/ted_baker/tbk-ar2016.pdf"
tbk_2015_path = "AR/ted_baker/tbk-ar2015.pdf"
tbk_2014_path = "AR/ted_baker/tbk-ar2014.pdf"
tbk_2013_path = "AR/ted_baker/tbk-ar2013.pdf"
tbk_2012_path = "AR/ted_baker/tbk-ar2012.pdf"


In [72]:
# Extract text from annual report
tbk_sen_2022 = extract_pdf_text(tbk_2022_path)
tbk_sen_2021 = extract_pdf_text(tbk_2021_path)
tbk_sen_2020 = extract_pdf_text(tbk_2020_path)
tbk_sen_2019 = extract_pdf_text(tbk_2019_path)
tbk_sen_2018 = extract_pdf_text(tbk_2018_path)
tbk_sen_2017 = extract_pdf_text(tbk_2017_path)
tbk_sen_2016 = extract_pdf_text(tbk_2016_path)
tbk_sen_2015 = extract_pdf_text(tbk_2015_path)
tbk_sen_2014 = extract_pdf_text(tbk_2014_path)
tbk_sen_2013 = extract_pdf_text(tbk_2013_path)
tbk_sen_2012 = extract_pdf_text(tbk_2012_path)


Print the sentences from the annual report from 2021.

In [73]:
# print the sentences summary of 2021 annual report
print(f"Total number of sentences of 2021: {len(tbk_sen_2021)}")
print("\nPrint the first 10 sentences from Ted Baker annual reports: ")
print(tbk_sen_2021[:10])
print(f"\nData type: {type(tbk_sen_2021)}")

Total number of sentences of 2021: 3577

Print the first 10 sentences from Ted Baker annual reports: 
['ANNUAL REPORT— — ’2 1 REWRITING THE SCRIPT', 'TBAR–’2 1 STRATEGIC REPORT TED BAKER TODAY 2 Chief Executive’s review and introduction to Ted Baker 10 Our Chair, John Barton TAKING TED BAKER INTO THE FUTURE 12 Our business model 14 — Our customers 16 — Design, source and make 18 — Sell 20 Our strategy REVIEW OF THE YEAR 22 Chief Financial Officer’s introduction 24 Key performance indicators 26 Financial/operational review 34 Our sustainability stor y 35 — People 38 — Ethical sourcing programme 41 — Communities 42 — Planet 46 — Fashioning a better future 48 Risk report 54 Viability statement and going concernGOVERNANCE REPORT 58 Board of Directors 60 Executive Team 62 Chair’s introduction to governance 63 Corporate governance 72 Audit & Risk Committee Report 76 Nominations Committee Report 80 Remuneration Report 94 Directors’ Report 97 Statement of Directors’ responsibilities FINANCIAL 

Second, it will standardise the sentences for each annual reports.

In [74]:
# Standardise all sentences from all annual reports
tbk_standsent_2022 = standardize_sentences(tbk_sen_2022)
tbk_standsent_2021 = standardize_sentences(tbk_sen_2021)
tbk_standsent_2020 = standardize_sentences(tbk_sen_2020)
tbk_standsent_2019 = standardize_sentences(tbk_sen_2019)
tbk_standsent_2018 = standardize_sentences(tbk_sen_2018)
tbk_standsent_2017 = standardize_sentences(tbk_sen_2017)
tbk_standsent_2016 = standardize_sentences(tbk_sen_2016)
tbk_standsent_2015 = standardize_sentences(tbk_sen_2015)
tbk_standsent_2014 = standardize_sentences(tbk_sen_2014)
tbk_standsent_2013 = standardize_sentences(tbk_sen_2013)
tbk_standsent_2012 = standardize_sentences(tbk_sen_2012)


Print the short summary of standardised sentences of 2021 annual reports.

In [75]:
# print the standardized sentences summary of 2021 annual report
print(f"Total number of standardized sentences of 2021: {len(tbk_standsent_2021)}")
print("\nPrint the first 10 standardized sentences from Ted Baker annual reprots: ")
print(tbk_standsent_2021[:10])
print(f"\nData type: {type(tbk_standsent_2021)}")

Total number of standardized sentences of 2021: 3577

Print the first 10 standardized sentences from Ted Baker annual reprots: 
['annual report rewriting script', 'tbar strategic report ted baker today chief executive review introduction ted baker chair john barton taking ted baker future business model customers design source make sell strategy review year chief financial officer introduction key performance indicators financial operational review sustainability stor people ethical sourcing programme communities planet fashioning better future risk report viability statement going concerngovernance report board directors executive team chair introduction governance corporate governance audit risk committee report nominations committee report remuneration report directors report statement directors responsibilities financial statements independent auditor report income statement statement comprehensive income statement changes equity balance sheet cash flow statement notes financial st

Next, it will combine all standardised sentences together as a list.

In [76]:
# Combine all standardised sentence into a list
tbk_standsent_all = (
    tbk_standsent_2022 +
    tbk_standsent_2021 +
    tbk_standsent_2020 +
    tbk_standsent_2019 +
    tbk_standsent_2018 +
    tbk_standsent_2017 +
    tbk_standsent_2016 +
    tbk_standsent_2015 +
    tbk_standsent_2014 +
    tbk_standsent_2013 +
    tbk_standsent_2012
)

print(f'Total number of all standardised sentences: {len(tbk_standsent_all)}')
print('\nThe first 10 Next standardised sentences:')
print(tbk_standsent_all[:10])
print(f"\nData type: {type(tbk_standsent_all)}")

Total number of all standardised sentences: 25602

The first 10 Next standardised sentences:
['attention detailannual report fy', 'one expected year ordinary previous year pandemic kept surprises sleeve', 'despite challenges however came year good progress transformation plan', 'foundations fixed black white began fill details bringing colour life ted baker journey', 'complete picture read', 'year year end th january', 'ttbb aar r tb ar contents strategic report ted baker today chief executive review introduction ted baker interim chair helena feltham taking ted baker future business model strategy brand customers product digital capital light growth priority markets review year chief financial officer introduction key performance indicators business financial review fashioning better future people communities charity partnerships ethical sourcing programme planet risk report going concern viability disclosure governance report board directors executive team interim chair introduction 

## B10. AO World PLC

In [77]:
# Setting the path of AO World PLC annual reports from 2012 to 2022
ao_2022_path = "AR/ao_world/ao-ar2022.pdf"
ao_2021_path = "AR/ao_world/ao-ar2021.pdf"
ao_2020_path = "AR/ao_world/ao-ar2020.pdf"
ao_2019_path = "AR/ao_world/ao-ar2019.pdf"
ao_2018_path = "AR/ao_world/ao-ar2018.pdf"
ao_2017_path = "AR/ao_world/ao-ar2017.pdf"
ao_2016_path = "AR/ao_world/ao-ar2016.pdf"
ao_2015_path = "AR/ao_world/ao-ar2015.pdf"
ao_2014_path = "AR/ao_world/ao-ar2014.pdf"


In [78]:
# Extract text from AO World PLC annual report
ao_sen_2022 = extract_pdf_text(ao_2022_path)
ao_sen_2021 = extract_pdf_text(ao_2021_path)
ao_sen_2020 = extract_pdf_text(ao_2020_path)
ao_sen_2019 = extract_pdf_text(ao_2019_path)
ao_sen_2018 = extract_pdf_text(ao_2018_path)
ao_sen_2017 = extract_pdf_text(ao_2017_path)
ao_sen_2016 = extract_pdf_text(ao_2016_path)
ao_sen_2015 = extract_pdf_text(ao_2015_path)
ao_sen_2014 = extract_pdf_text(ao_2014_path)


Print the sentences from the annual report from 2021.

In [79]:
# print the sentences summary of 2021 AO World PLC annual report
print(f"Total number of sentences of 2021: {len(ao_sen_2021)}")
print("\nPrint the first 10 sentences from AO World PLC annual reports: ")
print(ao_sen_2021[:10])
print(f"\nData type: {type(ao_sen_2021)}")

Total number of sentences of 2021: 6204

Print the first 10 sentences from AO World PLC annual reports: 
['30287 2 July 2021 4:59 pm V9 AO World Plc Annual Report and Accounts 2021The destination for electricals AO World Plc Annual Report and Accounts 2021 30287-AO-World-AR2021', 'indd 330287-AO-World-AR2021', 'indd 3 02/07/2021 16:59:2402/07/2021 16:59:24', '30287 2 July 2021 4:59 pm V9 We make customers’ lives easier by helping them brilliantly Overview 02 Financial and operational highlights 04 Investment case Strategic Report 08 Chair’s statement 10 Chief Executive Officer’s strategic review 14 How we create value 16 Our culture 18 Our values 20 Our customers 22 Our suppliers 24 Our technology 26 UK Retail 30 Logistics 32 Recycling 36 Germany 38 Our markets 42 Our strategy 44 Key performance indicators 46 Chief Financial Officer’s review 54 Our risks 66 Engaging with our stakeholders 68 SustainabilityContentsWe are an online leading retailer , specialising in electronics', 'In 2000

Second, it will standardise the sentences for each annual reports.

In [80]:
# Standardise all sentences from all AO annual reports
ao_standsent_2022 = standardize_sentences(ao_sen_2022)
ao_standsent_2021 = standardize_sentences(ao_sen_2021)
ao_standsent_2020 = standardize_sentences(ao_sen_2020)
ao_standsent_2019 = standardize_sentences(ao_sen_2019)
ao_standsent_2018 = standardize_sentences(ao_sen_2018)
ao_standsent_2017 = standardize_sentences(ao_sen_2017)
ao_standsent_2016 = standardize_sentences(ao_sen_2016)
ao_standsent_2015 = standardize_sentences(ao_sen_2015)
ao_standsent_2014 = standardize_sentences(ao_sen_2014)


Print the short summary of standardised sentences of 2021 annual reports.

In [81]:
# print the standardized sentences summary of 2021 AO annual report
print(f"Total number of standardized sentences of 2021: {len(ao_standsent_2021)}")
print("\nPrint the first 10 standardized sentences from AO annual reprots: ")
print(ao_standsent_2021[:10])
print(f"\nData type: {type(ao_standsent_2021)}")

Total number of standardized sentences of 2021: 6204

Print the first 10 standardized sentences from AO annual reprots: 
['july pm v ao world plc annual report accounts destination electricals ao world plc annual report accounts ao world ar', 'indd ao world ar', 'indd', 'july pm v make customers lives easier helping brilliantly overview financial operational highlights investment case strategic report chair statement chief executive officer strategic review create value culture values customers suppliers technology uk retail logistics recycling germany markets strategy key performance indicators chief financial officer review risks engaging stakeholders sustainabilitycontentswe online leading retailer specialising electronics', 'started selling white goods big items like fridge freezers cookers washing machines', 'sell kinds electricals major domestic appliances small domestic appliances audiovisual equipment computing mobile gaming smart home technology', 'sell different products ao',

Next, it will combine all standardised sentences together as a list.

In [82]:
# Combine all standardised sentence into a list
ao_standsent_all = (
    ao_standsent_2022 +
    ao_standsent_2021 +
    ao_standsent_2020 +
    ao_standsent_2019 +
    ao_standsent_2018 +
    ao_standsent_2017 +
    ao_standsent_2016 +
    ao_standsent_2015 +
    ao_standsent_2014
)

print(f'Total number of all standardised sentences: {len(ao_standsent_all)}')
print('\nThe first 10 AO standardised sentences:')
print(ao_standsent_all[:10])
print(f"\nData type: {type(ao_standsent_all)}")

Total number of all standardised sentences: 41035

The first 10 AO standardised sentences:
['ao world plc annual report accounts destination electricals ao world plc annual report accounts', 'contents overview year review performance investment case strategic report chair statement create value markets brand culture values customers technology uk retail germany suppliers logistics recycling strategy chief executive officer strategic review chief financial officer review risks engaging stakeholders sustainability material sustainability issues esg strategy pillars sustainable living fair equal responsible fit future governance chair letter introduction board directors corporate governance report nominations committee report audit committee report directors remuneration report directors report results independent auditor report consolidated income statement consolidated statement comprehensive income consolidated statement financial position consolidated statement changes equity consolid

## B11. ASOS plc

In [83]:
# Setting the path of Asos annual reports from 2012 to 2022
asos_2022_path = "AR/asos/asos-ar2022.pdf"
asos_2021_path = "AR/asos/asos-ar2021.pdf"
asos_2020_path = "AR/asos/asos-ar2020.pdf"
asos_2019_path = "AR/asos/asos-ar2019.pdf"
asos_2018_path = "AR/asos/asos-ar2018.pdf"
asos_2017_path = "AR/asos/asos-ar2017.pdf"
asos_2016_path = "AR/asos/asos-ar2016.pdf"
asos_2015_path = "AR/asos/asos-ar2015.pdf"
asos_2014_path = "AR/asos/asos-ar2014.pdf"
asos_2013_path = "AR/asos/asos-ar2013.pdf"
asos_2012_path = "AR/asos/asos-ar2012.pdf"

In [84]:
# Extract text from Asos annual report
asos_sen_2022 = extract_pdf_text(asos_2022_path)
asos_sen_2021 = extract_pdf_text(asos_2021_path)
asos_sen_2020 = extract_pdf_text(asos_2020_path)
asos_sen_2019 = extract_pdf_text(asos_2019_path)
asos_sen_2018 = extract_pdf_text(asos_2018_path)
asos_sen_2017 = extract_pdf_text(asos_2017_path)
asos_sen_2016 = extract_pdf_text(asos_2016_path)
asos_sen_2015 = extract_pdf_text(asos_2015_path)
asos_sen_2014 = extract_pdf_text(asos_2014_path)
asos_sen_2013 = extract_pdf_text(asos_2013_path)
asos_sen_2012 = extract_pdf_text(asos_2012_path)


Print the sentences from the annual report from 2021.

In [85]:
# print the sentences summary of 2021 Asos annual report
print(f"Total number of sentences of 2021: {len(asos_sen_2021)}")
print("\nPrint the first 10 sentences from Asos annual reports: ")
print(asos_sen_2021[:10])
print(f"\nData type: {type(asos_sen_2021)}")

Total number of sentences of 2021: 4161

Print the first 10 sentences from Asos annual reports: 
['ASOS Plc Annual Report and Accounts 2021 ASOS REIMAGINED', 'This has been a strong year for ASOS', 'Despite challenging circumstances, the talent, passion, resilience and commitment of our Executive team and our ASOSers has shone through and helped us deliver the strong results we have published', 'Revenues rose to £3,910', '5m, delivering an adjusted pre-tax profit of £193', '6m, a rise of 22%¹ and 36% respectively', 'Over the last three years we have made significant progress, delivering 60% growth in revenues, improved profitability and a strengthened balance sheet', 'We have also bolstered the management team and improved ASOS’ operational capabilities and resilience', 'As a result, we enter the coming year well placed against the backdrop of difficult conditions that we and all businesses are facing', 'At the same time, however, we recognise that there is more to do to accelerate the

Second, it will standardise the sentences for each annual reports.

In [86]:
# Standardise all sentences from all Asos annual reports
asos_standsent_2022 = standardize_sentences(asos_sen_2022)
asos_standsent_2021 = standardize_sentences(asos_sen_2021)
asos_standsent_2020 = standardize_sentences(asos_sen_2020)
asos_standsent_2019 = standardize_sentences(asos_sen_2019)
asos_standsent_2018 = standardize_sentences(asos_sen_2018)
asos_standsent_2017 = standardize_sentences(asos_sen_2017)
asos_standsent_2016 = standardize_sentences(asos_sen_2016)
asos_standsent_2015 = standardize_sentences(asos_sen_2015)
asos_standsent_2014 = standardize_sentences(asos_sen_2014)
asos_standsent_2013 = standardize_sentences(asos_sen_2013)
asos_standsent_2012 = standardize_sentences(asos_sen_2012)


Print the short summary of standardised sentences of 2021 annual reports.

In [87]:
# print the standardized sentences summary of 2021 Asos annual report
print(f"Total number of standardized sentences of 2021: {len(asos_standsent_2021)}")
print("\nPrint the first 10 standardized sentences from Asos annual reprots: ")
print(asos_standsent_2021[:10])
print(f"\nData type: {type(asos_standsent_2021)}")

Total number of standardized sentences of 2021: 4161

Print the first 10 standardized sentences from Asos annual reprots: 
['asos plc annual report accounts asos reimagined', 'strong year asos', 'despite challenging circumstances talent passion resilience commitment executive team asosers shone helped us deliver strong results published', 'revenues rose', 'delivering adjusted pre tax profit', 'rise respectively', 'last three years made significant progress delivering growth revenues improved profitability strengthened balance sheet', 'also bolstered management team improved asos operational capabilities resilience', 'result enter coming year well placed backdrop difficult conditions businesses facing', 'time however recognise accelerate pace intensity commercial execution']

Data type: <class 'list'>


Next, it will combine all standardised sentences together as a list.

In [88]:
# Combine all standardised sentence into a list
asos_standsent_all = (
    asos_standsent_2022 +
    asos_standsent_2021 +
    asos_standsent_2020 +
    asos_standsent_2019 +
    asos_standsent_2018 +
    asos_standsent_2017 +
    asos_standsent_2016 +
    asos_standsent_2015 +
    asos_standsent_2014 +
    asos_standsent_2013 +
    asos_standsent_2012
)

print(f'Total number of all standardised sentences: {len(asos_standsent_all)}')
print('\nThe first 10 Asos standardised sentences:')
print(asos_standsent_all[:10])
print(f"\nData type: {type(asos_standsent_all)}")

Total number of all standardised sentences: 34706

The first 10 Asos standardised sentences:
['changedriving asos plc annual report accounts', 'asos destination fashion loving somethings around world purpose give customers confidence whoever want', 'strategic report chair statement chief executive officer statement values brands people key performance indicators year review business model stakeholder engagement chief executive officer operational review performance market financial review fashion integrity task force climate related financial disclosures non financial information statement managing risk asos principal risks opportunities long term viability statementgovernance report board directors corporate governance report audit committee report nomination committee report esg committee report directors remuneration report annual report remuneration remuneration policy directors report statement directors responsibilities financial statements independent auditors report members aso

## B12. B&M European Value Retail

Annual reports are available from 2015 to 2022.

In [89]:
# Setting the path of annual reports from 2015 to 2022
bme_2015to2022_path = "AR/bm_eu_retail/bme_ar-2015-2022.pdf"


In [90]:
# Extract text from annual report
bme_sen_2012to2022 = extract_pdf_text(bme_2015to2022_path)

Print the sentences from the annual report from 2021.

In [91]:
# print the sentences summary of annual report
print(f"Total number of sentences of 2021: {len(bme_sen_2012to2022)}")
print("\nPrint the first 10 sentences from BME annual reports: ")
print(bme_sen_2012to2022[:10])
print(f"\nData type: {type(bme_sen_2012to2022)}")

Total number of sentences of 2021: 28181

Print the first 10 sentences from BME annual reports: 
['B&M European Value Retail S', 'A', 'Annual Report and Accounts 2015Big Brands Big Savings B&M European Value Retail S', 'A', '\uf138 Annual Report and Accounts 2015', 'B&M is a fast-growing discount retailer, operating from over 425 high street and out of town stores across the UK, as well as 50 stores under the Jawoll brand in Germany', 'We offer customers a broad range of grocery and general merchandise products at sensational prices', 'Our aim is to provide customers with a fun and exciting shopping experience, offering them great products and fantastic value so that they return again and again to a B&M store', 'Our success is down to our customers and built on “word of mouth”', 'Last year we enjoyed an average 2']

Data type: <class 'list'>


Second, it will standardise the sentences for each annual reports.

In [92]:
# Standardise all sentences from all annual reports
bme_standsent_all = standardize_sentences(bme_sen_2012to2022)


Print the short summary of standardised sentences of 2021 annual reports.

In [93]:
# print the standardized sentences summary of annual report
print(f"Total number of standardized sentences of 2021: {len(bme_standsent_all)}")
print("\nPrint the first 10 standardized sentences from BME annual reprots: ")
print(bme_standsent_all[:10])
print(f"\nData type: {type(bme_standsent_all)}")

Total number of standardized sentences of 2021: 28181

Print the first 10 standardized sentences from BME annual reprots: 
['b european value retail', '', 'annual report accounts big brands big savings b european value retail', '', 'annual report accounts', 'b fast growing discount retailer operating high street town stores across uk well stores jawoll brand germany', 'offer customers broad range grocery general merchandise products sensational prices', 'aim provide customers fun exciting shopping experience offering great products fantastic value return b store', 'success customers built word mouth', 'last year enjoyed average']

Data type: <class 'list'>


## B13. Currys

Currys annual reports are avilable from 2014 to 2022 online only.

In [94]:
# Setting the path of Currys annual reports from 2014 to 2022
cury_2022_path = "AR/currys/curry-ar2022.pdf"
cury_2021_path = "AR/currys/curry-ar2021.pdf"
cury_2020_path = "AR/currys/curry-ar2020.pdf"
cury_2019_path = "AR/currys/curry-ar2019.pdf"
cury_2018_path = "AR/currys/curry-ar2018.pdf"
cury_2017_path = "AR/currys/curry-ar2017.pdf"
cury_2016_path = "AR/currys/curry-ar2016.pdf"
cury_2015_path = "AR/currys/curry-ar2015.pdf"
cury_2014_path = "AR/currys/curry-ar2014.pdf"


In [95]:
# Extract text from Currys annual report
cury_sen_2022 = extract_pdf_text(cury_2022_path)
cury_sen_2021 = extract_pdf_text(cury_2021_path)
cury_sen_2020 = extract_pdf_text(cury_2020_path)
cury_sen_2019 = extract_pdf_text(cury_2019_path)
cury_sen_2018 = extract_pdf_text(cury_2018_path)
cury_sen_2017 = extract_pdf_text(cury_2017_path)
cury_sen_2016 = extract_pdf_text(cury_2016_path)
cury_sen_2015 = extract_pdf_text(cury_2015_path)
cury_sen_2014 = extract_pdf_text(cury_2014_path)


Print the sentences from the annual report from 2021.

In [96]:
# print the sentences summary of 2021 Currys annual report
print(f"Total number of sentences of 2021: {len(cury_sen_2021)}")
print("\nPrint the first 10 sentences from Currys annual reports: ")
print(cury_sen_2021[:10])
print(f"\nData type: {type(cury_sen_2021)}")

Total number of sentences of 2021: 4374

Print the first 10 sentences from Currys annual reports: 
['ANNUAL REPORT & ACCOUNTS 2020/21', 'DIXONS CARPHONE WHAT WE DO Dixons Carphone plc is a leading omnichannel retailer of technology products and services, operating through 829 stores and 16 websites in 7 countries', 'We Help Everyone Enjoy Amazing Technology, however they choose to shop with us', 'We are the market leader in the UK and Ireland, throughout the Nordics and in Greece, employing 35,000 capable and committed colleagues across the Group', 'By offering the best range of products, credit and services through digital-first omnichannel we are building customer relationships that are stickier and more valuable over time', 'This will benefit our customers, our colleagues, our shareholders and society', 'www', 'dixonscarphone', 'com/investors For the latest news visit our website', 'A LEADING OMNICHANNEL RETAILER OF TECHNOLOGY PRODUCTS AND SERVICESDixons Carphone plc Annual Report &

Second, it will standardise the sentences for each annual reports.

In [97]:
# Standardise all sentences from all Currys annual reports
cury_standsent_2022 = standardize_sentences(cury_sen_2022)
cury_standsent_2021 = standardize_sentences(cury_sen_2021)
cury_standsent_2020 = standardize_sentences(cury_sen_2020)
cury_standsent_2019 = standardize_sentences(cury_sen_2019)
cury_standsent_2018 = standardize_sentences(cury_sen_2018)
cury_standsent_2017 = standardize_sentences(cury_sen_2017)
cury_standsent_2016 = standardize_sentences(cury_sen_2016)
cury_standsent_2015 = standardize_sentences(cury_sen_2015)
cury_standsent_2014 = standardize_sentences(cury_sen_2014)



Print the short summary of standardised sentences of 2021 annual reports.

In [98]:
# print the standardized sentences summary of 2021 Currys annual report
print(f"Total number of standardized sentences of 2021: {len(cury_standsent_2021)}")
print("\nPrint the first 10 standardized sentences from Currys annual reprots: ")
print(cury_standsent_2021[:10])
print(f"\nData type: {type(cury_standsent_2021)}")

Total number of standardized sentences of 2021: 4374

Print the first 10 standardized sentences from Currys annual reprots: 
['annual report accounts', 'dixons carphone dixons carphone plc leading omnichannel retailer technology products services operating stores websites countries', 'help everyone enjoy amazing technology however choose shop us', 'market leader uk ireland throughout nordics greece employing capable committed colleagues across group', 'offering best range products credit services digital first omnichannel building customer relationships stickier valuable time', 'benefit customers colleagues shareholders society', 'www', 'dixonscarphone', 'com investors latest news visit website', 'leading omnichannel retailer technology products servicesdixons carphone plc annual report accounts']

Data type: <class 'list'>


Next, it will combine all standardised sentences together as a list.

In [99]:
# Combine all standardised sentence into a list
cury_standsent_all = (
    cury_standsent_2022 +
    cury_standsent_2021 +
    cury_standsent_2020 +
    cury_standsent_2019 +
    cury_standsent_2018 +
    cury_standsent_2017 +
    cury_standsent_2016 +
    cury_standsent_2015 +
    cury_standsent_2014 
)

print(f'Total number of all standardised sentences: {len(cury_standsent_all)}')
print('\nThe first 10 Currys standardised sentences:')
print(cury_standsent_all[:10])
print(f"\nData type: {type(cury_standsent_all)}")

Total number of all standardised sentences: 35968

The first 10 Currys standardised sentences:
['annual report accounts help everyone enjoy amazing technology currys plc annual report accounts', 'currys currys plc leading omnichannel retailer technology products services operating online stores countries', 'help everyone enjoy amazing technology however choose shop us', 'vision powerful social purpose heart', 'believe power technology improve lives help people stay connected productive healthy entertained', 'help everyone enjoy benefits scale expertise uniquely placed', 'leading omnichannel retailer technology quick links long live tech business model strategy action www', 'currysplc', 'com investors latest news visit website', 'governance financial statements investor informationstrategic report']

Data type: <class 'list'>


## B14. WH Smith

In [100]:
# Setting the path of annual reports from 2012 to 2022
smwh_2022_path = "AR/wh_smith/smwh-ar2022.pdf"
smwh_2021_path = "AR/wh_smith/smwh-ar2021.pdf"
smwh_2020_path = "AR/wh_smith/smwh-ar2020.pdf"
smwh_2019_path = "AR/wh_smith/smwh-ar2019.pdf"
smwh_2018_path = "AR/wh_smith/smwh-ar2018.pdf"
smwh_2017_path = "AR/wh_smith/smwh-ar2017.pdf"
smwh_2016_path = "AR/wh_smith/smwh-ar2016.pdf"
smwh_2015_path = "AR/wh_smith/smwh-ar2015.pdf"
smwh_2014_path = "AR/wh_smith/smwh-ar2014.pdf"
smwh_2013_path = "AR/wh_smith/smwh-ar2013.pdf"
smwh_2012_path = "AR/wh_smith/smwh-ar2012.pdf"

In [101]:
# Extract text from annual report
smwh_sen_2022 = extract_pdf_text(smwh_2022_path)
smwh_sen_2021 = extract_pdf_text(smwh_2021_path)
smwh_sen_2020 = extract_pdf_text(smwh_2020_path)
smwh_sen_2019 = extract_pdf_text(smwh_2019_path)
smwh_sen_2018 = extract_pdf_text(smwh_2018_path)
smwh_sen_2017 = extract_pdf_text(smwh_2017_path)
smwh_sen_2016 = extract_pdf_text(smwh_2016_path)
smwh_sen_2015 = extract_pdf_text(smwh_2015_path)
smwh_sen_2014 = extract_pdf_text(smwh_2014_path)
smwh_sen_2013 = extract_pdf_text(smwh_2013_path)
smwh_sen_2012 = extract_pdf_text(smwh_2012_path)


Print the sentences from the annual report from 2021.

In [102]:
# print the sentences summary of 2021 annual report
print(f"Total number of sentences of 2021: {len(smwh_sen_2021)}")
print("\nPrint the first 10 sentences from WH smith annual reports: ")
print(smwh_sen_2021[:10])
print(f"\nData type: {type(smwh_sen_2021)}")

Total number of sentences of 2021: 3938

Print the first 10 sentences from WH smith annual reports: 
['Annual Report and Accounts 2021', 'WHSmith High Street is present on most of the significant high streets and shopping centres in the UK, mainly inAs WHSmith continues on its journey to be a better business, we have a strong commitment to the principles ofWH Smith PLC is a leading global Travel retailer for news, books, convenience and tech accessories with a smaller business on the UK High Street', 'At the heart of both our businesses are our people and our customers', 'We aim to deliver our goals through our strategic priorities and initiatives by: constantly innovating, expanding globally, improving our profitability and delivering sustainable returns', 'We are a leading global retailer Travel is in a wide range of locations including airports, hospitals, railway stations and motorway service areasAbout us 30 countries across the globe, mainly in airportsWHSmith Travel is a world-l

Second, it will standardise the sentences for each annual reports.

In [103]:
# Standardise all sentences from all annual reports
smwh_standsent_2022 = standardize_sentences(smwh_sen_2022)
smwh_standsent_2021 = standardize_sentences(smwh_sen_2021)
smwh_standsent_2020 = standardize_sentences(smwh_sen_2020)
smwh_standsent_2019 = standardize_sentences(smwh_sen_2019)
smwh_standsent_2018 = standardize_sentences(smwh_sen_2018)
smwh_standsent_2017 = standardize_sentences(smwh_sen_2017)
smwh_standsent_2016 = standardize_sentences(smwh_sen_2016)
smwh_standsent_2015 = standardize_sentences(smwh_sen_2015)
smwh_standsent_2014 = standardize_sentences(smwh_sen_2014)
smwh_standsent_2013 = standardize_sentences(smwh_sen_2013)
smwh_standsent_2012 = standardize_sentences(smwh_sen_2012)


Print the short summary of standardised sentences of 2021 annual reports.

In [104]:
# print the standardized sentences summary of 2021 annual report
print(f"Total number of standardized sentences of 2021: {len(smwh_standsent_2021)}")
print("\nPrint the first 10 standardized sentences from WH Smith annual reprots: ")
print(smwh_standsent_2021[:10])
print(f"\nData type: {type(smwh_standsent_2021)}")

Total number of standardized sentences of 2021: 3938

Print the first 10 standardized sentences from WH Smith annual reprots: 
['annual report accounts', 'whsmith high street present significant high streets shopping centres uk mainly inas whsmith continues journey better business strong commitment principles ofwh smith plc leading global travel retailer news books convenience tech accessories smaller business uk high street', 'heart businesses people customers', 'aim deliver goals strategic priorities initiatives constantly innovating expanding globally improving profitability delivering sustainable returns', 'leading global retailer travel wide range locations including airports hospitals railway stations motorway service areasabout us countries across globe mainly airportswhsmith travel world leading travel retailer presence whsmithofficial whsmith youtube', 'com whsmithdirect linkedin', 'com company whsmithfind whsmith whsmithplc', 'co', 'uk disclaimer annual report prepared member

Next, it will combine all standardised sentences together as a list.

In [105]:
# Combine all standardised sentence into a list
smwh_standsent_all = (
    smwh_standsent_2022 +
    smwh_standsent_2021 +
    smwh_standsent_2020 +
    smwh_standsent_2019 +
    smwh_standsent_2018 +
    smwh_standsent_2017 +
    smwh_standsent_2016 +
    smwh_standsent_2015 +
    smwh_standsent_2014 +
    smwh_standsent_2013 +
    smwh_standsent_2012
)

print(f'Total number of all standardised sentences: {len(smwh_standsent_all)}')
print('\nThe first 10 Greencore standardised sentences:')
print(smwh_standsent_all[:10])
print(f"\nData type: {type(smwh_standsent_all)}")

Total number of all standardised sentences: 31443

The first 10 Greencore standardised sentences:
['annual report accounts every journey', 'find whsmith whsmithplc', 'co', 'uk whsmithofficial whsmith youtube', 'com whsmith linkedin', 'com company whsmithhere whsmith purpose simple make every one life journeys better supporting customers journeys key since', 'celebrate years since company founded continue support many journeys colleagues customers shareholders make', 'supporting journey people top priority', 'diverse team colleagues across countries committed championing career journey us also promoting culture everyone best self', 'journey create better business']

Data type: <class 'list'>


## B15. The Stanley Gibbons Group plc

2012 and 2015 annual reports are not available online.

In [106]:
# Setting the path of annual reports
sgi_2022_path = "AR/stanley_gibbsons/sgi-ar2022.pdf"
sgi_2021_path = "AR/stanley_gibbsons/sgi-ar2021.pdf"
sgi_2020_path = "AR/stanley_gibbsons/sgi-ar2020.pdf"
sgi_2019_path = "AR/stanley_gibbsons/sgi-ar2019.pdf"
sgi_2018_path = "AR/stanley_gibbsons/sgi-ar2018.pdf"
sgi_2017_path = "AR/stanley_gibbsons/sgi-ar2017.pdf"
sgi_2016_path = "AR/stanley_gibbsons/sgi-ar2016.pdf"
sgi_2014_path = "AR/stanley_gibbsons/sgi-ar2014.pdf"
sgi_2013_path = "AR/stanley_gibbsons/sgi-ar2013.pdf"


In [107]:
# Extract text from annual report
sgi_sen_2022 = extract_pdf_text(sgi_2022_path)
sgi_sen_2021 = extract_pdf_text(sgi_2021_path)
sgi_sen_2020 = extract_pdf_text(sgi_2020_path)
sgi_sen_2019 = extract_pdf_text(sgi_2019_path)
sgi_sen_2018 = extract_pdf_text(sgi_2018_path)
sgi_sen_2017 = extract_pdf_text(sgi_2017_path)
sgi_sen_2016 = extract_pdf_text(sgi_2016_path)
sgi_sen_2014 = extract_pdf_text(sgi_2014_path)
sgi_sen_2013 = extract_pdf_text(sgi_2013_path)


Print the sentences from the annual report from 2021.

In [108]:
# print the sentences summary of 2021 annual report
print(f"Total number of sentences of 2021: {len(sgi_sen_2021)}")
print("\nPrint the first 10 sentences from Stanley Gibbsons annual reports: ")
print(sgi_sen_2021[:10])
print(f"\nData type: {type(sgi_sen_2021)}")

Total number of sentences of 2021: 464

Print the first 10 sentences from Stanley Gibbsons annual reports: 
['The Stanley Gibbons Group plc Interim Report and Accounts for the six months ended 30 September 2021 262346 SG Interim cov-pp09', 'qxp 08/12/2021 18:47 Page 1', 'Page 2 Directors and Advisers 3 Chairman’s Statement 4 Chief Executive’s Report 10 Financial Statements and notes Contents The Stanley Gibbons Group plc 1262346 SG Interim cov-pp09', 'qxp 08/12/2021 18:47 Page 1', 'Directors H G Wilson Non-Executive Chairman G E Shircore Chief Executive Officer K Fitzpatrick Chief Finance Officer L E Castro Non-Executive Director* M West Non-Executive Director* * Independent Company Secretary K Fitzpatrick Registered Office 22 Grenville Street St', 'Helier Jersey JE4 8PX Tel: +44(0)20 7836 8444 Company Registration Registered in Jersey Number 13177 Legal Form Public Limited Company limited by shares Nominated Adviser and Broker Liberum Capital Limited 25 Ropemaker Street London EC2Y 9L

Second, it will standardise the sentences for each annual reports.

In [109]:
# Standardise all sentences from all annual reports
sgi_standsent_2022 = standardize_sentences(sgi_sen_2022)
sgi_standsent_2021 = standardize_sentences(sgi_sen_2021)
sgi_standsent_2020 = standardize_sentences(sgi_sen_2020)
sgi_standsent_2019 = standardize_sentences(sgi_sen_2019)
sgi_standsent_2018 = standardize_sentences(sgi_sen_2018)
sgi_standsent_2017 = standardize_sentences(sgi_sen_2017)
sgi_standsent_2016 = standardize_sentences(sgi_sen_2016)
sgi_standsent_2014 = standardize_sentences(sgi_sen_2014)
sgi_standsent_2013 = standardize_sentences(sgi_sen_2013)

Print the short summary of standardised sentences of 2021 annual reports.

In [110]:
# print the standardized sentences summary of 2021 annual report
print(f"Total number of standardized sentences of 2021: {len(sgi_standsent_2021)}")
print("\nPrint the first 10 standardized sentences from Stanley Gibbsons annual reprots: ")
print(sgi_standsent_2021[:10])
print(f"\nData type: {type(sgi_standsent_2021)}")

Total number of standardized sentences of 2021: 464

Print the first 10 standardized sentences from Stanley Gibbsons annual reprots: 
['stanley gibbons group plc interim report accounts six months ended september sg interim cov pp', 'qxp page', 'page directors advisers chairman statement chief executive report financial statements notes contents stanley gibbons group plc sg interim cov pp', 'qxp page', 'directors h g wilson non executive chairman g e shircore chief executive officer k fitzpatrick chief finance officer l e castro non executive director west non executive director independent company secretary k fitzpatrick registered office grenville street st', 'helier jersey je px tel company registration registered jersey number legal form public limited company limited shares nominated adviser broker liberum capital limited ropemaker street london ec ly auditors jeffreys henry llp finnsgate cranwood street london ec v ee legal advisers mourant ozannes grenville street st helier jers

Next, it will combine all standardised sentences together as a list.

In [111]:
# Combine all standardised sentence into a list
sgi_standsent_all = (
    sgi_standsent_2022 +
    sgi_standsent_2021 +
    sgi_standsent_2020 +
    sgi_standsent_2019 +
    sgi_standsent_2018 +
    sgi_standsent_2017 +
    sgi_standsent_2016 +
    sgi_standsent_2014 +
    sgi_standsent_2013 
)

print(f'Total number of all standardised sentences: {len(sgi_standsent_all)}')
print('\nThe first 10 Hilton Foods standardised sentences:')
print(sgi_standsent_all[:10])
print(f"\nData type: {type(sgi_standsent_all)}")

Total number of all standardised sentences: 11488

The first 10 Hilton Foods standardised sentences:
['stanley gibbons group plc annual report accounts year ended march', 'financial highlights year ended year ended march march restated group turnover continuing operations', '', 'trading loss continuing operations', '', 'loss taxation continuing operations', '', 'adjusted loss profit taxation continuing operations', '', 'basic earnings per share continuing operations p']

Data type: <class 'list'>


## B16. Frasers Group

In [112]:
# Setting the path of annual reports from 2012 to 2022
fras_2022_path = "AR/frasers/fras-ar2022.pdf"
fras_2021_path = "AR/frasers/fras-ar2021.pdf"
fras_2020_path = "AR/frasers/fras-ar2020.pdf"
fras_2019_path = "AR/frasers/fras-ar2019.pdf"
fras_2018_path = "AR/frasers/fras-ar2018.pdf"
fras_2017_path = "AR/frasers/fras-ar2017.pdf"
fras_2016_path = "AR/frasers/fras-ar2016.pdf"
fras_2015_path = "AR/frasers/fras-ar2015.pdf"
fras_2014_path = "AR/frasers/fras-ar2014.pdf"
fras_2013_path = "AR/frasers/fras-ar2013.pdf"
fras_2012_path = "AR/frasers/fras-ar2012.pdf"

In [113]:
# Extract text from annual report
fras_sen_2022 = extract_pdf_text(fras_2022_path)
fras_sen_2021 = extract_pdf_text(fras_2021_path)
fras_sen_2020 = extract_pdf_text(fras_2020_path)
fras_sen_2019 = extract_pdf_text(fras_2019_path)
fras_sen_2018 = extract_pdf_text(fras_2018_path)
fras_sen_2017 = extract_pdf_text(fras_2017_path)
fras_sen_2016 = extract_pdf_text(fras_2016_path)
fras_sen_2015 = extract_pdf_text(fras_2015_path)
fras_sen_2014 = extract_pdf_text(fras_2014_path)
fras_sen_2013 = extract_pdf_text(fras_2013_path)
fras_sen_2012 = extract_pdf_text(fras_2012_path)


Print the sentences from the annual report from 2021.

In [114]:
# print the sentences summary of 2021 annual report
print(f"Total number of sentences of 2021: {len(fras_sen_2021)}")
print("\nPrint the first 10 sentences from Frasers Group annual reports: ")
print(fras_sen_2021[:10])
print(f"\nData type: {type(fras_sen_2021)}")

Total number of sentences of 2021: 5526

Print the first 10 sentences from Frasers Group annual reports: 
['APTITUDE FORTITUDE ANNUAL REPORT 2021 A member of Frasers Property Group', 'Contents Overview 02 About Frasers Centrepoint Trust 03 Structure of FCT and Organisation Structure of The Manager 04 Business Objectives and Growth Strategies 05 FY2021 Highlights 06 Key Events 08 5-Year Performance at a Glance 10 Unit Price Performance 12 Letter to Unitholders 16 Board of Directors 20 Trust Management Team 22 Investor Relations Business Review 24 Operations Review 30 Financial Review 36 Capital Resources 38 Retail Property Market Overview Asset Portfolio 52 FCT Portfolio Overview 54 Causeway Point 56 Waterway Point 58 Tampines 1 60 Northpoint City North Wing and Yishun 10 Retail Podium 62 Tiong Bahru Plaza 64 Century Square 66 Changi City Point 68 Hougang Mall 70 White Sands 72 Central Plaza 74 Property Directory 75 Investment in Hektar REIT Risk Management, Sustainability Report & Corp

Second, it will standardise the sentences for each annual reports.

In [115]:
# Standardise all sentences from all annual reports
fras_standsent_2022 = standardize_sentences(fras_sen_2022)
fras_standsent_2021 = standardize_sentences(fras_sen_2021)
fras_standsent_2020 = standardize_sentences(fras_sen_2020)
fras_standsent_2019 = standardize_sentences(fras_sen_2019)
fras_standsent_2018 = standardize_sentences(fras_sen_2018)
fras_standsent_2017 = standardize_sentences(fras_sen_2017)
fras_standsent_2016 = standardize_sentences(fras_sen_2016)
fras_standsent_2015 = standardize_sentences(fras_sen_2015)
fras_standsent_2014 = standardize_sentences(fras_sen_2014)
fras_standsent_2013 = standardize_sentences(fras_sen_2013)
fras_standsent_2012 = standardize_sentences(fras_sen_2012)


Print the short summary of standardised sentences of 2021 annual reports.

In [116]:
# print the standardized sentences summary of 2021 annual report
print(f"Total number of standardized sentences of 2021: {len(fras_standsent_2021)}")
print("\nPrint the first 10 standardized sentences from Frasers Group annual reprots: ")
print(fras_standsent_2021[:10])
print(f"\nData type: {type(fras_standsent_2021)}")

Total number of standardized sentences of 2021: 5526

Print the first 10 standardized sentences from Frasers Group annual reprots: 
['aptitude fortitude annual report member frasers property group', 'contents overview frasers centrepoint trust structure fct organisation structure manager business objectives growth strategies fy highlights key events year performance glance unit price performance letter unitholders board directors trust management team investor relations business review operations review financial review capital resources retail property market overview asset portfolio fct portfolio overview causeway point waterway point tampines northpoint city north wing yishun retail podium tiong bahru plaza century square changi city point hougang mall white sands central plaza property directory investment hektar reit risk management sustainability report corporate governance report risk management sustainability report corporate governance report financial information financial st

Next, it will combine all standardised sentences together as a list.

In [117]:
# Combine all standardised sentence into a list
fras_standsent_all = (
    fras_standsent_2022 +
    fras_standsent_2021 +
    fras_standsent_2020 +
    fras_standsent_2019 +
    fras_standsent_2018 +
    fras_standsent_2017 +
    fras_standsent_2016 +
    fras_standsent_2015 +
    fras_standsent_2014 +
    fras_standsent_2013 +
    fras_standsent_2012
)

print(f'Total number of all standardised sentences: {len(fras_standsent_all)}')
print('\nThe first 10 Frasers Group standardised sentences:')
print(fras_standsent_all[:10])
print(f"\nData type: {type(fras_standsent_all)}")

Total number of all standardised sentences: 44598

The first 10 Frasers Group standardised sentences:
['annual report accounts', 'frasers group plc', 'frasers group founded single store frasers group plc frasers group group business company today uk largest sporting goods retailer revenue', 'group operates diversified portfolio sports fitness premium lifestyle luxury fascias countries', 'colleagues across five business segments uk sports retail premium lifestyle european retail rest world retail wholesale licensing', 'strategy provide consumers access world best sports premium luxury brands providing world leading retail ecosystem', 'aligned vision defined group purpose elevate lives many giving access world best brands experiences', 'impact since became listed public company', 'years since floated group greatly contributed british economy', 'includes approx']

Data type: <class 'list'>


## B17. Burberry

In [118]:
# Setting the path of annual reports from 2012 to 2022
brby_2022_path = "AR/burberry/brby-ar2022.pdf"
brby_2021_path = "AR/burberry/brby-ar2021.pdf"
brby_2020_path = "AR/burberry/brby-ar2020.pdf"
brby_2019_path = "AR/burberry/brby-ar2019.pdf"
brby_2018_path = "AR/burberry/brby-ar2018.pdf"
brby_2017_path = "AR/burberry/brby-ar2017.pdf"
brby_2016_path = "AR/burberry/brby-ar2016.pdf"
brby_2015_path = "AR/burberry/brby-ar2015.pdf"
brby_2014_path = "AR/burberry/brby-ar2014.pdf"
brby_2013_path = "AR/burberry/brby-ar2013.pdf"
brby_2012_path = "AR/burberry/brby-ar2012.pdf"

In [119]:
# Extract text from annual report
brby_sen_2022 = extract_pdf_text(brby_2022_path)
brby_sen_2021 = extract_pdf_text(brby_2021_path)
brby_sen_2020 = extract_pdf_text(brby_2020_path)
brby_sen_2019 = extract_pdf_text(brby_2019_path)
brby_sen_2018 = extract_pdf_text(brby_2018_path)
brby_sen_2017 = extract_pdf_text(brby_2017_path)
brby_sen_2016 = extract_pdf_text(brby_2016_path)
brby_sen_2015 = extract_pdf_text(brby_2015_path)
brby_sen_2014 = extract_pdf_text(brby_2014_path)
brby_sen_2013 = extract_pdf_text(brby_2013_path)
brby_sen_2012 = extract_pdf_text(brby_2012_path)


Print the sentences from the annual report from 2021.

In [120]:
# print the sentences summary of 2021 annual report
print(f"Total number of sentences of 2021: {len(brby_sen_2021)}")
print("\nPrint the first 10 sentences from Burberry annual reports: ")
print(brby_sen_2021[:10])
print(f"\nData type: {type(brby_sen_2021)}")

Total number of sentences of 2021: 7153

Print the first 10 sentences from Burberry annual reports: 
['ANNUAL REPORT 2020/21 ANNUAL REPORT 2020/21', 'INHERENT IN EVERY BURBERRY GARMENT IS FREEDOM Thomas Burberry', 'CONTENTS Strategic Report 2 Highlights 6 Chairman’s Letter 10 Chief Executive Officer’s Letter 14 Purpose 16 Business Model 18 Investment Case 20 Luxury Market Environment 24 Strategy 45 Key Performance Indicators 48 Group Financial Highlights 55 Capital Allocation Framework 56 Supporting Our Stakeholders During COVID-19 60 Environmental, Social and Governance 67 Our People 74 Our Communities 83 The Environment 92 Sustainability Bond 94 Non-Financial Information Statement 96 Stakeholder Engagement 104 Board Engagement 106 Risk and Viability Report 133 Task force on Climate- Related Financial Disclosures (TCFD) 138 Risk Management Activities in FY 2020/21 140 Our Viability StatementCorporate Governance Statement 146 Board Leadership and Company Purpose 146 Chairman’s Introduc

Second, it will standardise the sentences for each annual reports.

In [121]:
# Standardise all sentences from all annual reports
brby_standsent_2022 = standardize_sentences(brby_sen_2022)
brby_standsent_2021 = standardize_sentences(brby_sen_2021)
brby_standsent_2020 = standardize_sentences(brby_sen_2020)
brby_standsent_2019 = standardize_sentences(brby_sen_2019)
brby_standsent_2018 = standardize_sentences(brby_sen_2018)
brby_standsent_2017 = standardize_sentences(brby_sen_2017)
brby_standsent_2016 = standardize_sentences(brby_sen_2016)
brby_standsent_2015 = standardize_sentences(brby_sen_2015)
brby_standsent_2014 = standardize_sentences(brby_sen_2014)
brby_standsent_2013 = standardize_sentences(brby_sen_2013)
brby_standsent_2012 = standardize_sentences(brby_sen_2012)

Print the short summary of standardised sentences of 2021 annual reports.

In [122]:
# print the standardized sentences summary of 2021 annual report
print(f"Total number of standardized sentences of 2021: {len(brby_standsent_2021)}")
print("\nPrint the first 10 standardized sentences from Burberry  annual reprots: ")
print(brby_standsent_2021[:10])
print(f"\nData type: {type(brby_standsent_2021)}")

Total number of standardized sentences of 2021: 7153

Print the first 10 standardized sentences from Burberry  annual reprots: 
['annual report annual report', 'inherent every burberry garment freedom thomas burberry', 'contents strategic report highlights chairman letter chief executive officer letter purpose business model investment case luxury market environment strategy key performance indicators group financial highlights capital allocation framework supporting stakeholders covid environmental social governance people communities environment sustainability bond non financial information statement stakeholder engagement board engagement risk viability report task force climate related financial disclosures tcfd risk management activities fy viability statementcorporate governance statement board leadership company purpose chairman introduction board directors executive committee corporate governance report principal areas focus board fy division responsibilities governance structu

Next, it will combine all standardised sentences together as a list.

In [123]:
# Combine all standardised sentence into a list
brby_standsent_all = (
    brby_standsent_2022 +
    brby_standsent_2021 +
    brby_standsent_2020 +
    brby_standsent_2019 +
    brby_standsent_2018 +
    brby_standsent_2017 +
    brby_standsent_2016 +
    brby_standsent_2015 +
    brby_standsent_2014 +
    brby_standsent_2013 +
    brby_standsent_2012
)

print(f'Total number of all standardised sentences: {len(brby_standsent_all)}')
print('\nThe first 10 Burberry standardised sentences:')
print(brby_standsent_all[:10])
print(f"\nData type: {type(brby_standsent_all)}")

Total number of all standardised sentences: 66398

The first 10 Burberry standardised sentences:
['annual report annual report', 'table contents strategic report chair letter chief executive officer letter financial highlights business model investment case luxury market environment strategy overview key performance indicators financial measures financial review capital allocation framework non financial sustainability information statement environmental social responsibility product planet people communities responsibility approach task force climate related financial disclosures stakeholder engagement board engagement risk viability report risk management activities viability statement corporate governance statement chair introduction board directors executive committee corporate governance report division responsibilities governance structure division responsibilities composition succession evaluation board evaluation report nomination committee audit risk internal control report au

## B18. Dunelm Group plc

In [124]:
# Setting the path of annual reports from 2012 to 2022
dnlm_2022_path = "AR/dunelm/dnlm-ar2022.pdf"
dnlm_2021_path = "AR/dunelm/dnlm-ar2021.pdf"
dnlm_2020_path = "AR/dunelm/dnlm-ar2020.pdf"
dnlm_2019_path = "AR/dunelm/dnlm-ar2019.pdf"
dnlm_2018_path = "AR/dunelm/dnlm-ar2018.pdf"
dnlm_2017_path = "AR/dunelm/dnlm-ar2017.pdf"
dnlm_2016_path = "AR/dunelm/dnlm-ar2016.pdf"
dnlm_2015_path = "AR/dunelm/dnlm-ar2015.pdf"
dnlm_2014_path = "AR/dunelm/dnlm-ar2014.pdf"
dnlm_2013_path = "AR/dunelm/dnlm-ar2013.pdf"
dnlm_2012_path = "AR/dunelm/dnlm-ar2012.pdf"

In [125]:
# Extract text from annual report
dnlm_sen_2022 = extract_pdf_text(dnlm_2022_path)
dnlm_sen_2021 = extract_pdf_text(dnlm_2021_path)
dnlm_sen_2020 = extract_pdf_text(dnlm_2020_path)
dnlm_sen_2019 = extract_pdf_text(dnlm_2019_path)
dnlm_sen_2018 = extract_pdf_text(dnlm_2018_path)
dnlm_sen_2017 = extract_pdf_text(dnlm_2017_path)
dnlm_sen_2016 = extract_pdf_text(dnlm_2016_path)
dnlm_sen_2015 = extract_pdf_text(dnlm_2015_path)
dnlm_sen_2014 = extract_pdf_text(dnlm_2014_path)
dnlm_sen_2013 = extract_pdf_text(dnlm_2013_path)
dnlm_sen_2012 = extract_pdf_text(dnlm_2012_path)


Print the sentences from the annual report from 2021.

In [126]:
# print the sentences summary of 2021 annual report
print(f"Total number of sentences of 2021: {len(dnlm_sen_2021)}")
print("\nPrint the first 10 sentences from Dunelm annual reports: ")
print(dnlm_sen_2021[:10])
print(f"\nData type: {type(dnlm_sen_2021)}")

Total number of sentences of 2021: 6495

Print the first 10 sentences from Dunelm annual reports: 
['purposeGrowing withDUNELM GROUP PLC ANNUAL REPORT AND ACCOUNTS 2022 DUNELM GROUP PLC ANNUAL REPORT AND ACCOUNTS 2022', 'To help create the joy of truly feeling at home, now and for generations to come', 'Our purpose NickIn last year’s annual report, I talked about our renewed purpose and its relevance to the broader role we play in the lives of our stakeholders', 'Our purpose influences our Board and our colleagues in our decision-making', 'It prompts us to question why we do what we do, and it improves our recognition of how we help create the joy of truly feeling at home for our stakeholders – our customers, colleagues, store communities, suppliers, shareholders and all other people we deal with', 'As we continue to grow it is even more important that we use our purpose to guide us to do the right thing', 'Our plan is to become our customers’ 1st Choice for Home, across all products, 

Second, it will standardise the sentences for each annual reports.

In [127]:
# Standardise all sentences from all annual reports
dnlm_standsent_2022 = standardize_sentences(dnlm_sen_2022)
dnlm_standsent_2021 = standardize_sentences(dnlm_sen_2021)
dnlm_standsent_2020 = standardize_sentences(dnlm_sen_2020)
dnlm_standsent_2019 = standardize_sentences(dnlm_sen_2019)
dnlm_standsent_2018 = standardize_sentences(dnlm_sen_2018)
dnlm_standsent_2017 = standardize_sentences(dnlm_sen_2017)
dnlm_standsent_2016 = standardize_sentences(dnlm_sen_2016)
dnlm_standsent_2015 = standardize_sentences(dnlm_sen_2015)
dnlm_standsent_2014 = standardize_sentences(dnlm_sen_2014)
dnlm_standsent_2013 = standardize_sentences(dnlm_sen_2013)
dnlm_standsent_2012 = standardize_sentences(dnlm_sen_2012)

Print the short summary of standardised sentences of 2021 annual reports.

In [128]:
# print the standardized sentences summary of 2021 annual report
print(f"Total number of standardized sentences of 2021: {len(dnlm_standsent_2021)}")
print("\nPrint the first 10 standardized sentences from Dunelm annual reprots: ")
print(dnlm_standsent_2021[:10])
print(f"\nData type: {type(dnlm_standsent_2021)}")

Total number of standardized sentences of 2021: 6495

Print the first 10 standardized sentences from Dunelm annual reprots: 
['purposegrowing withdunelm group plc annual report accounts dunelm group plc annual report accounts', 'help create joy truly feeling home generations come', 'purpose nickin last year annual report talked renewed purpose relevance broader role play lives stakeholders', 'purpose influences board colleagues decision making', 'prompts us question improves recognition help create joy truly feeling home stakeholders customers colleagues store communities suppliers shareholders people deal', 'continue grow even important use purpose guide us right thing', 'plan become customers st choice home across products services experiences offer', 'increasingly demonstrate achieving sustainable responsible way generations come', 'report share examples purpose used guide strategic thinking actions', 'growing purpose believe better counter significant macroeconomic uncertainties ah

Next, it will combine all standardised sentences together as a list.

In [129]:
# Combine all standardised sentence into a list
dnlm_standsent_all = (
    dnlm_standsent_2022 +
    dnlm_standsent_2021 +
    dnlm_standsent_2020 +
    dnlm_standsent_2019 +
    dnlm_standsent_2018 +
    dnlm_standsent_2017 +
    dnlm_standsent_2016 +
    dnlm_standsent_2015 +
    dnlm_standsent_2014 +
    dnlm_standsent_2013 +
    dnlm_standsent_2012
)

print(f'Total number of all standardised sentences: {len(dnlm_standsent_all)}')
print('\nThe first 10 Dunelm standardised sentences:')
print(dnlm_standsent_all[:10])
print(f"\nData type: {type(dnlm_standsent_all)}")

Total number of all standardised sentences: 43790

The first 10 Dunelm standardised sentences:
['purposegrowing withdunelm group plc annual report accounts dunelm group plc annual report accounts', 'help create joy truly feeling home generations come', 'purpose nickin last year annual report talked renewed purpose relevance broader role play lives stakeholders', 'purpose influences board colleagues decision making', 'prompts us question improves recognition help create joy truly feeling home stakeholders customers colleagues store communities suppliers shareholders people deal', 'continue grow even important use purpose guide us right thing', 'plan become customers st choice home across products services experiences offer', 'increasingly demonstrate achieving sustainable responsible way generations come', 'report share examples purpose used guide strategic thinking actions', 'growing purpose believe better counter significant macroeconomic uncertainties ahead keep stakeholders engaged 

## B19. Halfords Group plc

In [130]:
# Setting the path of annual reports from 2012 to 2022
hfd_2022_path = "AR/halfords/hfd-ar2022.pdf"
hfd_2021_path = "AR/halfords/hfd-ar2021.pdf"
hfd_2020_path = "AR/halfords/hfd-ar2020.pdf"
hfd_2019_path = "AR/halfords/hfd-ar2019.pdf"
hfd_2018_path = "AR/halfords/hfd-ar2018.pdf"
hfd_2017_path = "AR/halfords/hfd-ar2017.pdf"
hfd_2016_path = "AR/halfords/hfd-ar2016.pdf"
hfd_2015_path = "AR/halfords/hfd-ar2015.pdf"
hfd_2014_path = "AR/halfords/hfd-ar2014.pdf"
hfd_2013_path = "AR/halfords/hfd-ar2013.pdf"
hfd_2012_path = "AR/halfords/hfd-ar2012.pdf"

In [131]:
# Extract text from annual report
hfd_sen_2022 = extract_pdf_text(hfd_2022_path)
hfd_sen_2021 = extract_pdf_text(hfd_2021_path)
hfd_sen_2020 = extract_pdf_text(hfd_2020_path)
hfd_sen_2019 = extract_pdf_text(hfd_2019_path)
hfd_sen_2018 = extract_pdf_text(hfd_2018_path)
hfd_sen_2017 = extract_pdf_text(hfd_2017_path)
hfd_sen_2016 = extract_pdf_text(hfd_2016_path)
hfd_sen_2015 = extract_pdf_text(hfd_2015_path)
hfd_sen_2014 = extract_pdf_text(hfd_2014_path)
hfd_sen_2013 = extract_pdf_text(hfd_2013_path)
hfd_sen_2012 = extract_pdf_text(hfd_2012_path)


Print the sentences from the annual report from 2021.

In [132]:
# print the sentences summary of 2021 annual report
print(f"Total number of sentences of 2021: {len(hfd_sen_2021)}")
print("\nPrint the first 10 sentences from Halfords annual reports: ")
print(hfd_sen_2021[:10])
print(f"\nData type: {type(hfd_sen_2021)}")

Total number of sentences of 2021: 6593

Print the first 10 sentences from Halfords annual reports: 
['Job Number 15 July 2021 7:09 pm Proof Number Halfords Group plc Annual Report and Accounts for the period ended 2 April 2021 Halfords Group plc Annual Report and Accounts for the period ended 2 April 2021 Stock code: HFDTo Inspire and Support a Lifetime of motoring and cycling 30441-Halfords-Annual-Report-2021-Strategic', 'indd 330441-Halfords-Annual-Report-2021-Strategic', 'indd 3 15/07/2021 19:12:3615/07/2021 19:12:36', 'Job Number 15 July 2021 7:09 pm Proof Number Halfords is the UK’s leading provider of motoring and cycling products and services', 'Our purpose is to Inspire and Support a Lifetime of motoring and cycling', 'Our vision is to be the super-specialists in motoring and cycling, trusted by the nation', 'Evolving into a consumer and B2B services-led business, positioned for long-term success', 'Our unique market position means we can offer customers products and services 

Second, it will standardise the sentences for each annual reports.

In [133]:
# Standardise all sentences from all annual reports
hfd_standsent_2022 = standardize_sentences(hfd_sen_2022)
hfd_standsent_2021 = standardize_sentences(hfd_sen_2021)
hfd_standsent_2020 = standardize_sentences(hfd_sen_2020)
hfd_standsent_2019 = standardize_sentences(hfd_sen_2019)
hfd_standsent_2018 = standardize_sentences(hfd_sen_2018)
hfd_standsent_2017 = standardize_sentences(hfd_sen_2017)
hfd_standsent_2016 = standardize_sentences(hfd_sen_2016)
hfd_standsent_2015 = standardize_sentences(hfd_sen_2015)
hfd_standsent_2014 = standardize_sentences(hfd_sen_2014)
hfd_standsent_2013 = standardize_sentences(hfd_sen_2013)
hfd_standsent_2012 = standardize_sentences(hfd_sen_2012)

Print the short summary of standardised sentences of 2021 annual reports.

In [134]:
# print the standardized sentences summary of 2021 annual report
print(f"Total number of standardized sentences of 2021: {len(hfd_standsent_2021)}")
print("\nPrint the first 10 standardized sentences from Halfords annual reprots: ")
print(hfd_standsent_2021[:10])
print(f"\nData type: {type(hfd_standsent_2021)}")

Total number of standardized sentences of 2021: 6593

Print the first 10 standardized sentences from Halfords annual reprots: 
['job number july pm proof number halfords group plc annual report accounts period ended april halfords group plc annual report accounts period ended april stock code hfdto inspire support lifetime motoring cycling halfords annual report strategic', 'indd halfords annual report strategic', 'indd', 'job number july pm proof number halfords uk leading provider motoring cycling products services', 'purpose inspire support lifetime motoring cycling', 'vision super specialists motoring cycling trusted nation', 'evolving consumer b b services led business positioned long term success', 'unique market position means offer customers products services motoring cycling needs halfords brand', 'proven strategic direction right highly skilled colleagues strong culture well positioned deliver stakeholders', 'offer unique proposition market leader motoring cycling products se

Next, it will combine all standardised sentences together as a list.

In [135]:
# Combine all standardised sentence into a list
hfd_standsent_all = (
    hfd_standsent_2022 +
    hfd_standsent_2021 +
    hfd_standsent_2020 +
    hfd_standsent_2019 +
    hfd_standsent_2018 +
    hfd_standsent_2017 +
    hfd_standsent_2016 +
    hfd_standsent_2015 +
    hfd_standsent_2014 +
    hfd_standsent_2013 +
    hfd_standsent_2012
)

print(f'Total number of all standardised sentences: {len(hfd_standsent_all)}')
print('\nThe first 10 Halfords standardised sentences:')
print(hfd_standsent_all[:10])
print(f"\nData type: {type(hfd_standsent_all)}")

Total number of all standardised sentences: 55614

The first 10 Halfords standardised sentences:
['halfords group plc annual report accounts period ended april halfords group plc annual report accounts period ended april inspire support lifetime motoring cycling', 'contents group overview scale change group highlights purpose values strategy culture group glance chair statement investment case strategic report chief executive officer statement marketplace engagement stakeholders section statement create value strategy environmental social governance key performance indicators chief financial officer review risk management climate related financial disclosure tcfd principal risks uncertainties viability statement governance board directors directors report corporate governance report nomination committee report esg committee report audit committee report remuneration committee report directors remuneration policy summary report directors remuneration report directors responsibilities fi

## B20. JD Sports Fashion PLC

In [136]:
# Setting the path of annual reports from 2012 to 2022
jd_2022_path = "AR/jd_sports/jd-ar2022.pdf"
jd_2021_path = "AR/jd_sports/jd-ar2021.pdf"
jd_2020_path = "AR/jd_sports/jd-ar2020.pdf"
jd_2019_path = "AR/jd_sports/jd-ar2019.pdf"
jd_2018_path = "AR/jd_sports/jd-ar2018.pdf"
jd_2017_path = "AR/jd_sports/jd-ar2017.pdf"
jd_2016_path = "AR/jd_sports/jd-ar2016.pdf"
jd_2015_path = "AR/jd_sports/jd-ar2015.pdf"
jd_2014_path = "AR/jd_sports/jd-ar2014.pdf"
jd_2013_path = "AR/jd_sports/jd-ar2013.pdf"
jd_2012_path = "AR/jd_sports/jd-ar2012.pdf"

In [137]:
# Extract text from annual report
jd_sen_2022 = extract_pdf_text(jd_2022_path)
jd_sen_2021 = extract_pdf_text(jd_2021_path)
jd_sen_2020 = extract_pdf_text(jd_2020_path)
jd_sen_2019 = extract_pdf_text(jd_2019_path)
jd_sen_2018 = extract_pdf_text(jd_2018_path)
jd_sen_2017 = extract_pdf_text(jd_2017_path)
jd_sen_2016 = extract_pdf_text(jd_2016_path)
jd_sen_2015 = extract_pdf_text(jd_2015_path)
jd_sen_2014 = extract_pdf_text(jd_2014_path)
jd_sen_2013 = extract_pdf_text(jd_2013_path)
jd_sen_2012 = extract_pdf_text(jd_2012_path)



Print the sentences from the annual report from 2021.

In [138]:
# print the sentences summary of 2021 annual report
print(f"Total number of sentences of 2021: {len(jd_sen_2021)}")
print("\nPrint the first 10 sentences from JD Sports annual reports: ")
print(jd_sen_2021[:10])
print(f"\nData type: {type(jd_sen_2021)}")

Total number of sentences of 2021: 6958

Print the first 10 sentences from JD Sports annual reports: 
['ANNUAL REPORT AND ACCOUNTS', 'CONTENTS OVERVIEW 2 HIGHLIGHTS 6 OUR BRANDS 30 WHERE WE ARE 36 EXECUTIVE CHAIRMAN’S STATEMENT STRATEGIC REPORT 44 BUSINESS MODEL 46 OUR STRATEGY 51 PRINCIPAL RISKS 79 BUSINESS REVIEW 83 FINANCIAL REVIEW 88 PROPERTY AND STORES REVIEW 96 CORPORATE AND SOCIAL RESPONSIBILITY 140 THE JD FOUNDATION 154 SECTION 172 STATEMENT GOVERNANCE 160 THE BOARD 162 DIRECTORS’ REPORT 168 CORPORATE GOVERNANCE REPORT 176 AUDIT COMMITTEE REPORT 179 DIRECTORS’ REMUNERATION REPORT FINANCIAL STATEMENTS 210 STATEMENT OF DIRECTORS’ RESPONSIBILITIES 212 INDEPENDENT AUDITOR’S REPORT 226 CONSOLIDATED INCOME STATEMENT 226 CONSOLIDATED STATEMENT OF COMPREHENSIVE INCOME 227 CONSOLIDATED STATEMENT OF FINANCIAL POSITION 228 CONSOLIDATED STATEMENT OF CHANGES IN EQUITY 229 CONSOLIDATED STATEMENT OF CASH FLOWS 230 NOTES TO THE CONSOLIDATED FINANCIAL STATEMENTS 313 COMPANY BALANCE SHEET 314 CO

Second, it will standardise the sentences for each annual reports.

In [139]:
# Standardise all sentences from all annual reports
jd_standsent_2022 = standardize_sentences(jd_sen_2022)
jd_standsent_2021 = standardize_sentences(jd_sen_2021)
jd_standsent_2020 = standardize_sentences(jd_sen_2020)
jd_standsent_2019 = standardize_sentences(jd_sen_2019)
jd_standsent_2018 = standardize_sentences(jd_sen_2018)
jd_standsent_2017 = standardize_sentences(jd_sen_2017)
jd_standsent_2016 = standardize_sentences(jd_sen_2016)
jd_standsent_2015 = standardize_sentences(jd_sen_2015)
jd_standsent_2014 = standardize_sentences(jd_sen_2014)
jd_standsent_2013 = standardize_sentences(jd_sen_2013)
jd_standsent_2012 = standardize_sentences(jd_sen_2012)

Print the short summary of standardised sentences of 2021 annual reports.

In [140]:
# print the standardized sentences summary of 2021 annual report
print(f"Total number of standardized sentences of 2021: {len(jd_standsent_2021)}")
print("\nPrint the first 10 standardized sentences from JD Sports annual reprots: ")
print(jd_standsent_2021[:10])
print(f"\nData type: {type(jd_standsent_2021)}")

Total number of standardized sentences of 2021: 6958

Print the first 10 standardized sentences from JD Sports annual reprots: 
['annual report accounts', 'contents overview highlights brands executive chairman statement strategic report business model strategy principal risks business review financial review property stores review corporate social responsibility jd foundation section statement governance board directors report corporate governance report audit committee report directors remuneration report financial statements statement directors responsibilities independent auditor report consolidated income statement consolidated statement comprehensive income consolidated statement financial position consolidated statement changes equity consolidated statement cash flows notes consolidated financial statements company balance sheet company statement changes equity notes company financial statements group information financial calendar shareholder information five year record altern

Next, it will combine all standardised sentences together as a list.

In [141]:
# Combine all standardised sentence into a list
jd_standsent_all = (
    jd_standsent_2022 +
    jd_standsent_2021 +
    jd_standsent_2020 +
    jd_standsent_2019 +
    jd_standsent_2018 +
    jd_standsent_2017 +
    jd_standsent_2016 +
    jd_standsent_2015 +
    jd_standsent_2014 +
    jd_standsent_2013 +
    jd_standsent_2012
)

print(f'Total number of all standardised sentences: {len(jd_standsent_all)}')
print('\nThe first 10 JD Sports standardised sentences:')
print(jd_standsent_all[:10])
print(f"\nData type: {type(jd_standsent_all)}")

Total number of all standardised sentences: 43655

The first 10 JD Sports standardised sentences:
['jd sports fashion plc annual report accounts years ofstyle', 'strategic report statement board interim non executive chair statement investment proposition business model strategy strategy action principal risks business financial review alternative performance measures property stores review environmental overview governance environmental tcfd environmental climate change environmental sector emissions data environmental sustainability social ethical sourcing social people social health safety social jd foundation social global empowerment governance section statement governance stakeholder engagement governance board directors directors report corporate governance report nominations committee report audit risk committee report directors remuneration report financial statements statement directors responsibilities independent auditor report consolidated income statement consolidated sta

## B21. Inchcape

2014 annual reports is not available online.

In [142]:
# Setting the path of annual reports from 2012 to 2022
inch_2022_path = "AR/inchcape/inch-ar2022.pdf"
inch_2021_path = "AR/inchcape/inch-ar2021.pdf"
inch_2020_path = "AR/inchcape/inch-ar2020.pdf"
inch_2019_path = "AR/inchcape/inch-ar2019.pdf"
inch_2018_path = "AR/inchcape/inch-ar2018.pdf"
inch_2017_path = "AR/inchcape/inch-ar2017.pdf"
inch_2016_path = "AR/inchcape/inch-ar2016.pdf"
inch_2015_path = "AR/inchcape/inch-ar2015.pdf"
inch_2013_path = "AR/inchcape/inch-ar2013.pdf"
inch_2012_path = "AR/inchcape/inch-ar2012.pdf"

In [143]:
# Extract text from annual report
inch_sen_2022 = extract_pdf_text(inch_2022_path)
inch_sen_2021 = extract_pdf_text(inch_2021_path)
inch_sen_2020 = extract_pdf_text(inch_2020_path)
inch_sen_2019 = extract_pdf_text(inch_2019_path)
inch_sen_2018 = extract_pdf_text(inch_2018_path)
inch_sen_2017 = extract_pdf_text(inch_2017_path)
inch_sen_2016 = extract_pdf_text(inch_2016_path)
inch_sen_2015 = extract_pdf_text(inch_2015_path)
inch_sen_2013 = extract_pdf_text(inch_2013_path)
inch_sen_2012 = extract_pdf_text(inch_2012_path)


Print the sentences from the annual report from 2021.

In [144]:
# print the sentences summary of 2021 annual report
print(f"Total number of sentences of 2021: {len(inch_sen_2021)}")
print("\nPrint the first 10 sentences from Inchcape annual reports: ")
print(inch_sen_2021[:10])
print(f"\nData type: {type(inch_sen_2021)}")

Total number of sentences of 2021: 7391

Print the first 10 sentences from Inchcape annual reports: 
['INCHCAPE ANNUAL REPORT AND ACCOUNTS 2021ANNUAL REPORT AND ACCOUNTS 2021Note: Spine set to 12', '5mm', 'Please adjust if necessary', 'INCHCAPE IS ON AN AMBITIOUS GROWTH JOURNEY As the leading automotive distributor in a highly fragmented global market, we have developed a ‘plug-and-play’ distribution platform and built our digital and data capability to create a significant competitive advantage', 'We are also uniquely positioned to capture more of a vehicle’s lifetime value', 'Our commitment to return shareholder value through organic growth, consolidation and cash returns will be delivered by our Accelerate strategy and is underpinned by our Responsible Business framework, ‘Driving What Matters’', 'STRATEGIC REPORT 2 Our business model 4 Our strategy 6 Chairman’s welcome 8 Chief Executive’s review 12 Facing into the future 14 Acquisition progress 16 Stakeholder engagement 20 Key perf

Second, it will standardise the sentences for each annual reports.

In [145]:
# Standardise all sentences from all annual reports
inch_standsent_2022 = standardize_sentences(inch_sen_2022)
inch_standsent_2021 = standardize_sentences(inch_sen_2021)
inch_standsent_2020 = standardize_sentences(inch_sen_2020)
inch_standsent_2019 = standardize_sentences(inch_sen_2019)
inch_standsent_2018 = standardize_sentences(inch_sen_2018)
inch_standsent_2017 = standardize_sentences(inch_sen_2017)
inch_standsent_2016 = standardize_sentences(inch_sen_2016)
inch_standsent_2015 = standardize_sentences(inch_sen_2015)
inch_standsent_2013 = standardize_sentences(inch_sen_2013)
inch_standsent_2012 = standardize_sentences(inch_sen_2012)

Print the short summary of standardised sentences of 2021 annual reports.

In [146]:
# print the standardized sentences summary of 2021 annual report
print(f"Total number of standardized sentences of 2021: {len(inch_standsent_2021)}")
print("\nPrint the first 10 standardized sentences from Inchcape annual reprots: ")
print(inch_standsent_2021[:10])
print(f"\nData type: {type(inch_standsent_2021)}")

Total number of standardized sentences of 2021: 7391

Print the first 10 standardized sentences from Inchcape annual reprots: 
['inchcape annual report accounts annual report accounts note spine set', 'mm', 'please adjust necessary', 'inchcape ambitious growth journey leading automotive distributor highly fragmented global market developed plug play distribution platform built digital data capability create significant competitive advantage', 'also uniquely positioned capture vehicle lifetime value', 'commitment return shareholder value organic growth consolidation cash returns delivered accelerate strategy underpinned responsible business framework driving matters', 'strategic report business model strategy chairman welcome chief executive review facing future acquisition progress stakeholder engagement key performance indicators investment case operating financial review responsible business task force climate related financial disclosures non financial information statement risk man

Next, it will combine all standardised sentences together as a list.

In [147]:
# Combine all standardised sentence into a list
inch_standsent_all = (
    inch_standsent_2022 +
    inch_standsent_2021 +
    inch_standsent_2020 +
    inch_standsent_2019 +
    inch_standsent_2018 +
    inch_standsent_2017 +
    inch_standsent_2016 +
    inch_standsent_2015 +
    inch_standsent_2013 +
    inch_standsent_2012
)

print(f'Total number of all standardised sentences: {len(inch_standsent_all)}')
print('\nThe first 10 Inchcape standardised sentences:')
print(inch_standsent_all[:10])
print(f"\nData type: {type(inch_standsent_all)}")

Total number of all standardised sentences: 67233

The first 10 Inchcape standardised sentences:
['ipl annual report', 'incitec pivot limited annual report incitec pivot limited annual report incitec pivot limited abn level freshwater place southbank victoria australia telephone facsimile www', 'incitecpivot', 'com', 'au', 'incitec pivot limited annual report us key operations ipl strategy snapshot performance outlook year review chairman report managing director ceo report operating financial review sustainable business zero harm number one company value people sustainability overview climate change caring communities governance corporate governance board directors executive team financial statutory reports directors report remuneration report financial report independent auditor report additional information shareholder information five year financial statistics glossary corporate directory contents', 'incitec pivot limited annual report', 'us ipl leading technology supplier resource

## B22. Lookers plc 


In [148]:
# Setting the path of annual reports from 2012 to 2022
look_2012to2022_path = "AR/lookers/look-ar2012-2022.pdf"


In [149]:
# Extract text from annual report
look_sen_2012to2022 = extract_pdf_text(look_2012to2022_path)


Print the sentences from the annual report from 2021.

In [150]:
# print the sentences summary of annual report
print(f"Total number of sentences of 2021: {len(look_sen_2012to2022)}")
print("\nPrint the first 10 sentences from Lookers annual reports: ")
print(look_sen_2012to2022[:10])
print(f"\nData type: {type(look_sen_2012to2022)}")

Total number of sentences of 2021: 51730

Print the first 10 sentences from Lookers annual reports: 
['Annual Report & Accounts 31st December 2012', 'Lookers plc Registered Ofﬁce: 776 Chester Road, Stretford, Manchester, M32 0QH', 'Registered Number: 111876Lomond Audi – Glasgow Group Net Assets 5 year history£204m £197m £182m £160m £83m2012 2011 2010 2009 2008 £ million2012 2011 2010 2009 2008', 'Lookers plc Annual Report & A ccounts 2012 1CONTENTS2012 REVIEW“ Together we will strive to be an outstanding company achieving customers for life', '” COMPANY MISSION STATEMENT FINANCIAL CALENDAR 6 March 2013 Announcement of the results for the full year 30 May 2013 Annual General MeetingFinancial Highlights', '02 Chairman’s Review', '05 Chief Executive’s Review', '09 Finance Director’s Review', '17 Board of Directors', '20 Directors’ Report']

Data type: <class 'list'>


Second, it will standardise the sentences for each annual reports.

In [151]:
# Standardise all sentences from all annual reports
look_standsent_all = standardize_sentences(look_sen_2012to2022)


Print the short summary of standardised sentences of 2021 annual reports.

In [152]:
# print the standardized sentences summary of annual report
print(f"Total number of standardized sentences of 2021: {len(look_standsent_all)}")
print("\nPrint the first 10 standardized sentences from Lookers annual reprots: ")
print(look_standsent_all[:10])
print(f"\nData type: {type(look_standsent_all)}")

Total number of standardized sentences of 2021: 51730

Print the first 10 standardized sentences from Lookers annual reprots: 
['annual report accounts st december', 'lookers plc registered ce chester road stretford manchester qh', 'registered number lomond audi glasgow group net assets year history million', 'lookers plc annual report ccounts contents review together strive outstanding company achieving customers life', 'company mission statement financial calendar march announcement results full year may annual general meetingfinancial highlights', 'chairman review', 'chief executive review', 'finance director review', 'board directors', 'directors report']

Data type: <class 'list'>


## B23. McColl's

Annual reports only available from 2012 to 2020.

In [153]:
# Setting the path of annual reports from 2012 to 2022
mcls_2012to2022_path = "AR/mccoll/mcls-ar2012-2020.pdf"


In [154]:
# Extract text from annual report
mcls_sen_2012to2022 = extract_pdf_text(mcls_2012to2022_path)


Print the sentences from the annual report from 2021.

In [155]:
# print the sentences summary of annual report
print(f"Total number of sentences of 2021: {len(mcls_sen_2012to2022)}")
print("\nPrint the first 10 sentences from McColl's annual reports: ")
print(mcls_sen_2012to2022[:10])
print(f"\nData type: {type(mcls_sen_2012to2022)}")

Total number of sentences of 2021: 20154

Print the first 10 sentences from McColl's annual reports: 
['McColl’s Retail Group plc Annual Report and Accounts 2020Your favourite neighbourhood shop', 'Contents Strategic report IFC Financial KPIs 1 2020 highlights 2 Chairman’s statement 4 Where we operate 6 What we offer 8 Chief Executive’s review 20 Marketplace 24 Our business model 26 Resources and relationships 30 Our key performance indicators 31 Financial review 35 Our approach to stakeholders 36 Non-ﬁnancial information statement 37 Sustainability review 42 Principal risks 48 Viability Statement Governance 49 Chairman’s introduction 50 Leadership and Company purpose 54 Division of responsibilities 56 Composition, succession and evaluation 60 Audit, Risk and Internal control 64 Remuneration 84 Directors’ report 89 Directors’ responsibilities statement Financial statements 90 Independent Auditor’s report to the members of McColl’s Retail Group plc 97 Consolidated income statement 97 Co

Second, it will standardise the sentences for each annual reports.

In [156]:
# Standardise all sentences from all annual reports
mcls_standsent_all = standardize_sentences(mcls_sen_2012to2022)


Print the short summary of standardised sentences of 2021 annual reports.

In [157]:
# print the standardized sentences summary of annual report
print(f"Total number of standardized sentences of 2021: {len(mcls_standsent_all)}")
print("\nPrint the first 10 standardized sentences from McColl's annual reprots: ")
print(mcls_standsent_all[:10])
print(f"\nData type: {type(mcls_standsent_all)}")

Total number of standardized sentences of 2021: 20154

Print the first 10 standardized sentences from McColl's annual reprots: 
['mccoll retail group plc annual report accounts favourite neighbourhood shop', 'contents strategic report ifc financial kpis highlights chairman statement operate offer chief executive review marketplace business model resources relationships key performance indicators financial review approach stakeholders non nancial information statement sustainability review principal risks viability statement governance chairman introduction leadership company purpose division responsibilities composition succession evaluation audit risk internal control remuneration directors report directors responsibilities statement financial statements independent auditor report members mccoll retail group plc consolidated income statement consolidated statement comprehensive income consolidated statement nancial position consolidated statement changes equity consolidated statement 

## B24. Pets at Home Group

Annual reports only available from 2014 to 2022.

In [158]:
# Setting the path of annual reports from 2014 to 2022
pets_2012to2022_path = "AR/pets_at_home/pets-ar2014-2022.pdf"


In [159]:
# Extract text from annual report
pets_sen_2012to2022 = extract_pdf_text(pets_2012to2022_path)


Print the sentences from the annual report from 2021.

In [160]:
# print the sentences summary of annual report
print(f"Total number of sentences of 2021: {len(pets_sen_2012to2022)}")
print("\nPrint the first 10 sentences from Pets at Home annual reports: ")
print(pets_sen_2012to2022[:10])
print(f"\nData type: {type(pets_sen_2012to2022)}")

Total number of sentences of 2021: 40359

Print the first 10 sentences from Pets at Home annual reports: 
['Annual Report & Accounts 2023Creating a be/t_ter world for pets and the people who love them', 'What we do We provide the best products, services and advice to guide pet owners through their pet care journey Our unique proposition of products, services and advice allows us to deliver complete pet care to consumers in a way competitors cannot easily replicate, and enables us to continue to take share across both our key markets of retail and veterinary', 'For more information please visit: https:/ /investors', 'petsathome', 'comPets at Home Group Plc Annual Report & Accounts 2023', '01Strategic Report Governance Financial Statements £1,404', '2 m £136', '4 m£122', '5 m 12', '8 p+6']

Data type: <class 'list'>


Second, it will standardise the sentences for each annual reports.

In [161]:
# Standardise all sentences from all annual reports
pets_standsent_all = standardize_sentences(pets_sen_2012to2022)


Print the short summary of standardised sentences of 2021 annual reports.

In [162]:
# print the standardized sentences summary of annual report
print(f"Total number of standardized sentences of 2021: {len(pets_standsent_all)}")
print("\nPrint the first 10 standardized sentences from Pets at Home annual reprots: ")
print(pets_standsent_all[:10])
print(f"\nData type: {type(pets_standsent_all)}")

Total number of standardized sentences of 2021: 40359

Print the first 10 standardized sentences from Pets at Home annual reprots: 
['annual report accounts creating ter world pets people love', 'provide best products services advice guide pet owners pet care journey unique proposition products services advice allows us deliver complete pet care consumers way competitors cannot easily replicate enables us continue take share across key markets retail veterinary', 'information please visit https investors', 'petsathome', 'compets home group plc annual report accounts', 'strategic report governance financial statements', '', '', '', 'p']

Data type: <class 'list'>


## B25. SCS Group Plc

Annual reprots are available online from 2015 to 2022.

In [163]:
# Setting the path of annual reports from 2015 to 2022
scs_2012to2022_path = "AR/scs/scs-ar2015-2022.pdf"


In [164]:
# Extract text from annual report
scs_sen_2012to2022 = extract_pdf_text(scs_2012to2022_path)


Print the sentences from the annual report from 2021.

In [165]:
# print the sentences summary of annual report
print(f"Total number of sentences of 2021: {len(scs_sen_2012to2022)}")
print("\nPrint the first 10 sentences from SCS Group annual reports: ")
print(scs_sen_2012to2022[:10])
print(f"\nData type: {type(scs_sen_2012to2022)}")

Total number of sentences of 2021: 19541

Print the first 10 sentences from SCS Group annual reports: 
['The Sofa Carpet Specialist Annual Report 2015 Sofa Carpet Specialist ScS Group plc Annual Report 2015', 'See our website for more information www', 'scsplc', 'co', 'uk The Sofa Carpet Specialist ScS is one of the UK’s leading furniture and flooring retailers, operating from 96 stores', 'Principally located in modern retail park locations and 30 House of Fraser concessions across the country – as far north as Dundee and as far south as Plymouth, offering a focused range of upholstered furniture and floorcoverings', 'ScS has over 100 years of furniture retailing experience and our specialist staff are highly trained in their fields so that we can offer our customers the best service when they choose new sofas and flooring for their homes', '1 ScS Group plc Annual Report 2015Strategic Report Corporate Governance Financial Statements Gross sales up 13', '2% to £292', '2 million (2014: £

Second, it will standardise the sentences for each annual reports.

In [166]:
# Standardise all sentences from all annual reports
scs_standsent_all = standardize_sentences(scs_sen_2012to2022)


Print the short summary of standardised sentences of 2021 annual reports.

In [167]:
# print the standardized sentences summary of annual report
print(f"Total number of standardized sentences of 2021: {len(scs_standsent_all)}")
print("\nPrint the first 10 standardized sentences from SCS Group annual reprots: ")
print(scs_standsent_all[:10])
print(f"\nData type: {type(scs_standsent_all)}")

Total number of standardized sentences of 2021: 19541

Print the first 10 standardized sentences from SCS Group annual reprots: 
['sofa carpet specialist annual report sofa carpet specialist scs group plc annual report', 'see website information www', 'scsplc', 'co', 'uk sofa carpet specialist scs one uk leading furniture flooring retailers operating stores', 'principally located modern retail park locations house fraser concessions across country far north dundee far south plymouth offering focused range upholstered furniture floorcoverings', 'scs years furniture retailing experience specialist staff highly trained fields offer customers best service choose new sofas flooring homes', 'scs group plc annual report strategic report corporate governance financial statements gross sales', '', 'million']

Data type: <class 'list'>


## B26. MYSALE Group 

Annual reports are available from 2014 to 2021.

In [168]:
# Setting the path of annual reports from 2014 to 2021
mysl_2012to2022_path = "AR/mysale/mysl-ar2014to2021.pdf"


In [169]:
# Extract text from annual report
mysl_sen_2012to2022 = extract_pdf_text(mysl_2012to2022_path)


Print the sentences from the annual report from 2021.

In [170]:
# print the sentences summary of annual report
print(f"Total number of sentences of 2021: {len(mysl_sen_2012to2022)}")
print("\nPrint the first 10 sentences from Mysale Group annual reports: ")
print(mysl_sen_2012to2022[:10])
print(f"\nData type: {type(mysl_sen_2012to2022)}")

Total number of sentences of 2021: 11667

Print the first 10 sentences from Mysale Group annual reports: 
['MySale Group Plc Corporate directory 30 June 2014 1 Directors David Mortimer AO - Independent Non-Executive Chairman Jamie Jackson - Executive Director and Vice Chairman Carl Jackson - Executive Director and Chief Executive Officer Andrew Dingle - Executive Director and Chief Financial Officer Adrian MacKenzie - Independent Non-Executive Director Head office 5/111 Old Pittwater Rd, Brookvale, NSW 2100, Australia Company secretary Prism Cosec Limited, 10 Margaret Street, London, W1W 8RL Registered office Ogier House, The Esplanade, St', 'Helier, JE4 9WG, Jersey Principal place of business United Kingdom: 959 Fulham Rd, London SW6 6HY Australia: 5/111 Old Pittwater Rd, Brookvale, NSW 2100 United States: 1107 S', 'Boyle Avenue, Los Angeles, CA 90023 Auditor PricewaterhouseCoopers,1 Embankment Place, London WC2N 6RH Solicitors United Kingdom: Linklaters LLP, One Silk Street, London, 

Second, it will standardise the sentences for each annual reports.

In [171]:
# Standardise all sentences from all annual reports
mysl_standsent_all = standardize_sentences(mysl_sen_2012to2022)


Print the short summary of standardised sentences of 2021 annual reports.

In [172]:
# print the standardized sentences summary of annual report
print(f"Total number of standardized sentences of 2021: {len(mysl_standsent_all)}")
print("\nPrint the first 10 standardized sentences from Mysale Group annual reprots: ")
print(mysl_standsent_all[:10])
print(f"\nData type: {type(mysl_standsent_all)}")

Total number of standardized sentences of 2021: 11667

Print the first 10 standardized sentences from Mysale Group annual reprots: 
['mysale group plc corporate directory june directors david mortimer ao independent non executive chairman jamie jackson executive director vice chairman carl jackson executive director chief executive officer andrew dingle executive director chief financial officer adrian mackenzie independent non executive director head office old pittwater rd brookvale nsw australia company secretary prism cosec limited margaret street london w w rl registered office ogier house esplanade st', 'helier je wg jersey principal place business united kingdom fulham rd london sw hy australia old pittwater rd brookvale nsw united states', 'boyle avenue los angeles ca auditor pricewaterhousecoopers embankment place london wc n rh solicitors united kingdom linklaters llp one silk street london ec hq australia clayton utz level bligh street sydney nsw jersey ogier ogier house esp

## B27. Card Factory PLC

Annual reports are available from 2015 to 2022.

In [173]:
# Setting the path of annual reports from 2015 to 2022
card_2012to2022_path = "AR/card_factory/card-ar2015-2022.pdf"


In [174]:
# Extract text from annual report
card_sen_2012to2022 = extract_pdf_text(card_2012to2022_path)


Print the sentences from the annual report from 2021.

In [175]:
# print the sentences summary of annual report
print(f"Total number of sentences of 2021: {len(card_sen_2012to2022)}")
print("\nPrint the first 10 sentences from Card Factory annual reports: ")
print(card_sen_2012to2022[:10])
print(f"\nData type: {type(card_sen_2012to2022)}")

Total number of sentences of 2021: 32229

Print the first 10 sentences from Card Factory annual reports: 
['Celebrate life’s moments Annual Report and Accounts 2022', 'Card Factory sells more greeting cards in the UK than/uni00A0anyone else and is ranked #1 by shoppers on/uni00A0“wide range of cards” and/uni00A0“value for money”', '/uni00B9 Vision: Card Factory aspires to be recognised as the world’s best greeting/uni00A0card retailer: everywhere, and/uni00A0for all occasions, the first choice for/uni00A0greeting/uni00A0cards', 'Mission: Card Factory’s mission is helping people celebrate life moments by making our products affordable and/uni00A0available for everyone', '1 Source: Dynata February 2022', 'Card Factory is the UK’s leading specialist retailer of greeting cards, gifts, wrap and bags', 'Strategic Report 01 FY22 highlights 02 Welcome to Card Factory 04 Investment case 06 Chair’s statement 08 Market overview 10 Business model 12 Chief Executive Officer’s review 16 Our strategy

Second, it will standardise the sentences for each annual reports.

In [176]:
# Standardise all sentences from all annual reports
card_standsent_all = standardize_sentences(card_sen_2012to2022)


Print the short summary of standardised sentences of 2021 annual reports.

In [177]:
# print the standardized sentences summary of annual report
print(f"Total number of standardized sentences of 2021: {len(card_standsent_all)}")
print("\nPrint the first 10 standardized sentences from Card Factory annual reprots: ")
print(card_standsent_all[:10])
print(f"\nData type: {type(card_standsent_all)}")

Total number of standardized sentences of 2021: 32229

Print the first 10 standardized sentences from Card Factory annual reprots: 
['celebrate life moments annual report accounts', 'card factory sells greeting cards uk uni anyone else ranked shoppers uni wide range cards uni value money', 'uni b vision card factory aspires recognised world best greeting uni card retailer everywhere uni occasions first choice uni greeting uni cards', 'mission card factory mission helping people celebrate life moments making products affordable uni available everyone', 'source dynata february', 'card factory uk leading specialist retailer greeting cards gifts wrap bags', 'strategic report fy highlights welcome card factory investment case chair statement market overview business model chief executive officer review strategy stakeholders chief financial officer review risk management esg strategy non financial information statement governance board directors chair letter corporate governance corporate go

## B28. N Brown Group plc

In [178]:
# Setting the path of annual reports from 2012 to 2022
bwng_2012to2022_path = "AR/n_brown_group/bwng-ar2012-2022.pdf"


In [179]:
# Extract text from annual report
bwng_sen_2012to2022 = extract_pdf_text(bwng_2012to2022_path)


Print the sentences from the annual report from 2021.

In [180]:
# print the sentences summary of annual report
print(f"Total number of sentences of 2021: {len(bwng_sen_2012to2022)}")
print("\nPrint the first 10 sentences from N Brown Group annual reports: ")
print(bwng_sen_2012to2022[:10])
print(f"\nData type: {type(bwng_sen_2012to2022)}")

Total number of sentences of 2021: 45875

Print the first 10 sentences from N Brown Group annual reports: 
['N Brown Group plc Annual Report and Accounts 2012FREEDOM TO', 'Young (30-45) fashionworld', 'co', 'uk simplybe', 'co', 'uk simplybe', 'com simplybe', 'de simplybe', 'eu simplyyours', 'co']

Data type: <class 'list'>


Second, it will standardise the sentences for each annual reports.

In [181]:
# Standardise all sentences from all annual reports
bwng_standsent_all = standardize_sentences(bwng_sen_2012to2022)


Print the short summary of standardised sentences of 2021 annual reports.

In [182]:
# print the standardized sentences summary of annual report
print(f"Total number of standardized sentences of 2021: {len(bwng_standsent_all)}")
print("\nPrint the first 10 standardized sentences from Card Factory annual reprots: ")
print(bwng_standsent_all[:10])
print(f"\nData type: {type(bwng_standsent_all)}")

Total number of standardized sentences of 2021: 45875

Print the first 10 standardized sentences from Card Factory annual reprots: 
['n brown group plc annual report accounts freedom', 'young fashionworld', 'co', 'uk simplybe', 'co', 'uk simplybe', 'com simplybe', 'de simplybe', 'eu simplyyours', 'co']

Data type: <class 'list'>


## B29. The Hut Group

Annual reports are available from 2015 to 2022.

In [183]:
# Setting the path of annual reports from 2015 to 2022
thg_2012to2022_path = "AR/thg/thg-ar2015-2022.pdf"


In [184]:
# Extract text from annual report
thg_sen_2012to2022 = extract_pdf_text(thg_2012to2022_path)


Print the sentences from the annual report from 2021.

In [185]:
# print the sentences summary of annual report
print(f"Total number of sentences of 2021: {len(thg_sen_2012to2022)}")
print("\nPrint the first 10 sentences from THG annual reports: ")
print(thg_sen_2012to2022[:10])
print(f"\nData type: {type(thg_sen_2012to2022)}")

Total number of sentences of 2021: 14637

Print the first 10 sentences from THG annual reports: 
['Annual Report and Financial Statements The Hut Group Limited', 'Company Number: 06539496 Year Ended 31 December 2015', 'CONTENTS 02 Directors and advisors24 Consolidated statement of changes in equity 03 Strategic report25 Consolidated statement of cash flows 12 Directors’ report26 Notes to the financial statements 18 Independent auditor’s report54 Company balance sheet 21 Consolidated statement of comprehensive income56 Company statement of changes in equity 23 Consolidated balance sheet 57 Notes to the Company financial statements', '2 THE HUT GROUP ANNUAL REPORT AND FINANCIAL STATEMENTS COMPANY SECRETARYAUDITORS Ernst & Young LLP 100 Barbirolli Square Manchester M2 3EY BANKERS Barclays Bank Plc 1 Churchill Place, London E14 5HP HSBC 8 Canada Square, Canary Wharf, London E14 5HQ Santander UK Plc 2 Triton Square, Regent’s Place, London NW1 3AN REGISTERED OFFICE The Hut Group Meridian Hou

Second, it will standardise the sentences for each annual reports.

In [186]:
# Standardise all sentences from all annual reports
thg_standsent_all = standardize_sentences(thg_sen_2012to2022)


Print the short summary of standardised sentences of 2021 annual reports.

In [187]:
# print the standardized sentences summary of annual report
print(f"Total number of standardized sentences of 2021: {len(thg_standsent_all)}")
print("\nPrint the first 10 standardized sentences from THG annual reprots: ")
print(thg_standsent_all[:10])
print(f"\nData type: {type(thg_standsent_all)}")

Total number of standardized sentences of 2021: 14637

Print the first 10 standardized sentences from THG annual reprots: 
['annual report financial statements hut group limited', 'company number year ended december', 'contents directors advisors consolidated statement changes equity strategic report consolidated statement cash flows directors report notes financial statements independent auditor report company balance sheet consolidated statement comprehensive income company statement changes equity consolidated balance sheet notes company financial statements', 'hut group annual report financial statements company secretaryauditors ernst young llp barbirolli square manchester ey bankers barclays bank plc churchill place london e hp hsbc canada square canary wharf london e hq santander uk plc triton square regent place london nw registered office hut group meridian house rudheath gadbrook park northwich cheshire cw ra j p pochindirectors lloyds bank plc gresham street london ec v hn m

## B30. Mothercare plc

In [188]:
# Setting the path of annual reports from 2012 to 2022
mtc_2012to2022_path = "AR/mothercare/mtc-ar2012-2022.pdf"


In [189]:
# Extract text from annual report
mtc_sen_2012to2022 = extract_pdf_text(mtc_2012to2022_path)


Print the sentences from the annual report from 2021.

In [190]:
# print the sentences summary of annual report
print(f"Total number of sentences of 2021: {len(mtc_sen_2012to2022)}")
print("\nPrint the first 10 sentences from Mothercare annual reports: ")
print(mtc_sen_2012to2022[:10])
print(f"\nData type: {type(mtc_sen_2012to2022)}")

Total number of sentences of 2021: 44905

Print the first 10 sentences from Mothercare annual reports: 
['Mothercare plc Annual report and accounts 2012 www', 'mothercareplc', 'com Transformation and growth', 'Financial highlights Worldwide network sales £1,232', '4m +6', '4% Group sales £812', '7m +2', '4% Operating proﬁt £1', '6m -94', '4%UK operating loss £24']

Data type: <class 'list'>


Second, it will standardise the sentences for each annual reports.

In [191]:
# Standardise all sentences from all annual reports
mtc_standsent_all = standardize_sentences(mtc_sen_2012to2022)


Print the short summary of standardised sentences of 2021 annual reports.

In [192]:
# print the standardized sentences summary of annual report
print(f"Total number of standardized sentences of 2021: {len(mtc_standsent_all)}")
print("\nPrint the first 10 standardized sentences from Mothercare annual reprots: ")
print(mtc_standsent_all[:10])
print(f"\nData type: {type(mtc_standsent_all)}")

Total number of standardized sentences of 2021: 44905

Print the first 10 standardized sentences from Mothercare annual reprots: 
['mothercare plc annual report accounts www', 'mothercareplc', 'com transformation growth', 'financial highlights worldwide network sales', '', 'group sales', '', 'operating pro', '', 'uk operating loss']

Data type: <class 'list'>


# Part C. Combine all sentences

The following codes will combine all standardised sentences from the above 30 companies.

In [193]:
ar_standsent_all = (
    mtc_standsent_all + #30 
    thg_standsent_all + #29
    bwng_standsent_all + #28
    card_standsent_all + #27
    mysl_standsent_all + #26
    scs_standsent_all + #25
    pets_standsent_all + #24
    mcls_standsent_all + #23
    look_standsent_all + #22
    inch_standsent_all + #21
    jd_standsent_all + #20
    hfd_standsent_all + #19
    dnlm_standsent_2012 + #18
    brby_standsent_2012 + #17
    fras_standsent_all + #16
    sgi_standsent_all + #15
    smwh_standsent_all + #14
    cury_standsent_all + #13
    bme_standsent_all + #12
    asos_standsent_all + #11
    ao_standsent_all + #10
    tbk_standsent_all + #9
    nxt_standsent_all + #8
    abf_standsent_all + #7
    coop_standsent_all + #6
    morri_standsent_all + #5
    jl_standsent_all + #4
    tesco_standsent_all + #3
    ms_standsent_all + #2
    sain_standsent_all #1    
)

Print the summary after combine all standardised sentences from the above companies.

In [194]:
print(f'Total number of standardised sentences {len(ar_standsent_all)}')
print("\nThe first 5 standardised sentences:")
print(ar_standsent_all[:5])
print("\nThe last 5 standardised sentences:")
print(ar_standsent_all[-5:])
print(f"\nData type: {type(ar_standsent_all)}")

Total number of standardised sentences 1133413

The first 5 standardised sentences:
['mothercare plc annual report accounts www', 'mothercareplc', 'com transformation growth', 'financial highlights worldwide network sales', '']

The last 5 standardised sentences:
['sainsbury accolade reflects work store pharmacies', 'official partner london paralympic games proud first ever paralympics sponsor', 'sponsorship helping us promote healthier active lifestyle across ages abilities', 'celebrating majesty diamond jubilee sainsbury celebrating majesty queen diamond jubilee support thames diamond jubilee pageant diamond jubilee beacons jubilee family festival woodland trust jubilee woods project', 'winner superm arket year winner convenience chain year annual report financial statements']

Data type: <class 'list'>


Convert to a pd dataframe.

In [195]:
import pandas as pd

# Create a dictionary with the 'ar_standsent_all' data
text = {'standardised sentences': ar_standsent_all}

# Create a pandas DataFrame
AR_30companies = pd.DataFrame(text)

# Print the summary
print(f'Total number of standardised sentences: {len(AR_30companies)}')
print("\nThe first 5 standardised sentences:")
print(AR_30companies.head(5))
print("\nThe last 5 standardised sentences:")
print(AR_30companies.tail(5))
print(f"\nData type: {type(AR_30companies)}")


Total number of standardised sentences: 1133413

The first 5 standardised sentences:
                         standardised sentences
0     mothercare plc annual report accounts www
1                                 mothercareplc
2                     com transformation growth
3  financial highlights worldwide network sales
4                                              

The last 5 standardised sentences:
                                    standardised sentences
1133408  sainsbury accolade reflects work store pharmacies
1133409  official partner london paralympic games proud...
1133410  sponsorship helping us promote healthier activ...
1133411  celebrating majesty diamond jubilee sainsbury ...
1133412  winner superm arket year winner convenience ch...

Data type: <class 'pandas.core.frame.DataFrame'>


As it can be seen, there is a row without any text. It should remove to save computational capacity.

In [196]:
# Remove rows with empty sentences
AR_30companies_1 = AR_30companies[AR_30companies['standardised sentences'] != '']

# Reset the index
AR_30companies_1.reset_index(drop=True, inplace=True)

# Print the updated DataFrame
# Print the summary
print(f'Total number of standardised sentences: {len(AR_30companies_1)}')
print("\nThe first 5 standardised sentences:")
print(AR_30companies_1.head(5))
print("\nThe last 5 standardised sentences:")
print(AR_30companies_1.tail(5))
print(f"\nData type: {type(AR_30companies_1)}")


Total number of standardised sentences: 904498

The first 5 standardised sentences:
                         standardised sentences
0     mothercare plc annual report accounts www
1                                 mothercareplc
2                     com transformation growth
3  financial highlights worldwide network sales
4                                   group sales

The last 5 standardised sentences:
                                   standardised sentences
904493  sainsbury accolade reflects work store pharmacies
904494  official partner london paralympic games proud...
904495  sponsorship helping us promote healthier activ...
904496  celebrating majesty diamond jubilee sainsbury ...
904497  winner superm arket year winner convenience ch...

Data type: <class 'pandas.core.frame.DataFrame'>


Save as a csv file for the above dataframe.

In [197]:
# Save the DataFrame to a CSV file
AR_30companies_1.to_csv('ar_30companies.csv', index=True)

In [198]:
import os

current_directory = os.getcwd()
print(current_directory)


/Users/amosmbp14/Jupyter notebook/Summer_proj
