### **Background**

In some situations, we want to know what changes have been made to an updated Master Service Agreement. Identifying the changes with human eyes could be tedious; sometimes the change is so small that it could be easily ignored.  

This notebook addresses this problem by using Python to extract the text from two different Master Service Agreement files and compare the text sentence by sentence. Python will identify different sentences and print them out for us.  

**The different sentences could be:**  

**1. Sentences appear in both files, but the sentences are not exactly the same.**

**2. Sentences appear in one file but not in the other file.**  



The MSA(Master Service Agreement) files used in this notebook are from the Salesforce MSA archive: https://www.salesforce.com/company/legal/MSA-archive/

### **0. Web Scraping to Automatically Download the MSA Files From the Website**

Below are the codes to automatically download the MSA files from the Salesforce website to Google Colab and rename those files.   
The latest MSA will be named "Salesforce_MSA_1", and the larger the number, the older the version.

In [1]:
# import libraries for web scraping
import requests
import urllib.request
from bs4 import BeautifulSoup

In [2]:
# url of the website
url = 'https://www.salesforce.com/company/legal/MSA-archive/'

# create a response of the website
response = requests.get(url)

# parse the text of the website
soup = BeautifulSoup(response.text, 'html.parser')

# find all the links in the website
links = soup.find_all('a')

file_num = 0
for link in links:
  # find pdf links
  if link['href'].endswith('.pdf'):
    file_num += 1
    pdf_link = link['href']

    # get the url of the pdf link
    pdf_url = 'https:' + pdf_link

    # get the name of the pdf link
    file_name = link.text[8:]

    # download the pdf
    urllib.request.urlretrieve(pdf_url, 'Salesforce_MSA_'+str(file_num)+'.pdf')
    print(file_name, ' downloaded, named as Salesforce_MSA_'+str(file_num))

print('Downloaded '+str(file_num)+' files')

 Salesforce MSA (December 19, 2022 - Present)  downloaded, named as Salesforce_MSA_1
 Salesforce MSA (November 18, 2022 - December 18, 2022)  downloaded, named as Salesforce_MSA_2
 Salesforce MSA (September 15, 2022 - November 17, 2022)  downloaded, named as Salesforce_MSA_3
 Salesforce MSA (August 15, 2022 - September 14, 2022)   downloaded, named as Salesforce_MSA_4
 Salesforce MSA (August 4, 2022 - August 14, 2022)   downloaded, named as Salesforce_MSA_5
 Salesforce MSA (May 16, 2022 - August 3, 2022)  downloaded, named as Salesforce_MSA_6
 Salesforce MSA (April 4, 2022 - May 15, 2022)  downloaded, named as Salesforce_MSA_7
 Salesforce MSA (February 18, 2022 - April 3, 2022)  downloaded, named as Salesforce_MSA_8
 Salesforce MSA (November 22, 2021 - February 17, 2022)  downloaded, named as Salesforce_MSA_9
 Salesforce MSA (September 27, 2021 - November 21, 2021)  downloaded, named as Salesforce_MSA_10
 Salesforce MSA (April 23, 2021 - September 26, 2021)  downloaded, named as Salesf

### **1. Extracting Text From PDF Files**

Below are the codes to define a function to extract text from a PDF file:

In [None]:
!pip install pdfminer.six

In [4]:
# import libraries for extracting the text
from IPython.lib.pretty import StringIO
from io import StringIO
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage

In [5]:
def extract_text_from_pdf(pdf_name):
  '''
  This function takes in a PDF file name as a parameter and returns the text content of the PDF file as a string
  
  '''

  with open(str(pdf_name)+'.pdf', 'rb') as file:
    # create a PDF resource manager object to store resources
    resource_manager = PDFResourceManager()

    # create a string buffer object to store the extracted text from the PDF file
    string_io = StringIO()

    # create a text converter object to convert the PDF content into text
    converter = TextConverter(resource_manager, string_io, codec='utf-8', laparams = LAParams(char_margin=5, all_texts=True))

    # creates a PDF page interpreter object to process the PDF pages and extract the text content
    interpreter = PDFPageInterpreter(resource_manager, converter)

    # extract the text from each page
    for page in PDFPage.get_pages(file, caching=True, check_extractable=True):
      interpreter.process_page(page)

    # get the text content stored in the string_io buffer object
    text = string_io.getvalue()
  
  return text

### **2. Cleaning Text Using Regular Expression**

There could be some messy formatting on the text, caused during extracting the text from PDF files (for example, extra spaces between words). Cleaning the messy formatting can help AI model better tokenize the text into sentences, and also reduce the influence of different formatting when comparing two documents. 

Furthermore, there could be some unique characteristics in the particular files we are dealing with (for example, the foot noting of each page). These unwanted information should be remove before we comparing two documents if we want the result of comparision to be less redundant. In this situation, the cleaning rules have to be set up on a case-by-case basis, so some parts of the codes below are customized to clean Salesforce MSA files based on the unique characteristics of these files.

In [6]:
# import library for cleaning string
import re

In [74]:
def clean_text(text_string):
  '''
  This function takes a string and returned the cleaned string based on the regex rules

  '''

  # remove line breakers
  cleaned_text = text_string.replace('\n', '')

  # remove extra whitespaces
  cleaned_text = re.sub(r'\s+', ' ', cleaned_text)

  # clean extra whitespaces before a set of punctuations
  cleaned_text = re.sub(r'\s+([-.,;:)#/&*”>%@])', r'\1', cleaned_text)

  # clean extra whitespaces after a set of punctuations
  cleaned_text = re.sub(r'([-(#/&*“<@]+)(\s+)', r'\1', cleaned_text)

  # remove "Back to Top" navigator in the pdf file
  cleaned_text = cleaned_text.replace('Back to Top', '')

  # remove page footing of each page
  cleaned_text = re.sub(r'Page \d{1,2} of \d{1,2}', '', cleaned_text)
  cleaned_text = re.sub(r'SFDC-MSA, [A-Za-z]+ \d{1,2}, \d{4}', '', cleaned_text)
  cleaned_text = re.sub(r'SFDC-MSA [A-Za-z]+ \d{1,2}, \d{4}', '', cleaned_text)
  cleaned_text = re.sub(r'SFDC-MSA [A-Za-z]+ \d{4}', '', cleaned_text)
  cleaned_text = re.sub(r'SFDC-MSA [A-Za-z]+ \d{1,2} \d{4}', '', cleaned_text)
  
  # remove extra whitespaces
  cleaned_text = re.sub(r'\s+', ' ', cleaned_text)

  return cleaned_text

### **3. Using A Pre-trained NLP Model to Tokenize Text into Sentences**

After cleaning up the text, the next step is to split the text into sentences so we can compare sentences from two different MSA.  

spaCy is a popular NLP library that provides a pre-trained sentence tokenizer, which is used to split the text into individual sentences. The spaCy sentence tokenizer uses complex algorithms and machine learning models to accurately identify sentence boundaries in text data. It can handle a wide variety of text data, including text in multiple languages, text with different writing styles, and text with complex punctuation and sentence structures.

In [None]:
!pip install spacy

In [None]:
import spacy

In [10]:
# define a function to use spaCy's pre-trained sentence tokenizer model
def sentencize_text(text):
  '''
  This function takes in a string of text and returns a list of sentences.
  
  '''
  # Load the English NLP model
  nlp = spacy.load("en_core_web_sm")
  
  # The 'nlp' object is used to create documents
  doc = nlp(text)
  sentences_list = [sent.text.strip() for sent in doc.sents]

  return sentences_list

### **4. Comparing Two Files Sentence by Sentence**

Below are the codes to define a function to compare the text of 2 PDF files sentence by sentence:

In [11]:
# import the library for string by string comparing
import difflib

In [57]:
# a function to identify different sentences in 2 PDF files
def find_different_sentences():
  '''
  This function takes two PDF file names from the user and prints out the different sentences between the two files.
    
  '''

  # prompt the user to enter the names of two PDF files
  pdf_name_1 = input('Enter the first file name and press enter: ')
  pdf_name_2 = input('Enter the second file name and press enter: ')

  # extract the text from the two PDF files
  pdf1_text = extract_text_from_pdf(pdf_name_1)
  pdf2_text = extract_text_from_pdf(pdf_name_2)

  # clean the extracted text
  pdf1_cleaned_text = clean_text(pdf1_text)
  pdf2_cleaned_text = clean_text(pdf2_text)

  # split the cleaned text into sentences
  pdf1_sentences = sentencize_text(pdf1_cleaned_text)
  pdf2_sentences = sentencize_text(pdf2_cleaned_text)

  # compare the sentences of the two PDF files and stores the differences in a list
  diff = difflib.ndiff(pdf1_sentences, pdf2_sentences)

  # store the different sentence in a list
  differences = [line for line in diff if (line.startswith('- ') or line.startswith('+ '))]

  # print the different sentences along with the file name they belong to
  if differences:
    print("Below are the differences:\n")

    for line in differences:
      if line.startswith('- '):
        line = 'Sentence in the ' + str(pdf_name_1) + ':' + line[1:]
        print(line)
      else:
        line = 'Sentence in the ' + str(pdf_name_2) + ':' + line[1:]
        print(line)

  else:
    print("The PDF files are identical")

### **5. Testing for Salesforce MSA Files**

Below are the codes to test the function `find_different_sentences()` with the Salesforce MSA files:




In [22]:
find_different_sentences()

Enter the first file name and press enter: Salesforce_MSA_1
Enter the second file name and press enter: Salesforce_MSA_2
Below are the differences:

Sentence in the Salesforce_MSA_1: This Agreement was last updated on December 19, 2022.
Sentence in the Salesforce_MSA_2: This Agreement was last updated on November 18, 2022.
Sentence in the Salesforce_MSA_1: With respect to Customers domiciled in the United Kingdom, Section 12.3 “Entire Agreement and Order of Precedence” of this Agreement is replaced with the following section: 12.3 Entire Agreement and Order of Precedence.
Sentence in the Salesforce_MSA_2: Section 12.3 “Entire Agreement and Order of Precedence” of this Agreement is replaced with the following section: 12.3 Entire Agreement and Order of Precedence.


In [26]:
find_different_sentences()

Enter the first file name and press enter: Salesforce_MSA_2
Enter the second file name and press enter: Salesforce_MSA_3
Below are the differences:

Sentence in the Salesforce_MSA_2: This Agreement was last updated on November 18, 2022.
Sentence in the Salesforce_MSA_3: This Agreement was last updated on, September 15, 2022.
Sentence in the Salesforce_MSA_2: Customer will not permit any User to access or use any Service or Content in a U.S.-embargoed country or region (currently the Crimea, Luhansk or Donetsk regions, Cuba, Iran, North Korea, or Syria) or as may be updated from time to time at https://www.salesforce.com/company/legal/compliance/ or in violation of any U.S. export law or regulation.
Sentence in the Salesforce_MSA_3: Customer will not permit any User to access or use any Service or Content in a U.S.-embargoed country or region (currently the Crimea, Luhansk or Donetsk regions, Cuba, Iran, North Korea, Sudan or Syria) or as may be updated from time to time at https://www.

In the output above, there is a chunk of "Sentence in the Salesforce_MSA_2" without corresponding "Sentence in the Salesforce_MSA_3". This means that these lines only appear in Salesforce_MSA_2 - they are newly added content in the newer version of MSA.

In [52]:
find_different_sentences()

Enter the first file name and press enter: Salesforce_MSA_3
Enter the second file name and press enter: Salesforce_MSA_4
Below are the differences:

Sentence in the Salesforce_MSA_3: This Agreement was last updated on, September 15, 2022.
Sentence in the Salesforce_MSA_4: This Agreement was last updated on August 15, 2022.
Sentence in the Salesforce_MSA_3: During an applicable subscription term (a) this Agreement, the Order Forms and the Documentation will accurately describe the applicable administrative, physical, and technical safeguards for protection of the security, confidentiality and integrity of Customer Data, (b) SFDC will not materially decrease the overall security of the Services, (c) the Services will perform materially in accordance with the applicable Documentation, and (d) subject to the “Integration with Non-SFDC Applications” section above, SFDC will not materially decrease the overall functionality of the Services.
Sentence in the Salesforce_MSA_4: Pendant une pério

In [54]:
find_different_sentences()

Enter the first file name and press enter: Salesforce_MSA_4
Enter the second file name and press enter: Salesforce_MSA_5
Below are the differences:

Sentence in the Salesforce_MSA_4: This Agreement was last updated on August 15, 2022.
Sentence in the Salesforce_MSA_5: This Agreement was last updated on August 4, 2022.
Sentence in the Salesforce_MSA_4: Legal Department-Level 1, Block A, Nova Atria North, Sandyford Business District, Dublin 18, Ireland Salesforce.com Sarl, Route de la Longeraie 9, Morges, 1110, Switzerland, attn: Director, EMEA Sales Operations, with a copy to attn.:
Sentence in the Salesforce_MSA_5: Legal Department-3rd and 4th Floor, 1 Central Park Block G, Central Park, Leopardstown, Dublin 18, Ireland Salesforce.com Sarl, Route de la Longeraie 9, Morges, 1110, Switzerland, attn: Director, EMEA Sales Operations, with a copy to attn.:


In [60]:
find_different_sentences()

Enter the first file name and press enter: Salesforce_MSA_5
Enter the second file name and press enter: Salesforce_MSA_6
Below are the differences:

Sentence in the Salesforce_MSA_5: This Agreement was last updated on August 4, 2022.
Sentence in the Salesforce_MSA_6: This Agreement was last updated on May 16, 2022.
Sentence in the Salesforce_MSA_5: Customer will defend SFDC and its Affiliates against any claim, demand, suit or proceeding made or brought against SFDC by a third party (a) alleging that the combination of a Non-SFDC Application or configuration provided by Customer and used with the Services, infringes or misappropriates such third party’s intellectual property rights, or (b) arising from (i) Customer’s use of the Services or Content in an unlawful manner or in violation of the Agreement, the Documentation, or Order Form, (ii) any Customer Data or Customer’s use of Customer Data with the Services, or (iii) a Non-SFDC Application provided by Customer(each a “Claim Against 

In [61]:
find_different_sentences()

Enter the first file name and press enter: Salesforce_MSA_6
Enter the second file name and press enter: Salesforce_MSA_7
Below are the differences:

Sentence in the Salesforce_MSA_6: This Agreement was last updated on May 16, 2022.
Sentence in the Salesforce_MSA_7: This Agreement was last updated on April 4, 2022.
Sentence in the Salesforce_MSA_6: California and controlling United States federal law exclusive jurisdiction are: San Francisco, California, U.S.A. Brazil Canada France Germany
Sentence in the Salesforce_MSA_7: California and controlling United States federal law exclusive jurisdiction are: San Francisco, California, U.S.A. Brazil Canada France Germany Italy
Sentence in the Salesforce_MSA_6: India Italy Spain Salesforce Tecnologia Ltda.
Sentence in the Salesforce_MSA_7: Spain Salesforce Tecnologia Ltda.
Sentence in the Salesforce_MSA_6: Canada Corporation, a Nova Scotia corporation salesforce.com France, a French S.A.S company with a share capital of 37,000 €, registered wit

In [63]:
find_different_sentences()

Enter the first file name and press enter: Salesforce_MSA_9
Enter the second file name and press enter: Salesforce_MSA_10
Below are the differences:

Sentence in the Salesforce_MSA_9: This Agreement was last updated on November 22, 2021.
Sentence in the Salesforce_MSA_10: This Agreement was last updated on September 27, 2021.
Sentence in the Salesforce_MSA_9: “Documentation” at https://trust.salesforce.com/en/trust-and-compliance-documentation/and its usage guides and policies, as updated from time to time, accessible via help.salesforce.com or login to the applicable Service.
Sentence in the Salesforce_MSA_10: “Documentation” means the applicable Service’s Trust and Compliance documentation at https://trust.salesforce.com/en/trust-and-compliance-documentation/and its usage guides and policies, as updated from time to time, accessible via help.salesforce.com or login to the applicable Service.
Sentence in the Salesforce_MSA_9: documentation Compliance applicable Service’s means Trust a

In [64]:
find_different_sentences()

Enter the first file name and press enter: Salesforce_MSA_10
Enter the second file name and press enter: Salesforce_MSA_11
Below are the differences:

Sentence in the Salesforce_MSA_10: This Agreement was last updated on September 27, 2021.
Sentence in the Salesforce_MSA_11: This Agreement was last updated on April 23, 2021.
Sentence in the Salesforce_MSA_10: “Content” means information obtained by SFDC from publicly available sources or its third-party content providers and made available to Customer through the Services, Beta Services or pursuant to an Order Form, as more fully described in the Documentation.
Sentence in the Salesforce_MSA_11: “Content” means information obtained by SFDC from publicly available sources or its third party content providers and made available to Customer through the Services, Beta Services or pursuant to an Order Form, as more fully described in the Documentation.
Sentence in the Salesforce_MSA_10: “Documentation” means the applicable Service’s Trust a

In [65]:
find_different_sentences()

Enter the first file name and press enter: Salesforce_MSA_11
Enter the second file name and press enter: Salesforce_MSA_12
Below are the differences:

Sentence in the Salesforce_MSA_11: This Agreement was last updated on April 23, 2021.
Sentence in the Salesforce_MSA_12: This Agreement was last updated on April 12, 2021.
Sentence in the Salesforce_MSA_11: It is effective between Customer and SFDC as of the date of Customer’s accepting this Agreement (the “Effective Date”).
Sentence in the Salesforce_MSA_12: It is effective between Customer and SFDC as of the date of Customer’s accepting this Agreement.


In [66]:
find_different_sentences()

Enter the first file name and press enter: Salesforce_MSA_20
Enter the second file name and press enter: Salesforce_MSA_21
Below are the differences:

Sentence in the Salesforce_MSA_20: This Agreement was last updated on June 23rd, 2016.
Sentence in the Salesforce_MSA_21: This Agreement was last updated on January 9, 2016.
Sentence in the Salesforce_MSA_20: “Marketplace” means an online directory, catalog or marketplace of applications that interoperate with the Services, including, for example, the AppExchange located at http://www.salesforce.com/appexchange, ExactTarget’s HubExchange located at https://hubexchange.exacttarget.com/, or the Heroku add-ons catalog located at https://elements.heroku.com/, and any successor websites.
Sentence in the Salesforce_MSA_21: “Marketplace” means an online directory, catalog or marketplace of applications that interoperate with the Services, including, for example, the AppExchange located at http://www.salesforce.com/appexchange, ExactTarget’s Hub

In [69]:
find_different_sentences()

Enter the first file name and press enter: Salesforce_MSA_30
Enter the second file name and press enter: Salesforce_MSA_31
Below are the differences:

Sentence in the Salesforce_MSA_30: Master Subscription Agreement THIS MASTER SUBSCRIPTION AGREEMENT (“AGREEMENT”) GOVERNS YOUR FREE TRIAL OF THE SERVICES.
Sentence in the Salesforce_MSA_31: http://www.salesforce.com/company/msa.jsp Master Subscription Agreement THIS MASTER SUBSCRIPTION AGREEMENT (“AGREEMENT”) GOVERNS YOUR 30-DAY FREE TRIAL OF THE SERVICES.
Sentence in the Salesforce_MSA_30: This Agreement was last updated on December 1, 2010.
Sentence in the Salesforce_MSA_31: This Agreement was last updated on January 31, 2009.
Sentence in the Salesforce_MSA_31: Definitions 2. 30-Day Free Trial 3.
Sentence in the Salesforce_MSA_30: Definitions 2.
Sentence in the Salesforce_MSA_30: Free Trial 3.
Sentence in the Salesforce_MSA_30: "Purchased Services" means Services that You or Your Affiliates purchase under an Order Form, as distinguishe

In [72]:
find_different_sentences()

Enter the first file name and press enter: Salesforce_MSA_40
Enter the second file name and press enter: Salesforce_MSA_41
Below are the differences:

Sentence in the Salesforce_MSA_40: Master Subscription Agreement BY CLICKING THE "I ACCEPT" BUTTON DISPLAYED AS PART OF THE ORDERING PROCESS, YOU AGREE TO THE FOLLOWING TERMS AND CONDITIONS (THE "AGREEMENT") GOVERNING YOUR USE OF SALESFORCE.COM'S ONLINE SERVICE, INCLUDING OFFLINE COMPONENTS (COLLECTIVELY, THE "SERVICE").
Sentence in the Salesforce_MSA_41: Master Subscription Agreement Master Subscription Agreement BY CLICKING THE "I ACCEPT" BUTTON DISPLAYED AS PART OF THE ORDERING PROCESS, YOU AGREE TO THE FOLLOWING TERMS AND CONDITIONS (THE "AGREEMENT") GOVERNING YOUR USE OF SALESFORCE.COM'S ONLINE SERVICE (THE "SERVICE").
Sentence in the Salesforce_MSA_40: Intellectual Property Ownership Salesforce.com alone (and its licensors, where applicable) shall own all right, title and interest, including all related Intellectual Property Rights

In [73]:
find_different_sentences()

Enter the first file name and press enter: Salesforce_MSA_50
Enter the second file name and press enter: Salesforce_MSA_51
Below are the differences:

Sentence in the Salesforce_MSA_50: ONLINE SERVICE AGREEMENT 30-Apr-02 IMPORTANT.
Sentence in the Salesforce_MSA_51: ONLINE SERVICE AGREEMENT 10-Apr-02 IMPORTANT.
Sentence in the Salesforce_MSA_50: Welcome Customer’s registration for, or use of, the Service shall be taken to be Customer’s acceptance of, and agreement to, the terms of this Agreement including terms contained in any materials available on the salesforce.com website incorporated by reference herein, including but not limited to the Privacy and Security Statements.
Sentence in the Salesforce_MSA_51: Welcome As part of the Service, salesforce.com will provide Customer with use of the Service, including a browser interface and data encryption, transmission, access and storage.
Sentence in the Salesforce_MSA_51: Customer’s registration for, or use of, the Service shall be deemed