# "A look into the modern slavery statements released by companies towards the SDGs"
> "Data taken from the ABS website is forecasted 4 periods into the future by different models"

- toc: true
- branch: master
- badges: true
- comments: true
- categories: [natural language processing]
- image: images/slavery_img.png
- hide: false
- search_exclude: true
- metadata_key1: metadata_value1
- metadata_key2: metadata_value2

### About

The code is available here: https://github.com/IanLDias/Modern_slavery

Inspired by the work done by The Future Society: https://thefuturesociety.org/, this analysis aims to take their research futher.

The UK Modern Slavery Act (MSA) requires companies above a certain valuation to report steps that they are taking to to abolish modern slavery from their operations. Australia has now also implemented a similar act. This consists of thousands of PDFs every year submitted by these companies.

The aim of this project is to extract information from these documents and identify companies at risk of having these components in their production chains.


### Goals

The goal of this project is to implement an information extraction framework for these PDFs

1) Text Summarization of the PDFs 
    - Extractive Approach: identifying key sentences and phrases and using these in a summary
    
2) Relevance Score
    - Calculates the TF-IDF of each document and compares document similarities
    - Makes a list of the most and least relevant document. Found in TF-IDF sklearn.ipynb
3) Identifying companies at risk
    - Testing knowledge graphs for information extraction and linkage
    

### Methods

- Google search was scraped with the keyword 'modern slavery statements' and everything that was a pdf as a link was downloaded. 

Tools used: BeautifulSoup, Requests, regex

- The pdfs were then read using PyPDF, Textract and OCR techniques. Textract was found to be most accurate and was used. 

**1) Text summarization of the PDFs**

$\href{https://github.com/IanLDias/Modern_slavery/blob/main/notebooks/Extractive%20Text%20Summarization.ipynb}{Text-  Summarization}$

Each PDF can be a dozen pages long and so a summarization method was needed. An extractive method was used which when used on the coca-cola 10-page pdf yielding the following summary:

"""*We prohibit the use of all forms of forced labour, including prison labour, indentured labour, bonded labour, military labour, slave labour and any form of human trafficking within our company and by any company that directly supplies or provides services to our business.   It expressly prohibits the use of all forms of child labour and forced labour Ð including prison labour, indentured labour, bonded labour, military labour, slave labour and any form of human trafficking.   Developed in partnership with The Coca-Cola Company, our SAGPs cover the Coca-Cola systemÕs key agricultural ingredients, and define the standards we expect our agricultural ingredient suppliers to adhere to in terms of human and workplace rights Ð including prohibitions on modern slavery and child labour, the environment, and management systems.*"""

This isn't as good as it needs to be. Further extensions are to use Named Entity Recognition to identify important elements of the text and center the summarization around those.

**2) Relevance Score**

A manual approach
: $\href{https://github.com/IanLDias/Modern_slavery/blob/main/notebooks/testing/Relevance%20Score%20-%20Manual%20approach.ipynb}{Manual-Approach}$

as well as a more optimized approach was used to calculate how relevant each document was in relation to all the other documents. 

$\href{https://github.com/IanLDias/Modern_slavery/blob/main/notebooks/Relevance%20Score%20TF-IDF%20sklearn.ipynb}{Optimized-Approach}$

- The language model was used from Spacy
- The text was lemmatized and punctuation was removed
- Each word in every document was added to a corpus
- Each document had a number of words that was in this corpus
- The term frequency (number of times that word appears in a particular document) and the inverse document frequency (number of times that word appears in the whole corpus) was calculated
- Comparing these scores allows you to model a similarity score among the documents

![image.png](attachment:image.png)