# 04 Company Scoring(i)

## 04.1 Cleaning

In [1]:
import pandas as pd
import spacy
from spacy.matcher import PhraseMatcher
import matplotlib.pyplot as plt
from collections import Counter
import re
import joblib

Will first import the datasets

In [2]:
downstream = pd.read_csv('Downstream.csv', encoding='latin1')
inputgoods = pd.read_csv('InputGoods.csv', encoding='latin1')
gov_docs = pd.read_csv('preprocessed_gov_docs.csv', encoding='latin1')

We are only interested in Goods that are produced with Child Labour for this Analysis. Therefore, will drop all goods from our analysis that have no ties with child labour

In [3]:
downstream = downstream.dropna(subset=['Child Labor'])
downstream

Unnamed: 0,Country/Area,TVPRA Input Good,Child Labor,Forced Labor,Country/Area.1,TVPRA Downstream Good,Downstream Goods at Risk
0,Bolivia,Zinc,X,,South Korea,Indium,"Conductive Glass,Touchscreen Devices, Flatscre..."
10,Côte dIvoire,Cocoa,X,,Côte dIvoire,Chocolate,"Candy, Baked Goods, Beverages, Ice Cream, Cosm..."
11,Côte dIvoire,Cocoa,X,,Côte dIvoire,Cocoa Butter,"Candy, Baked Goods, Beverages, Ice Cream, Cosm..."
12,Côte dIvoire,Cocoa,X,,Côte dIvoire,Cocoa Paste,"Candy, Baked Goods, Beverages, Ice Cream, Cosm..."
13,Côte dIvoire,Cocoa,X,,Côte dIvoire,Cocoa Powder,"Candy, Baked Goods, Beverages, Ice Cream, Cosm..."
14,Côte dIvoire and Ghana,Cocoa,X,,Netherlands,Chocolate,"Candy, Baked Goods, Beverages, Ice Cream, Cosm..."
15,Côte dIvoire and Ghana,Cocoa,X,,Netherlands,Cocoa Butter,"Candy, Baked Goods, Beverages, Ice Cream, Cosm..."
16,Côte dIvoire and Ghana,Cocoa,X,,Netherlands,Cocoa Paste,"Candy, Baked Goods, Beverages, Ice Cream, Cosm..."
17,Côte dIvoire and Ghana,Cocoa,X,,Netherlands,Cocoa Powder,"Candy, Baked Goods, Beverages, Ice Cream, Cosm..."
18,Democratic Republic of the Congo,Cobalt Ore (heterogenite),X,,China,Lithium-Ion Batteries,"Cell Phones, Electric Cars, Laptops, Medical I..."


In [4]:
inputgoods = inputgoods.dropna(subset=['Child Labor', 'Forced Child Labor'], how='all')
inputgoods

Unnamed: 0,Country/Area,Good,Child Labor,Forced Labor,Forced Child Labor
0,Afghanistan,Bricks,X,X,X
1,Afghanistan,Carpets,X,,
2,Afghanistan,Coal,X,,
3,Afghanistan,Poppies,X,,
4,Afghanistan,Salt,X,,
...,...,...,...,...,...
473,Zambia,Tobacco,X,,
474,Zimbabwe,Gold,X,,
475,Zimbabwe,Lithium,X,,
476,Zimbabwe,Sugarcane,X,,


All good assocoiated with forced child labour also associated with child labour, logically.

## 04.2 Synyonym Dictionary

Will now extract all good from the datasets.
Will have to seprate goods separated by a coma in the 'Downstream Goods at Risk' field in the downstream dataset

In [5]:
# Extract unique goods from the correct columns
downstream_goods = downstream['TVPRA Downstream Good'].dropna().unique().tolist()

# Split strings separated by a comma in 'Downstream Goods at Risk' and flatten the list
downstream_goods_at_risk = downstream['Downstream Goods at Risk'].dropna().str.split(',').explode().str.strip().unique().tolist()

input_goods_list = inputgoods['Good'].dropna().unique().tolist()

# Combine and deduplicate the goods list
all_goods = sorted(set(downstream_goods + downstream_goods_at_risk + input_goods_list))

# Create initial synonym dictionary with goods mapping to themselves
goods_synonyms = {good: [good] for good in all_goods}

# Display the synonym dictionary
goods_synonyms

{'Aircraft Engines': ['Aircraft Engines'],
 'Alcoholic Beverages': ['Alcoholic Beverages'],
 'Amber': ['Amber'],
 'Animal Feed': ['Animal Feed'],
 'Açaí Berries': ['Açaí Berries'],
 'Baked Goods': ['Baked Goods'],
 'Bakery Items': ['Bakery Items'],
 'Bamboo': ['Bamboo'],
 'Bananas': ['Bananas'],
 'Beans ': ['Beans '],
 'Beans (green beans)': ['Beans (green beans)'],
 'Beans (green, soy, yellow)': ['Beans (green, soy, yellow)'],
 'Beef': ['Beef'],
 'Beverages': ['Beverages'],
 'Bidis (hand-rolled cigarettes)': ['Bidis (hand-rolled cigarettes)'],
 'Biofuel': ['Biofuel'],
 'Biofuels': ['Biofuels'],
 'Bovines': ['Bovines'],
 'Brass': ['Brass'],
 'Brassware': ['Brassware'],
 'Brazil Nuts/Chestnuts': ['Brazil Nuts/Chestnuts'],
 'Bricks': ['Bricks'],
 'Bricks (clay)': ['Bricks (clay)'],
 'Broccoli': ['Broccoli'],
 'Cabbages': ['Cabbages'],
 'Candy': ['Candy'],
 'Carpets': ['Carpets'],
 'Carrots': ['Carrots'],
 'Cashews': ['Cashews'],
 'Cattle': ['Cattle'],
 'Cell Phones': ['Cell Phones'],
 'C

The next step is to clean the goods synoynm dictionary. We begin this by removing US localised terminologies, to be replaced with UK-English forms as keys in our dictionary.

We then remove irregular forms of goods e.g. "Cigarettes (Tobacco)" to be replaced withmore regular forms "Cigarettes" to aid detection of goods in text.

In [6]:
# List of goods to remove from the synonym dictionary
goods_to_remove = [
    'and LEDs','Aluminum', 'Cigarettes (Tobacco)', 'Cigarettes (Tobacco) ', 'Cobalt Ore (heterogenite)', 'Chile Peppers'
    'Coca (stimulant plant)', 'Cocoa Paste ', 'Crude Palm Oil ', 'Eggplants',
    'Leather Goods/Accessories', 'Lithium-Ion Batteries ', 'Manioc/Cassava', 'Oleochemicals ',
    'Refined Palm Kernel Oil ', 'Refined Palm Oil ', 'Shrimp', 'Soccer Balls', 'Tilapia (fish)',
    'Tungsten Ore (wolframite)', 'Yerba Mate (stimulant plant)'
]

# Remove the specified goods from the synonym dictionary
for good in goods_to_remove:
    if good in goods_synonyms:
        del goods_synonyms[good]

# Display the updated synonym dictionary
goods_synonyms

{'Aircraft Engines': ['Aircraft Engines'],
 'Alcoholic Beverages': ['Alcoholic Beverages'],
 'Amber': ['Amber'],
 'Animal Feed': ['Animal Feed'],
 'Açaí Berries': ['Açaí Berries'],
 'Baked Goods': ['Baked Goods'],
 'Bakery Items': ['Bakery Items'],
 'Bamboo': ['Bamboo'],
 'Bananas': ['Bananas'],
 'Beans ': ['Beans '],
 'Beans (green beans)': ['Beans (green beans)'],
 'Beans (green, soy, yellow)': ['Beans (green, soy, yellow)'],
 'Beef': ['Beef'],
 'Beverages': ['Beverages'],
 'Bidis (hand-rolled cigarettes)': ['Bidis (hand-rolled cigarettes)'],
 'Biofuel': ['Biofuel'],
 'Biofuels': ['Biofuels'],
 'Bovines': ['Bovines'],
 'Brass': ['Brass'],
 'Brassware': ['Brassware'],
 'Brazil Nuts/Chestnuts': ['Brazil Nuts/Chestnuts'],
 'Bricks': ['Bricks'],
 'Bricks (clay)': ['Bricks (clay)'],
 'Broccoli': ['Broccoli'],
 'Cabbages': ['Cabbages'],
 'Candy': ['Candy'],
 'Carpets': ['Carpets'],
 'Carrots': ['Carrots'],
 'Cashews': ['Cashews'],
 'Cattle': ['Cattle'],
 'Cell Phones': ['Cell Phones'],
 'C

Now will create a custom synonym dictionary with all the varying ways of saying each good.

Will then enrich goods_synonyms with these synyonymsso that every instance of a mention of a good is detected in thge governance docs

In [7]:
custom_synonyms = {
    "Açaí Berries": ["Açaí Berries", "Acai", 'Acai Berries'],
    "Aluminium": ["Aluminium", "Aluminum"],
    "Aubergine": ["Aubergine", "Aubergines", "Brinjals", "Eggplant", "Eggplants"],
    "Beetroot": ["Beetroot", "Beet"],
    "Carpets": ["Carpets", "Rugs"],
    "Cassava": ["Cassava", 'Manioc', 'Yuca', 'Manioc/Cassava'],
    "Chickpeas": ["Chickpeas", "Garbanzo Beans"],
    "Chili Peppers": ["Chili Peppers", "Chile Peppers", "Chilli Peppers"],
    "Cobalt Ore": ["Cobalt Ore", "Cobalt"],
    "Coca Leaf": ["Coca Leaf", 'Coca (Stimulant)'],
    "Cocoa": ["Cocoa", "Cacao"],
    "Cotton": ["Cotton", "Cottonseed", "Cottonseed (hybrid)", "Textiles (Cotton)", "Garments (Cotton)", "Thread/Yarn (Cotton)"],
    "Courgettes": ['Courgettes', 'Courgette', 'Zucchini', 'Zucchinis'],
    "Coriander": ["Coriander", "Cilantro"],
    "Footballs": ["Footballs", "Soccer Balls"],
    "Footwear": ["Footwear", "Shoes", "Sandals", "Boots"],
    "Garments": ["Garments", "Clothing", "Apparel"],
    "Groundnuts": ["Groundnuts", "Peanuts"],
    "Jewellery": ["Jewellery", "Jewelry"],
    "Leather": ["Leather", "Leather Goods", "Leather Accessories"],
    "LEDs": ["LEDs"],
    "Lithium-Ion Batteries": ["Lithium-Ion Batteries", "Lithium Batteries"],
    "Maize": ["Maize", "Corn"],
    "Petrol": ["Petrol", "Gasoline"],
    "Prawns": ["Prawns", "Shrimp"],
    "Rubber": ["Rubber", "Latex"],
    "Silk Fabric": ["Silk Fabric", "Silk"],
    "Soybeans": ["Soybeans", "Soya", "Soy"],
    "Sugarcane": ["Sugarcane", "Sugar Cane"],
    "Sweet Potatoes": ["Sweet Potatoes", "Yams", "Sweet Potato"],
    "Tea": ["Tea", "Chai"],
    "Tilapia": ["Tilapia", "Tilapia (fish)"],
    "Timber": ["Timber", "Lumber"],
    "Tomatoes": ["Tomatoes", "Tomato"],
    "Tungsten": ["Tungsten", "Tungsten Ore", "Tungsten Ore (wolframite)"],
    "Yerba Mate": ["Yerba Mate", "Mate", "Yerba", "Yerba Mate (stimulant plant)"]
}

# Enrich the synonym dictionary
for standard_term, synonyms in custom_synonyms.items():
    # If the standard term already exists, extend its synonym list
    if standard_term in goods_synonyms:
        current_synonyms = set(goods_synonyms[standard_term])
        goods_synonyms[standard_term] = sorted(current_synonyms.union(synonyms))
    else:
        # Try to map to an existing term in the original list
        for good in goods_synonyms.keys():
            if standard_term.lower() in good.lower() or good.lower() in standard_term.lower():
                current_synonyms = set(goods_synonyms[good])
                goods_synonyms[good] = sorted(current_synonyms.union(synonyms))
                break
        else:
            # If no close match found, just add the new term as a new entry
            goods_synonyms[standard_term] = sorted(set(synonyms))

# Display enriched synonym dictionary (partial sample for preview)
dict(list(goods_synonyms.items()))

{'Aircraft Engines': ['Aircraft Engines'],
 'Alcoholic Beverages': ['Alcoholic Beverages'],
 'Amber': ['Amber'],
 'Animal Feed': ['Animal Feed'],
 'Açaí Berries': ['Acai', 'Acai Berries', 'Açaí Berries'],
 'Baked Goods': ['Baked Goods'],
 'Bakery Items': ['Bakery Items'],
 'Bamboo': ['Bamboo'],
 'Bananas': ['Bananas'],
 'Beans ': ['Beans '],
 'Beans (green beans)': ['Beans (green beans)'],
 'Beans (green, soy, yellow)': ['Beans (green, soy, yellow)'],
 'Beef': ['Beef'],
 'Beverages': ['Beverages'],
 'Bidis (hand-rolled cigarettes)': ['Bidis (hand-rolled cigarettes)'],
 'Biofuel': ['Biofuel'],
 'Biofuels': ['Biofuels'],
 'Bovines': ['Bovines'],
 'Brass': ['Brass'],
 'Brassware': ['Brassware'],
 'Brazil Nuts/Chestnuts': ['Brazil Nuts/Chestnuts'],
 'Bricks': ['Bricks'],
 'Bricks (clay)': ['Bricks (clay)'],
 'Broccoli': ['Broccoli'],
 'Cabbages': ['Cabbages'],
 'Candy': ['Candy'],
 'Carpets': ['Carpets', 'Rugs'],
 'Carrots': ['Carrots'],
 'Cashews': ['Cashews'],
 'Cattle': ['Cattle'],
 'Ce

Will now drop lead from the analysis following findings from the supervised contextual classifier. It was too ambiguous: nearly all mentions were the verb (to lead), not the noun (metal).

In [8]:
del goods_synonyms['Lead']

# Display the updated synonym dictionary
goods_synonyms

{'Aircraft Engines': ['Aircraft Engines'],
 'Alcoholic Beverages': ['Alcoholic Beverages'],
 'Amber': ['Amber'],
 'Animal Feed': ['Animal Feed'],
 'Açaí Berries': ['Acai', 'Acai Berries', 'Açaí Berries'],
 'Baked Goods': ['Baked Goods'],
 'Bakery Items': ['Bakery Items'],
 'Bamboo': ['Bamboo'],
 'Bananas': ['Bananas'],
 'Beans ': ['Beans '],
 'Beans (green beans)': ['Beans (green beans)'],
 'Beans (green, soy, yellow)': ['Beans (green, soy, yellow)'],
 'Beef': ['Beef'],
 'Beverages': ['Beverages'],
 'Bidis (hand-rolled cigarettes)': ['Bidis (hand-rolled cigarettes)'],
 'Biofuel': ['Biofuel'],
 'Biofuels': ['Biofuels'],
 'Bovines': ['Bovines'],
 'Brass': ['Brass'],
 'Brassware': ['Brassware'],
 'Brazil Nuts/Chestnuts': ['Brazil Nuts/Chestnuts'],
 'Bricks': ['Bricks'],
 'Bricks (clay)': ['Bricks (clay)'],
 'Broccoli': ['Broccoli'],
 'Cabbages': ['Cabbages'],
 'Candy': ['Candy'],
 'Carpets': ['Carpets', 'Rugs'],
 'Carrots': ['Carrots'],
 'Cashews': ['Cashews'],
 'Cattle': ['Cattle'],
 'Ce

## 04.3 Flatten Synonym Dictionary

In [9]:
# Flatten the synonym dictionary: maps all variations to the canonical UK term
reverse_synonyms = {}

for standard_good, synonyms in goods_synonyms.items():
    for synonym in synonyms:
        reverse_synonyms[synonym.lower()] = standard_good

# Check sample output
print("eggplant →", reverse_synonyms.get("eggplant"))
print("aubergine →", reverse_synonyms.get("aubergine"))
print("maize →", reverse_synonyms.get("maize"))

eggplant → Aubergine
aubergine → Aubergine
maize → Maize


## 04.4 PhraseMatcher

drop irrelevant columns from the gov_docs dataset to speed processing of NLP Phrasematcher

In [10]:
# Drop specified columns from gov_docs
columns_to_drop = ["extracted_text", "tokenized_text", "no_stopwords_text", "lemmatized_text", "mentions_child_labour", "mentions_forced_child_labour", "lemmatized_text_str", "mentions_forced_labour"]
gov_docs = gov_docs.drop(columns=columns_to_drop)

In [11]:
gov_docs.head(10)

Unnamed: 0,company_id,company_name,url,company_size,sectors,sub_industries,extracted_text_clean,doc_length
0,RMZRFEKAVSF42,Urban Logistics ReitÃÂ PLC,https://www.urbanlogisticsreit.com/supply-chai...,,Energy;Real Estate,Coal & Consumable Fuels;Industrial REITs,supply chain code of conduct | urban logistics...,707
1,YYKCG6HLXOBWM,Abbott Laboratories,https://dam.abbott.com/en-us/documents/pdfs/tr...,LARGE,Health Care,Pharmaceuticals;Health Care Equipment,position statement on human rights abbott beli...,148
2,RPO43IPXA62O6,Fortive Corp,https://www.fortive.com/sites/default/files/fi...,LARGE,Information Technology;Industrials,Electronic Equipment & Instruments;Industrial ...,3 2019 accelerating progress toward a sustaina...,7148
3,4F44JZW6IC7JG,Akamai Technologies Inc,https://www.akamai.com/company/corporate-respo...,LARGE,Information Technology,Application Software;Internet Services & Infra...,gri sustainability reporting standards skip to...,543
4,F5JRKO67WN6RM,Keller Group PLC,https://www.keller.com/sites/keller-africa-za/...,SMALL,Industrials,Construction & Engineering,building foundations for a sustainable future ...,1636
5,L3LFWS7SMD6OM,Renewi PLC,https://www.renewi.com/-/media/pdf/reports-and...,SMALL,Industrials,Environmental & Facilities Services,renewi plc modern slavery statement 2023 intro...,1449
6,P3UJCTPH5DFOW,RENTOKIL INITIAL,https://www.rentokil-initial.com/~/media/Files...,LARGE,Industrials,Diversified Support Services;Environmental & F...,1 code of conduct code of conduct you are the ...,6386
7,WMAV6CIX6HJ3I,Amazon.com CDR,https://sustainability.aboutamazon.com/modern-...,LARGE,Consumer Discretionary,Broadline Retail;Internet Services & Infrastru...,modern slavery statement amazon (newline) (ne...,6492
8,Y4VP27VVPJGVG,ONEOK,https://www.oneok.com/about-us/ethics-compliance,LARGE,Energy,Oil & Gas Storage & Transportation,ethics and compliance skip to main content abo...,790
9,Y6U55BEVXRWN6,DANAHER,https://danaher.com/sites/default/files/2024-0...,LARGE,Health Care,Biotechnology;Health Care Equipment,supplier code of conduct revised may 2024 1. i...,1586


The following steps are the same as in the Contextiual Classifier: synonyms are added to the phrasematcher object for detection. Then a function is created for detecting goods mentions and returning them in their canonical form

In [12]:
# Load small English language model
nlp = spacy.load("en_core_web_sm")

# Initialize PhraseMatcher with lowercase matching
matcher = PhraseMatcher(nlp.vocab, attr="LOWER")

# Add patterns (all synonyms) to the matcher
patterns = [nlp.make_doc(synonym) for synonym in reverse_synonyms.keys()]
matcher.add("GOOD", patterns)

print(f"Total patterns added for matching: {len(patterns)}")

Total patterns added for matching: 268


In [13]:
# Function to detect goods from a given text
def detect_goods(text):
    doc = nlp(text)
    matches = matcher(doc)
    detected = set()
    
    for match_id, start, end in matches:
        matched_phrase = doc[start:end].text.lower()  # matched word from text
        standard_good = reverse_synonyms.get(matched_phrase, matched_phrase)  # map to standard good
        detected.add(standard_good)
        
    return list(detected)

This next function basically returns three fields:
- index of the document from which the text snippet is extracted
- CompanyID associated with text snippet
- detected good in flattened form i.e. standard UK form
- snippet of the good in the gov doc text

In [14]:
# Load your spaCy model
nlp = spacy.load("en_core_web_sm")

# Increase the maximum length limit
nlp.max_length = 3000000

# Function to detect goods and extract snippets
def detect_goods_with_snippets(text):
    doc = nlp(text)
    matches = matcher(doc)
    detected = set()
    snippets = []
    
    for match_id, start, end in matches:
        matched_phrase = doc[start:end].text.lower()  # matched word from text
        standard_good = reverse_synonyms.get(matched_phrase, matched_phrase)  # map to standard good
        detected.add(standard_good)
        snippet = doc[max(0, start-5):min(len(doc), end+5)].text  # extract snippet around the match
        snippets.append((standard_good, snippet))
        
    return list(detected), snippets

# Apply the function to your clean text column (preprocessed)
gov_docs['detected_goods'], gov_docs['snippets'] = zip(*gov_docs['extracted_text_clean'].apply(detect_goods_with_snippets))

# Create a new dataframe with snippets
snippets_df = gov_docs[['company_id', 'snippets']].explode('snippets').dropna(subset=['snippets'])
snippets_df[['detected_good', 'snippet']] = pd.DataFrame(snippets_df['snippets'].tolist(), index=snippets_df.index)
snippets_df = snippets_df.drop(columns=['snippets'])

In [15]:
snippets_df.head(100)

Unnamed: 0,company_id,detected_good,snippet
2,RPO43IPXA62O6,Coffee,local charity events to hosting coffee chats f...
2,RPO43IPXA62O6,Electric Vehicles,and from solar cells to electric vehicles. tes...
2,RPO43IPXA62O6,Solar Panels,solar tracker works to aim solar panels toward...
2,RPO43IPXA62O6,Ice Cream,the safety of scoops of ice cream and the tast...
2,RPO43IPXA62O6,Tin,"documenting countries of origin for tin, tanta..."
...,...,...,...
19,OLVIAOCW4JROG,Tobacco,labor rights issues in its tobacco growing sup...
19,OLVIAOCW4JROG,Tobacco,"in mexico, pmi sources tobacco from a third-party"
19,OLVIAOCW4JROG,Tobacco,s highlands and settle in tobacco-growing area...
19,OLVIAOCW4JROG,Tobacco,. the earnings during the tobacco season are t...


In [16]:
snippets_df.to_csv("04.1snippets.csv")

snippets_df is saved to csv as it took a while to run the PhraseMatcher. Therefore won't have to run the model every time we want to work in this notebook

In [17]:
gov_docs.head(20)

Unnamed: 0,company_id,company_name,url,company_size,sectors,sub_industries,extracted_text_clean,doc_length,detected_goods,snippets
0,RMZRFEKAVSF42,Urban Logistics ReitÃÂ PLC,https://www.urbanlogisticsreit.com/supply-chai...,,Energy;Real Estate,Coal & Consumable Fuels;Industrial REITs,supply chain code of conduct | urban logistics...,707,[],[]
1,YYKCG6HLXOBWM,Abbott Laboratories,https://dam.abbott.com/en-us/documents/pdfs/tr...,LARGE,Health Care,Pharmaceuticals;Health Care Equipment,position statement on human rights abbott beli...,148,[],[]
2,RPO43IPXA62O6,Fortive Corp,https://www.fortive.com/sites/default/files/fi...,LARGE,Information Technology;Industrials,Electronic Equipment & Instruments;Industrial ...,3 2019 accelerating progress toward a sustaina...,7148,"[Coffee, Tungsten, Gold, Electric Cars, Electr...","[(Coffee, local charity events to hosting coff..."
3,4F44JZW6IC7JG,Akamai Technologies Inc,https://www.akamai.com/company/corporate-respo...,LARGE,Information Technology,Application Software;Internet Services & Infra...,gri sustainability reporting standards skip to...,543,[Laptops],"[(Laptops, newsroom servers epyc business syst..."
4,F5JRKO67WN6RM,Keller Group PLC,https://www.keller.com/sites/keller-africa-za/...,SMALL,Industrials,Construction & Engineering,building foundations for a sustainable future ...,1636,[],[]
5,L3LFWS7SMD6OM,Renewi PLC,https://www.renewi.com/-/media/pdf/reports-and...,SMALL,Industrials,Environmental & Facilities Services,renewi plc modern slavery statement 2023 intro...,1449,[],[]
6,P3UJCTPH5DFOW,RENTOKIL INITIAL,https://www.rentokil-initial.com/~/media/Files...,LARGE,Industrials,Diversified Support Services;Environmental & F...,1 code of conduct code of conduct you are the ...,6386,"[Phones, Laptops, Garments]","[(Phones, use hand-held mobile phones/devices ..."
7,WMAV6CIX6HJ3I,Amazon.com CDR,https://sustainability.aboutamazon.com/modern-...,LARGE,Consumer Discretionary,Broadline Retail;Internet Services & Infrastru...,modern slavery statement amazon (newline) (ne...,6492,"[Tungsten, Gold, Tin, Cotton, Garments, Cobalt...","[(Garments, amazon-branded products are appare..."
8,Y4VP27VVPJGVG,ONEOK,https://www.oneok.com/about-us/ethics-compliance,LARGE,Energy,Oil & Gas Storage & Transportation,ethics and compliance skip to main content abo...,790,[Petrol],"[(Petrol, denatured fuel ethanol diesel fuels ..."
9,Y6U55BEVXRWN6,DANAHER,https://danaher.com/sites/default/files/2024-0...,LARGE,Health Care,Biotechnology;Health Care Equipment,supplier code of conduct revised may 2024 1. i...,1586,"[Tungsten, Gold, Tin]","[(Tin, countries of origin for the tin, tantal..."


In [18]:
gov_docs.to_csv("04_gov_docs.csv", index=False)

gov_docs, with the appended detected goods and snippets fields, is also saved for the same reason

Snippets have been extracted and saved under snippets_df

In [19]:
snippets_df.shape

(3529, 3)

There are 3529 mentions of input, downstream or downstream at risk goods across all 329 governance documents

## 04.5 Contextual Filtering

Import gov_docs and snippets so don't have to run the model again

In [20]:
gov_docs = pd.read_csv('04_gov_docs.csv')
snippets_df = pd.read_csv('04.1snippets.csv')

In [21]:
snippets_df.head(10)

Unnamed: 0.1,Unnamed: 0,company_id,detected_good,snippet
0,2,RPO43IPXA62O6,Coffee,local charity events to hosting coffee chats f...
1,2,RPO43IPXA62O6,Electric Vehicles,and from solar cells to electric vehicles. tes...
2,2,RPO43IPXA62O6,Solar Panels,solar tracker works to aim solar panels toward...
3,2,RPO43IPXA62O6,Ice Cream,the safety of scoops of ice cream and the tast...
4,2,RPO43IPXA62O6,Tin,"documenting countries of origin for tin, tanta..."
5,2,RPO43IPXA62O6,Tungsten,"for tin, tantalum, tungsten, and gold purchases."
6,2,RPO43IPXA62O6,Gold,"tantalum, tungsten, and gold purchases. respon..."
7,2,RPO43IPXA62O6,Electric Cars,. the team also purchased electric cars for th...
8,2,RPO43IPXA62O6,Solar Panels,nagase-landauer installed rooftop solar panels...
9,2,RPO43IPXA62O6,Cobalt Ore,"a lead-free, cobalt-based material that offers"


Rename the unnamed column in snippets_df to allow for smotth document matching based off the index of gov_docs

In [22]:
snippets_df = snippets_df.rename(columns={'Unnamed: 0':'doc_id'})

In [23]:
snippets_df.head()

Unnamed: 0,doc_id,company_id,detected_good,snippet
0,2,RPO43IPXA62O6,Coffee,local charity events to hosting coffee chats f...
1,2,RPO43IPXA62O6,Electric Vehicles,and from solar cells to electric vehicles. tes...
2,2,RPO43IPXA62O6,Solar Panels,solar tracker works to aim solar panels toward...
3,2,RPO43IPXA62O6,Ice Cream,the safety of scoops of ice cream and the tast...
4,2,RPO43IPXA62O6,Tin,"documenting countries of origin for tin, tanta..."


Will now import the vectoriser and logistic regression model, trained in the model training stage of the supervised contextual classifier, and apply it to the text snippets of the goods in the governance docs

The aim here is to only apply the model to ambiguous goods mentions so that we are left with the correct uses of contextually ambiguous goods

In [24]:
# Load the vectorizer and logistic regression model from files
vectorizer = joblib.load('tfidf_vectorizer.pkl')
model = joblib.load('logistic_regression_model.pkl')

In [25]:
# Define ambiguous goods
ambiguous_goods = [
    'Gold', 'Silver', 'Rubber', 'Timber',
    'Tin', 'Nickel', 'Diamonds', 'Iron'
]

# Filter snippets containing ambiguous goods
ambiguous_snippets = snippets_df[snippets_df['detected_good'].isin(ambiguous_goods)]

# Vectorize the text snippets
X_ambiguous = vectorizer.transform(ambiguous_snippets['snippet'])

# Apply the logistic regression model
predictions = model.predict(X_ambiguous)

# Filter out irrelevant mentions (assuming 0 is irrelevant and 1 is relevant)
filtered_out = ambiguous_snippets[predictions == 0]

# Get the list of snippets and companyIDs that were filtered out
filtered_out_list = filtered_out[['company_id', 'snippet']].values.tolist()

filtered_out

Unnamed: 0,doc_id,company_id,detected_good,snippet
70,15,YDXT5LV6YDY2O,Gold,"assessed, earning us a gold medal. the un sdgs"
176,29,YG6X2CCLIA3F4,Gold,"ice benchmark administration, world gold counc..."
179,29,YG6X2CCLIA3F4,Gold,jewellery (source: world gold council). in 2020
182,29,YG6X2CCLIA3F4,Gold,covid-19. source: world gold council Ã¢ÂÂour...
193,29,YG6X2CCLIA3F4,Silver,"features a striking, clean silver dial with co..."
...,...,...,...,...
3521,328,CE8561CBFC466FC84D01E653CEAD3808,Gold,) and the arnold p. gold foundation; Ã¢ÂÂ¢ al...
3522,328,CE8561CBFC466FC84D01E653CEAD3808,Gold,care professionals: Ã¢ÂÂ¢ drove gold corporat...
3523,328,CE8561CBFC466FC84D01E653CEAD3808,Gold,speaker at the arnold p. gold foundationÃ¢ÂÂ...
3524,328,CE8561CBFC466FC84D01E653CEAD3808,Gold,of our leadership of the gold corporate counci...


In [26]:
filtered_out.shape

(216, 4)

216 irrelevant observations from our snippets dataframe need to be filtered out now. Remember from model training, there were 402 mentions of ambiguous goods. Therefore more than 50% of contextually ambiguous goods have been filtered out. These goods would have led to erroneous conclusions about the number of detected goods. 

In [27]:
# Remove all observations from snippets_df that have an exact match across all fields with rows in filtered_out
snippets_filtered = snippets_df.merge(filtered_out, on=['doc_id','company_id', 'detected_good', 'snippet'], how='left', indicator=True)
snippets_filtered = snippets_filtered[snippets_filtered['_merge'] == 'left_only'].drop(columns=['_merge'])

In [28]:
snippets_filtered.shape

(3313, 4)

The filtered dataset contains 3313 contextually-relevant mentions of goods across input, downstream and downstream at risk

In [29]:
snippets_filtered.head(30)

Unnamed: 0,doc_id,company_id,detected_good,snippet
0,2,RPO43IPXA62O6,Coffee,local charity events to hosting coffee chats f...
1,2,RPO43IPXA62O6,Electric Vehicles,and from solar cells to electric vehicles. tes...
2,2,RPO43IPXA62O6,Solar Panels,solar tracker works to aim solar panels toward...
3,2,RPO43IPXA62O6,Ice Cream,the safety of scoops of ice cream and the tast...
4,2,RPO43IPXA62O6,Tin,"documenting countries of origin for tin, tanta..."
5,2,RPO43IPXA62O6,Tungsten,"for tin, tantalum, tungsten, and gold purchases."
6,2,RPO43IPXA62O6,Gold,"tantalum, tungsten, and gold purchases. respon..."
7,2,RPO43IPXA62O6,Electric Cars,. the team also purchased electric cars for th...
8,2,RPO43IPXA62O6,Solar Panels,nagase-landauer installed rooftop solar panels...
9,2,RPO43IPXA62O6,Cobalt Ore,"a lead-free, cobalt-based material that offers"


In [30]:
# Group by 'doc_id' and aggregate 'detected_good' into a list of unique values
unique_goods_per_doc = snippets_filtered.groupby('doc_id')['detected_good'].unique().reset_index()

# Rename the column to 'unique_detected_goods'
unique_goods_per_doc.rename(columns={'detected_good': 'unique_detected_goods'}, inplace=True)

# Add a new column for the count of unique goods
unique_goods_per_doc['unique_detected_goods_count'] = unique_goods_per_doc['unique_detected_goods'].apply(len)

# Merge this new dataframe back with the original snippets_filtered dataframe
snippets_filtered = snippets_filtered.merge(unique_goods_per_doc, on='doc_id', how='left')

In [31]:
snippets_filtered.head()

Unnamed: 0,doc_id,company_id,detected_good,snippet,unique_detected_goods,unique_detected_goods_count
0,2,RPO43IPXA62O6,Coffee,local charity events to hosting coffee chats f...,"[Coffee, Electric Vehicles, Solar Panels, Ice ...",9
1,2,RPO43IPXA62O6,Electric Vehicles,and from solar cells to electric vehicles. tes...,"[Coffee, Electric Vehicles, Solar Panels, Ice ...",9
2,2,RPO43IPXA62O6,Solar Panels,solar tracker works to aim solar panels toward...,"[Coffee, Electric Vehicles, Solar Panels, Ice ...",9
3,2,RPO43IPXA62O6,Ice Cream,the safety of scoops of ice cream and the tast...,"[Coffee, Electric Vehicles, Solar Panels, Ice ...",9
4,2,RPO43IPXA62O6,Tin,"documenting countries of origin for tin, tanta...","[Coffee, Electric Vehicles, Solar Panels, Ice ...",9


## 04.6 Merging relevant snippets data with gov_docs

Will now count the total number of mentions of goods linked to child labour in each governance document and merge this data with the gov_docs data

Will now create an extra dataframe with the count of risky goods (unique_good) per Document and subsequently merge this data with gov_docs

In [32]:
# Count the number of detected_goods observations grouped by doc_id
detected_goods_counts = snippets_filtered.groupby('doc_id')['detected_good'].count().reset_index()

# Rename the columns for clarity
detected_goods_counts.columns = ['doc_id', 'detected_good_count']

# Merge with unique_detected_goods to include it for each doc_id
detected_goods_counts = detected_goods_counts.merge(unique_goods_per_doc, on='doc_id', how='left')

# Display the first few rows of the new dataframe
detected_goods_counts.head(25)

Unnamed: 0,doc_id,detected_good_count,unique_detected_goods,unique_detected_goods_count
0,2,10,"[Coffee, Electric Vehicles, Solar Panels, Ice ...",9
1,3,4,[Laptops],1
2,6,6,"[Phones, Garments, Laptops]",3
3,7,13,"[Garments, Electronics, Tin, Tungsten, Gold, C...",7
4,8,1,[Petrol],1
5,9,3,"[Tin, Tungsten, Gold]",3
6,10,4,"[Phones, Tablets]",2
7,13,22,"[Soybeans, Maize, Bananas, Fish, Rice, Solar P...",8
8,14,6,"[Cotton, Leather, Mica]",3
9,15,8,"[Footwear, Tablets, Biofuel, Electric Vehicles...",8


Now merge with gov_docs

In [33]:
# Merge the detected_goods_counts dataframe with gov_docs matching the index of gov_docs with doc_id in detected_goods_counts
gov_docs_merged = gov_docs.merge(detected_goods_counts, left_index=True, right_on='doc_id', how='left')

# Display the first few rows of the merged dataframe
gov_docs_merged.head(25)

Unnamed: 0,company_id,company_name,url,company_size,sectors,sub_industries,extracted_text_clean,doc_length,detected_goods,snippets,doc_id,detected_good_count,unique_detected_goods,unique_detected_goods_count
,RMZRFEKAVSF42,Urban Logistics ReitÃÂ PLC,https://www.urbanlogisticsreit.com/supply-chai...,,Energy;Real Estate,Coal & Consumable Fuels;Industrial REITs,supply chain code of conduct | urban logistics...,707,[],[],0,,,
,YYKCG6HLXOBWM,Abbott Laboratories,https://dam.abbott.com/en-us/documents/pdfs/tr...,LARGE,Health Care,Pharmaceuticals;Health Care Equipment,position statement on human rights abbott beli...,148,[],[],1,,,
0.0,RPO43IPXA62O6,Fortive Corp,https://www.fortive.com/sites/default/files/fi...,LARGE,Information Technology;Industrials,Electronic Equipment & Instruments;Industrial ...,3 2019 accelerating progress toward a sustaina...,7148,"['Coffee', 'Tungsten', 'Gold', 'Electric Cars'...","[('Coffee', 'local charity events to hosting c...",2,10.0,"[Coffee, Electric Vehicles, Solar Panels, Ice ...",9.0
1.0,4F44JZW6IC7JG,Akamai Technologies Inc,https://www.akamai.com/company/corporate-respo...,LARGE,Information Technology,Application Software;Internet Services & Infra...,gri sustainability reporting standards skip to...,543,['Laptops'],"[('Laptops', 'newsroom servers epyc business s...",3,4.0,[Laptops],1.0
,F5JRKO67WN6RM,Keller Group PLC,https://www.keller.com/sites/keller-africa-za/...,SMALL,Industrials,Construction & Engineering,building foundations for a sustainable future ...,1636,[],[],4,,,
,L3LFWS7SMD6OM,Renewi PLC,https://www.renewi.com/-/media/pdf/reports-and...,SMALL,Industrials,Environmental & Facilities Services,renewi plc modern slavery statement 2023 intro...,1449,[],[],5,,,
2.0,P3UJCTPH5DFOW,RENTOKIL INITIAL,https://www.rentokil-initial.com/~/media/Files...,LARGE,Industrials,Diversified Support Services;Environmental & F...,1 code of conduct code of conduct you are the ...,6386,"['Phones', 'Laptops', 'Garments']","[('Phones', 'use hand-held mobile phones/devic...",6,6.0,"[Phones, Garments, Laptops]",3.0
3.0,WMAV6CIX6HJ3I,Amazon.com CDR,https://sustainability.aboutamazon.com/modern-...,LARGE,Consumer Discretionary,Broadline Retail;Internet Services & Infrastru...,modern slavery statement amazon (newline) (ne...,6492,"['Tungsten', 'Gold', 'Tin', 'Cotton', 'Garment...","[('Garments', 'amazon-branded products are app...",7,13.0,"[Garments, Electronics, Tin, Tungsten, Gold, C...",7.0
4.0,Y4VP27VVPJGVG,ONEOK,https://www.oneok.com/about-us/ethics-compliance,LARGE,Energy,Oil & Gas Storage & Transportation,ethics and compliance skip to main content abo...,790,['Petrol'],"[('Petrol', 'denatured fuel ethanol diesel fue...",8,1.0,[Petrol],1.0
5.0,Y6U55BEVXRWN6,DANAHER,https://danaher.com/sites/default/files/2024-0...,LARGE,Health Care,Biotechnology;Health Care Equipment,supplier code of conduct revised may 2024 1. i...,1586,"['Tungsten', 'Gold', 'Tin']","[('Tin', 'countries of origin for the tin, tan...",9,3.0,"[Tin, Tungsten, Gold]",3.0


In [34]:
# Set 'doc_id' as the new index in gov_docs
gov_docs_merged = gov_docs_merged.set_index('doc_id')
gov_docs_merged.head(10)

Unnamed: 0_level_0,company_id,company_name,url,company_size,sectors,sub_industries,extracted_text_clean,doc_length,detected_goods,snippets,detected_good_count,unique_detected_goods,unique_detected_goods_count
doc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,RMZRFEKAVSF42,Urban Logistics ReitÃÂ PLC,https://www.urbanlogisticsreit.com/supply-chai...,,Energy;Real Estate,Coal & Consumable Fuels;Industrial REITs,supply chain code of conduct | urban logistics...,707,[],[],,,
1,YYKCG6HLXOBWM,Abbott Laboratories,https://dam.abbott.com/en-us/documents/pdfs/tr...,LARGE,Health Care,Pharmaceuticals;Health Care Equipment,position statement on human rights abbott beli...,148,[],[],,,
2,RPO43IPXA62O6,Fortive Corp,https://www.fortive.com/sites/default/files/fi...,LARGE,Information Technology;Industrials,Electronic Equipment & Instruments;Industrial ...,3 2019 accelerating progress toward a sustaina...,7148,"['Coffee', 'Tungsten', 'Gold', 'Electric Cars'...","[('Coffee', 'local charity events to hosting c...",10.0,"[Coffee, Electric Vehicles, Solar Panels, Ice ...",9.0
3,4F44JZW6IC7JG,Akamai Technologies Inc,https://www.akamai.com/company/corporate-respo...,LARGE,Information Technology,Application Software;Internet Services & Infra...,gri sustainability reporting standards skip to...,543,['Laptops'],"[('Laptops', 'newsroom servers epyc business s...",4.0,[Laptops],1.0
4,F5JRKO67WN6RM,Keller Group PLC,https://www.keller.com/sites/keller-africa-za/...,SMALL,Industrials,Construction & Engineering,building foundations for a sustainable future ...,1636,[],[],,,
5,L3LFWS7SMD6OM,Renewi PLC,https://www.renewi.com/-/media/pdf/reports-and...,SMALL,Industrials,Environmental & Facilities Services,renewi plc modern slavery statement 2023 intro...,1449,[],[],,,
6,P3UJCTPH5DFOW,RENTOKIL INITIAL,https://www.rentokil-initial.com/~/media/Files...,LARGE,Industrials,Diversified Support Services;Environmental & F...,1 code of conduct code of conduct you are the ...,6386,"['Phones', 'Laptops', 'Garments']","[('Phones', 'use hand-held mobile phones/devic...",6.0,"[Phones, Garments, Laptops]",3.0
7,WMAV6CIX6HJ3I,Amazon.com CDR,https://sustainability.aboutamazon.com/modern-...,LARGE,Consumer Discretionary,Broadline Retail;Internet Services & Infrastru...,modern slavery statement amazon (newline) (ne...,6492,"['Tungsten', 'Gold', 'Tin', 'Cotton', 'Garment...","[('Garments', 'amazon-branded products are app...",13.0,"[Garments, Electronics, Tin, Tungsten, Gold, C...",7.0
8,Y4VP27VVPJGVG,ONEOK,https://www.oneok.com/about-us/ethics-compliance,LARGE,Energy,Oil & Gas Storage & Transportation,ethics and compliance skip to main content abo...,790,['Petrol'],"[('Petrol', 'denatured fuel ethanol diesel fue...",1.0,[Petrol],1.0
9,Y6U55BEVXRWN6,DANAHER,https://danaher.com/sites/default/files/2024-0...,LARGE,Health Care,Biotechnology;Health Care Equipment,supplier code of conduct revised may 2024 1. i...,1586,"['Tungsten', 'Gold', 'Tin']","[('Tin', 'countries of origin for the tin, tan...",3.0,"[Tin, Tungsten, Gold]",3.0


Now save this dataframe to csv for scoring

In [35]:
# Replace null values in the 'detected_good_count' column with 0
gov_docs_merged['detected_good_count'].fillna(0, inplace=True)
gov_docs_merged['unique_detected_goods_count'].fillna(0, inplace=True)

# Display the first few rows of the updated dataframe to verify the changes
gov_docs_merged.head(25)

Unnamed: 0_level_0,company_id,company_name,url,company_size,sectors,sub_industries,extracted_text_clean,doc_length,detected_goods,snippets,detected_good_count,unique_detected_goods,unique_detected_goods_count
doc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,RMZRFEKAVSF42,Urban Logistics ReitÃÂ PLC,https://www.urbanlogisticsreit.com/supply-chai...,,Energy;Real Estate,Coal & Consumable Fuels;Industrial REITs,supply chain code of conduct | urban logistics...,707,[],[],0.0,,0.0
1,YYKCG6HLXOBWM,Abbott Laboratories,https://dam.abbott.com/en-us/documents/pdfs/tr...,LARGE,Health Care,Pharmaceuticals;Health Care Equipment,position statement on human rights abbott beli...,148,[],[],0.0,,0.0
2,RPO43IPXA62O6,Fortive Corp,https://www.fortive.com/sites/default/files/fi...,LARGE,Information Technology;Industrials,Electronic Equipment & Instruments;Industrial ...,3 2019 accelerating progress toward a sustaina...,7148,"['Coffee', 'Tungsten', 'Gold', 'Electric Cars'...","[('Coffee', 'local charity events to hosting c...",10.0,"[Coffee, Electric Vehicles, Solar Panels, Ice ...",9.0
3,4F44JZW6IC7JG,Akamai Technologies Inc,https://www.akamai.com/company/corporate-respo...,LARGE,Information Technology,Application Software;Internet Services & Infra...,gri sustainability reporting standards skip to...,543,['Laptops'],"[('Laptops', 'newsroom servers epyc business s...",4.0,[Laptops],1.0
4,F5JRKO67WN6RM,Keller Group PLC,https://www.keller.com/sites/keller-africa-za/...,SMALL,Industrials,Construction & Engineering,building foundations for a sustainable future ...,1636,[],[],0.0,,0.0
5,L3LFWS7SMD6OM,Renewi PLC,https://www.renewi.com/-/media/pdf/reports-and...,SMALL,Industrials,Environmental & Facilities Services,renewi plc modern slavery statement 2023 intro...,1449,[],[],0.0,,0.0
6,P3UJCTPH5DFOW,RENTOKIL INITIAL,https://www.rentokil-initial.com/~/media/Files...,LARGE,Industrials,Diversified Support Services;Environmental & F...,1 code of conduct code of conduct you are the ...,6386,"['Phones', 'Laptops', 'Garments']","[('Phones', 'use hand-held mobile phones/devic...",6.0,"[Phones, Garments, Laptops]",3.0
7,WMAV6CIX6HJ3I,Amazon.com CDR,https://sustainability.aboutamazon.com/modern-...,LARGE,Consumer Discretionary,Broadline Retail;Internet Services & Infrastru...,modern slavery statement amazon (newline) (ne...,6492,"['Tungsten', 'Gold', 'Tin', 'Cotton', 'Garment...","[('Garments', 'amazon-branded products are app...",13.0,"[Garments, Electronics, Tin, Tungsten, Gold, C...",7.0
8,Y4VP27VVPJGVG,ONEOK,https://www.oneok.com/about-us/ethics-compliance,LARGE,Energy,Oil & Gas Storage & Transportation,ethics and compliance skip to main content abo...,790,['Petrol'],"[('Petrol', 'denatured fuel ethanol diesel fue...",1.0,[Petrol],1.0
9,Y6U55BEVXRWN6,DANAHER,https://danaher.com/sites/default/files/2024-0...,LARGE,Health Care,Biotechnology;Health Care Equipment,supplier code of conduct revised may 2024 1. i...,1586,"['Tungsten', 'Gold', 'Tin']","[('Tin', 'countries of origin for the tin, tan...",3.0,"[Tin, Tungsten, Gold]",3.0


In [36]:
gov_docs_merged.to_csv('gov_docs_scoring.csv')