#### Rapports de Responsabilité Sociale d'Entreprise (RSE)
Un rapport sur la responsabilité sociale des entreprises (RSE) est un document interne et externe que les entreprises utilisent pour communiquer leurs efforts de RSE concernant les impacts environnementaux, éthiques, philanthropiques et économiques sur l'environnement et la communauté.

Bien qu'ils ne soient en aucun cas obligatoires, plus de 90 % des sociétés de l'indice S&P 500 les publient chaque année. Comme il n'existe pas de processus de reporting standard, la quantité et la qualité des informations divulguées dépendent du service des relations publiques de chaque entreprise. Les rapports peuvent contenir de 30 à 200+ pages.

#### Traitement du langage naturel (NLP)
Le traitement du langage naturel (NLP) est un domaine de la linguistique et de l'apprentissage automatique qui traite des langues naturelles (c'est-à-dire humaines). Le but est de « comprendre » les données textuelles non structurées et de produire quelque chose de nouveau. Des exemples de tâches PNL sont la traduction linguistique, le résumé de texte et l’analyse des sentiments.


#### Zero-shot-learning (ZSL)
Les langues humaines sont vraiment complexes, il est donc impossible de former des classificateurs sur chaque phrase. Les modèles d'apprentissage zéro-shot (ZSL) permettent de classer le texte en catégories invisibles par le modèle pendant la formation. Ces méthodes fonctionnent en combinant les catégories observées/vues et non observées/invisibles via des informations auxiliaires, qui codent les propriétés des objets.

Les images et les vidéos sont d’autres utilisations courantes des modèles d’apprentissage ZSL. Et les usages ne cessent de croître, comme la reconnaissance d’activité à partir de capteurs.

Dans ce projet je vais utiliser le NLP et le ZSL pour analyser un rapport RSE afin de classer chaque phrase dans l'une des nombreuses catégories relatives à l'ESG.


---

In [None]:
 !brew install tesseract tesseract-lang
 !brew cask install adoptopenjdk
 !brew install tika
 !pip install tika

 import nltk
nltk.download('punkt')

!pip install transformers[sentencepiece]

/bin/bash: line 1: brew: command not found
/bin/bash: line 1: brew: command not found
/bin/bash: line 1: brew: command not found
Collecting tika
  Downloading tika-2.6.0.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: tika
  Building wheel for tika (setup.py) ... [?25l[?25hdone
  Created wheel for tika: filename=tika-2.6.0-py3-none-any.whl size=32622 sha256=01629e41d704e99a21e3fa715bd7614b3845f47caed12aced3046becbf49b20e
  Stored in directory: /root/.cache/pip/wheels/5f/71/c7/b757709531121b1700cffda5b6b0d4aad095fb507ec84316d0
Successfully built tika
Installing collected packages: tika
Successfully installed tika-2.6.0


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Collecting sentencepiece!=0.1.92,>=0.1.91 (from transformers[sentencepiece])
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: sentencepiece
Successfully installed sentencepiece-0.1.99


In [None]:
# Imports
import re
import string
from collections import defaultdict
import pandas as pd
from tika import parser
import nltk
import torch
from transformers import pipeline  # Hugging Face

pd.set_option("display.max_colwidth", None)



```
```

## Parsing CSR PDFs
Une partie non triviale de la classification des rapports RSE consiste à les convertir dans un format lisible par ordinateur. Les entreprises publient leurs rapports RSE au format PDF, notoirement difficiles à lire. Notre objectif est d'extraire du texte sous forme de liste de phrases.

Nous effectuerons une analyse très simple d'un rapport PDF en utilisant le package tika pour extraire le texte, des expressions régulières pour filtrer et joindre le texte, et NLTK pour diviser le texte en phrases.



In [None]:
class parsePDF:
    def __init__(self, url):
        self.url = url

    def extract_contents(self):
        """ Extract a pdf's contents using tika. """
        pdf = parser.from_file(self.url)
        self.text = pdf["content"]
        return self.text


    def clean_text(self):
        """ Extract & clean sentences from raw text of pdf. """
        # Remove non ASCII characters
        printables = set(string.printable)
        self.text = "".join(filter(lambda x: x in printables, self.text))

        # Replace tabs with spaces
        self.text = re.sub(r"\t+", r" ", self.text)

        # Aggregate lines where the sentence wraps
        # Also, lines in CAPITALS is counted as a header
        fragments = []
        prev = ""
        for line in re.split(r"\n+", self.text):
            if line.isupper():
                prev = "."  # skip it
            elif line and (line.startswith(" ") or line[0].islower()
                  or not prev.endswith(".")):
                prev = f"{prev} {line}"  # make into one line
            else:
                fragments.append(prev)
                prev = line
        fragments.append(prev)

        # Clean the lines into sentences
        sentences = []
        for line in fragments:
            # Use regular expressions to clean text
            url_str = (r"((http|https)\:\/\/)?[a-zA-Z0-9\.\/\?\:@\-_=#]+\."
                       r"([a-zA-Z]){2,6}([a-zA-Z0-9\.\&\/\?\:@\-_=#])*")
            line = re.sub(url_str, r" ", line)  # URLs
            line = re.sub(r"^\s?\d+(.*)$", r"\1", line)  # headers
            line = re.sub(r"\d{5,}", r" ", line)  # figures
            line = re.sub(r"\.+", ".", line)  # multiple periods

            line = line.strip()  # leading & trailing spaces
            line = re.sub(r"\s+", " ", line)  # multiple spaces
            line = re.sub(r"\s?([,:;\.])", r"\1", line)  # punctuation spaces
            line = re.sub(r"\s?-\s?", "-", line)  # split-line words

            # Use nltk to split the line into sentences
            for sentence in nltk.sent_tokenize(line):
                s = str(sentence).strip().lower()  # lower case
                # Exclude tables of contents and short sentences
                if "table of contents" not in s and len(s) > 5:
                    sentences.append(s)
        return sentences

 Exemple: McDonald's
Ici, nous extrayons le rapport RSE le plus récent de McDonalds de [responsibilityreports.com](https://www.responsibilityreports.com/Company/mcdonalds-corporation). Nous allons extraire et analyser le texte afin de passer à sa classification en utilisant le Zero Shot Learning.

In [None]:
mcdonalds_url = "https://www.responsibilityreports.com/Click/2534"
pp = parsePDF(mcdonalds_url)
pp.extract_contents()
sentences = pp.clean_text()

print(f"The McDonalds CSR report has {len(sentences):,d} sentences")

2023-12-03 17:26:56,982 [MainThread  ] [INFO ]  Retrieving https://www.responsibilityreports.com/Click/2534 to /tmp/click-2534.
INFO:tika.tika:Retrieving https://www.responsibilityreports.com/Click/2534 to /tmp/click-2534.
2023-12-03 17:26:58,784 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server-standard/2.6.0/tika-server-standard-2.6.0.jar to /tmp/tika-server.jar.
INFO:tika.tika:Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server-standard/2.6.0/tika-server-standard-2.6.0.jar to /tmp/tika-server.jar.
2023-12-03 17:26:59,205 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server-standard/2.6.0/tika-server-standard-2.6.0.jar.md5 to /tmp/tika-server.jar.md5.
INFO:tika.tika:Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server-standard/2.6.0/tika-server-standard-2.6.0.jar.md5 to /tmp/tika-server.jar.md5.
2023-12

The McDonalds CSR report has 288 sentences




```
# Ce texte est au format code
```

## Zero-Shot Learning
Les modèles d'apprentissage Zero-shot sont extrêmement utiles lorsque vous souhaitez classer du texte sur des étiquettes très spécifiques et que vous ne disposez pas de données étiquetées. Les données étiquetées peuvent être difficiles, coûteuses et fastidieuses à acquérir, c'est pourquoi l'apprentissage zéro fournit un moyen rapide d'obtenir une classification sans données spécialisées ni formation supplémentaire du modèle.

Nous allons définir des catégories ESG spécifiques à l'industrie et demander à notre modèle de classer chaque phrase de notre rapport RSE. Nous obtiendrons un « score » qui montre dans quelle mesure le modèle est sûr que cette étiquette s'applique. Un score de 1,0 signifie que cette phrase concerne définitivement ce sujet. À l’inverse, un score de 0,0 signifie que la phrase n’a absolument aucun rapport avec ce sujet.

L’inconvénient de l’apprentissage zéro-shot est qu’il est extrêmement lent par rapport aux modèles formés sur des étiquettes spécifiques. Il doit essentiellement calculer « ce que signifie être cette étiquette », puis vérifier si votre phrase « est cette étiquette ».

In [None]:
class ZeroShotClassifier:

    def create_zsl_model(self, model_name):
        """ Create the zero-shot learning model. """
        self.model = pipeline("zero-shot-classification", model=model_name)


    def classify_text(self, text, categories):
        """
        Classify text(s) to the pre-defined categories using a
        zero-shot classification model and return the raw results.
        """
        # Classify text using the zero-shot transformers model
        hypothesis_template = "This text is about {}."
        result = self.model(text, categories, multi_label=True,
                            hypothesis_template=hypothesis_template)
        return result


    def text_labels(self, text, category_dict, cutoff=None):
        """
        Classify a text into the pre-defined categories. If cutoff
        is defined, return only those entries where the score > cutoff
        """
        # Run the model on our categories
        categories = list(category_dict.keys())
        result = (self.classify_text(text, categories))

        # Format as a pandas dataframe and add ESG label
        df = pd.DataFrame(result).explode(["labels", "scores"])
        df["ESG"] = df.labels.map(category_dict)

        # If a cutoff is provided, filter the dataframe
        if cutoff:
            df = df[df.scores.gt(cutoff)].copy()
        return df.reset_index(drop=True)

 Pre-Define Labels
Les labels choisis ci-dessous sont basés sur les catégories et sujets utilisés par les sociétés de notation ESG.
Nous définissons la version en anglais simple, qui est celle qui sera recherchée par le modèle d'apprentissage zero-shot, ainsi que le label général « ESG ».

En raison du fonctionnement de Zero Shot Learning, le temps d'inférence augmentera linéairement avec le nombre d'étiquettes que vous définissez. Par conséquent, il est nécessaire de déterminer quelles étiquettes vous souhaitez réellement et combien de temps est acceptable pour la classification du texte.

In [None]:
# Define categories we want to classify
esg_categories = {
  "emissions": "E",
  "natural resources": "E",
  "pollution": "E",
  "diversity and inclusion": "S",
  "philanthropy": "S",
  "health and safety": "S",
  "training and education": "S",
  "transparancy": "G",
  "corporate compliance": "G",
  "board accountability": "G"}

# Obtenir la classification du texte



Il ne reste plus qu’à définir le modèle et faire des prédictions. L'architecture du modèle peut être choisie parmi n'importe quel modèle de classification de texte sur [Hugging Face](https://huggingface.co/models).

Ici, nous choisissons d'utiliser la version extra large du modèle DeBERTa, telle que maintenue par Microsoft. Un modèle plus grand (généralement) donne de meilleures performances mais est beaucoup plus lent.

In [None]:
# Define and Create the zero-shot learning model
model_name = "microsoft/deberta-base-mnli"
    # a smaller version: "microsoft/deberta-base-mnli"
ZSC = ZeroShotClassifier()
ZSC.create_zsl_model(model_name)
    # Note: the warning is expected, so ignore it

config.json:   0%|          | 0.00/728 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/557M [00:00<?, ?B/s]

Some weights of the model checkpoint at microsoft/deberta-base-mnli were not used when initializing DebertaForSequenceClassification: ['config']
- This IS expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [None]:
# Classify all the sentences in the report
classified = ZSC.text_labels(sentences, esg_categories)
classified.sample(n=20)  # display 20 random records

Unnamed: 0,sequence,labels,scores,ESG
1641,"since the programs launch, mcdonalds has engaged thousands of suppliers and facilities on respecting human rights and mitigating risk.",training and education,0.511379,S
2588,"for purposes of mcdonalds reporting, including with respect to human capital metrics and equal pay, underrepresented groups is defined as people who identify as black, indigenous, asian or pacific islander, people of hispanic or latino/a/x descent, or people having a combination of these identities or attributes.",health and safety,0.000479,S
529,were aiming for a 90% reduction in virgin fossil fuel-based plastic used to make happy meal toys by 2025.,philanthropy,0.021804,S
2494,thesearenot guarantees of performance and speak only as ofthedate the statements are made.,diversity and inclusion,0.064289,S
2846,"these commitments apply to chicken raised for sale at mcdonalds restaurants in australia, canada, france, germany, italy, ireland, the netherlands, poland, russia, south korea, spain, switzerland, the u.k. and the u.s. russia is included for the purposes of progress reporting to the end of december 2021.",philanthropy,0.009659,S
2322,"throughout the covid-19 pandemic, and with guidance from partners like food donation connection, the global foodbanking network and feeding america, we have ensured millions of pounds of stranded food hasnt been wasted, insteaddonating it topeople globally who need it.",health and safety,0.473058,S
1881,"jobs, inclusion & empowerment check out our 2021 diversity snapshot to see our goals andprogress purpose & impact progress summary introduction food quality & sourcing our planet jobs, inclusion & empowerment community connection 12 attracting, retaining and rewarding talent providing a best-in-class employee experience is a business imperative because it directly impacts the customer experience.",natural resources,0.829499,E
455,as the transacted u.s. renewable energy projects come online they are expected to contribute to a 27% reduction from our global 2015 baseline.,philanthropy,0.074916,S
1514,"additionally, on average, roughly78% of our restaurants in mcdonalds largest european markets already provide guest packaging recycling.",natural resources,0.453747,E
2144,our employee assistance fund distributed donations among ukrainian employees impacted by the closure of mcdonalds restaurants and offices during this crisis.,board accountability,0.016058,G


Notre modèle a classé chaque phrase de notre rapport RSE selon les scores ESG. Un score de 1,0 signifie que cette phrase concerne définitivement ce sujet. À l’inverse, un score de 0,0 signifie que la phrase n’a absolument aucun rapport avec ce sujet.

In [None]:
# Look at an example of "E" classified sentences:
E_sentences = classified[classified.scores.gt(0.8) & classified.ESG.eq("E")].copy()
E_sentences.head(10)

Unnamed: 0,sequence,labels,scores,ESG
170,helping protect our planet earning the trust of our people and customers by doing what we say weregoing to do has always been key to building a strong brand and a lasting legacy.,natural resources,0.984418,E
200,"thats why, in 2021, we set an ambition to achieve net zero emissions by 2050.",pollution,0.957022,E
201,"thats why, in 2021, we set an ambition to achieve net zero emissions by 2050.",emissions,0.956338,E
210,"were prioritizing action on the largest elements of our carbon footprint from restaurant energy use to packaging and waste, and the sourcing of key ingredients for our menu.",emissions,0.991796,E
211,"were prioritizing action on the largest elements of our carbon footprint from restaurant energy use to packaging and waste, and the sourcing of key ingredients for our menu.",pollution,0.991387,E
220,meaningful change also requires us to find alternative and sustainable solutions to help protect the worlds natural resources and the communities that rely on them.,natural resources,0.998103,E
240,"weare committed to partnering with our suppliers around the world to scale innovative practices, fromresponsible sourcing and regenerative agriculture, towidespread reuse andrecycling programs.",natural resources,0.898137,E
260,"the actions we continue to take today across people, communities and our planet will ensure were building a better business and a more trusted brand forgenerations to come.",natural resources,0.972356,E
270,"further details about mcdonalds strategy, goals and progress can be found at a message from our ceo chris kempczinski president & ceo, mcdonalds corporation purpose & impact progress summary introduction food quality & sourcing our planet jobs, inclusion & empowerment community connection 2 food quality & sourcing at mcdonalds, our purpose is to feed and foster communities.",natural resources,0.930494,E
280,"to do this, we must help address some of the worlds most pressing social challenges and ensure the natural world is protected for future generations.",natural resources,0.994608,E


Ici, par exemple on a les 10 premiers phrases qui sont classifiées selon les sujets environnementaux.
la première phrase a un score de 0.98, donc elle concerne définitivement le sujet environnementaux.

In [None]:
# Look at an example of "S" classified sentences:
S_sentences = classified[classified.scores.gt(0.8) & classified.ESG.eq("S")].copy()
S_sentences.head(10)

Unnamed: 0,sequence,labels,scores,ESG
20,"with the strength of our full system, weve worked together to build a more diverse, equitable and inclusive business, source more food responsibly, adopt more sustainable practices, and implement innovative and credible solutions in our ongoing quest to be a good neighbor in the communities where we live, work and serve.",diversity and inclusion,0.99569,S
30,we are proud of the work we do to make a difference and will continue to help uphold this promise in all of the communities in which we operate.,diversity and inclusion,0.878567,S
31,we are proud of the work we do to make a difference and will continue to help uphold this promise in all of the communities in which we operate.,philanthropy,0.845291,S
40,"showing up for our communities ray kroc used to say, none of us is as good as all of us a phrase that serves as a constant reminder of mcdonalds impact on the world when we leverage thecollective strength of our system.",diversity and inclusion,0.95414,S
41,"showing up for our communities ray kroc used to say, none of us is as good as all of us a phrase that serves as a constant reminder of mcdonalds impact on the world when we leverage thecollective strength of our system.",philanthropy,0.831389,S
80,"in moments like these, our number one priority remains our people.",diversity and inclusion,0.803527,S
90,"i continue to be deeply moved by the offers of support from across our global system and all the generous contributions from colleagues opening their homes to refugees around the world, to the deployment of the ronald mcdonald house charities (rmhc) care mobile in poland and latvia.",philanthropy,0.986728,S
91,"i continue to be deeply moved by the offers of support from across our global system and all the generous contributions from colleagues opening their homes to refugees around the world, to the deployment of the ronald mcdonald house charities (rmhc) care mobile in poland and latvia.",diversity and inclusion,0.943042,S
110,"through our newly established mcdonalds community fund, we are now better prepared to respond when people need us the most whether by investing in chicago-based neighborhood organizations that are actively and effectively working to address the youth opportunity crises in our hometown, oracorporate contribution of $5 million toanemployee assistance fund to support our ukrainian colleagues.",philanthropy,0.935826,S
111,"through our newly established mcdonalds community fund, we are now better prepared to respond when people need us the most whether by investing in chicago-based neighborhood organizations that are actively and effectively working to address the youth opportunity crises in our hometown, oracorporate contribution of $5 million toanemployee assistance fund to support our ukrainian colleagues.",training and education,0.822521,S


Ici on a les 10 premiers phrases qui sont classifiées selon les sujets sociaux. la première phrase a un score de 0.99, donc elle concerne définitivement ce sujet.

In [None]:
# Look at an example of "G" classified sentences:
G_sentences = classified[classified.scores.gt(0.8) & classified.ESG.eq("G")].copy()
G_sentences.head(10)

Unnamed: 0,sequence,labels,scores,ESG
172,helping protect our planet earning the trust of our people and customers by doing what we say weregoing to do has always been key to building a strong brand and a lasting legacy.,corporate compliance,0.839229,G
180,"and in a world where people expect much more of the brands with which they do business, this becomes allthe more important.",corporate compliance,0.922523,G
481,"now, we are committed to eliminating deforestation from our global supply chain by the end of 2030.",corporate compliance,0.86396,G
550,"in our top 35 markets, on average 35% of mcdonalds restaurants offer guests the opportunity to recycle packaging items.",transparancy,0.838566,G
571,"for example, on average, roughly 78% of our restaurants in mcdonalds largest european markets already provide guest packaging recycling.",corporate compliance,0.885882,G
690,93% of audited suppliers met sqms standards.,corporate compliance,0.913047,G
772,"mcdonalds will continue to approach this responsibly, offering balanced options and promoting menu items that contribute to recommended food groups, such as fruits, vegetables and low-fat dairy.",corporate compliance,0.878814,G
773,"mcdonalds will continue to approach this responsibly, offering balanced options and promoting menu items that contribute to recommended food groups, such as fruits, vegetables and low-fat dairy.",board accountability,0.802943,G
1002,"set in 2018 and approved by the science based targets initiative (sbti), our current targets aim to reduce restaurant and office emissions by 36% by 2030 from a 2015 baseline, and supply chain emissions intensity by 31% over the same period.",corporate compliance,0.803865,G
1040,"tackling climate change for a stronger food system we are working to take climate action and transform our food systems for the better, embracing our unique opportunity to mobilize the entire mcdonalds system to act.",transparancy,0.962678,G


Ici on a les 10 premiers phrases qui sont classifiées selon les sujets du gouvernement. la première phrase a un score de 0.83, donc elle fortement concerne ce sujet.