<a href="https://colab.research.google.com/github/Kensuzuki95/Corporate_AI_Ethics_Guideline_Analysis/blob/main/Dataset_Preprocess.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

** **
# Step 1: Load Package
** **

In [2]:
import numpy as np 
import pandas as pd 
import requests
import io

** **
# Step 2: Load Data
** **

In [3]:
# Downloading the csv file from your GitHub account

url_1 = ("https://raw.githubusercontent.com/Kensuzuki95/Corporate_AI_Ethics_Guideline_Analysis/main/Dataset/Dataset_Filtered.csv")
download = requests.get(url_1).content

dataset = pd.read_csv(io.StringIO(download.decode('utf-8')))

dataset.head()

Unnamed: 0,No.,Company Name,Country,Industry,Published Year,Last Revised,Link,Document Name,Main Text,Comment
0,1,Accenture,Ireland,Consulting,03-30-2021,03-30-2021,https://www.accenture.com/content/dam/accentur...,Responsible AI From principles to practice,Responsible AI\r\nFrom principles to practice\...,Addtional Details: https://www.accenture.com/u...
1,2,Adobe,United States of America,Software,,,https://www.adobe.com/content/dam/cc/en/ai-eth...,Adobe’s Commitment to AI Ethics,"Adobe’s Commitment to AI Ethics\r\nAt Adobe, o...",Addtional Details: https://www.adobe.com/conte...
2,3,Alphabet,United States of America,Software,,,https://ai.google/responsibilities/responsible...,Responsible AI practices,Responsible AI practices\r\nThe development of...,Addtional Information: https://ai.google/princ...
3,4,Amazon,United States of America,Software,,,https://d1.awsstatic.com/responsible-machine-l...,Responsible Use of Machine Learning,"Responsible Use of Machine Learning\r\nAt AWS,...",
4,5,Atos,France,Consulting,,,https://atos.net/en/lp/cybersecurity-magazine-...,The Atos Blueprint for Responsible AI,AI is a broad topic encompassing many differen...,


## Clean the Dataset Format

In [4]:
#Check for unecesarry columns
dataset.columns

Index(['No.', 'Company Name', 'Country', 'Industry', 'Published Year',
       'Last Revised', 'Link', 'Document Name', 'Main Text', 'Comment'],
      dtype='object')

In [5]:
text_data = dataset.drop(columns=['No.','Country', 'Industry', 'Published Year', 'Last Revised', 'Link', 'Comment'], axis=1)
text_data.info()
text_data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49 entries, 0 to 48
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Company Name   49 non-null     object
 1   Document Name  49 non-null     object
 2   Main Text      49 non-null     object
dtypes: object(3)
memory usage: 1.3+ KB


Unnamed: 0,Company Name,Document Name,Main Text
0,Accenture,Responsible AI From principles to practice,Responsible AI\r\nFrom principles to practice\...
1,Adobe,Adobe’s Commitment to AI Ethics,"Adobe’s Commitment to AI Ethics\r\nAt Adobe, o..."
2,Alphabet,Responsible AI practices,Responsible AI practices\r\nThe development of...
3,Amazon,Responsible Use of Machine Learning,"Responsible Use of Machine Learning\r\nAt AWS,..."
4,Atos,The Atos Blueprint for Responsible AI,AI is a broad topic encompassing many differen...


** **
#Step 3: Data Cleaning
** **

Since the goal of this analysis is to perform topic modeling, we will solely focus on the text data from each paper, and drop other metadata columns

## Remove punctuation/lower casing

Next, let’s perform a simple preprocessing on the content of paper_text column to make them more amenable for analysis, and reliable results. To do that, we’ll use a regular expression to remove any punctuation, and then lowercase the text

In [5]:
# Load the regular expression library
import re
import nltk


# Remove punctuation
text_data['Main_Text_Processed'] = text_data['Main Text'].map(lambda x: re.sub('[,\.!?()]', '', x))

# Convert the text to lowercase
text_data['Main_Text_Processed'] = text_data['Main_Text_Processed'].map(lambda x: x.lower())

# Applying Tokenization
nltk.download('punkt')
text_data['Main_Text_Tokenized'] = text_data.apply(lambda row: nltk.word_tokenize(row['Main_Text_Processed']), axis=1)

# Print out the first rows of papers
#training_data.head()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


## Tokenize words and further clean-up text

Let’s tokenize each sentence into a list of words, removing punctuations and unnecessary characters altogether.

In [6]:
text_data.head()

Unnamed: 0,Company Name,Document Name,Main Text,Main_Text_Processed,Main_Text_Tokenized
0,Accenture,Responsible AI From principles to practice,Responsible AI\r\nFrom principles to practice\...,responsible ai\r\nfrom principles to practice\...,"[responsible, ai, from, principles, to, practi..."
1,Adobe,Adobe’s Commitment to AI Ethics,"Adobe’s Commitment to AI Ethics\r\nAt Adobe, o...",adobe’s commitment to ai ethics\r\nat adobe ou...,"[adobe, ’, s, commitment, to, ai, ethics, at, ..."
2,Alphabet,Responsible AI practices,Responsible AI practices\r\nThe development of...,responsible ai practices\r\nthe development of...,"[responsible, ai, practices, the, development,..."
3,Amazon,Responsible Use of Machine Learning,"Responsible Use of Machine Learning\r\nAt AWS,...",responsible use of machine learning\r\nat aws ...,"[responsible, use, of, machine, learning, at, ..."
4,Atos,The Atos Blueprint for Responsible AI,AI is a broad topic encompassing many differen...,ai is a broad topic encompassing many differen...,"[ai, is, a, broad, topic, encompassing, many, ..."


In [7]:
#defining the function to remove stopwords from tokenized text
nltk.download('stopwords')
stopwords = nltk.corpus.stopwords.words('english')

def remove_stopwords(text):
    output= [i for i in text if i not in stopwords]
    return output

#applying the function
text_data['Main_text_without_stopwords'] = text_data['Main_Text_Tokenized'].apply(lambda x:remove_stopwords(x))
#training_data.head()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [8]:
#importing the Stemming function from nltk library
from nltk.stem.porter import PorterStemmer

#defining the object for stemming
porter_stemmer = PorterStemmer()

#defining a function for stemming
def stemming(text):
  stem_text = [porter_stemmer.stem(word) for word in text]
  return stem_text

text_data['Main_text_stemmed'] = text_data['Main_text_without_stopwords'].apply(lambda x: stemming(x))
text_data.head()

Unnamed: 0,Company Name,Document Name,Main Text,Main_Text_Processed,Main_Text_Tokenized,Main_text_without_stopwords,Main_text_stemmed
0,Accenture,Responsible AI From principles to practice,Responsible AI\r\nFrom principles to practice\...,responsible ai\r\nfrom principles to practice\...,"[responsible, ai, from, principles, to, practi...","[responsible, ai, principles, practice, conten...","[respons, ai, principl, practic, content, resp..."
1,Adobe,Adobe’s Commitment to AI Ethics,"Adobe’s Commitment to AI Ethics\r\nAt Adobe, o...",adobe’s commitment to ai ethics\r\nat adobe ou...,"[adobe, ’, s, commitment, to, ai, ethics, at, ...","[adobe, ’, commitment, ai, ethics, adobe, purp...","[adob, ’, commit, ai, ethic, adob, purpos, ser..."
2,Alphabet,Responsible AI practices,Responsible AI practices\r\nThe development of...,responsible ai practices\r\nthe development of...,"[responsible, ai, practices, the, development,...","[responsible, ai, practices, development, ai, ...","[respons, ai, practic, develop, ai, creat, new..."
3,Amazon,Responsible Use of Machine Learning,"Responsible Use of Machine Learning\r\nAt AWS,...",responsible use of machine learning\r\nat aws ...,"[responsible, use, of, machine, learning, at, ...","[responsible, use, machine, learning, aws, pro...","[respons, use, machin, learn, aw, proud, suppo..."
4,Atos,The Atos Blueprint for Responsible AI,AI is a broad topic encompassing many differen...,ai is a broad topic encompassing many differen...,"[ai, is, a, broad, topic, encompassing, many, ...","[ai, broad, topic, encompassing, many, differe...","[ai, broad, topic, encompass, mani, differ, fa..."


** **
#Step 4: Measure Text Similarity
** **



## Create BERT-based Text Similaarity Scoring Model

In [6]:
!pip install sentence_transformers
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-mpnet-base-v2')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence_transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 KB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.26.0-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m52.7 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m55.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub>=0.4.0
  Downloading huggingface_hub-0.12.0-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3

Downloading (…)a8e1d/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)0bca8e1d/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)e1d/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)a8e1d/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)8e1d/train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)bca8e1d/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [52]:
# Create a Dataset with only Company Name & Main Text of the AI Ethics Principle Document
similarity_data = text_data[['Company Name','Main Text']]
sentences = similarity_data['Main Text'].values.tolist()

In [50]:
# Load UNESCO's AI Ethics Principles Dataset
url_2 = ("https://raw.githubusercontent.com/Kensuzuki95/Corporate_AI_Ethics_Guideline_Analysis/main/Dataset/UNESCO_AI_Ethics_Principles.csv")
download = requests.get(url_2).content

principles = pd.read_csv(io.StringIO(download.decode('utf-8')))
principles.head()

Unnamed: 0,No.,Principle Name,Content
0,1,Proportionality and Do No Harm,It should be recognized that AI technologies d...
1,2,Safety and security,"Unwanted harms (safety risks), as well as vuln..."
2,3,Fairness and non-discrimination,AI actors should promote social justice and sa...
3,4,Sustainability,The development of sustainable societies relie...
4,5,"Right to Privacy, and Data Protection","Privacy, a right essential to the protection o..."


In [59]:
## Create a function
from sklearn.metrics.pairwise import cosine_similarity

def similarity_score(principle_number):
  principle_principle_number = principles.iloc[principle_number-1]['Content']
  sentences_principle_number = sentences.copy()
  sentences_principle_number.insert(0, principle_principle_number) 
  sentence_embeddings_principle_number = model.encode(sentences_principle_number)
  sentence_embeddings_principle_number.shape
  results_principle_number = cosine_similarity([sentence_embeddings_principle_number[0]], 
                                               sentence_embeddings_principle_number[1:])
  results_principle_number = results_principle_number.tolist()
  results_principle_number = results_principle_number[0]
  print("UNESCO AI Ethics Princple #" + str(principle_number) + ":" + '\n' + str(principle_principle_number))
  #(len(results_principle_number))
  return results_principle_number;

## Principle 1

In [None]:
# Product Similarity Score of all documents to Principle #1
results_1 = similarity_score(principle_number = 1)
#results_1

In [9]:
# Extract UNESCO's AI Ethics Princilple #1
principle_1 = principles.iloc[0]['Content']
principle_1

'It should be recognized that AI technologies do not necessarily, per se, ensure human and environmental and ecosystem flourishing. Furthermore, none of the processes related to the AI system life cycle shall exceed what is necessary to achieve legitimate aims or objectives and should be appropriate to the context. In the event of possible occurrence of any harm to human beings, human rights and fundamental freedoms, communities and society at large or the environment and ecosystems, the implementation of procedures for risk assessment and the adoption of measures in order to preclude the occurrence of such harm should be ensured.\nThe choice to use AI systems and which AI method to use should be justified in the following ways: (a) the AI method chosen should be appropriate and proportional to achieve a given legitimate aim; (b) the AI method chosen should not infringe upon the foundational values captured in this document, in particular, its use must not violate or abuse human rights

In [30]:
# Insert UNESCO's AI Ethics Princilple #1 to list of AI Ethics Guideline texts

sentences_1 = sentences.copy()
sentences_1.insert(0, principle_1) 
#sentences_1

In [15]:
sentence_embeddings = model.encode(sentences_1)
sentence_embeddings.shape

(50, 768)

In [16]:
from sklearn.metrics.pairwise import cosine_similarity

In [17]:
# Save the similarity results to a list object
results_1 = cosine_similarity([sentence_embeddings[0]], sentence_embeddings[1:])
results_1 = results_1.tolist()
results_1 = results_1[0]

# Confirm that the list is compationable with the Original Dataset
len(results_1)

49

In [56]:
# Create a new column storing the results in the Dataset
similarity_data['Principle_1'] = results_1
similarity_data.head()
#similarity_data.info()

Unnamed: 0,Company Name,Main Text,Principle_1
0,Accenture,Responsible AI\r\nFrom principles to practice\...,0.717751
1,Adobe,"Adobe’s Commitment to AI Ethics\r\nAt Adobe, o...",0.566864
2,Alphabet,Responsible AI practices\r\nThe development of...,0.630594
3,Amazon,"Responsible Use of Machine Learning\r\nAt AWS,...",0.596675
4,Atos,AI is a broad topic encompassing many differen...,0.709803


## Principle 2

In [19]:
# Extract UNESCO's AI Ethics Princilple #2
principle_2 = principles.iloc[1]['Content']
principle_2

'Unwanted harms (safety risks), as well as vulnerabilities to attack (security risks) should be avoided and should be addressed, prevented and eliminated throughout the life cycle of AI systems to ensure human, environmental and ecosystem safety and security. Safe and secure AI will be enabled by the development of sustainable, privacyprotective data access frameworks that foster better training and validation of AI models utilizing quality data.'

In [20]:
# Insert UNESCO's AI Ethics Princilple #2 to list of AI Ethics Guideline texts

sentences_2 = sentences
sentences_2.insert(0, principle_2) 
#sentences_2

In [21]:
sentence_embeddings = model.encode(sentences_2)
sentence_embeddings.shape

(51, 768)

In [22]:
from sklearn.metrics.pairwise import cosine_similarity

In [23]:
# Save the similarity results to a list object
results_2 = cosine_similarity([sentence_embeddings[0]], sentence_embeddings[1:])
results_2 = results_2.tolist()
results_2 = results_2[0]

# Confirm that the list is compationable with the Original Dataset
len(results_2)

50

In [24]:
# Create a new column storing the results in the Dataset
similarity_data['Principle_2'] = results_2
similarity_data.head()
#similarity_data.info()

ValueError: ignored

## Principle 3

In [None]:
# Extract UNESCO's AI Ethics Princilple #3
principle_3 = principles.iloc[2]['Content']
principle_3

In [None]:
# Insert UNESCO's AI Ethics Princilple #3 to list of AI Ethics Guideline texts
sentences_3 = sentences
sentences_3.insert(0, principle_3) 
#sentences_3

In [None]:
sentence_embeddings = model.encode(sentences_3)
sentence_embeddings.shape

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
# Save the similarity results to a list object
results_3 = cosine_similarity([sentence_embeddings[0]], sentence_embeddings[1:])
results_3 = results_3.tolist()
results_3 = results_3[0]

# Confirm that the list is compationable with the Original Dataset
len(results_3)

In [None]:
# Create a new column storing the results in the Dataset
similarity_data['Principle_3'] = results_3
similarity_data.head()
#similarity_data.info()

## Principle 4

In [None]:
# Extract UNESCO's AI Ethics Princilple #4
principle_4 = principles.iloc[3]['Content']
principle_4

In [None]:
# Insert UNESCO's AI Ethics Princilple #4 to list of AI Ethics Guideline texts
sentences_4 = sentences
sentences_4.insert(0, principle_4) 
#sentences_4

In [None]:
sentence_embeddings = model.encode(sentences_4)
sentence_embeddings.shape

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
# Save the similarity results to a list object
results_4 = cosine_similarity([sentence_embeddings[0]], sentence_embeddings[1:])
results_4 = results_4.tolist()
results_4 = results_4[0]

# Confirm that the list is compationable with the Original Dataset
len(results_4)

In [None]:
# Create a new column storing the results in the Dataset
similarity_data['Principle_4'] = results_4
similarity_data.head()
#similarity_data.info()