<a href="https://colab.research.google.com/github/Kensuzuki95/Corporate_AI_Ethics_Guideline_Analysis/blob/main/Exploratory_Data_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

** **
# Step 1: Load Package
** **

In [20]:
import numpy as np 
import pandas as pd 
import requests
import io

** **
# Step 2: Load Data
** **

In [22]:
# Downloading the csv file from your GitHub account

url = ("https://raw.githubusercontent.com/Kensuzuki95/Corporate_AI_Ethics_Guideline_Analysis/main/Dataset/Dataset_Filtered.csv")
download = requests.get(url).content

dataset = pd.read_csv(io.StringIO(download.decode('utf-8')))

dataset.head()

Unnamed: 0,No.,Company Name,Country,Industry,Published Year,Last Revised,Link,Document Name,Main Text,Comment
0,1,Accenture,Ireland,Consulting,03-30-2021,03-30-2021,https://www.accenture.com/content/dam/accentur...,Responsible AI From principles to practice,Responsible AI\r\nFrom principles to practice\...,Addtional Details: https://www.accenture.com/u...
1,2,Adobe,United States of America,Software,,,https://www.adobe.com/content/dam/cc/en/ai-eth...,Adobe’s Commitment to AI Ethics,"Adobe’s Commitment to AI Ethics\r\nAt Adobe, o...",Addtional Details: https://www.adobe.com/conte...
2,3,Alphabet,United States of America,Software,,,https://ai.google/responsibilities/responsible...,Responsible AI practices,Responsible AI practices\r\nThe development of...,Addtional Information: https://ai.google/princ...
3,4,Amazon,United States of America,Software,,,https://d1.awsstatic.com/responsible-machine-l...,Responsible Use of Machine Learning,"Responsible Use of Machine Learning\r\nAt AWS,...",
4,5,Atos,France,Consulting,,,https://atos.net/en/lp/cybersecurity-magazine-...,The Atos Blueprint for Responsible AI,AI is a broad topic encompassing many differen...,


** **
## Clean the Dataset Format
** **

In [None]:
text_data = data.drop(columns=['No.','Country', 'Industry', 'Published Year', 'Last Revised', 'Link', 'Comment'], axis=1)
text_data.head()

Unnamed: 0,Company Name,Document Name,Main Text
0,Sony Group,Sony Group AI Ethics Guidelines,AI Engagement within Sony Group\nThrough the u...
1,Samsung,AI Ethics Principles,AI Ethics Principles\nAI is a rapidly developi...
2,Accenture,Responsible AI From principles to practice,Responsible AI\nFrom principles to practice\nC...
3,Acer,,
4,Adobe,Adobe’s Commitment to AI Ethics,"Adobe’s Commitment to AI Ethics\nAt Adobe, our..."


In [None]:
#Exclude firms without AI Ethics or Responsible AI Guideline Document
text_data = text_data.dropna()
text_data

Unnamed: 0,Company Name,Document Name,Main Text
0,Sony Group,Sony Group AI Ethics Guidelines,AI Engagement within Sony Group\nThrough the u...
1,Samsung,AI Ethics Principles,AI Ethics Principles\nAI is a rapidly developi...
2,Accenture,Responsible AI From principles to practice,Responsible AI\nFrom principles to practice\nC...
4,Adobe,Adobe’s Commitment to AI Ethics,"Adobe’s Commitment to AI Ethics\nAt Adobe, our..."
8,Alphabet,Responsible AI practices,Responsible AI practices\nThe development of A...


** **
#Step 3: Data Cleaning
** **

Since the goal of this analysis is to perform topic modeling, we will solely focus on the text data from each paper, and drop other metadata columns

## Remove punctuation/lower casing

Next, let’s perform a simple preprocessing on the content of paper_text column to make them more amenable for analysis, and reliable results. To do that, we’ll use a regular expression to remove any punctuation, and then lowercase the text

In [None]:
# Load the regular expression library 
import re

# Remove punctuation
text_data['main_text_processed'] = text_data['Main Text'].map(lambda x: re.sub('[,\.!?]', '', x))

# Convert the text to lowercase
text_data['main_text_processed'] = text_data['main_text_processed'].map(lambda x: x.lower())

# Print out the first rows of papers
text_data['main_text_processed'].head()

0    ai engagement within sony group\nthrough the u...
1    ai ethics principles\nai is a rapidly developi...
2    responsible ai\nfrom principles to practice\nc...
4    adobe’s commitment to ai ethics\nat adobe our ...
8    responsible ai practices\nthe development of a...
Name: main_text_processed, dtype: object

## Tokenize words and further clean-up text

Let’s tokenize each sentence into a list of words, removing punctuations and unnecessary characters altogether.

In [None]:
text_data.head()

Unnamed: 0,Company Name,Document Name,Main Text,main_text_processed
0,Sony Group,Sony Group AI Ethics Guidelines,AI Engagement within Sony Group\nThrough the u...,ai engagement within sony group\nthrough the u...
1,Samsung,AI Ethics Principles,AI Ethics Principles\nAI is a rapidly developi...,ai ethics principles\nai is a rapidly developi...
2,Accenture,Responsible AI From principles to practice,Responsible AI\nFrom principles to practice\nC...,responsible ai\nfrom principles to practice\nc...
4,Adobe,Adobe’s Commitment to AI Ethics,"Adobe’s Commitment to AI Ethics\nAt Adobe, our...",adobe’s commitment to ai ethics\nat adobe our ...
8,Alphabet,Responsible AI practices,Responsible AI practices\nThe development of A...,responsible ai practices\nthe development of a...


In [None]:
import gensim
from gensim.utils import simple_preprocess

def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

data = text_data.main_text_processed.values.tolist()
data_words = list(sent_to_words(data))

print(data_words[:1][0][:30])

['ai', 'engagement', 'within', 'sony', 'group', 'through', 'the', 'utilization', 'of', 'artificial', 'intelligence', 'ai', 'sony', 'aims', 'to', 'contribute', 'to', 'the', 'development', 'of', 'peaceful', 'and', 'sustainable', 'society', 'while', 'delivering', 'kando', 'sense', 'of', 'excitement']
