---
title: "Data Cleaning"
format:
    html: 
        code-fold: false
---

<!-- After digesting the instructions, you can delete this cell, these are assignment instructions and do not need to be included in your final submission.  -->

{{< include instructions.qmd >}} 

# Code 

Provide the source code used for this section of the project here.

If you're using a package for code organization, you can import it at this point. However, make sure that the **actual workflow steps**—including data processing, analysis, and other key tasks—are conducted and clearly demonstrated on this page. The goal is to show the technical flow of your project, highlighting how the code is executed to achieve your results.

If relevant, link to additional documentation or external references that explain any complex components. This section should give readers a clear view of how the project is implemented from a technical perspective.

Remember, this page is a technical narrative, NOT just a notebook with a collection of code cells, include in-line Prose, to describe what is going on.

In [1]:
import pandas as pd
import numpy as np

In [None]:
# Read csv
df_gender = pd.read_csv("../../data/raw-data/company_gender_data.csv")
df_cik = pd.read_csv("../../data/raw-data/company_cik_list.csv")

In [5]:
df_gender.head()

Unnamed: 0,WBA_ID,Company-name,ISIN,SEDOL Code,Region,Country,Industry,Total,Percentage of Total Possible Score \n (out of 52.3),CEO Gender,...,VHR-E02.EA-Explanation,VHR-E02.EA-Evidence,VHR-E02.EA-Source,VHR-E02.EA-Link,VHR-E02.EA-Score.1,VHR-E02.EA-Assessment.1,VHR-E02.EA-Explanation.1,VHR-E02.EA-Evidence.1,VHR-E02.EA-Source.1,VHR-E02.EA-Link.1
0,PT_00001,3M Company,US88579Y1010,2595708,North America,United States,Chemicals,11.3,21.6,Male,...,No evidence was found regarding whether the co...,,Sustainability Report_CY-2022,https://multimedia.3m.com/mws/media/2292786O/3...,0.0,Unmet,No evidence was found regarding whether the co...,,Sustainability Report_CY-2022,https://multimedia.3m.com/mws/media/2292786O/3...
1,PT_00006,AbbVie,US00287Y1091,B92SR70,North America,United States,Pharmaceuticals & Biotechnology,15.4,29.5,Male,...,No evidence was found regarding whether the co...,,The AbbVie Code of Business Conduct,https://investors.abbvie.com/static-files/09fd...,0.0,Unmet,No evidence was found regarding whether the co...,,The AbbVie Code of Business Conduct,https://investors.abbvie.com/static-files/09fd...
2,PT_00007,Abercrombie & Fitch,US0028962076,2004185,North America,United States,Apparel & Footwear,10.0,19.1,Female,...,No evidence was found regarding whether the co...,,,,0.0,Unmet,No evidence was found regarding whether the co...,,Form 10-K_2022-2023,https://abercrombieandfitchcompany.gcs-web.com...
3,PT_00024,Adobe,US00724F1012,2008154,North America,United States,Digital,16.5,31.6,Male,...,No evidence was found regarding whether the co...,,Code of Conduct / Code of Ethics_2022-2023,https://www.adobe.com/content/dam/cc/en/corpor...,0.0,Unmet,No evidence was found regarding whether the co...,,Code of Conduct / Code of Ethics_2022-2023,https://www.adobe.com/content/dam/cc/en/corpor...
4,PT_00027,AMD,US0079031078,2007849,North America,United States,Digital,11.2,21.5,Female,...,No evidence was found regarding whether the co...,,Code of Conduct / Code of Ethics_2022-2023,https://d1io3yog0oux5.cloudfront.net/_ebdf5d9e...,0.0,Unmet,No evidence was found regarding whether the co...,,Code of Conduct / Code of Ethics_2022-2023,https://d1io3yog0oux5.cloudfront.net/_ebdf5d9e...


In [7]:
df_cik.head()

Unnamed: 0,Company-name,CIK-code
0,3M Company,CIK0000066740
1,AbbVie,CIK0001551152
2,Abercrombie & Fitch,CIK0001018840
3,Adobe,CIK0000008680
4,AMD,CIK0000002488


In [None]:
# Merge two dataframes
df = pd.merge(df_cik, df_gender, on="Company-name", how="inner")

In [15]:
# Check "CEO Gender" and "CIK Code" for NA values
missing_ceo_gender = df['CEO Gender'].isna().sum()
missing_cik_code = df['CIK-code'].isna().sum()

print(f"Number of missing values in 'CEO Gender': {missing_ceo_gender}")
print(f"Number of missing values in 'CIK Code': {missing_cik_code}")

Number of missing values in 'CEO Gender': 4
Number of missing values in 'CIK Code': 35


In [19]:
# Delete rows containing NA values in 'CEO Gender' and 'CIK Code'
df = df.dropna(subset=['CEO Gender', 'CIK-code'])

In [21]:
df.head()

Unnamed: 0,Company-name,CIK-code,WBA_ID,ISIN,SEDOL Code,Region,Country,Industry,Total,Percentage of Total Possible Score \n (out of 52.3),...,VHR-E02.EA-Explanation,VHR-E02.EA-Evidence,VHR-E02.EA-Source,VHR-E02.EA-Link,VHR-E02.EA-Score.1,VHR-E02.EA-Assessment.1,VHR-E02.EA-Explanation.1,VHR-E02.EA-Evidence.1,VHR-E02.EA-Source.1,VHR-E02.EA-Link.1
0,3M Company,CIK0000066740,PT_00001,US88579Y1010,2595708,North America,United States,Chemicals,11.3,21.6,...,No evidence was found regarding whether the co...,,Sustainability Report_CY-2022,https://multimedia.3m.com/mws/media/2292786O/3...,0.0,Unmet,No evidence was found regarding whether the co...,,Sustainability Report_CY-2022,https://multimedia.3m.com/mws/media/2292786O/3...
1,AbbVie,CIK0001551152,PT_00006,US00287Y1091,B92SR70,North America,United States,Pharmaceuticals & Biotechnology,15.4,29.5,...,No evidence was found regarding whether the co...,,The AbbVie Code of Business Conduct,https://investors.abbvie.com/static-files/09fd...,0.0,Unmet,No evidence was found regarding whether the co...,,The AbbVie Code of Business Conduct,https://investors.abbvie.com/static-files/09fd...
2,Abercrombie & Fitch,CIK0001018840,PT_00007,US0028962076,2004185,North America,United States,Apparel & Footwear,10.0,19.1,...,No evidence was found regarding whether the co...,,,,0.0,Unmet,No evidence was found regarding whether the co...,,Form 10-K_2022-2023,https://abercrombieandfitchcompany.gcs-web.com...
3,Adobe,CIK0000008680,PT_00024,US00724F1012,2008154,North America,United States,Digital,16.5,31.6,...,No evidence was found regarding whether the co...,,Code of Conduct / Code of Ethics_2022-2023,https://www.adobe.com/content/dam/cc/en/corpor...,0.0,Unmet,No evidence was found regarding whether the co...,,Code of Conduct / Code of Ethics_2022-2023,https://www.adobe.com/content/dam/cc/en/corpor...
4,AMD,CIK0000002488,PT_00027,US0079031078,2007849,North America,United States,Digital,11.2,21.5,...,No evidence was found regarding whether the co...,,Code of Conduct / Code of Ethics_2022-2023,https://d1io3yog0oux5.cloudfront.net/_ebdf5d9e...,0.0,Unmet,No evidence was found regarding whether the co...,,Code of Conduct / Code of Ethics_2022-2023,https://d1io3yog0oux5.cloudfront.net/_ebdf5d9e...


In [22]:
df.shape

(211, 303)

In [24]:
# Define the path
save_path = "../../data/processed-data/merged_data.csv"

# Save csv
df.to_csv(save_path, index=False)
print(f"Cleaned data saved to: {save_path}")

Cleaned data saved to: ../../data/processed-data/merged_data.csv
