<a href="https://colab.research.google.com/github/AI4ALL-ESG-Investing/esg-financial-assistant/blob/main/data-cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Downloading A Kaggle Dataset into your Google Drive

In this tutorial, we will be using downloading a dataset from Kaggle into Google Colab and Google Drive. To begin, create an account on [Kaggle](https://www.kaggle.com/). Next find a dataset to explore further. In this example we will be using the [salaries of data science professionals](https://www.kaggle.com/datasets/arnabchaki/data-science-salaries-2023).  

## Mount Google Drive to Colab

In [None]:
# Mount google drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Connect to Kaggle using Token
To download data from kaggle a json token file must be in an accessible directory. To get this file, start by going to [Kaggle Settings](https://www.kaggle.com/settings) in your account.
2. Scroll down to **API**
3. Click on **Create New Token**, this should allow for the automatic download of a kaggle.json file.
4. Upload this file into Google Drive where you will be downloading the Kaggle data, in this example I have created a folder called "Kaggle"

In [None]:
# Import OS for navigation and environment set up
import os
# Check current location
#os.getcwd()
# Enable the Kaggle environment, use the path to the directory your Kaggle.json file
os.environ['KAGGLE_CONFIG_DIR'] = '/content/drive/MyDrive/Kaggle'

## Install Kaggle Package
The Kaggle package is not avaible by default on Colab and must be installed using pip.

In [None]:
!pip install kaggle



## Download Kaggle Data into Google Drive


Now that we have Kaggle installed and the ability to link to the API using the Kaggle.json file, we can download a dataset using the steps below to find the download command to add the data directly into your google drive.

 1. Find a dataset on
 Kaggle i.e. [Data Science Salaries 2023](https://www.kaggle.com/datasets/arnabchaki/data-science-salaries-2023)
 2. Click on Download button in the upper right hand corner.
 3. Select “Kaggle CLI”
 4. Paste that command in the code block below (Note: Make sure to add an exclamation point before the API command to let the system know to run on the terminal


In [None]:
import os
# Navigate into Drive where you want to store your Kaggle data
os.chdir('/content/drive/MyDrive/Kaggle')
# Paste and run the copied API command, the data will download to the current directory
!kaggle datasets download alistairking/public-company-esg-ratings-dataset
# Check contents of directory, you should see the .zip file for the competition in your Drive
os.listdir()

Dataset URL: https://www.kaggle.com/datasets/alistairking/public-company-esg-ratings-dataset
License(s): CC-BY-NC-SA-4.0
public-company-esg-ratings-dataset.zip: Skipping, found more recently modified local copy (use --force to force download)


['data.csv',
 'public-company-esg-ratings-dataset.zip',
 'investment_survey.csv',
 'investment-survey-dataset.zip',
 'kaggle.json']

In [None]:
!unzip public-company-esg-ratings-dataset.zip

unzip:  cannot find or open public-company-esg-ratings-dataset.zip, public-company-esg-ratings-dataset.zip.zip or public-company-esg-ratings-dataset.zip.ZIP.


In [None]:
!ls

drive  sample_data


## Load Data into Colab

In [None]:
import pandas as pd

# clean ESG dataset
df = pd.read_csv("data.csv")

FileNotFoundError: [Errno 2] No such file or directory: 'data.csv'

In [None]:
df.head()

Unnamed: 0,ticker,name,currency,exchange,industry,logo,weburl,environment_grade,environment_level,social_grade,...,governance_grade,governance_level,environment_score,social_score,governance_score,total_score,last_processing_date,total_grade,total_level,cik
0,dis,Walt Disney Co,USD,"NEW YORK STOCK EXCHANGE, INC.",Media,https://static.finnhub.io/logo/ef50b4a2b263c84...,https://thewaltdisneycompany.com/,A,High,BB,...,BB,Medium,510,316,321,1147,19-04-2022,BBB,High,1744489
1,gm,General Motors Co,USD,"NEW YORK STOCK EXCHANGE, INC.",Automobiles,https://static.finnhub.io/logo/9253db78-80c9-1...,https://www.gm.com/,A,High,BB,...,B,Medium,510,303,255,1068,17-04-2022,BBB,High,1467858
2,gww,WW Grainger Inc,USD,"NEW YORK STOCK EXCHANGE, INC.",Trading Companies and Distributors,https://static.finnhub.io/logo/f153dcda-80eb-1...,https://www.grainger.com/,B,Medium,BB,...,B,Medium,255,385,240,880,19-04-2022,BB,Medium,277135
3,mhk,Mohawk Industries Inc,USD,"NEW YORK STOCK EXCHANGE, INC.",Consumer products,https://static.finnhub.io/logo/26868a62-80ec-1...,https://mohawkind.com/,A,High,B,...,BB,Medium,570,298,303,1171,18-04-2022,BBB,High,851968
4,lyv,Live Nation Entertainment Inc,USD,"NEW YORK STOCK EXCHANGE, INC.",Media,https://static.finnhub.io/logo/1cd144d2-80ec-1...,https://www.livenationentertainment.com/,BBB,High,BB,...,B,Medium,492,310,250,1052,18-04-2022,BBB,High,1335258


In [None]:
df.drop(columns = ["logo", "weburl", "environment_grade", "environment_level", "social_grade", "social_level", "governance_grade", "governance_level", "total_grade", "total_level", "cik", "last_processing_date"], inplace = True)

In [None]:
df.head(3)

Unnamed: 0,ticker,name,currency,exchange,industry,environment_score,social_score,governance_score,total_score
0,dis,Walt Disney Co,USD,"NEW YORK STOCK EXCHANGE, INC.",Media,510,316,321,1147
1,gm,General Motors Co,USD,"NEW YORK STOCK EXCHANGE, INC.",Automobiles,510,303,255,1068
2,gww,WW Grainger Inc,USD,"NEW YORK STOCK EXCHANGE, INC.",Trading Companies and Distributors,255,385,240,880


In [None]:
df.isna().any()

Unnamed: 0,0
ticker,False
name,False
currency,False
exchange,False
industry,True
environment_score,False
social_score,False
governance_score,False
total_score,False


In [None]:
row_na = df[df["industry"].isna()]

In [None]:
print(row_na["name"])

15                Armada Acquisition Corp I
27            Acri Capital Acquisition Corp
32         ACE Convergence Acquisition Corp
57                    Edoc Acquisition Corp
76                      AF Acquisition Corp
97                     AIB Acquisition Corp
101        Sports Ventures Acquisition Corp
123                Alignment Healthcare LLC
630       Health Assurance Acquisition Corp
646    Healthcare Services Acquisition Corp
669                Artisan Acquisition Corp
675                          Powered Brands
696                Concord Acquisition Corp
Name: name, dtype: object


In [None]:
industry_map = {
    'Armada Acquisition Corp I': 'Financial Services',
    'Acri Capital Acquisition Corp': 'Financial Services',
    'ACE Convergence Acquisition Corp': 'Technology',
    'Edoc Acquisition Corp': 'Healthcare',
    'AF Acquisition Corp': 'Financial Services',
    'AIB Acquisition Corp': 'Financial Services',
    'Sports Ventures Acquisition Corp': 'Media & Entertainment',
    'Alignment Healthcare LLC': 'Healthcare',
    'Health Assurance Acquisition Corp': 'Healthcare',
    'Healthcare Services Acquisition Corp': 'Healthcare',
    'Artisan Acquisition Corp': 'Financial Services',
    'Powered Brands': 'Consumer Goods',
    'Concord Acquisition Corp': 'Financial Services'
}
df['industry'] = df.apply(
    lambda row: industry_map[row['name']] if pd.isna(row['industry']) and row['name'] in industry_map else row['industry'],
    axis=1
)


In [None]:
df.isna().any()
# ESG Dataset cleaned

Unnamed: 0,0
ticker,False
name,False
currency,False
exchange,False
industry,False
environment_score,False
social_score,False
governance_score,False
total_score,False


In [None]:
import yfinance as yf

df["Ticker"] = df["ticker"].str.replace('$', '', regex=False).str.upper()
tickers = df["Ticker"].unique()  # Avoid redundant API calls for repeated tickers

latest_prices = {}

for ticker in tickers:
    try:
        stock = yf.Ticker(ticker)
        hist = stock.history(period="1d")
        if not hist.empty:
            latest_prices[ticker] = hist["Close"].iloc[-1]
        else:
            print(f"{ticker} has no data.")
            latest_prices[ticker] = None
    except Exception as e:
        print(f"{ticker} failed: {e}")
        latest_prices[ticker] = None

# Convert to DataFrame
latest_price_df = pd.DataFrame(list(latest_prices.items()), columns=["Ticker", "Latest_Price"])

# Merge on cleaned "Ticker" column
stock_merged = pd.merge(df, latest_price_df, on="Ticker", how="inner")

print(stock_merged.head())

ERROR:yfinance:$AAWW: possibly delisted; no price data found  (period=1d) (Yahoo error = "No data found, symbol may be delisted")
ERROR:yfinance:$AACI: possibly delisted; no price data found  (period=1d) (Yahoo error = "No data found, symbol may be delisted")


AAWW has no data.
AACI has no data.


ERROR:yfinance:$AADI: possibly delisted; no price data found  (period=1d) (Yahoo error = "No data found, symbol may be delisted")


AADI has no data.


ERROR:yfinance:$ABIO: possibly delisted; no price data found  (period=1d) (Yahoo error = "No data found, symbol may be delisted")
ERROR:yfinance:$ABMD: possibly delisted; no price data found  (period=1d) (Yahoo error = "No data found, symbol may be delisted")


ABIO has no data.
ABMD has no data.


ERROR:yfinance:$ABTX: possibly delisted; no price data found  (period=1d) (Yahoo error = "No data found, symbol may be delisted")


ABTX has no data.


ERROR:yfinance:$ACAC: possibly delisted; no price data found  (period=1d) (Yahoo error = "No data found, symbol may be delisted")


ACAC has no data.


ERROR:yfinance:$ACCD: possibly delisted; no price data found  (period=1d) (Yahoo error = "No data found, symbol may be delisted")


ACCD has no data.


ERROR:yfinance:$ACEV: possibly delisted; no price data found  (period=1d) (Yahoo error = "No data found, symbol may be delisted")
ERROR:yfinance:$ACHL: possibly delisted; no price data found  (period=1d) (Yahoo error = "No data found, symbol may be delisted")


ACEV has no data.
ACHL has no data.


ERROR:yfinance:$ACOR: possibly delisted; no price data found  (period=1d) (Yahoo error = "No data found, symbol may be delisted")


ACOR has no data.


ERROR:yfinance:$ACRX: possibly delisted; no price data found  (period=1d) (Yahoo error = "No data found, symbol may be delisted")


ACRX has no data.


ERROR:yfinance:$ADES: possibly delisted; no price data found  (period=1d) (Yahoo error = "No data found, symbol may be delisted")


ADES has no data.


ERROR:yfinance:$ADMP: possibly delisted; no price data found  (period=1d) (Yahoo error = "No data found, symbol may be delisted")


ADMP has no data.


ERROR:yfinance:$ACER: possibly delisted; no price data found  (period=1d) (Yahoo error = "No data found, symbol may be delisted")
ERROR:yfinance:$ADOC: possibly delisted; no price data found  (period=1d) (Yahoo error = "No data found, symbol may be delisted")


ACER has no data.
ADOC has no data.


ERROR:yfinance:$AESE: possibly delisted; no price data found  (period=1d) (Yahoo error = "No data found, symbol may be delisted")


AESE has no data.


ERROR:yfinance:$AERI: possibly delisted; no price data found  (period=1d) (Yahoo error = "No data found, symbol may be delisted")
ERROR:yfinance:$AEY: possibly delisted; no price data found  (period=1d) (Yahoo error = "No data found, symbol may be delisted")


AERI has no data.
AEY has no data.


ERROR:yfinance:$AFAQ: possibly delisted; no price data found  (period=1d) (Yahoo error = "No data found, symbol may be delisted")


AFAQ has no data.


ERROR:yfinance:$AGFS: possibly delisted; no price data found  (period=1d) (Yahoo error = "No data found, symbol may be delisted")


AGFS has no data.


ERROR:yfinance:$AGIL: possibly delisted; no price data found  (period=1d) (Yahoo error = "No data found, symbol may be delisted")


AGIL has no data.


ERROR:yfinance:$AGLE: possibly delisted; no price data found  (period=1d) (Yahoo error = "No data found, symbol may be delisted")


AGLE has no data.


ERROR:yfinance:$AGRX: possibly delisted; no price data found  (period=1d) (Yahoo error = "No data found, symbol may be delisted")


AGRX has no data.


ERROR:yfinance:$AGTC: possibly delisted; no price data found  (period=1d) (Yahoo error = "No data found, symbol may be delisted")
ERROR:yfinance:$AHPI: possibly delisted; no price data found  (period=1d)
ERROR:yfinance:$AIB: possibly delisted; no price data found  (period=1d)


AGTC has no data.
AHPI has no data.
AIB has no data.


ERROR:yfinance:$AIKI: possibly delisted; no price data found  (period=1d) (Yahoo error = "No data found, symbol may be delisted")


AIKI has no data.


ERROR:yfinance:$AKIC: possibly delisted; no price data found  (period=1d) (Yahoo error = "No data found, symbol may be delisted")


AKIC has no data.


ERROR:yfinance:$AIMC: possibly delisted; no price data found  (period=1d) (Yahoo error = "No data found, symbol may be delisted")


AIMC has no data.


ERROR:yfinance:$AKU: possibly delisted; no price data found  (period=1d) (Yahoo error = "No data found, symbol may be delisted")
ERROR:yfinance:$AKUS: possibly delisted; no price data found  (period=1d) (Yahoo error = "No data found, symbol may be delisted")


AKU has no data.
AKUS has no data.


ERROR:yfinance:$ALBO: possibly delisted; no price data found  (period=1d) (Yahoo error = "No data found, symbol may be delisted")


ALBO has no data.


ERROR:yfinance:$ALIM: possibly delisted; no price data found  (period=1d) (Yahoo error = "No data found, symbol may be delisted")


ALIM has no data.


ERROR:yfinance:$AIH: possibly delisted; no price data found  (period=1d) (Yahoo error = "No data found, symbol may be delisted")


AIH has no data.


ERROR:yfinance:$ATVI: possibly delisted; no price data found  (period=1d) (Yahoo error = "No data found, symbol may be delisted")


ATVI has no data.


ERROR:yfinance:$CTXS: possibly delisted; no price data found  (period=1d) (Yahoo error = "No data found, symbol may be delisted")


CTXS has no data.


ERROR:yfinance:$DISH: possibly delisted; no price data found  (period=1d) (Yahoo error = "No data found, symbol may be delisted")


DISH has no data.


ERROR:yfinance:$NLOK: possibly delisted; no price data found  (period=1d) (Yahoo error = "No data found, symbol may be delisted")


NLOK has no data.


ERROR:yfinance:$ALLK: possibly delisted; no price data found  (period=1d)


ALLK has no data.


ERROR:yfinance:$ABC: possibly delisted; no price data found  (period=1d) (Yahoo error = "No data found, symbol may be delisted")


ABC has no data.


ERROR:yfinance:$SIVB: possibly delisted; no price data found  (period=1d) (Yahoo error = "No data found, symbol may be delisted")


SIVB has no data.


ERROR:yfinance:$CDAY: possibly delisted; no price data found  (period=1d) (Yahoo error = "No data found, symbol may be delisted")


CDAY has no data.


ERROR:yfinance:$DFS: possibly delisted; no price data found  (period=1d)


DFS has no data.


ERROR:yfinance:$DRE: possibly delisted; no price data found  (period=1d) (Yahoo error = "No data found, symbol may be delisted")


DRE has no data.
