# Accelerating Cleantech Advancements through NLP-Powered Text Mining and Knowledge Extraction

Group: Marusa Storman, Vignesh Govindaraj, Pradip Ravichandran

## Stage 2: Advanced Embedding Models Training and Analysis

### Data Preparation for Embeddings

In [2]:
import sys
import os

# Get the directory of the current notebook
notebook_dir = os.getcwd()

# Change current working directory to where the notebook resides
os.chdir(notebook_dir)

# List of required libraries
required_libraries = [
    'numpy',
    'pandas'
]

# Check if each library is installed, if not, install it
for lib in required_libraries:
    try:
        __import__(lib)
    except ImportError:
        print(f"Installing {lib}...")
        !"{sys.executable}" -m pip install {lib}

In [3]:
import csv
import numpy as np
import pandas as pd


# Jupyter config
%config InteractiveShell.ast_node_interactivity = 'all'

# Additional setup for seaborn
# sns.set(color_codes=True)
# sns.set_style("whitegrid")

# Download needed NLTK's resources
# nltk.download('punkt')
# nltk.download('stopwords')

In [5]:
# Read the JSON file from https://www.kaggle.com/datasets/prakharbhandari20/cleantech-google-patent-dataset
google_patent_original = pd.read_csv("Data/lang/google_patent_en.csv")

# Read the first CSV file from https://www.kaggle.com/datasets/jannalipenkova/cleantech-media-dataset
media_original = pd.read_csv("Data/ct_media.csv")

# Read the second CSV file from https://www.kaggle.com/datasets/jannalipenkova/cleantech-media-dataset
media_evaluation_original = pd.read_csv("Data/ct_evaluation.csv")

In [6]:
# This function will provide with more useful information:
def analyze_column(df, has_list=False):
    info = pd.DataFrame({
        'Data Type': df.dtypes,
        'Number of Entries': df.count(),
        'Missing/None Count': df.isna().sum(),
        'Uniqueness': df.nunique()
    })
    
    return info

print("Google Patent Dataset:")
google_patent_original.head()
analyze_column(google_patent_original)
print("\nNumber of duplicate rows:", media_original.duplicated().sum())

Google Patent Dataset:


Unnamed: 0,publication_number,country_code,publication_date,inventor,title_localized_text,title_localized_language,abstract_localized_text,abstract_localized_language,cpc_code
0,US-2022239235-A1,US,2022-07-28,,Adaptable DC-AC Inverter Drive System and Oper...,en,Disclosed is an adaptable DC-AC inverter syste...,en,H02M7/5395
1,US-2022239235-A1,US,2022-07-28,,Adaptable DC-AC Inverter Drive System and Oper...,en,Disclosed is an adaptable DC-AC inverter syste...,en,H02J3/32
2,US-2022239235-A1,US,2022-07-28,,Adaptable DC-AC Inverter Drive System and Oper...,en,Disclosed is an adaptable DC-AC inverter syste...,en,H02M1/32
3,US-2022239235-A1,US,2022-07-28,,Adaptable DC-AC Inverter Drive System and Oper...,en,Disclosed is an adaptable DC-AC inverter syste...,en,H02J1/10
4,US-2022239235-A1,US,2022-07-28,,Adaptable DC-AC Inverter Drive System and Oper...,en,Disclosed is an adaptable DC-AC inverter syste...,en,H02J3/381


Unnamed: 0,Data Type,Number of Entries,Missing/None Count,Uniqueness
publication_number,object,337021,0,13408
country_code,object,337021,0,30
publication_date,object,337021,0,165
inventor,object,299150,37871,20287
title_localized_text,object,337021,0,24873
title_localized_language,object,337021,0,11
abstract_localized_text,object,337021,0,26315
abstract_localized_language,object,337021,0,11
cpc_code,object,304158,32863,7097



Number of duplicate rows: 0


In [7]:
print("Media Dataset:")
media_original.head()
analyze_column(media_original)
print("\nNumber of duplicate rows:", media_original.duplicated().sum())

Media Dataset:


Unnamed: 0,title,date,author,content,domain,url
0,Qatar to Slash Emissions as LNG Expansion Adva...,2021-01-13,,Qatar Petroleum ( QP) is targeting aggressive ...,energyintel,0000017b-a7dc-de4c-a17b-e7de685b0000
1,India Launches Its First 700 MW PHWR,2021-01-15,,• Nuclear Power Corp. of India Ltd. ( NPCIL) s...,energyintel,0000017b-a7dc-de4c-a17b-e7de6c710001
2,New Chapter for US-China Energy Trade,2021-01-20,,New US President Joe Biden took office this we...,energyintel,0000017b-a7dc-de4c-a17b-e7de735a0000
3,Japan: Slow Restarts Cast Doubt on 2030 Energy...,2021-01-22,,The slow pace of Japanese reactor restarts con...,energyintel,0000017b-a7dc-de4c-a17b-e7de79160000
4,NYC Pension Funds to Divest Fossil Fuel Shares,2021-01-25,,Two of New York City's largest pension funds s...,energyintel,0000017b-a7dc-de4c-a17b-e7de7d9e0000


Unnamed: 0,Data Type,Number of Entries,Missing/None Count,Uniqueness
title,object,9593,0,9569
date,object,9593,0,967
author,object,31,9562,7
content,object,9593,0,9588
domain,object,9593,0,19
url,object,9591,2,9580



Number of duplicate rows: 0


In [8]:
print("Media Evaluation Dataset:")
media_evaluation_original.head()
analyze_column(media_evaluation_original)
print("\nNumber of duplicate rows:", media_evaluation_original.duplicated().sum())

Media Evaluation Dataset:


Unnamed: 0,example_id,question_id,question,relevant_chunk,article_url,domain
0,1,1,What is the innovation behind Leclanché's new ...,Leclanché said it has developed an environment...,https://www.sgvoice.net/strategy/technology/23...,sgvoice.net
1,2,2,What is the EU’s Green Deal Industrial Plan?,The Green Deal Industrial Plan is a bid by the...,https://www.sgvoice.net/policy/25396/eu-seeks-...,sgvoice.net
2,3,2,What is the EU’s Green Deal Industrial Plan?,The European counterpart to the US Inflation R...,https://www.pv-magazine.com/2023/02/02/europea...,pv-magazine.com
3,4,3,What are the four focus areas of the EU's Gree...,The new plan is fundamentally focused on four ...,https://www.sgvoice.net/policy/25396/eu-seeks-...,sgvoice.net
4,5,4,When did the cooperation between GM and Honda ...,What caught our eye was a new hookup between G...,https://cleantechnica.com/2023/05/08/general-m...,cleantechnica.com


Unnamed: 0,Data Type,Number of Entries,Missing/None Count,Uniqueness
example_id,int64,23,0,23
question_id,int64,23,0,21
question,object,23,0,21
relevant_chunk,object,23,0,23
article_url,object,23,0,21
domain,object,23,0,6



Number of duplicate rows: 0


### Word Embedding Training

### Sentence Embedding Training

### Embedding Model Evaluation