# Accelerating Cleantech Advancements through NLP-Powered Text Mining and Knowledge Extraction

Group: Marusa Storman, Vignesh Govindaraj, Pradip Ravichandran

## Stage 2: Advanced Embedding Models Training and Analysis

### Data Preparation for Embeddings

In [1]:
import sys
import os

# Get the directory of the current notebook
notebook_dir = os.getcwd()

# Change current working directory to where the notebook resides
os.chdir(notebook_dir)

# List of required libraries
required_libraries = [
    'gensim',
    'scipy==1.12'
    'transformers',
    'torch'
]

# Check if each library is installed, if not, install it
for lib in required_libraries:
    try:
        __import__(lib)
    except ImportError:
        print(f"Installing {lib}...")
        !"{sys.executable}" -m pip install {lib}

Installing scipy==1.12...


In [2]:
import ast
import gc
import numpy as np
import pandas as pd
import time
import torch

from collections import Counter
from gensim.models import Word2Vec
from gensim.test.utils import datapath
from gensim.models import KeyedVectors
# from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer, BertModel
from wordcloud import WordCloud


# Jupyter config
%config InteractiveShell.ast_node_interactivity = 'all'

# Load pre-trained BERT model and tokenizer
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')

The dataset has already been cleaned and prepared for embedding training in the initial task. To streamline processing and avoid redundant code, we'll import the preprocessed data. Before splitting it, let's quickly review the data again as a reminder.

In [3]:
# Get the preprocessed data from stage 1
google_patent_original = pd.read_csv("Data/google_patent_en_preprocessed.csv")
media_original = pd.read_csv("Data/ct_media_preprocessed.csv")
media_evaluation_original = pd.read_csv("Data/ct_evaluation_preprocessed.csv")

In [4]:
# This function will provide with more useful information:
def analyze_column(df, has_list=False):
    info = pd.DataFrame({
        'Data Type': df.dtypes,
        'Number of Entries': df.count(),
        'Missing/None Count': df.isna().sum(),
        'Uniqueness': df.nunique()
    })
    
    return info

print("Google Patent Dataset:")
google_patent_original['publication_date'] = pd.to_datetime(google_patent_original['publication_date'])
google_patent_original.head()
analyze_column(google_patent_original)
print("\nNumber of duplicate rows:", media_original.duplicated().sum())

Google Patent Dataset:


Unnamed: 0,publication_number,country_code,publication_date,title_localized_text,abstract_localized_text,title_tokens,abstract_tokens,title_token_count,abstract_token_count
0,US-2022239235-A1,US,2022-07-28,adaptable dcac inverter drive system and opera...,disclosed is an adaptable dcac inverter system...,"['adapt', 'dcac', 'invert', 'drive', 'system',...","['disclos', 'adapt', 'dcac', 'invert', 'system...",7,64
1,US-2022239251-A1,US,2022-07-28,system for providing the energy from a single ...,in accordance with an example embodiment a sol...,"['system', 'provid', 'energi', 'singl', 'conti...","['accord', 'exampl', 'embodi', 'solar', 'energ...",18,92
2,US-11396827-B2,US,2022-07-26,control method for optimizing solartopower eff...,a control method for optimizing a solartopower...,"['control', 'method', 'optim', 'solartopow', '...","['control', 'method', 'optim', 'solartopow', '...",15,149
3,CN-114772674-A,CN,2022-07-22,lowcarbon running saline wastewater treatment ...,the invention discloses a system and a method ...,"['lowcarbon', 'run', 'salin', 'wastewat', 'tre...","['invent', 'disclos', 'system', 'method', 'tre...",15,226
4,CN-217026795-U,CN,2022-07-22,water ecological remediation device convenient...,the utility model discloses a water ecological...,"['water', 'ecolog', 'remedi', 'devic', 'conven...","['util', 'model', 'disclos', 'water', 'ecolog'...",7,252


Unnamed: 0,Data Type,Number of Entries,Missing/None Count,Uniqueness
publication_number,object,13412,0,13351
country_code,object,13412,0,29
publication_date,datetime64[ns],13412,0,158
title_localized_text,object,13412,0,12441
abstract_localized_text,object,13412,0,13250
title_tokens,object,13412,0,12424
abstract_tokens,object,13412,0,13235
title_token_count,int64,13412,0,30
abstract_token_count,int64,13412,0,282



Number of duplicate rows: 0


In [5]:
print("Media Dataset:")
media_original['date'] = pd.to_datetime(media_original['date'])
media_original.head()
analyze_column(media_original)
print("\nNumber of duplicate rows:", media_original.duplicated().sum())

Media Dataset:


Unnamed: 0,title,date,content,domain,title_tokens,content_tokens,title_token_count,content_token_count
0,qatar to slash emissions as lng expansion adva...,2021-01-13,qatar petroleum qp is targeting aggressive cut...,energyintel,"['qatar', 'slash', 'emiss', 'lng', 'expans', '...","['qatar', 'petroleum', 'qp', 'target', 'aggres...",8,442
1,india launches its first 700 mw phwr,2021-01-15,nuclear power corp of india ltd npcil synchro...,energyintel,"['india', 'launch', 'first', '700', 'mw', 'phwr']","['nuclear', 'power', 'corp', 'india', 'ltd', '...",7,538
2,new chapter for uschina energy trade,2021-01-20,new us president joe biden took office this we...,energyintel,"['new', 'chapter', 'uschina', 'energi', 'trade']","['new', 'presid', 'joe', 'biden', 'took', 'off...",6,706
3,japan slow restarts cast doubt on 2030 energy ...,2021-01-22,the slow pace of japanese reactor restarts con...,energyintel,"['japan', 'slow', 'restart', 'cast', 'doubt', ...","['slow', 'pace', 'japanes', 'reactor', 'restar...",9,687
4,nyc pension funds to divest fossil fuel shares,2021-01-25,two of new york citys largest pension funds sa...,energyintel,"['nyc', 'pension', 'fund', 'divest', 'fossil',...","['two', 'new', 'york', 'citi', 'largest', 'pen...",8,394


Unnamed: 0,Data Type,Number of Entries,Missing/None Count,Uniqueness
title,object,9593,0,9565
date,datetime64[ns],9593,0,967
content,object,9593,0,9588
domain,object,9593,0,19
title_tokens,object,9593,0,9563
content_tokens,object,9593,0,9587
title_token_count,int64,9593,0,25
content_token_count,int64,9593,0,1782



Number of duplicate rows: 0


In [6]:
print("Media Evaluation Dataset:")
media_evaluation_original.head()
analyze_column(media_evaluation_original)
print("\nNumber of duplicate rows:", media_evaluation_original.duplicated().sum())

Media Evaluation Dataset:


Unnamed: 0,example_id,question_id,question,relevant_chunk,domain,question_tokens,relevant_chunk_tokens,question_token_count,relevant_chunk_token_count
0,1,1,what is the innovation behind leclanches new m...,leclanche said it has developed an environment...,sgvoice.net,"['innov', 'behind', 'leclanch', 'new', 'method...","['leclanch', 'said', 'develop', 'environment',...",12,36
1,2,2,what is the eus green deal industrial plan,the green deal industrial plan is a bid by the...,sgvoice.net,"['eu', 'green', 'deal', 'industri', 'plan']","['green', 'deal', 'industri', 'plan', 'bid', '...",8,47
2,3,2,what is the eus green deal industrial plan,the european counterpart to the us inflation r...,pv-magazine.com,"['eu', 'green', 'deal', 'industri', 'plan']","['european', 'counterpart', 'inflat', 'reduct'...",8,35
3,4,3,what are the four focus areas of the eus green...,the new plan is fundamentally focused on four ...,sgvoice.net,"['four', 'focu', 'area', 'eu', 'green', 'deal'...","['new', 'plan', 'fundament', 'focus', 'four', ...",13,42
4,5,4,when did the cooperation between gm and honda ...,what caught our eye was a new hookup between g...,cleantechnica.com,"['cooper', 'gm', 'honda', 'fuel', 'cell', 'veh...","['caught', 'eye', 'new', 'hookup', 'gm', 'hond...",13,60


Unnamed: 0,Data Type,Number of Entries,Missing/None Count,Uniqueness
example_id,int64,23,0,23
question_id,int64,23,0,21
question,object,23,0,21
relevant_chunk,object,23,0,23
domain,object,23,0,6
question_tokens,object,23,0,21
relevant_chunk_tokens,object,23,0,23
question_token_count,int64,23,0,12
relevant_chunk_token_count,int64,23,0,18



Number of duplicate rows: 0


To ensure unique characteristics and fair splitting, we will prioritize the "country" column for patents. This approach ensures each country is proportionally represented in both the test and validation datasets. We chose the country column because it is the only attribute that makes sense; the type of patent may be influenced by its country of origin.

For countries with only one patent, we cannot split them effectively. Therefore, we created a new "Country" tag, "OT" (short for "other"), to group these countries together and facilitate splitting.

For the media dataset, we chose the "domain" column. Similarly, a domain may report on specific topics or emphasize certain aspects, making it important for the dataset's characteristics.

In [7]:
# Get all the rows with an unique country_code
class_counts = google_patent_original['country_code'].value_counts()
single_instances = class_counts[class_counts == 1].index.tolist()

# Update country_code for single-instance classes
google_patent_original.loc[google_patent_original['country_code'].isin(single_instances), 'country_code'] = 'OT'  # OT = Other

In [8]:
# Split patent data into training and validation sets, country code is been splitted equal
patent_train, patent_val = train_test_split(google_patent_original, test_size=0.2, stratify=google_patent_original['country_code'], random_state=42)

# Split media data into training and validation sets, domain is been splitted equal
media_train, media_val = train_test_split(media_original, test_size=0.2, stratify=media_original['domain'], random_state=42)

### Word Embedding Training

#### Word2Vec

In [9]:
def clean_tokenized_data(data):
    return [ast.literal_eval(sentence) for sentence in data]

# Function to train Word2Vec model
def train_word2vec_model(data, vector_size=100, window=5, epochs=10):
    model = Word2Vec(sentences=data, vector_size=vector_size, window=window, epochs=epochs)
    return model

# Define parameters for training and evaluation
parameters = {
    'vector_size': [100, 200],
    'window': [5, 10],
    'epochs': [10, 20]
}

def hypertrain(model_basename, train_data, sub_folder):
    models: dict[str, Word2Vec] = {}
    for vector_size in parameters['vector_size']:
        for window in parameters['window']:
            for epochs in parameters['epochs']:
                # Train Word2Vec model
                model_name = f'{model_basename}_{vector_size}_{window}_{epochs}'
                print(f'Training Word2Vec model {model_name} ...')
                model = train_word2vec_model(train_data, vector_size=vector_size, window=window, epochs=epochs)

                model.save(f'Data/Word/{sub_folder}/{model_name}.model')
                models[model_name] = model
    return models

We attempted several evaluation methods, but since they all yielded the same results, we have decided to discontinue them to save execution time.

##### Google patent

###### Title

In [10]:
# Define train and validation data
patent_word_title_train_data = clean_tokenized_data(patent_train['title_tokens'].tolist())
patent_word_title_validation_data = patent_val['title_tokens']

# Train and validate Word2Vec models
patent_word_title_models = hypertrain("patent_title", patent_word_title_train_data, "Patent/Title")

Training Word2Vec model patent_title_100_5_10 ...
Training Word2Vec model patent_title_100_5_20 ...
Training Word2Vec model patent_title_100_10_10 ...
Training Word2Vec model patent_title_100_10_20 ...
Training Word2Vec model patent_title_200_5_10 ...
Training Word2Vec model patent_title_200_5_20 ...
Training Word2Vec model patent_title_200_10_10 ...
Training Word2Vec model patent_title_200_10_20 ...


###### Text

In [11]:
# Define train and validation data
patent_word_abstract_train_data = clean_tokenized_data(patent_train['abstract_tokens'].tolist())
patent_word_abstract_validation_data = patent_val['abstract_tokens']

# Train and validate Word2Vec models
patent_word_abstract_models = hypertrain("patent_abstract", patent_word_abstract_train_data, "Patent/Text")

Training Word2Vec model patent_abstract_100_5_10 ...
Training Word2Vec model patent_abstract_100_5_20 ...
Training Word2Vec model patent_abstract_100_10_10 ...
Training Word2Vec model patent_abstract_100_10_20 ...
Training Word2Vec model patent_abstract_200_5_10 ...
Training Word2Vec model patent_abstract_200_5_20 ...
Training Word2Vec model patent_abstract_200_10_10 ...
Training Word2Vec model patent_abstract_200_10_20 ...


##### Cleantech Media

###### Title

In [12]:
# Define train and validation data
train_data = clean_tokenized_data(media_train['title_tokens'].tolist())
validation_data = media_train['title_tokens']

# Train and validate Word2Vec models
media_word_title_models = hypertrain("media_title", train_data, "Media/Title")

Training Word2Vec model media_title_100_5_10 ...
Training Word2Vec model media_title_100_5_20 ...
Training Word2Vec model media_title_100_10_10 ...
Training Word2Vec model media_title_100_10_20 ...
Training Word2Vec model media_title_200_5_10 ...
Training Word2Vec model media_title_200_5_20 ...
Training Word2Vec model media_title_200_10_10 ...
Training Word2Vec model media_title_200_10_20 ...


###### Content

In [13]:
# Define train and validation data
train_data = clean_tokenized_data(media_train['content_tokens'].tolist())
validation_data = media_train['content_tokens']

# Train and validate Word2Vec models
media_word_content_models = hypertrain("media_content", train_data, "Media/Text")

Training Word2Vec model media_content_100_5_10 ...
Training Word2Vec model media_content_100_5_20 ...
Training Word2Vec model media_content_100_10_10 ...
Training Word2Vec model media_content_100_10_20 ...
Training Word2Vec model media_content_200_5_10 ...
Training Word2Vec model media_content_200_5_20 ...
Training Word2Vec model media_content_200_10_10 ...
Training Word2Vec model media_content_200_10_20 ...


### Sentence Embedding Training

Here again we tried several ways, this one is the best and most opitmized one, b

In [14]:
def encode_sentences(sentences, tokenizer, model, device='cuda', max_length=512, batch_size=128):
    device = torch.device(device if torch.cuda.is_available() else 'cpu')
    model.to(device)
    model.eval()
    embeddings = []
    num_batches = len(sentences) // batch_size + (1 if len(sentences) % batch_size != 0 else 0)
    start_time = time.time()

    for i in range(0, len(sentences), batch_size):
        batch = sentences[i:i + batch_size]
        inputs = tokenizer(batch, return_tensors='pt', padding=True, truncation=True, max_length=max_length).to(device)
        with torch.no_grad():
            outputs = model(**inputs)
            # Get the embeddings from the [CLS] token (first token)
            cls_embeddings = outputs.last_hidden_state[:, 0, :].cpu().numpy()
            embeddings.extend(cls_embeddings)
        torch.cuda.empty_cache()
        gc.collect()

        if (i // batch_size) % 10 == 0:  # Print progress every 10 batches
            elapsed_time = time.time() - start_time
            print(f"Processed batch {i // batch_size + 1}/{num_batches}, elapsed time: {elapsed_time:.2f} seconds")

    return embeddings

#### Google Patent

Due to the long process time we just let it run once (except we have some changes) and import it again.

In [16]:
# Load the embeddings from CSV files into DataFrames
patent_train_titles_df = pd.read_csv('Data/Sentence/google_patent_en_train_titles_embeddings.csv')
patent_val_titles_df = pd.read_csv('Data/Sentence/google_patent_en_val_titles_embeddings.csv')
patent_train_abstracts_df = pd.read_csv('Data/Sentence/google_patent_en_train_abstracts_embeddings.csv')
patent_val_abstracts_df = pd.read_csv('Data/Sentence/google_patent_en_val_abstracts_embeddings.csv')

# Merge the embeddings DataFrame with the original patent_train DataFrame
patent_train_titles = pd.concat([patent_train, patent_train_titles_df], axis=1)
patent_train.rename(columns=lambda x: 'title_embedding' if 'Unnamed' in x else x, inplace=True)
patent_train = pd.concat([patent_train, patent_train_abstracts_df], axis=1)
patent_train.rename(columns=lambda x: 'abstract_embedding' if 'Unnamed' in x else x, inplace=True)

patent_val = pd.concat([patent_val, patent_val_titles_df], axis=1)
patent_val.rename(columns=lambda x: 'title_embedding' if 'Unnamed' in x else x, inplace=True)
patent_val = pd.concat([patent_val, patent_val_abstracts_df], axis=1)
patent_val.rename(columns=lambda x: 'abstract_embedding' if 'Unnamed' in x else x, inplace=True)

#### Cleantech Media

In [29]:
# Define the columns to be used
title_column = 'title'

# Encode titles and abstracts for training and validation sets
media_train_titles = encode_sentences(media_train[title_column].tolist(), bert_tokenizer, bert_model)
media_val_titles = encode_sentences(media_val[title_column].tolist(), bert_tokenizer, bert_model)

# Convert lists to DataFrames for easier handling
media_train_titles_df = pd.DataFrame(media_train_titles)
media_val_titles_df = pd.DataFrame(media_val_titles)

# Save embeddings to files if needed
media_train_titles_df.to_csv('Data/Sentence/media_train_titles_embeddings.csv', index=False)
media_val_titles_df.to_csv('Data/Sentence/media_val_titles_embeddings.csv', index=False)

Processed batch 1/60, elapsed time: 4.63 seconds
Processed batch 11/60, elapsed time: 41.52 seconds
Processed batch 21/60, elapsed time: 83.83 seconds
Processed batch 31/60, elapsed time: 132.45 seconds
Processed batch 41/60, elapsed time: 176.25 seconds
Processed batch 51/60, elapsed time: 220.50 seconds
Processed batch 1/15, elapsed time: 4.68 seconds
Processed batch 11/15, elapsed time: 50.05 seconds


In [30]:
# Define the columns to be used
title_column = 'content'

# Encode titles and abstracts for training and validation sets
media_train_content = encode_sentences(media_train[title_column].tolist(), bert_tokenizer, bert_model)
media_val_content = encode_sentences(media_val[title_column].tolist(), bert_tokenizer, bert_model)

# Convert lists to DataFrames for easier handling
media_train_content_df = pd.DataFrame(media_train_content)
media_val_content_df = pd.DataFrame(media_val_content)

# Save embeddings to files if needed
media_train_content_df.to_csv('Data/Sentence/media_train_content_embeddings.csv', index=False)
media_val_content_df.to_csv('Data/Sentence/media_val_content_embeddings.csv', index=False)

Processed batch 1/60, elapsed time: 96.20 seconds
Processed batch 11/60, elapsed time: 1092.52 seconds
Processed batch 21/60, elapsed time: 2336.34 seconds
Processed batch 31/60, elapsed time: 3528.26 seconds
Processed batch 41/60, elapsed time: 35542.99 seconds
Processed batch 51/60, elapsed time: 38919.86 seconds
Processed batch 1/15, elapsed time: 52317.23 seconds
Processed batch 11/15, elapsed time: 53533.41 seconds


In [31]:
# Load the embeddings from CSV files into DataFrames
media_train_titles_df = pd.read_csv('Data/Sentence/media_train_titles_embeddings.csv')
media_val_titles_df = pd.read_csv('Data/Sentence/media_val_titles_embeddings.csv')
media_train_abstracts_df = pd.read_csv('Data/Sentence/media_train_content_embeddings.csv')
media_val_abstracts_df = pd.read_csv('Data/Sentence/media_val_content_embeddings.csv')

# Merge the embeddings DataFrame with the original patent_train DataFrame
media_train_titles = pd.concat([media_train, media_train_titles_df], axis=1)
media_train.rename(columns=lambda x: 'title_embedding' if 'Unnamed' in x else x, inplace=True)
media_train = pd.concat([media_train, media_train_abstracts_df], axis=1)
media_train.rename(columns=lambda x: 'content_embedding' if 'Unnamed' in x else x, inplace=True)

media_val = pd.concat([media_val, media_val_titles_df], axis=1)
media_val.rename(columns=lambda x: 'title_embedding' if 'Unnamed' in x else x, inplace=True)
media_val = pd.concat([media_val, media_val_abstracts_df], axis=1)
media_val.rename(columns=lambda x: 'content_embedding' if 'Unnamed' in x else x, inplace=True)