# Stage 2: Advanced Embedding Models Training and Analysis

## Objective

The primary goal of Stage 2 is to develop and utilize advanced embedding models to effectively represent the content of the Cleantech Media and Google Patent datasets. This stage aims to compare domain-specific embeddings to gain unique insights into the textual data from these two sources, exploring similarities, differences, and emerging patterns in the cleantech domain.


## Data Preparation for Embeddings

We will preprocess the columns 'Title_preprocesed' and 'Abstract_preprocesed' in of the 'patent_preprocessed.csv' file saved as an output file from the Stage 1.

1. WORD2VEC requires a list of sentences, where each sentence is a list of words.
   for example: [['This', 'is', 'a', 'sentence'], ['This', 'is', 'another', 'sentence']]


In [4]:
# Load the libraries
import pandas as pd
from gensim.utils import simple_preprocess
from gensim.models import Word2Vec

In [5]:
# Load the data

patent_data = pd.read_csv("patent_preprocessed.csv", encoding="utf-8")

# Display the first 5 rows of the preprocessed data
patent_data[["title_preprocessed", "abstract_preprocessed"]].head(50)

Unnamed: 0,title_preprocessed,abstract_preprocessed
0,adaptable dc ac inverter drive system operation,disclose adaptable dc ac inverter system opera...
1,system provide energy single contiguous solar ...,accordance example embodiment solar energy sys...
2,verfahren zum steuern einer windenergieanlage,verfahren zum steuern einer windenergieanlage ...
3,control method optimize solar power efficiency...,control method optimize solar power efficiency...
4,mutually support hydropower system,mutually support hydropower system include hyd...
5,system method drive away geese,system method drive geese away area employ pre...
6,clad sheet,clad sheet roof wall clad sheet include mount ...
7,harden solar energy collector system,harden solar thermal energy collector stec sys...
8,system method hydro base electric power genera...,hydrodynamic power generation assembly method ...
9,system method remove dust solar panel surface ...,present system method waterless contactless sy...


In [6]:
patent_data[["title_preprocessed", "abstract_preprocessed"]].isna().sum()

title_preprocessed        150
abstract_preprocessed    3993
dtype: int64

In [7]:
# Drop the rows with missing values in abstract_preprocessed and title_preprocessed
# See Stage_1_patent_preprocessing.ipynb for reasoning of missing values

patent_data = patent_data.dropna(
    subset=["abstract_preprocessed", "title_preprocessed"])

# Checking the shape of the dataframe

print(patent_data.shape)

(4470, 12)


In [8]:
# Convert the preprocessed data (titles and abstracts) to a list of words
sentences = [
    sentence.split(" ")
    for row in patent_data[["title_preprocessed", "abstract_preprocessed"]].values
    for sentence in row
]

# Checking the first 5 sentences

print(sentences[0:5])

# Check type and shape of sentences

print(type(sentences))

print(len(sentences))

[['adaptable', 'dc', 'ac', 'inverter', 'drive', 'system', 'operation'], ['disclose', 'adaptable', 'dc', 'ac', 'inverter', 'system', 'operation', 'system', 'include', 'multiple', 'dc', 'input', 'source', 'input', 'provide', 'stable', 'operation', 'condition', 'dc', 'input', 'source', 'add', 'system', 'remove', 'system', 'impact', 'functionality', 'system', 'disclose', 'system', 'suit', 'solar', 'energy', 'harvesting', 'grid', 'connect', 'grid', 'mode', 'operation'], ['system', 'provide', 'energy', 'single', 'contiguous', 'solar', 'energy', 'structure', 'different', 'meter'], ['accordance', 'example', 'embodiment', 'solar', 'energy', 'system', 'comprise', 'solar', 'energy', 'structure', 'comprise', 'photovoltaic', 'solar', 'panel', 'contiguously', 'cover', 'area', 'inverter', 'configure', 'receive', 'power', 'string', 'photovoltaic', 'solar', 'panel', 'inverter', 'configure', 'provide', 'power', 'receive', 'inverter', 'meter', 'second', 'inverter', 'configure', 'receive', 'power', 'secon

In [9]:
# Setting up Word2Vec model parameters

model = Word2Vec(
    sentences=sentences, vector_size=100, window=5, min_count=2, sg=1, epochs=10
)

In [10]:
# Train the model

model.train(sentences, total_examples=model.corpus_count, epochs=model.epochs)


(4009008, 4890530)

In [11]:
# Save the model
model_save_path = "trained-models/word2vec_model.model"
model.save(model_save_path)

In [12]:
# Load the model to test for evaluation

model = Word2Vec.load("word2vec_model.model")



[('onboard', 0.6985699534416199),
 ('thermophotovoltaic', 0.6764763593673706),
 ('decommissioning', 0.6753373742103577),
 ('energy', 0.6737689971923828),
 ('predictive', 0.6716766357421875),
 ('power', 0.656843364238739),
 ('multisource', 0.6539903879165649),
 ('kalina', 0.6531026363372803),
 ('dcs', 0.6501502990722656),
 ('retired', 0.6450400352478027)]

In [14]:
# Test for evaluation

model.wv.most_similar("system")

model.wv.most_similar("software")


[('bentley', 0.7667834162712097),
 ('detail', 0.7485581040382385),
 ('rationality', 0.718150794506073),
 ('rho', 0.717994749546051),
 ('poa', 0.7169286608695984),
 ('scuc', 0.7100946307182312),
 ('earthwork', 0.7019782662391663),
 ('ex', 0.6986589431762695),
 ('organization', 0.6983261108398438),
 ('transient', 0.6949236392974854)]