## Load the data 

In [5]:
import boto3
import logging 
from botocore.exceptions import ClientError
import pandas as pd 
import io
import os 


In [3]:
# Function to read the parquet file as pandas dataframe 
def open_S3_file_as_df(bucket_name, file_name):
    """Open a S3 parquet file from bucket and filename and return the parquet as pandas dataframe
    :param bucket_name: Bucket name
    :param file_name: Specific file name to open
    :return: body of the file as a string
    """
    try: 
        s3 = boto3.resource('s3')
        object = s3.Object(bucket_name, file_name)
        body = object.get()['Body'].read()
        df = pd.read_parquet(io.BytesIO(body))
        print(f'Loading {file_name} from {bucket_name} to pandas dataframe')
        return df
    except ClientError as e:
        logging.error(e)
        return e
file_name = "Processed_records.parquet"
bucket_name_nlp = "nlp-data-preprocessing"
df_en = open_S3_file_as_df(bucket_name_nlp, file_name)

Loading Processed_records.parquet from nlp-data-preprocessing to pandas dataframe


Check duplications beore training the models, and slice the data for training. 
We will use all the data for training the Word2Vec model in this case as our data is not large 

In [4]:
# Get the shape of the dataframe
shape_training = df_en.shape
# Count the number of missing values in each column
missing_values_training = df_en.isnull().sum()
# Count the number of unique values in each column
unique_values_training = df_en.nunique()

shape_training, missing_values_training, unique_values_training

((7153, 6),
 features_properties_id                0
 features_properties_title_en          0
 features_properties_description_en    0
 features_properties_keywords_en       0
 metadata_en                           0
 metadata_en_processed                 0
 metadata_en_preprocessed_token        0
 dtype: int64,
 features_properties_id                7153
 features_properties_title_en          6930
 features_properties_description_en    5604
 features_properties_keywords_en       3779
 metadata_en                           7153
 metadata_en_processed                 6847
 dtype: int64)

In [5]:
#TODO: Remove duplicates based on 'metadata_en_preprocessed_token'
"""
If a uuid is removed at this step, we will need to think about how to merge with the other parquet files
"""
df_en_deduplicated = df_en.drop_duplicates(subset='metadata_en_processed')
# Display the first few rows of the deduplicated dataframe
df_en_deduplicated.head()
# Check the shape of the deduplicated dataframe
shape_deduplicated = df_en_deduplicated.shape
shape_deduplicated

(6847, 6)

In [6]:
# Get a sample of 500 rows as the training data 
df = df_en[['features_properties_id', 'features_properties_title_en', 'metadata_en_processed']]
#df = df.sample(n=500, random_state=1)
# Use all data to train the model
df.head()
print(df.shape)

# write out training data to csv
df.to_csv('df_training_full.csv', index=False)

(7153, 3)


## Word2Vec using Genism library and spacy library 
Gensim provides an easy-to-use interface for training Word2Vec models on custom corpora. You can train models from scratch or continue training existing models on new data. Gensim also offers pretrained Word2Vec models for various languages. 

Word2Vec generates word embeddings, not sentence or document embeddings. In order to compute the similarity between entire texts, we would typically average the word vectors for all words in the text to get a single vector that represents the text. This approach has its limitations, as it doesn't consider the order of words and its semantic meaning, but it can still provide useful results.

In this case, we are training a Word2Vec model using our own data. The steps are as below:
1. Preprocess the text: Tokenize the 'metadata_en_processed' texts (split them into individual words) because Word2Vec expects a list of sentences, where each sentence is represented as a list of words.
2. Train the Word2Vec model on the tokenized texts. We can use the gensim library's Word2Vec implementation for this.

3. Use the trained model to convert each sentence in 'metadata_en_preprocessed_token' into a vector.

4. Calculate similarity between each vector and all others.

5. For each row, find the top 5 rows with the most similar vectors.


Please note that Word2Vec models typically need a lot of data to train well. With only 500 unique sentences, the results may not be very reliable. Additionally, it's important to note that Word2Vec is a word embedding model, which represents each word in a high-dimensional space. To represent entire sentences, we would need to do something like taking the average of all word vectors in a sentence. This method might not always capture the semantic meaning of the sentence very well, but it's a common approach when using word embeddings to represent longer texts.

In [7]:
# Import necessary libraries
from gensim.models import Word2Vec
from sklearn.metrics.pairwise import cosine_similarity
from gensim import matutils
import numpy as np
from tqdm import tqdm

### Train Word2vec model with genism 

In [8]:
# Prepare the input for the Word2Vec model
sentences = df['metadata_en_processed'].apply(lambda x: x.split(' ')).tolist()
print(type(sentences))
#sentences[0:2]

# Train the Word2Vec model
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)

# Precompute L2-normalized vectors for better performance
model.init_sims(replace=True)

# Function to computing the vector representations of the texts, from sentence to vector 
def sentence_to_vector(sentence, model):
    words = str(sentence).split()
    vector = np.mean([model.wv[word] for word in words if word in model.wv.key_to_index], axis=0)
    return vector if isinstance(vector, np.ndarray) else np.zeros(model.vector_size)

# Convert each sentence in 'metadata_preprocessed' into a vector
vectors = df['metadata_en_processed'].apply(sentence_to_vector, model=model)
# Replace the missing value in the 'features_properties_title_en' column with an empty string
df['features_properties_title_en'].fillna('', inplace=True)
print(type(vectors))
vectors

<class 'list'>


  model.init_sims(replace=True)


<class 'pandas.core.series.Series'>


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['features_properties_title_en'].fillna('', inplace=True)


0       [-0.074049026, 0.027289702, 0.016700882, 0.073...
1       [-0.02858037, -0.02957781, 0.011721299, 0.0860...
2       [-0.09283519, -0.016993595, 0.017842524, 0.060...
3       [-0.083689615, -0.00096761726, 0.019858327, 0....
4       [-0.10750779, 0.0075561353, 0.029387811, 0.160...
                              ...                        
7338    [-0.07207868, 0.07566401, 0.042431034, -0.0114...
7339    [-0.06976483, 0.076250896, 0.044135798, -0.012...
7340    [-0.07208346, 0.06931207, 0.040898647, -0.0150...
7341    [-0.07353419, 0.06887483, 0.04055026, -0.01561...
7342    [-0.02755257, -0.0038945249, 0.011606601, 0.03...
Name: metadata_en_processed, Length: 7153, dtype: object

### Use pretrained word2vec model Google News from Genism 
Gensim provides various pre-trained models for word embeddings like Word2Vec, FastText, GloVe, etc. However, these models are trained on specific corpora (like Google News, Wikipedia, etc.) and might not provide the best embeddings for your specific use case, especially if your text data is domain-specific or uses a specialized vocabulary that is not well-represented in the pre-trained models.

In [None]:
from gensim.models import KeyedVectors

# Load pretrained model (since intermediate data is not included, the model cannot be refined with additional data)
model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True)

# Preprocess the 'metadata_en_processed' texts
sentences = df['metadata_en_processed'].apply(lambda x: str(x).split())

# Compute the vector for each 'metadata_en_processed' text
vectors = sentences.apply(lambda words: np.mean([model[word] for word in words if word in model.key_to_index], axis=0))

### Use pretrained wword2vec from spacy 
the en_core_web_lg model from SpaCy includes word vectors trained on the Common Crawl corpus using the GloVe algorithm by Stanford. These vectors can be used for tasks similar to Word2Vec vector. 
Note, AWS Lambda has a deployment package size limit of 250MB for functions that include layers. The en_core_web_lg model is quite large (about 800MB), so it might not be suitable for deployment on AWS Lambda.


In [None]:
# In terminal, run python -m spacy download en_core_web_lg
import spacy
nlp = spacy.load('en_core_web_lg')

# Vectorization 
def sentence_to_vector(sentence):
    # Process the sentence
    doc = nlp(sentence)
    # Return the average of the word vectors
    return np.mean(np.array([token.vector for token in doc]), axis=0)

vector = df['metadata_en_processed'].apply(sentence_to_vector)


### Calculate Similarity 

In [9]:
# Calculate similarity between each vector and all others
similarity_matrix = cosine_similarity(np.array(vectors.tolist()))
# Initialize new columns for the top 5 similar texts
df['sim1'], df['sim2'], df['sim3'], df['sim4'], df['sim5'] = "", "", "", "", ""
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['sim1'], df['sim2'], df['sim3'], df['sim4'], df['sim5'] = "", "", "", "", ""
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['sim1'], df['sim2'], df['sim3'], df['sim4'], df['sim5'] = "", "", "", "", ""


Unnamed: 0,features_properties_id,features_properties_title_en,metadata_en_processed,sim1,sim2,sim3,sim4,sim5
0,000183ed-8864-42f0-ae43-c4313a860720,"Principal Mineral Areas, Producing Mines, and ...",princip miner area produc mine oil ga field 90...,,,,,
1,7f245e4d-76c2-4caa-951a-45d1d2051333,"Canadian Digital Elevation Model, 1945-2011",canadian digit elev model collect legaci produ...,,,,,
2,085024ac-5a48-427a-a2ea-d62af73f2142,Canada's National Earthquake Scenario Catalogue,canada nation earthquak scenario catalogu nati...,,,,,
3,03ccfb5c-a06e-43e3-80fd-09d4f8f69703,Temporal Series of the National Air Photo Libr...,tempor seri nation air photo librari napl regi...,,,,,
4,488faf70-b50b-4749-ac1c-a1fd44e06f11,Indigenous Mining Agreements,indigen mine agreement indigen mine agreement ...,,,,,


In [10]:
# For each text, find the top 5 most similar texts and append their 'features_properties_title_en' as new columns
df.reset_index(drop=True, inplace=True)
for i in tqdm(range(similarity_matrix.shape[0])):
    top_5_similar = np.argsort(-similarity_matrix[i, :])[1:6]  # Exclude the text itself
    df.loc[i, ['sim1', 'sim2', 'sim3', 'sim4', 'sim5']] = df.loc[top_5_similar, 'features_properties_id'].values
df.head()


100%|██████████| 7153/7153 [00:13<00:00, 546.56it/s]


Unnamed: 0,features_properties_id,features_properties_title_en,metadata_en_processed,sim1,sim2,sim3,sim4,sim5
0,000183ed-8864-42f0-ae43-c4313a860720,"Principal Mineral Areas, Producing Mines, and ...",princip miner area produc mine oil ga field 90...,b64179f3-ea0f-4abb-9cc5-85432fc958a0,22b2db8a-dc12-47f2-9737-99d3da921751,8db08fa3-a181-4d9b-b091-0f65270ff18b,7ce2ed0c-cb87-463d-83aa-7ed62a672792,e2b6e799-9f29-87c4-0143-6c7505978508
1,7f245e4d-76c2-4caa-951a-45d1d2051333,"Canadian Digital Elevation Model, 1945-2011",canadian digit elev model collect legaci produ...,768570f8-5761-498a-bd6a-315eb6cc023d,0fe65119-e96e-4a57-8bfe-9d9245fba06b,f5c4e4af-ddc5-4f54-a70b-8390e4a4268e,ff383f5c-0772-46f3-84df-c2cb860d0da2,4ff9312f-a200-4fe6-aac3-00a803afa5d9
2,085024ac-5a48-427a-a2ea-d62af73f2142,Canada's National Earthquake Scenario Catalogue,canada nation earthquak scenario catalogu nati...,79fdad93-9025-49ad-ba16-c26d718cc070,f2d6263a-8b65-4350-9515-345875c6bebf,2364749d-70a0-4874-956d-a636401ac5a6,ee421d50-dffd-41fb-976c-5bbfec04b2dd,4cedd37e-0023-41fe-8eff-bea45385e469
3,03ccfb5c-a06e-43e3-80fd-09d4f8f69703,Temporal Series of the National Air Photo Libr...,tempor seri nation air photo librari napl regi...,230f1f6d-353e-4d02-800b-368f4c48dc86,d8627209-bda2-436f-b22b-0eb19fdc6660,4e8e3c6a-c961-4def-bdc7-f24823462818,f129611d-7ca1-418b-8390-ebac5adf958e,f498bb69-3982-4b62-94db-4c0e0065bc17
4,488faf70-b50b-4749-ac1c-a1fd44e06f11,Indigenous Mining Agreements,indigen mine agreement indigen mine agreement ...,CGDIWH-118602,82cad281-ff7d-47b3-b2ce-9f794257e86d,CGDIWH-150333,498a6086-e7c3-440c-b4d3-22fc6c5599a9,fa542137-a976-49a6-856d-f1201adb2243


In [36]:
#df.to_csv('df_training_full_sim.csv', index=False)

### Merge and upload

In [11]:
# Read the original parquet file and merge by features_properties_id
file_name_origianl = "records.parquet"
bucket_name = "webpresence-geocore-geojson-to-parquet-dev"
df_original = open_S3_file_as_df(bucket_name, file_name_origianl)

Loading records.parquet from webpresence-geocore-geojson-to-parquet-dev to pandas dataframe


In [None]:
df_original.head()
print(df_original.shape)

In [12]:
merged_df = df_original.merge(df[['features_properties_id', 'sim1', 'sim2', 'sim3', 'sim4', 'sim5']], on='features_properties_id', how='left')
print(merged_df.shape)
merged_df.head()

(7343, 73)


Unnamed: 0,features_type,features_geometry_type,features_geometry_coordinates,features_properties_id,features_properties_title_en,features_properties_title_fr,features_properties_description_en,features_properties_description_fr,features_properties_keywords_en,features_properties_keywords_fr,...,features_properties_temporalExtent_end_@indeterminatePosition,features_properties_temporalExtent_end_#text,features_properties_plugins,features_properties_sourceSystemName,features_popularity,sim1,sim2,sim3,sim4,sim5
0,Feature,Polygon,"[[[-141.003, 41.6755], [-52.6174, 41.6755], [-...",000183ed-8864-42f0-ae43-c4313a860720,"Principal Mineral Areas, Producing Mines, and ...","Principales régions minières, principales mine...",This dataset is produced and published annuall...,Ce jeu de données est produit et publié annuel...,"mineralization, mineral occurrences, mines, hy...","minéralisation, indices minéralisés, mines, hy...",...,,,[],,1250806,b64179f3-ea0f-4abb-9cc5-85432fc958a0,22b2db8a-dc12-47f2-9737-99d3da921751,8db08fa3-a181-4d9b-b091-0f65270ff18b,7ce2ed0c-cb87-463d-83aa-7ed62a672792,e2b6e799-9f29-87c4-0143-6c7505978508
1,Feature,Polygon,"[[[-142, 41], [-52, 41], [-52, 84], [-142, 84]...",7f245e4d-76c2-4caa-951a-45d1d2051333,"Canadian Digital Elevation Model, 1945-2011","Modèle numérique d'élévation du Canada, 1945-2011",This collection is a legacy product that is no...,Ce produit fait maintenant partie du patrimoin...,"Canada, Earth Sciences, elevation, relief, geo...","Canada, Sciences de la Terre, élévation, relie...",...,,,[],,210798,768570f8-5761-498a-bd6a-315eb6cc023d,0fe65119-e96e-4a57-8bfe-9d9245fba06b,f5c4e4af-ddc5-4f54-a70b-8390e4a4268e,ff383f5c-0772-46f3-84df-c2cb860d0da2,4ff9312f-a200-4fe6-aac3-00a803afa5d9
2,Feature,Polygon,"[[[-141.003, 41.6755], [-52.6174, 41.6755], [-...",085024ac-5a48-427a-a2ea-d62af73f2142,Canada's National Earthquake Scenario Catalogue,Catalogue national de scénarios de tremblement...,"The National Earthquake Scenario Catalogue, pr...",Le dépôt est utilisé pour l’élaboration du cat...,"Emergency preparedness, Earth sciences, Earthq...","Protection civile, Sciences de la terre, Tremb...",...,,,[],,140088,79fdad93-9025-49ad-ba16-c26d718cc070,f2d6263a-8b65-4350-9515-345875c6bebf,2364749d-70a0-4874-956d-a636401ac5a6,ee421d50-dffd-41fb-976c-5bbfec04b2dd,4cedd37e-0023-41fe-8eff-bea45385e469
3,Feature,Polygon,"[[[-104.75571511, 50.42392886], [-104.56356008...",03ccfb5c-a06e-43e3-80fd-09d4f8f69703,Temporal Series of the National Air Photo Libr...,Série temporelle de la photothèque nationale d...,"Note: To visualize the data in the viewer, zoo...",Note: Pour visualiser les données dans l’outil...,"Mosaic, Aerial photography, Access to informat...","Mosaïque, Photographie aérienne, Accès à l'inf...",...,,,[],,120162,230f1f6d-353e-4d02-800b-368f4c48dc86,d8627209-bda2-436f-b22b-0eb19fdc6660,4e8e3c6a-c961-4def-bdc7-f24823462818,f129611d-7ca1-418b-8390-ebac5adf958e,f498bb69-3982-4b62-94db-4c0e0065bc17
4,Feature,Polygon,"[[[-141.003, 41.6755], [-52.6174, 41.6755], [-...",488faf70-b50b-4749-ac1c-a1fd44e06f11,Indigenous Mining Agreements,Ententes minières autochtones,The Indigenous Mining Agreements dataset provi...,Les données des ententes minières autochtones ...,"Indigenous, First Nations, Métis, Indigenous a...","Autochtones, Premières nations, Métis, Affaire...",...,,,[],,111036,CGDIWH-118602,82cad281-ff7d-47b3-b2ce-9f794257e86d,CGDIWH-150333,498a6086-e7c3-440c-b4d3-22fc6c5599a9,fa542137-a976-49a6-856d-f1201adb2243


In [13]:
# write out merged data to csv
#merged_df.to_csv('sim_word2vec_records.csv', index=False)


In [19]:
# upload the parquet file to S3
# Upload the duplicate date to S3 as a parquet file 
def upload_dataframe_to_s3_as_parquet(df, bucket_name, file_key):
    # Save DataFrame as a Parquet file locally
    parquet_file_path = 'temp.parquet'
    df.to_parquet(parquet_file_path, index=False)  # Set index to False

    # Create an S3 client
    s3_client = boto3.client('s3')

    # Upload the Parquet file to S3 bucket
    try:
        response = s3_client.upload_file(parquet_file_path, bucket_name, file_key)
        os.remove(parquet_file_path)
        print(f'Uploading {file_key} to {bucket_name} as parquet file')
        # Delete the local Parquet file
        return True
    except ClientError as e:
        logging.error(e)
        return False
upload_dataframe_to_s3_as_parquet(df=merged_df, bucket_name=bucket_name_nlp, file_key='sim_word2vec_records.parquet')

Uploading sim_word2vec_records.parquet to nlp-data-preprocessing as parquet file


True

### Visualize similarity scores 

In [None]:
# Visualize the similarity matrix for selcted records
import matplotlib.pyplot as plt
import seaborn as sns
# Select the first 5 rows
selected_vectors = vectors[:5]

# Calculate similarity between selected vectors
selected_similarity_matrix = cosine_similarity(selected_vectors)

# Create a DataFrame for better visualization
similarity_df = pd.DataFrame(selected_similarity_matrix, 
                             columns=df['features_properties_title_en'][:5],
                             index=df['features_properties_title_en'][:5])

# Visualize the similarity scores in a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(similarity_df, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Similarity Scores for Selected Rows")
plt.show()

## BERT from transformer library 
The BERT (Bidirectional Encoder Representations from Transformers) model is a transformer-based machine learning technique for natural language processing (NLP) tasks. BERT is a deep- learning model 

We can use the transformers library developed by Hugging Face to work with BERT and other transformer models. This library provides pre-trained models for BERT and many other popular transformer models, and it's straightforward to use for tasks such as text classification, named entity recognition, and others.

In [None]:
# pip install transformers
#pip install torch
from transformers import AutoTokenizer, AutoModel
import torch
from sklearn.metrics.pairwise import cosine_similarity

# Load the BERT model
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModel.from_pretrained('bert-base-uncased')

# Function to calculate embeddings


In [None]:
def calculate_embedding(text):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True)
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state[:, 0, :].detach().numpy()  # we take the embedding of the [CLS] token
    return embeddings

# Calculate embeddings for each text
df['embeddings'] = df['metadata_en_processed'].apply(calculate_embedding)

# Calculate cosine similarity
similarity_matrix = cosine_similarity(np.vstack(df['embeddings']))

# Find top 5 most similar texts for each text
df['top_5_similar'] = [list(df.iloc[np.argsort(-row)][1:6].index) for row in similarity_matrix]

In [25]:
# Import CustomModel and load finetuned model from path.
import sys
import argparse
import torch
sys.path.append("/home/rsaha/projects/similarity-engine/src/")
sys.path.append("/home/rsaha/projects/similarity-engine/src/models/")
from custom_model import CustomModel

In [26]:

argparse = argparse.ArgumentParser(description='arguments in ipynb')

# Add arguments to the parser.
argparse.add_argument('--model_name', type=str, default='bert-base-uncased')
argparse.add_argument('--model_type', type=str, default='bert')
argparse.add_argument('--batch_size', type=int, default=8)
argparse.add_argument('--epochs', type=int, default=3)
argparse.add_argument('--lr', type=float, default=2e-5)
argparse.add_argument('--load_model_path', type=str, default='')


args = argparse.parse_args('')
args.device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

In [27]:
args.load_model_path = "/home/rsaha/projects/similarity-engine/saved_models/trainer_bert_fine_tune/checkpoint-4000/"


In [28]:
model = CustomModel(args, load_model_from_path=False, model_path=args.load_model_path)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [29]:
model.model.device

device(type='cpu')