# How to create your own Word2Vec model for your domain 🧠
This tutorial will cover:
- Applications for word embeddings.
- How you can train your own word embedding model (Word2Vec) using the Python library Gensim.
- Text cleaning methods & considering your domain.
- A wee tiny bit of linguistics.

I am assuming you have prior We will be training a Word2Vec model on scientific abstracts taken from the Semantic Scholar Graph API! This process is very similar to the work I am doing in building [ClimateScholar](https://github.com/EarthNLP/ClimateScholar/) (an open source climate science literature search engine).



## What is Gensim? 🤔

[Gensim](https://radimrehurek.com/gensim/) is a Python library developed by [RaRe Technologies](https://rare-technologies.com/) that makes training a wide variety of topic models very easy. 

## Tutorial versions 🔢
For reproducibility reasons, here are the versions for Python & the packages (In case you have issues running this with future versions of Gensim):
- Python: 3.8.10
- Gensim: 4.2.0

# Preparing our data 👩🏽‍🍳

In [6]:
# All of our imports for the tutorial
# The core import for Word2Vec
import gensim.models

# We'll be using these imports during our text cleaning
from gensim.parsing.preprocessing import remove_stopwords, strip_multiple_whitespaces, strip_punctuation 
import re

# Our data is stored in jsonl format
import json

# Used later to show off words from our trained model
import random

from typing import List

In [9]:
papers = []
root_path = "./data"
file_names = ["weather_CO2.jsonl", "paleoclimate.jsonl", "rewilding.jsonl", "rockfish.jsonl", "arctic.jsonl", "climate.jsonl", "shark_climate.jsonl"]

for file_path in file_names:
    with open(f'{root_path}/{file_path}', 'r') as json_file:
        json_list = list(json_file)

    result = json.loads(json_list[0])

    for result_dict in result["data"]:
        papers.append(result_dict)

len(papers)

350

In [None]:
# Let's take a peek at how our data looks!
papers[0]

In [11]:
# You can condense this to a single list comprehension, but for readability I chose to keep them separate.
# There are a lot of fields we don't need for this tutorial. We just want the abstract.
# So all we're doing here is removing all entries in our data with a null abstract, and making a list of abstracts for easy processing.
data = [y for y in (x for x in papers) if y["abstract"] is not None]
abstracts = [item['abstract'] for item in data]
len(abstracts)

222

In [12]:
# We should just see the abstract text now after the above processing.
abstracts[0]

'Marine carbon dioxide (CO2) system data has been collected from December 2014 to June 2018 in the northern Salish Sea (NSS; British Columbia, Canada) and consisted of continuous measurements at two sites as well as spatially- and seasonally-distributed discrete seawater samples. The array of CO2 observing activities included high-resolution CO2 partial pressure (pCO2) and pHT (total scale) measurements made at the Hakai Institute’s Quadra Island Field Station (QIFS) and from an Environment Canada weather buoy, respectively, as well as discrete seawater measurements of pCO2 and total dissolved inorganic carbon (TCO2) obtained during a number of field campaigns. A relationship between NSS alkalinity and salinity was developed with the discrete datasets and used with the continuous measurements to highly resolve the marine CO2 system. Collectively, these datasets provided insights into the seasonality in this historically under-sampled region and detail the area’s tendency for aragonite 

Now that we've loaded our data. We can move forward with the most important part of cleaning our data.

### ✨Preprocessing✨

This is where your domain specific knowledge comes in. Do not be fooled with how easy this seems! This is a process that takes some time to get right. Tutorials can trick you with how easy this might seem, but alas that is not the case. 

We're working with scientific text, so let us consider the goals of this process:
- We want to keep scientific terms, so avoid removing any rare or uncommon words.
- Stopwords (the, we, and, etc) can safely be removed because their inclusion doesn't impact the meaning of our data. It also means less words for our model to be trained on.
- Since authors can be inconsistent when captializing words (co2 vs CO2). Let's lowercase everything so the same words aren't treated differently.
- The punctuation looked at me funny. So let's remove it because we don't need it.
- I am from the future, so I know that after perfoming some of the operations above, we ended up with a lot of whitespace. Let's make sure to remove any extra!
- The last step will be to remove years from the data. We can achieve this with some simple regex below. For our usecase, years just add noise to the embedding space. 


### Exploring other ideas 🚀
There are a few other ways we can improve this process, we won't show them in the tutorial, but here are some ideas:

- Phrase Matching: Climate Change is treated as two separate words (climate & change), but we often think of this phrase as a single "concept". Most NLP methods define a word as a string of letters, separated by whitespace. This method won't work for all languages, but it also doesn't work for all domains. 
    - Linguistics also has another definition for a word as anything that encaptures a single semantic concept. 
        - So in our case: Climate Change or Sebastes ruberrimus (the scientific name for Yellow Rockfish). You can combine them into a single word for processing by replacing the space with a hyphen or concat them together:
            - Climate change becomes climate-change or climateChange, etc

- You can use a statistical tokenizer (like the ones used by Transformer models) along with minor cleaning to prepare your text. (Not sure *why* you would do this, but you could!)

Text cleaning is also a never "finished" step. You can forever tinker here to better capture your domain in the embedding space. We're going to consider this good enough for now though 😉.

In [13]:
year_pattern = r'20[0-9][0-9]'
def clean_sent(sent: str) -> List[str]:
    # You can see the steps here are the ones we covered above!
    removed_stopwords = remove_stopwords(sent)
    lowered_string = removed_stopwords.lower()
    punc_removed = strip_punctuation(lowered_string)
    remove_whitespace = strip_multiple_whitespaces(punc_removed)
    cleaned_string = re.sub(year_pattern, '', remove_whitespace)
    # Gensim wants the sentences in list of strings ["word1", "word2", "word3", etc] format, so we can do that when returning!
    return cleaned_string.split()

In [14]:
cleaned_sentences = [clean_sent(sent) for sent in abstracts]
# Let's take a peak of the cleaned data
cleaned_sentences[0]

['marine',
 'carbon',
 'dioxide',
 'co2',
 'data',
 'collected',
 'december',
 'june',
 'northern',
 'salish',
 'sea',
 'nss',
 'british',
 'columbia',
 'canada',
 'consisted',
 'continuous',
 'measurements',
 'sites',
 'spatially',
 'seasonally',
 'distributed',
 'discrete',
 'seawater',
 'samples',
 'the',
 'array',
 'co2',
 'observing',
 'activities',
 'included',
 'high',
 'resolution',
 'co2',
 'partial',
 'pressure',
 'pco2',
 'pht',
 'total',
 'scale',
 'measurements',
 'hakai',
 'institute’s',
 'quadra',
 'island',
 'field',
 'station',
 'qifs',
 'environment',
 'canada',
 'weather',
 'buoy',
 'respectively',
 'discrete',
 'seawater',
 'measurements',
 'pco2',
 'total',
 'dissolved',
 'inorganic',
 'carbon',
 'tco2',
 'obtained',
 'number',
 'field',
 'campaigns',
 'a',
 'relationship',
 'nss',
 'alkalinity',
 'salinity',
 'developed',
 'discrete',
 'datasets',
 'continuous',
 'measurements',
 'highly',
 'resolve',
 'marine',
 'co2',
 'system',
 'collectively',
 'datasets',
 'p

# Training & saving our model 🏋🏽

In [15]:
# The only required parameter here is your text data. 
# The defaults for workers, epochs, min_count, etc are quite good. I just wanted to show what modifying them might look like.
model = gensim.models.Word2Vec(sentences=cleaned_sentences, workers=6, epochs=1000, min_count=2, vector_size=500)

Training our model is that easy! Now lets see what words are in the model vocabulary.

In [21]:
random_word = random.choice(model.wv.index_to_key)
random_word

'glacial'

Saving is pretty easy too.

In [17]:
model.save('models/word2vec_tutorial.pkl')

We can load our newly trained model to see if it still works.

In [24]:
loaded_model = gensim.models.Word2Vec.load('models/word2vec_tutorial.pkl')
random_word = random.choice(loaded_model.wv.index_to_key)
random_word

'issues'

We can also take a peak at what words are in our embedding

In [25]:
loaded_model.wv.index_to_key

['climate',
 'the',
 'co2',
 'species',
 'change',
 'data',
 'model',
 'rewilding',
 'we',
 'ice',
 'this',
 'rockfish',
 'arctic',
 'sharks',
 'global',
 'changes',
 'shark',
 'ocean',
 'sea',
 'in',
 'atmospheric',
 '2',
 '1',
 'temperature',
 '0',
 'study',
 'high',
 'conservation',
 'results',
 'conditions',
 'carbon',
 'models',
 'time',
 'weather',
 'large',
 'fish',
 'analysis',
 'water',
 'potential',
 'years',
 'based',
 '3',
 'research',
 'soil',
 'different',
 'important',
 's',
 'habitat',
 'surface',
 '5',
 'year',
 'ecosystem',
 'provide',
 'reef',
 'future',
 'however',
 'variability',
 'pacific',
 'these',
 'management',
 'higher',
 'paleoclimate',
 'effects',
 'marine',
 'new',
 'human',
 '4',
 'increase',
 'observations',
 'observed',
 'emissions',
 'processes',
 'range',
 'area',
 'elephant',
 'level',
 'increased',
 'areas',
 'la',
 'ecological',
 'regional',
 'understanding',
 'assessment',
 'environmental',
 'e',
 'present',
 'levels',
 'region',
 'precipitation',

In [26]:
# And get an idea of its length too!
len(loaded_model.wv.index_to_key)

3805

Perfect! In the next tutorial we'll explore techniques to visualize our embedding space and better understand it. 

To recap we covered:
- Applications for word embeddings.
- How you can train your own word embedding model (Word2Vec) using the Python library Gensim.
- Text cleaning methods & considering your domain.
- A wee tiny bit of linguistics.


Thanks for reading! 😁