# How to create your own Word2Vec model for your domain
This tutorial will cover:
- What are word embeddings
- Applications for word embeddings
- How you can train your own word embedding model (Word2Vec) using the Python library Gensim
- Text cleaning methods & considering your domain
- Visualizing your embedding space

We will be training a Word2Vec model on scientific abstracts taken from the Semantic Scholar Graph API! This process is very similar to the work I am doing in building ClimateScholar (an open source climate science literature search engine).



## What are word embeddings?


## Why should you care?

## What is Gensim?

## Tutorial versions
Just so you can follow along for reproducibility reasons, here are the versions for Python & the packages:
- Python:
- Gensim:
- Jupyter:

# Preparing our data

In [None]:
import gensim.models
from gensim.parsing.preprocessing import remove_stopwords, strip_multiple_whitespaces, strip_punctuation
import json
import re
import random

In [None]:
papers = []
root_path = "./data/"
sample_data = ["weather_CO2.jsonl", "paleoclimate.jsonl", "rewilding.jsonl", "rockfish.jsonl", "arctic.jsonl", "climate.jsonl", "shark_climate.jsonl"]

for data_path in sample_data:
    with open(f'{root_path}/{data_path}', 'r') as json_file:
        json_list = list(json_file)

    result = json.loads(json_list[0])

    for result_dict in result["data"]:
        papers.append(result_dict)

len(papers)

In [None]:
data = [y for y in (x for x in papers) if y["abstract"] is not None]
abstracts = [item['abstract'] for item in data]
len(abstracts)

In [None]:
abstracts[0]

In [None]:
year_pattern = r'20[0-9]'
def clean_sent(sent):
    removed_stopwords = remove_stopwords(sent)
    lowered_string = removed_stopwords.lower()
    punc_removed = strip_punctuation(lowered_string)
    remove_whitespace = strip_multiple_whitespaces(punc_removed)
    cleaned_string = re.sub(year_pattern, '', remove_whitespace)
    return cleaned_string.split()

In [None]:
cleaned_sentences = [clean_sent(sent) for sent in abstracts]
cleaned_sentences[0]

# Training & saving our model

In [None]:
model = gensim.models.Word2Vec(sentences=cleaned_sentences, workers=6, epochs=1000, min_count=2, vector_size=500)

In [None]:
random_word = random.choice(model.wv.index_to_key)
random_word

In [None]:
model.save('models/word2vec_first_pass.pkl')

In [None]:
loaded_model = gensim.models.Word2Vec.load('models/word2vec_first_pass.pkl')
random_word = random.choice(loaded_model.wv.index_to_key)
random_word

# Visualize our embeddings