## **Custom Embeddings**

This notebook should be ideally run on google colab. 

For Google colab, 
1. Make sure you have added the shared folder "digital-forest". 
2. Mount the google drive onto the colab environment. 
    1. Go to the folder icon on the left
    2. Click on th folder icon with google drive icon.
    3. This should mount the drive.
    4. Now all files in your drive are directly accessible in your colab environment.

For running on local environment, 
1. Make sure to change the root path to the local directory.
2. If any errors make sure to double check the file directory.



## 1. Load the data

**Note:-** Here the directory should match the directory from your google colab drive. 
To get this
1. Explore the folders in the files section
2. Right Click on the folder whose path you woukld like to import.
3. Click on Copy Path from the dropdown

In [1]:
article_dir = 'to the xml folder'

In [2]:
# Get a list of all files in given directory
from os import walk
filenames = next(walk(article_dir), (None, None, []))[2]  # [] if no file

In [4]:
filenames[:3]

['10.3390_rs13153009.html',
 '10.3390_rs13152956.html',
 '10.3390_rs13152892.html']

In [1]:
import imp
from bs4 import BeautifulSoup
import re

In [2]:
file_name="10.1016_j.foreco.2016.05.018.xml"

In [3]:
with open(file_name, "r", encoding='utf-8') as f:
        xml_file = f.read()

In [4]:
soup=BeautifulSoup(xml_file,'xml')

In [22]:
soup.title

<dc:title>Emergent crowns and light-use complementarity lead to global maximum biomass and leaf area in Sequoia sempervirens forests </dc:title>

In [27]:
#print article title
print(soup.title.string)

Emergent crowns and light-use complementarity lead to global maximum biomass and leaf area in Sequoia sempervirens forests 


In [29]:
#print article doi
print(soup.identifier.string)

doi:10.1016/j.foreco.2016.05.018


In [48]:
#print article date The date the search begin
print(soup.coverDate.string)
# the date the article published
date=soup.find_all('orig-load-date')
print(date[0].string)

2016-09-01
2016-06-15


In [55]:
#print journal name
journal_name=soup.find(name='srctitle',string=True)
norm_journal_name=soup.find(name='normalized-srctitle',string=True)
print(journal_name)
print(norm_journal_name)

<xocs:srctitle>Forest Ecology and Management</xocs:srctitle>
<xocs:normalized-srctitle>FORESTECOLOGYMANAGEMENT</xocs:normalized-srctitle>


In [59]:
#open access details
OA=soup.find('openaccess')
OA_txt=soup.find('openaccessArticle')
OA_type=soup.find('openaccessType')
OA_sponsor_type=soup.find('openaccessSponsorType')
print(OA)
print(OA_txt)
print(OA_type)
print(OA_sponsor_type)

<openaccess>1</openaccess>
<openaccessArticle>true</openaccessArticle>
<openaccessType>Full</openaccessType>
<openaccessSponsorType>Author</openaccessSponsorType>


In [37]:
#print article authors
authors=soup.find_all('creator')
print(authors)
print(type(authors))
print(len(authors))
author_list=[]
for author in authors:
    author_list.append(author.string)
print(author_list)

[<dc:creator>Van Pelt, Robert</dc:creator>, <dc:creator>Sillett, Stephen C.</dc:creator>, <dc:creator>Kruse, William A.</dc:creator>, <dc:creator>Freund, James A.</dc:creator>, <dc:creator>Kramer, Russell D.</dc:creator>]
<class 'bs4.element.ResultSet'>
5
['Van Pelt, Robert', 'Sillett, Stephen C.', 'Kruse, William A.', 'Freund, James A.', 'Kramer, Russell D.']


In [110]:
#print keywords from bibliometric
keywords=soup.find_all('subject')
print(len(keywords))
keyword_list=[]
for keyword in keywords:
    
    keyword_list.append(keyword.string.strip())  #use strip() to remove the extra spaces for first keyword
print(keyword_list)

12
['Sequoia sempervirens', 'Old growth', 'Forest structure', 'Biomass', 'LAI', 'Leaf area', 'Carbon sequestration', 'LiDAR', 'Emergent trees', 'Heartwood', 'Allometric equations', 'Light-use complementarity']


In [126]:
#print keyowords from full text
#note the first keywords is italic which create addition steps for text extraction
## use stripped_strings in keyword elements to remove formatting
keywords=soup.find_all('keyword')
print(len(keywords))
#print(keywords[3])
#print(keywords[1].find('italic'))
keyword_list=[]
for keyword in keywords:
    for kw in keyword.stripped_strings:
        keyword_list.append(kw)
print(keyword_list)

12
['Sequoia sempervirens', 'Old growth', 'Forest structure', 'Biomass', 'LAI', 'Leaf area', 'Carbon sequestration', 'LiDAR', 'Emergent trees', 'Heartwood', 'Allometric equations', 'Light-use complementarity']


In [127]:
#find funding information  
#need futher cleaning later
funding=soup.find_all('funding-list')
print(funding)

[<xocs:funding-list has-funding-info="1">
<xocs:funding-addon-generated-timestamp>2019-03-28T03:46:39.516Z</xocs:funding-addon-generated-timestamp>
<xocs:funding-addon-type>http://vtw.elsevier.com/data/voc/AddOnTypes/50.7/nlp</xocs:funding-addon-type>
<xocs:funding-source-document source-document-type="pii">S0378112716302584</xocs:funding-source-document>
<xocs:funding>
<xocs:funding-agency-matched-string>National Science Foundation</xocs:funding-agency-matched-string>
<xocs:funding-id>IOB-0445277</xocs:funding-id>
<xocs:funding-agency>National Science Foundation</xocs:funding-agency>
<xocs:funding-agency-id>http://data.elsevier.com/vocabulary/SciValFunders/100000001</xocs:funding-agency-id>
<xocs:funding-agency-country>http://sws.geonames.org/6252001/</xocs:funding-agency-country>
</xocs:funding>
<xocs:funding>
<xocs:funding-agency-matched-string>Forest Ecology at Humboldt State University</xocs:funding-agency-matched-string>
</xocs:funding>
<xocs:funding-text>This research was suppor

In [155]:
#find authors and affiliation
authors=soup.find_all('ce:author')
print(len(authors))
##find author id
print(authors[0].get('id'))
## find author name
print(authors[0].find('ce:given-name').string,' ',authors[0].surname.string)

## find author afflication

##extract all authors' affiliation reference with refid
affl=authors[0].find_all('ce:cross-ref')
print(len(affl))
print(affl)

## get refid and search affiliation usng refid
print(affl[0].get('refid'))
af_id=affl[0].get('refid')

print(soup.find(name='ce:affiliation',id=af_id))

5
au005
Robert   Van Pelt
3
[<ce:cross-ref id="c0130" refid="af005">
<ce:sup loc="post">a</ce:sup>
</ce:cross-ref>, <ce:cross-ref id="c0135" refid="af010">
<ce:sup loc="post">b</ce:sup>
</ce:cross-ref>, <ce:cross-ref id="c0140" refid="cor1">
<ce:sup loc="post">⁎</ce:sup>
</ce:cross-ref>]
af005
<ce:affiliation id="af005">
<ce:label>a</ce:label>
<ce:textfn>Department of Forestry and Wildland Resources, Humboldt State University, Arcata, CA 95521, USA</ce:textfn>
<sa:affiliation>
<sa:organization>Department of Forestry and Wildland Resources</sa:organization>
<sa:organization>Humboldt State University</sa:organization>
<sa:city>Arcata</sa:city>
<sa:state>CA</sa:state>
<sa:postal-code>95521</sa:postal-code>
<sa:country>USA</sa:country>
</sa:affiliation>
</ce:affiliation>


In [163]:
#find correspondence authors
cor_ids=soup.find_all('ce:correspondence')
#print(cor_ids)
cor_id_list=[]
for cor_id in cor_ids:
    cor_id_list.append(cor_id.get('id'))
print(cor_id_list)

cor_aut_list=[]
authors=soup.find_all('ce:author')
for id in cor_id_list:
    for author in authors:
        cor_aff=author.find('ce:cross-ref',refid=id)
        if cor_aff is not None:
            cor_aut_list.append(author)
print(cor_aut_list)
print(len(cor_aut_list))
        

['cor1']
[<ce:author id="au005" orcid="0000-0002-2424-5040">
<ce:given-name>Robert</ce:given-name>
<ce:surname>Van Pelt</ce:surname>
<ce:cross-ref id="c0130" refid="af005">
<ce:sup loc="post">a</ce:sup>
</ce:cross-ref>
<ce:cross-ref id="c0135" refid="af010">
<ce:sup loc="post">b</ce:sup>
</ce:cross-ref>
<ce:cross-ref id="c0140" refid="cor1">
<ce:sup loc="post">⁎</ce:sup>
</ce:cross-ref>
</ce:author>]
1


In [177]:
# need further text for abstract extraction from formatted full text
abstract=soup.find_all('ce:abstract')
print(len(abstract))
#print(abstract[-1])
for ab in abstract[-1].stripped_strings:
    print(ab)


2
Forests >80
m tall have the highest biomass, and individual trees in these forests are Earth’s largest with deep crowns emerging above neighboring vegetation, but it is unclear to what degree these maxima depend on the emergent trees themselves or a broader-scale forest structure. Here we advance the concept of
emergent facilitation
, whereby emergent trees benefit co-occurring species. Trees reorganize foliage within crowns to optimize available light and, if long-lived, can reiterate after crown damage to become emergent. The height, depth, and spacing of emergent trees in turn allows for abundant light to pass through the canopy, leading to light-use complementarity as well as elevated biomass, leaf area, and species diversity of the forest as a whole. We chose
Sequoia sempervirens
to develop this concept and installed eleven 1-ha plots in old-growth forests spanning nearly six degrees of latitude in California. Each plot was based off a 316-m-long centerline where biomass and lea

In [26]:
print(soup.section)

<ce:section id="s0005" view="all">
<ce:label>1</ce:label>
<ce:section-title id="st025">Introduction</ce:section-title>
<ce:para id="p0030" view="all">Globally, the tallest forests also have the highest biomass (<ce:cross-refs id="c0170" refid="b0805 b0400">Waring and Franklin, 1979; Keith et al., 2009</ce:cross-refs>). Forests with trees &gt;80<ce:hsp sp="0.25"/>m tall all have abundant precipitation and most occur at low elevations, but the same holds true for many other forests that do not produce tall trees (<ce:cross-ref id="c0175" refid="b0725">Tng et al., 2012</ce:cross-ref>). Beyond elevation and precipitation, an optimal temperature regime – specifically low seasonal variation in temperature – is a global determinant of maximum tree height (<ce:cross-ref id="c0180" refid="b0460">Larjavaara and Muller-Landau, 2011</ce:cross-ref>). Wet, low elevation forests with mild and stable temperature regimes only occur in coastal environments, and accordingly nearly all trees &gt;80<ce:hsp

In [5]:
abstract=soup.find('description')
print(abstract)

<dc:description>
                  Forests &gt;80m tall have the highest biomass, and individual trees in these forests are Earth’s largest with deep crowns emerging above neighboring vegetation, but it is unclear to what degree these maxima depend on the emergent trees themselves or a broader-scale forest structure. Here we advance the concept of emergent facilitation, whereby emergent trees benefit co-occurring species. Trees reorganize foliage within crowns to optimize available light and, if long-lived, can reiterate after crown damage to become emergent. The height, depth, and spacing of emergent trees in turn allows for abundant light to pass through the canopy, leading to light-use complementarity as well as elevated biomass, leaf area, and species diversity of the forest as a whole. We chose Sequoia sempervirens to develop this concept and installed eleven 1-ha plots in old-growth forests spanning nearly six degrees of latitude in California. Each plot was based off a 316-m-lon

In [6]:
abstract=soup.find('abstract')
print(abstract)

<ce:abstract class="graphical" id="ab005" view="all" xml:lang="en">
<ce:section-title id="st005">Graphical abstract</ce:section-title>
<ce:abstract-sec id="as005" view="all">
<ce:simple-para id="sp0005" view="all">
<ce:display>
<ce:figure id="f0070">
<ce:link locator="fx1"/>
</ce:figure>
</ce:display>
</ce:simple-para>
</ce:abstract-sec>
</ce:abstract>


In [18]:
keywords=soup.find('keywords')
print(keywords)
print(len(keywords))

<ce:keywords class="keyword" id="kg005" view="all">
<ce:section-title id="st020">Keywords</ce:section-title>
<ce:keyword id="k0005">
<ce:text>
<ce:italic>Sequoia sempervirens</ce:italic>
</ce:text>
</ce:keyword>
<ce:keyword id="k0010">
<ce:text>Old growth</ce:text>
</ce:keyword>
<ce:keyword id="k0015">
<ce:text>Forest structure</ce:text>
</ce:keyword>
<ce:keyword id="k0020">
<ce:text>Biomass</ce:text>
</ce:keyword>
<ce:keyword id="k0025">
<ce:text>LAI</ce:text>
</ce:keyword>
<ce:keyword id="k0030">
<ce:text>Leaf area</ce:text>
</ce:keyword>
<ce:keyword id="k0035">
<ce:text>Carbon sequestration</ce:text>
</ce:keyword>
<ce:keyword id="k0040">
<ce:text>LiDAR</ce:text>
</ce:keyword>
<ce:keyword id="k0045">
<ce:text>Emergent trees</ce:text>
</ce:keyword>
<ce:keyword id="k0050">
<ce:text>Heartwood</ce:text>
</ce:keyword>
<ce:keyword id="k0055">
<ce:text>Allometric equations</ce:text>
</ce:keyword>
<ce:keyword id="k0060">
<ce:text>Light-use complementarity</ce:text>
</ce:keyword>
</ce:keyword

In [11]:
soup.find_all("keywords", class_="keyword")

[<ce:keywords class="keyword" id="kg005" view="all">
 <ce:section-title id="st020">Keywords</ce:section-title>
 <ce:keyword id="k0005">
 <ce:text>
 <ce:italic>Sequoia sempervirens</ce:italic>
 </ce:text>
 </ce:keyword>
 <ce:keyword id="k0010">
 <ce:text>Old growth</ce:text>
 </ce:keyword>
 <ce:keyword id="k0015">
 <ce:text>Forest structure</ce:text>
 </ce:keyword>
 <ce:keyword id="k0020">
 <ce:text>Biomass</ce:text>
 </ce:keyword>
 <ce:keyword id="k0025">
 <ce:text>LAI</ce:text>
 </ce:keyword>
 <ce:keyword id="k0030">
 <ce:text>Leaf area</ce:text>
 </ce:keyword>
 <ce:keyword id="k0035">
 <ce:text>Carbon sequestration</ce:text>
 </ce:keyword>
 <ce:keyword id="k0040">
 <ce:text>LiDAR</ce:text>
 </ce:keyword>
 <ce:keyword id="k0045">
 <ce:text>Emergent trees</ce:text>
 </ce:keyword>
 <ce:keyword id="k0050">
 <ce:text>Heartwood</ce:text>
 </ce:keyword>
 <ce:keyword id="k0055">
 <ce:text>Allometric equations</ce:text>
 </ce:keyword>
 <ce:keyword id="k0060">
 <ce:text>Light-use complementari

In [21]:
full_article=soup.find('sections')
print(full_article[1])
print(len(full_article))

KeyError: 1

In [17]:
full_article=soup.find_all('para')
print(full_article[1])
print(len(full_article))

<ce:para id="p0035" view="all">Biomass and leaf area of the tallest forests are often more than an order of magnitude higher than shorter forests, including those in tropical Central and South America as well as those in temperate Europe and eastern North America (<ce:cross-refs id="r0010" refid="b0205 b0400">Franklin and Waring, 1980; Keith et al., 2009</ce:cross-refs>). The largest trees consistently occur in forests with near-maximum biomass and leaf area (<ce:cross-refs id="r0015" refid="b0640 b0655">Sillett and Van Pelt, 2007; Sillett et al., 2015b</ce:cross-refs>). Linkages between large trees and forest structure are so strong that aboveground biomass in Central African forests is predictable with information from only the largest 5% of their trees (<ce:cross-ref id="c0190" refid="b0050">Bastin et al., 2015</ce:cross-ref>). Large trees are a critical element of forest structure, but it is unclear to what degree maximum biomass and leaf area are attributable to the presence of la

## 2. Get text data from all files

In [35]:
import imp
from bs4 import BeautifulSoup
import re

def extract_text_from_html(mdpi_dir, mdpi_file_name):
    with open(mdpi_dir + '/' + mdpi_file_name, "r", encoding='utf-8') as f:
        html_file = f.read()
    soup = BeautifulSoup(html_file, 'html.parser')
    
    article = soup.find('article')
    text_list = article.find_all(text=True)
    article_text = " ".join(text_list)
    
    # Remove \n characters
    clean_text = article_text.replace('\n', ' ')
    # Remove special characters and numbers
    clean_text = re.sub('[^.,A-Za-z]+', ' ', clean_text)
    # Convert all text to lower
    clean_text = clean_text.lower()
    
    return clean_text

In [36]:
# Get all the text data from the articles
mdpi_corpus = []
failed_files = []
for file_name in filenames:
    # There might be possible exceptions from extracting text. 
    # This will catch the exceptions and we can analyze why it failed for some files
    try:
        extracted_text = extract_text_from_html(mdpi_dir, file_name)
        mdpi_corpus.append(extracted_text)
    except Exception as e:
        failed_files.append(file_name)
        print("Error while extracting text for {}".format(file_name), e)

Error while extracting text for 10.3390_rs90201023.html 'NoneType' object has no attribute 'find_all'
Error while extracting text for 10.3390_rs90201024.html 'NoneType' object has no attribute 'find_all'


In [37]:
print("Successfully processed {} records".format(len(mdpi_corpus)))

Successfully processed 402 records


## 3. Setup glove embeddings 

**This is required on google colab as data is not stored permenantly.**

In [10]:
!wget https://nlp.stanford.edu/data/glove.6B.zip

--2022-04-15 15:39:05--  https://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2022-04-15 15:39:05--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2022-04-15 15:41:46 (5.12 MB/s) - ‘glove.6B.zip’ saved [862182613/862182613]



In [11]:
!unzip "glove.6B.zip"

Archive:  glove.6B.zip
  inflating: glove.6B.50d.txt        
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       


## 4. Setup pipeline to make custom embeddings

In [12]:
import gensim
from gensim.test.utils import get_tmpfile, datapath
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.models import Word2Vec

In [13]:
glove_file = datapath('/content/glove.6B.50d.txt')
tmp_file = get_tmpfile("test_word2vec.txt")

_ = glove2word2vec(glove_file, tmp_file)
glove_vectors = KeyedVectors.load_word2vec_format(tmp_file)

### 4.1 Process the corpus to the input format required by Word2Vec algorithm

In [15]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [14]:
# First we combine all the records into one single string
full_text = " ".join(mdpi_corpus)

In [39]:
sentences = []
for document in mdpi_corpus:
    # Break down each document in the corpus to list of sentences 
    sent_list = sent_tokenize(document)
    # For each sentence break it into list of words
    for sent in sent_list:
        word_list = word_tokenize(sent)
        sentences.append(word_tokenize(sent))

In [41]:
print("We have {} sentences in the corpus".format(len(sentences)))

We have 377179 sentences in the corpus


### 4.2 Setup Word2Vec model

In [46]:
# build a word2vec model on your dataset
base_model = Word2Vec(size=50, window=5, min_count=3, workers=4)
base_model.build_vocab(sentences)

In [52]:
total_examples = base_model.corpus_count

In [59]:
# Unique words in the vocabulary
len(base_model.wv.vocab)

26528

In [71]:
# Statistics of our vocabulary
unique_words = set(base_model.wv.vocab.keys()) - set(glove_vectors.vocab.keys())
common_words = set(base_model.wv.vocab.keys()).intersection(set(glove_vectors.vocab.keys()))

print("Unique words to our corpus {}".format(len(unique_words)))
print("Common words between corpus and glove {}".format(len(common_words)))

Unique words to our corpus 6268
Common words between corpus and glove 20260


### 4.3 Train Word2Vec model

In [72]:
# update our model with GloVe's vocabulary & weights
base_model.build_vocab([list(glove_vectors.vocab.keys())], update=True)

In [None]:
# train on your data
base_model.train(sentences, total_examples=total_examples, epochs=100)
base_model_wv = base_model.wv

### 4.4 Analyze our embeddings

In [77]:
list(unique_words)[:10]

['qair',
 'k.l',
 'singto',
 'forclime',
 'tection',
 'channan',
 'logsig',
 'mizoue',
 'y.o',
 'g.o']

In [78]:
'geoinform' in common_words

False

In [74]:
base_model_wv.most_similar('geoinform')

[('energy', 0.5695731043815613),
 ('topography', 0.49855130910873413),
 ('poulin', 0.4958604872226715),
 ('levin', 0.4894944727420807),
 ('sun', 0.48915255069732666),
 ('vicarious', 0.48822200298309326),
 ('quantification', 0.48101136088371277),
 ('gasparini', 0.470840185880661),
 ('cools', 0.46899086236953735),
 ('ahola', 0.4679374396800995)]