CWPK \#63: Staging Data Sci Resources and Preprocessing
=======================================

Clean Corpora and Datasets are a Major Part of the Effort
--------------------------

<div style="float: left; width: 305px; margin-right: 10px;">

<img src="http://kbpedia.org/cwpk-files/cooking-with-kbpedia-305.png" title="Cooking with KBpedia" width="305" />

</div>

With our discussions of [network analysis](https://en.wikipedia.org/wiki/Network_theory) and knowledge extractions from our [knowledge graph](https://en.wikipedia.org/wiki/Knowledge_graph) now behind us, we are ready to tackle the questions of analytic applications and [machine learning](https://en.wikipedia.org/wiki/Machine_learning) in earnest for our [*Cooking with Python and KBpedia*](https://www.mkbergman.com/cooking-with-python-and-kbpedia/) series. We will be devoting our next nine installments to this area. We devote two installments to data sources and input preparations, largely based on [NLP](https://en.wikipedia.org/wiki/Natural_language_processing) (natural language processing) applications. Then we devote two installments to 'standard' machine learning (largely) using the [scikit-learn](https://github.com/scikit-learn/scikit-learn) packages. We next devote four installments to [deep learning](https://en.wikipedia.org/wiki/Deep_learning), split equally between the Deep Learning Graph ([DGL](https://www.dgl.ai/)) and PyTorch Geometric ([PyG](https://github.com/rusty1s/pytorch_geometric)) frameworks. We then conclude this **Part VI** with a summary and comparison of results across these installments based on the task of node classification.

In this particular installment we flesh out the plan for completing these installments and discuss data sources and completing data prep needed for the plan. We provide particular attention to the architecture and data flows within the [PyTorch](https://en.wikipedia.org/wiki/PyTorch) framework. We describe the additional Python packages we need for this work, and install and configure the first ones. We discuss general sources of data and corpora useful for machine learning purposes. Our coding efforts in this installment will obtain and clean the Wikipedia pages that supplement the two structural and annotation sources based on KBpedia that were covered in the prior installment. These three sources of **structure**, **annotations** and **pages** are the input basis to creating our own embeddings to be used in many of the machine learning tests.

### Plan for Completion of Part VI
The broad ecosystem of Python packages I was considering looked, generally, to be good choices to work together, as first outlined in [**CWPK #61**](https://www.mkbergman.com/2411/cwpk-61-nlp-machine-learning-and-analysis/). I had done an adequate initial diligence. But, how all of this was to unfold, what my plan of attack should be, became driving factors I had to solve to shorten my development and coding efforts. So, with an understanding of how we could extract general information from KBpedia useful to analysis and machine learning, I needed to project out over the entire anticipated scope to see if, indeed, these initial sources looked to be the right ones for our purposes. And, if so, how shall the efforts be sequenced and what is the flow of data?

Much reading and research went into this effort. It is true, for example, that we had already prepared a pretty robust series of analytic and machine learning case studies in [Clojure](https://en.wikipedia.org/wiki/Clojure), available from the [KBpedia Web site](https://kbpedia.org/use-cases/). I revisited each of these use cases and got some ideas of what made sense for us to attempt with Python. But I needed to understand the capabilities now available to us with Python, so I also studied each of the candidate keystone packages in some detail.

I will weave the results of this research as the next installments unfold, providing background discussion in context and as appropriate. But, in total, I formulated about 30 tasks going forward that appeared necessary to cover the defined scope. The listing below summarizes these steps, and keys the transition point (as indicated by **CWPK** installment number) for proceeding to each next new installment:

1. Formulate Part VI plan
1. Extract two source files from KBpedia
  - structure
  - annotations
3. Set environment up (not doing virtual)
1. Obtain Wikipedia articles for matching RCs
1. Set up gensim
1. Clean Wikipedia articles, all KB annotations
1. Set up spaCy
1. ID, extract phrases
1. Finish embeddings prep **#64**
  - remove stoplist
  - create numeric??
1.  Create embedding models:
  - word2vec and doc2vec
1. Text summarization for short articles (gensim)   
1. Named entity recognition
1. Set up scikit-learn **#65**
1. Create master pandas file
1. Do event/action extraction
1. Do scikit-learn classifier **#66**
  - SVM
  - k-nearest neighbors
  - random forests
1. Introduce the sklearn.metrics module and confusion matrix, etc. The standard for reporting
1. Discuss basic test parameters/'gold standars'
1. Knowledge graph embeddings **#67**
1. Create embedding models -2
  - KB-struct
  - KB-annot
  - KB-annot-full: what is above + below
  - KB-annot-page
1. Set up PyTorch/DLG-KE  **#68**
1. Set up PyTorch/PyG 
1. Formulate learning pathway/code
1. Do standard DL classifiers: **#69**
  - TransE
  - TransR
  - RESCAL
  - DistMult
  - ComplEx
  - RotatE
1. Do research DL classifiers: **#70**
  - VAE
  - GGSNN
  - MPNN
  - ChebyNet
  - GCN
  - SAGE
  - GAT
1. Choose a model evaluator: **#71**
  - scikit-learn
  - pyTorch
  - other?
1. Collate prior results
1. Evaluate prior results
1. Present comparative results

Some of these steps also needed some preliminary research before proceeding. For example, knowing I wanted to compare results across algorithms meant I needed to have a good understanding of testing and analysis requirements before starting any of the tests. 

### PyTorch Architecture

A critical question in contemplating this plan was how exactly data needed to be produced, staged, and then fed into the analysis portions. From the earlier investigations I had identified the three categories of knowledge grounded in KBpedia that could act as bases or features to machine learning; namely, **structure**, **annotations** and **pages**. I also had identified PyTorch as a shared abstraction layer for deep and machine learning. 

I was particularly focused on the question of data formats and representations such that information could be readily passed from one step to the next in the analysis pipeline. *Figure 1* is the resulting data flow chart and architecture that arose from these investigations.

First, the green block labeled 'owlready2' represents that Python package, but also the location where the intact knowledge graph of KBpedia is stored and accessed. As early installments covered, we can use either [owlready2](https://owlready2.readthedocs.io/en/latest/) or [Protégé](https://en.wikipedia.org/wiki/Prot%C3%A9g%C3%A9_(software)) to manage this knowledge graph, though owlready2 is the point at which the KBpedia information is exported or extracted for downstream uses, importantly machine learning. As our owlready2 discussions also indicated, there is a close relationship between it and [RDFLib](https://rdflib.readthedocs.io/en/stable/) (which is also the [SPARQL](https://en.wikipedia.org/wiki/SPARQL) access point). RDFLib can provide direct input into [NetworkX](https://en.wikipedia.org/wiki/NetworkX), but that is limited to **structure** only.

The clearest common denominator format for entry into the machine learning pipeline is [pandas](https://en.wikipedia.org/wiki/Pandas_(software)) via [CSV](https://en.wikipedia.org/wiki/Comma-separated_values) files. This centrality is fortunate given that all of our prior KBpedia extract-and-build routines have been designed around this format. This format is also one of the direct feeds possible into the PyTorch datasets format, as the figure shows:

<div style="margin: 10px auto; display: table;">

<img src="files/ml-data-flow.png" title="Data Flows in Machine Learning and KG Analysis" width="800" alt="Data Flows in Machine Learning and KG Analysis" />

</div>

<div style="margin: 10px auto; display: table; font-style: italic;">

Figure 1: Data Flows in Machine Learning and Knowledge Graph Analysis

</div>

An important block on the figure is for 'embeddings'. If you recall, all text needs to first be encoded to a numeric form to be understood by the computer. This process can also undertake dimensionality reduction, important for a sparse matrix data form like language. This same ability can be applied to graph structure and interactions. Thus, the 'embedding' block is a pivotal point at which we can represent words, sentences, paragraphs, documents, nodes, or entire graphs. We will focus much on embeddings throughout this **Part VI**.

For training purposes we can also feed pre-trained corpora or embeddings into the system. We address this topic in the next main section.

*Figure 1* is not meant to be a comprehensive view of PyTorch, but it is one useful to understand data flows with respect to our use of the KBpedia knowledge graph. Over the course of this research, I also encountered many PyTorch-related extensions that, when warranted, I include in the discussion.

#### Possible Extensions
There are some extensions to the PyTorch ecosystem that we will not be using or testing in this **CWPK** series. Here are some of the ones that seem closest in capabilities to what we are doing with KBpedia:

- [PyCaret](https://pycaret.org/) is essentially a Python wrapper around several machine learning libraries and frameworks such as scikit-learn, XGBoost, Microsoft LightGBM, spaCy, and many more
- [PiePline](https://github.com/PiePline/piepline) is a neural networks training pipeline based on PyTorch. Designed to standardize training process and accelerate experiments
- [Catalyst](https://github.com/catalyst-team/catalyst) helps to write full-featured deep learning pipelines in a few lines of code
- [Poutyne](https://github.com/GRAAL-Research/poutyne) is a Keras-like framework for PyTorch and handles much of the boilerplating code needed to train neural networks  
- [torchtext](https://pytorch.org/text/) has some capabilities in language modeling, sentiment analysis, text classification, question classification, entailment, machine translation, sequence tagging, question answering, and unsupervised learning
- [Spotlight](https://github.com/maciejkula/spotlight) uses PyTorch to build both deep and shallow recommender models.

### Corpora and Datasets

There are many off-the-shelf resources that can be of use when doing machine learning involving text and language. (There are as well for images, but that is out of scope to our current interests.) These resources fall into three main areas:

- [corpora](https://en.wikipedia.org/wiki/Text_corpus) - are language resources of either a general or domain nature, with vetted relationships or annotations between terms and concepts or other pre-processing useful to [computational linguistics](https://en.wikipedia.org/wiki/Corpus_linguistics)
- [pre-trained models](https://en.wikipedia.org/wiki/Language_model) - are pre-calculated language models, often expressing probability distributions over words or text. Some embeddings can act in this manner. [Transformers](transformers) use deep learning to train their representations, with [BERT](https://en.wikipedia.org/wiki/BERT_(language_model)) being a notable example 
- embeddings - are vector representations of chunks of text, ranging from individual words up to entire documents or languages. The numeric representation either represents a pooled statistical representation across all tokens (the so-called [CBOW](https://en.wikipedia.org/wiki/Bag-of-words_model#CBOW) approach) or context and adjacency using the [skip-gram](https://en.wikipedia.org/wiki/N-gram#Skip-gram) or similar method. [GloVe](https://nlp.stanford.edu/projects/glove/), [word2vec](https://en.wikipedia.org/wiki/Word2vec) and [fastText](https://en.wikipedia.org/wiki/FastText) are example methodologies for producing word embeddings.

Example corpora include Wikipedia (in multiple languages), news articles, Web crawls, and many others. Such corpora can be used as the language input basis for training various models, or may be a reference vocabulary for scoring and ranking input text. Various pre-trained language models are available, and embedding methods are available in a number of Python packages, including scikit-learn, [gensim](https://radimrehurek.com/gensim/) and [spaCy](https://spacy.io/) used in *cowpoke*.
  
#### Pre-trained Resources
There are a number of free or open-source resources for these corpora or datasets. Some include:

- [Transformers](https://github.com/huggingface/transformers) provides thousands of pretrained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation, etc in 100+ languages
- [HuggingFace datasets](https://github.com/huggingface/datasets)
- [English Wikipedia dump](http://vectors.nlpl.eu/explore/embeddings/en/models/)
- [Wikipedia2Vec pre-trained embeddings](https://wikipedia2vec.github.io/wikipedia2vec/pretrained/)
- [gensim datasets](https://rare-technologies.com/new-download-api-for-pretrained-nlp-models-and-datasets-in-gensim) contain links to 8 options
- [word2vec pre-trained models](https://github.com/3Top/word2vec-api#where-to-get-a-pretrained-models) lists 16 or so datasets
- A comprehensive list of available [gensim datasets and models](moz-extension://679501e4-f7af-4a03-be07-bd99ebaf3542/page_panel.html?url=https%3A%2F%2Fraw.githubusercontent.com%2FRaRe-Technologies%2Fgensim-data%2Fmaster%2Flist.json&type=json&ext=json)
- [11 pre-trained word embedding models](https://stackoverflow.com/questions/45310409/using-a-word2vec-model-pre-trained-on-wikipedia) in various embedding formats
- Some older [GloVe embeddings](https://nlp.stanford.edu/projects/glove/)
- [Word vectors for 157 languages](https://fasttext.cc/docs/en/crawl-vectors.html)
- [DBpedia entity typing + word embeddings](https://github.com/ISE-FIZKarlsruhe/Entity-Typing-with-Word-Embeddings).

### Setting Up the Environment
In doing this research, I also assembled the list of Python packages needed to add these capabilities to *cowpoke*. Had I not just updated the conda packages, I would do so now:

<code>conda update --all</code>

Next, the general recommendation when installing multiple new packages in Python is to do them in one batch, which allows the package manager (<code>conda</code> in our circumstance) to check on version conflicts and compatibility during the install process. However, with some of the packages involved in the current expansion, there are other settings necessary that obviates this standard 'batch' install recommendation.

Another note is important here. In an enterprise environment with many Python projects, it is also best to install these machine learning extensions into their own virtual environment. (I covered this topic a bit in [**CWPK #58**](https://www.mkbergman.com/2407/cwpk-58-setting-up-a-remote-instance-and-web-page-server/).) However, since we are keeping this entire series in its own environment, we will skip that step here. You may prefer the virtual option.

So, we will begin with those Python packages and frameworks that pose their own unique set-up and install challenges. We begin with PyTorch. We need to first appreciate that the rationale for PyTorch was to abstract machine learning constructs while taking advantage of graphics processing units ([GPUs](https://en.wikipedia.org/wiki/Graphics_processing_unit)) (specifically, [Nvidia](https://en.wikipedia.org/wiki/Nvidia) via the [CUDA](https://en.wikipedia.org/wiki/CUDA) interface). The CUDA architecture provides one or two orders of magnitude speed up on a local machine. Unfortunately, my local Windows machine does not have the separate Nvidia GPU, so I want to install the no CUDA option. For the PyTorch install options, visit https://pytorch.org/get-started/locally/. This figure shows my selections prior to download (yours may vary): 

<div style="margin: 10px auto; display: table;">

<img src="files/pytorch-install.png" title="PyTorch Download Screen" width="800" alt="PyTorch Download Screen" />

</div>

<div style="margin: 10px auto; display: table; font-style: italic;">

Figure 2: PyTorch Download Screen

</div>

In my circumstance, my local machine does not have a separate graphics processor, so I set the CUDA requirement to 'None' **(1)**. I also removed the 'torchvision' command line specification **(2)** since that is an image-related package. (We may later need some libraries from this package, in which case we will then install it.) The PyTorch package is rather large, so install takes a few minutes. Here is the actual install command:

<code>conda install pytorch cpuonly -c pytorch</code>

Since we were not able to batch all new packages, I decide to continue with some of the other major additions in a sequential matter, with spaCy and its [installation](https://spacy.io/usage) next:

<code>conda install -c conda-forge spacy</code>

and then gensim and its [installation](https://radimrehurek.com/gensim/):

<code>conda install -c conda-forge gensim</code>

and then DLG, which has an [installation](https://www.dgl.ai/pages/start.html) screen similar to PyTorch in *Figure 2* with the same picked options:

<code>conda install -c dglteam dgl</code>

The DLG-KE extension needs to be built from source for Windows, so we will hold off on that now until we need it. We next [install](https://github.com/rusty1s/pytorch_geometric) PyTorch Geometric, which needs to be installed from a series of binaries, with CPU or GPU individually specified:

<pre>
pip install torch-scatter==latest+cpu -f https://pytorch-geometric.com/whl/torch-1.6.0.html
pip install torch-sparse==latest+cpu -f https://pytorch-geometric.com/whl/torch-1.6.0.html
pip install torch-cluster==latest+cpu -f https://pytorch-geometric.com/whl/torch-1.6.0.html
pip install torch-spline-conv==latest+cpu -f https://pytorch-geometric.com/whl/torch-1.6.0.html
pip install torch-geometric
</pre>

These new packages join these that are already a part of my local <code>conda</code> packages, and which will arise in the coming installments:

<code>scikit-learn</code> and <code>tqdm</code>.

### Getting Wikipedia Pages

With these preliminaries complete, we are now ready to resume our data preparation tasks for our embedding and machine learning experiments. In the prior installment, we discussed two of the three source files we had identified for these efforts, the KBpedia **structure** (<code>kbpedia/v300/extractions/data/graph_specs.csv</code>) and the KBpedia **annotations** (<code>kbpedia/v300/extractions/classes/Generals_annot_out.csv</code>) files. In this specific section we obtain the third source file of **pages** from Wikipedia.

Of the 58,000 reference concepts presently contained in KBpedia, about 45,000 have a directly corresponding Wikipedia article or listing of category articles. These provide a potentially rich source of content for language models and embeddings. The challenge is how to obtain this content in a way that can be readily processed for our purposes.

We have been working with Wikipedia since its inception, so we knew that there are data sources for downloads or dumps. For example, the periodic language dumps such as https://dumps.wikimedia.org/enwiki/20200920/ may be accessed to obtain full-text versions of articles. Such dumps have been used scores of times to produce Wikipedia corpora in many different languages and for many different purposes. But, our own mappings are a mere subset, about 1% of the nearly 6 million articles in the English Wikipedia alone. So, even if we grabbed the current dump or one of the corpora so derived, we would need to process much content to obtain the subset of interest.

Unfortunately, Wikipedia does not have a direct query or SPARQL form as exists for Wikidata (which also does not have full-text articles). We could obtain the so-called 'long abstracts' of Wikipedia pages from DBpedia (see, for example, https://wiki.dbpedia.org/downloads-2016-10), but this source is dated and each abstract is limited to about 220 words; further, a full download of the specific file in English is about 15 GB!

The basic approach, then, appeared that I would need to download the full Wikipedia article file, figure out how to split it into parts, and then match identifiers between KBpedia mappings and the full dataset to obtain the articles of interest. This approach is not technically difficult, but it is a real pain in the ass.

So, shortly before I committed to this work effort, I challenged myself to find another way that was perhaps less onerous. Fortunately, I found the online Wikipedia service, https://en.wikipedia.org/wiki/Special:Export, that allows one to submit article names to a text box and then get the full page article back in XML format. I tested this online service with a few articles, then 100, and then ramped up to a listing of 5 K at a time. (Similar services often have governors that limit the frequency or amounts of individual requests.) This approach worked!, and within 30 min I had full articles in nine separate batches for all 45 K items in KBpedia.

### Clean All Input Text
This file is a single article from the Wikipedia English dump for <code>1-(2-Nitrophenoxy)octane</code>:

In [None]:
<page>
    <title>1-(2-Nitrophenoxy)octane</title>
    <ns>0</ns>
    <id>11793192</id>
    <revision>
      <id>891140188</id>
      <parentid>802024542</parentid>
      <timestamp>2019-04-05T23:04:47Z</timestamp>
      <contributor>
        <username>Koavf</username>
        <id>205121</id>
      </contributor>
      <minor/>
      <comment>/* top */Replace HTML with MediaWiki markup or templates, replaced: &lt;sub&gt; → {{sub| (3), &lt;/sub&gt; → }} (3)</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text bytes="2029" xml:space="preserve">{{chembox
| Watchedfields = changed
| verifiedrevid = 477206849
| ImageFile =Nitrophenoxyoctane.png
| ImageSize =240px
| ImageFile1 = 1-(2-Nitrophenoxy)octane-3D-spacefill.png
| ImageSize1 = 220
| ImageAlt1 = NPOE molecule
| PIN = 1-Nitro-2-(octyloxy)benzene
| OtherNames = 1-(2-Nitrophenoxy)octane&lt;br /&gt;2-Nitrophenyl octyl ether&lt;br /&gt;1-Nitro-2-octoxy-benzene&lt;br /&gt;2-(Octyloxy)nitrobenzene&lt;br /&gt;Octyl o-nitrophenyl ether
|Section1={{Chembox Identifiers
| Abbreviations =NPOE
| ChemSpiderID_Ref = {{chemspidercite|correct|chemspider}}
| ChemSpiderID = 148623
| InChI = 1/C14H21NO3/c1-2-3-4-5-6-9-12-18-14-11-8-7-10-13(14)15(16)17/h7-8,10-11H,2-6,9,12H2,1H3
| InChIKey = CXVOIIMJZFREMM-UHFFFAOYAD
| StdInChI_Ref = {{stdinchicite|correct|chemspider}}
| StdInChI = 1S/C14H21NO3/c1-2-3-4-5-6-9-12-18-14-11-8-7-10-13(14)15(16)17/h7-8,10-11H,2-6,9,12H2,1H3
| StdInChIKey_Ref = {{stdinchicite|correct|chemspider}}
| StdInChIKey = CXVOIIMJZFREMM-UHFFFAOYSA-N
| CASNo_Ref = {{cascite|correct|CAS}}
| CASNo =37682-29-4
| PubChem =169952
| SMILES = [O-][N+](=O)c1ccccc1OCCCCCCCC
}}
|Section2={{Chembox Properties
| Formula =C{{sub|14}}H{{sub|21}}NO{{sub|3}}
| MolarMass =251.321
| Appearance =
| Density =1.04 g/mL
| MeltingPt =
| BoilingPtC = 197 to 198
| BoilingPt_notes = (11 mm Hg)
| Solubility =
  }}
|Section3={{Chembox Hazards
| MainHazards =
| FlashPt =
| AutoignitionPt = 
 }}
}}

'''1-(2-Nitrophenoxy)octane''', also known as '''nitrophenyl octyl ether''' and abbreviated '''NPOE''', is a 
[[chemical compound]] that is used as a matrix in [[fast atom bombardment]] [[mass spectrometry]], liquid 
[[secondary ion mass spectrometry]], and as a highly [[lipophilic]] [[plasticizer]] in [[polymer]] 
[[Polymeric membrane|membranes]] used in [[ion selective electrode]]s.

== See also ==

* [[Glycerol]]
* [[3-Mercaptopropane-1,2-diol]]
* [[3-Nitrobenzyl alcohol]]
* [[18-Crown-6]]
* [[Sulfolane]]
* [[Diethanolamine]]
* [[Triethanolamine]]

{{DEFAULTSORT:Nitrophenoxy)octane, 1-(2-}}
[[Category:Nitrobenzenes]]
[[Category:Phenol ethers]]</text>
      <sha1>0n15t2w0sp7a50fjptoytuyus0vsrww</sha1>
    </revision>
  </page>

We want to extract out the specific article text (denoted by the <code>&lt;text></code> field), perhaps capture some other specific fields, remove internal tags, and then create a clean text representation that we can further process. This additional processing includes removing stoplist words, finding and identifying phrases (multiple token chunks), and then tokenizing the text suitable for processing as computer input. 

There are multiple methods available for this kind of processing. One approach, for example, uses [XML parsing and specific code](https://www.heatonresearch.com/2017/03/03/python-basic-wikipedia-parsing.html) geared to the Wikipedia dump. Another approach uses a [dedicated Wikipedia extractor](https://www.heatonresearch.com/2017/03/03/python-basic-wikipedia-parsing.html). There are actually a few variants of dedicated extractors.

However, one particular Python package, gensim, provides multiple utilities and [Wikipedia services](https://radimrehurek.com/gensim/corpora/wikicorpus.html). Since I had already identified gensim to provide services like sentiment analysis and some other NLP capabilities, I chose to focus on using this package for the needed Wikipedia cleaning tasks.

Gensim has a <code>gensim.corpora.wikicorpus.WikiCorpus</code> class designed specifically for processing the Wikipedia article dump file. Fortunately, I was able to find some example code on KDnuggets that showed the way in how to process this file
- https://stackoverflow.com/questions/56715394/how-do-i-use-the-wikipedia-dump-as-a-gensim-model (doc2vec example) and
- https://www.kdnuggets.com/2017/11/building-wikipedia-text-corpus-nlp.html as another.

However, prior to using gensim, I needed to combine the batch outputs from my Wikipedia page retrievals into a single <code>xml</code> file, which I could then bzip for direct ingest by gensim. (Most gensim models and capabilities can read either bzip or text files.)

Each 5 K <code>xml</code> page retrieval from Wikipedia comes with its own header and closing tags. These need to be manually snipped out of the group retrieval files before combining. We prepared these into nine blocks that corresponded to each of the batch Wikipedia retrievals, and retained the header and closing tags in the first and last files respectively:

<div style="background-color:#ffecec; border:1px dotted #f5aca6; vertical-align:middle; margin:15px 60px; padding:8px;"><span style="font-weight: bold;">NOTE:</span> Due to GitHub's file size limits (of 100 MB max), the nine text files listed in the next routine have been zipped and uploaded to <a href="kbpedia.org/cwpk-text/Wikipedia-pages-1.zip">kbpedia.org/cwpk-text/Wikipedia-pages-1.zip</a>. To use these files, you will need to download to your local system and unzip. Increment the zip files as shown in the link through #9. Then, all following routines below must be repeated locally in order to progress through the various cleaning and preparation steps.</div>

In [None]:
out_f = r'C:\1-PythonProjects\kbpedia\v300\models\inputs\wikipedia-pages-full.xml'
filenames = [r'C:\1-PythonProjects\kbpedia\v300\models\inputs\Wikipedia-pages-1.txt', 
             r'C:\1-PythonProjects\kbpedia\v300\models\inputs\Wikipedia-pages-2.txt', 
             r'C:\1-PythonProjects\kbpedia\v300\models\inputs\Wikipedia-pages-3.txt', 
             r'C:\1-PythonProjects\kbpedia\v300\models\inputs\Wikipedia-pages-4.txt', 
             r'C:\1-PythonProjects\kbpedia\v300\models\inputs\Wikipedia-pages-5.txt', 
             r'C:\1-PythonProjects\kbpedia\v300\models\inputs\Wikipedia-pages-6.txt', 
             r'C:\1-PythonProjects\kbpedia\v300\models\inputs\Wikipedia-pages-7.txt', 
             r'C:\1-PythonProjects\kbpedia\v300\models\inputs\Wikipedia-pages-8.txt', 
             r'C:\1-PythonProjects\kbpedia\v300\models\inputs\Wikipedia-pages-9.txt']
with open(out_f, 'w', encoding='utf-8') as outfile:
    for fname in filenames:
        with open(fname, encoding='utf-8', errors='ignore') as infile:
            i = 0
            for line in infile:
                i = i + 1
                try:
                    outfile.write(line)
                except Exception as e:
                    print('Error at line:' + i + str(e))
            print('Now combined:' + fname)
    outfile.close 
    print('Now all files combined!')            

The output of this routine is then bzipped offline, and then used as the submission to the gensim <code>WikiCorpus</code> function that processes the standard <code>xml</code> output:

In [None]:
"""
Creates a corpus from Wikipedia dump file.
Inspired by:
https://github.com/panyang/Wikipedia_Word2vec/blob/master/v1/process_wiki.py
"""

import sys
from gensim.corpora import WikiCorpus

in_f = r'C:\1-PythonProjects\kbpedia\v300\models\inputs\wikipedia-pages-full.xml.bz2'
out_f = r'C:\1-PythonProjects\kbpedia\v300\models\inputs\wikipedia-output-full.txt'

def make_corpus(in_f, out_f):

    """Convert Wikipedia xml dump file to text corpus"""

    output = open(out_f, 'w', encoding='utf-8')            # made change
    wiki = WikiCorpus(in_f)
    i = 0
    for text in wiki.get_texts():
        try:
            output.write(' '.join(map(lambda x:x.decode('utf-8'), text)) + '\n')
        except Exception as e:
            print ('Exception error: ' + str(e))
        i = i + 1
        if (i % 10000 == 0):
            print('Processed ' + str(i) + ' articles')
    output.close()
    print('Processed ' + str(i) + ' articles;')
    print('Processing complete!')

make_corpus(in_f, out_f)

We further make a smaller input file, <code>enwiki-test-corpus.xml.bz2</code>, with only a few records from the Wikipedia XML dump in order to speed testing of the above code.

### Initial Results

Here is what the sample program produced for the entry for <code>1-(2-Nitrophenoxy)octane</code> listed above:

<code>nitrophenoxy octane also known as nitrophenyl octyl ether and abbreviated npoe is chemical compound that is used as matrix in fast atom bombardment mass spectrometry liquid secondary ion mass spectrometry and as highly lipophilic plasticizer in polymer membranes used in ion selective electrodes see also glycerol mercaptopropane diol nitrobenzyl alcohol crown sulfolane diethanolamine triethanolamine</code>

We see a couple of things that are perhaps not in keeping with the extracted information we desire:

1. No title
1. No sentence boundaries
1. No internal category links
1. No infobox specifications

On the other hand, we do get the content from the 'See Also' section.

We want sentence boundaries for cleaner training purposes for word embedding models like word2vec. We want the other items so as to improve the lexical richness and context for the given concept. Further, we want two versions: one with titles as a separate field and one for learning purposes that includes the title in the lexicon (titles, after all, are preferred labels and deserve an additional frequency boost).

OK, so how does one make these modifications? My first hope was that arguments to these functions (<code>args</code>) might provide the specification latitude to deal with these changes. Unfortunately, none of the specified items fell into this category, though there is much latitude to modify underlying procedures. The second option was to find some third-party modification or override. Indeed, I did [find one](https://github.com/RaRe-Technologies/gensim/issues/552), that I found quite intriguing as a way to at least deal with sentence boundaries and possibly other areas. I spent nearly a full day trying to adapt this script, never succeeding. One fix would lead to another need for a fix, research on that problem, and then a fix and more problems. I'm sure most all of this is due to my amateur programming skills.

Still, the effort was frustrating. The good thing, however, is that in trying to work out a third-party fix, I was learning the underlying module. Eventually, it became clear if I was to address all desired areas it was smartest to modify the source directly. The three key functions that emerged as needing attention were <code>tokenize</code>, <code>process_article</code> and the <code>class WikiCorpus(TextCorpus)</code> code. In fact, it was the text processing heart of the last class that was the focus for changes, but the other two functions got involved because of their supporting roles. As I attempted to sub-class this basis with my own parallel approach (<code>class KBWikiCorpus(WikiCorpus)</code>, I kept finding the need to bring into the picture more supporting functions. Some of this may have been due to nuances in how to specify imported functions and modules, which I am still learning about (see concluding installments). But it is also difficult to sub-set or modify any code. 

The real impact of these investigations was to help me understand the underlying module. What at first blush looked too intimidating, now was becoming understandable. I could also see other portions of the underlying module that addressed **ALL** aspects of my earlier desires. Third-party modifications choose their own scope; direct modification of the underlying module provides more aspects to tweak. So, I switched emphasis from modifying a third-party overlay to directly changing the core underlying module.

### Modifying WikiCorpus
We already knew the key functions needing focus. All changes to be made occur in the <code>wikicorpus.py</code> file that resides in your gensim package directory under Python packages. So, I make a copy of the original and name it such, then proceed to modify the base file. Though we will substitute this modified <code>wikicorpus_kb.py</code> file, I will also keep a backup of it as well such that we have copies of the original and modified file.

Here is the resulting modified code, with notes about key changes following the listing:

In [1]:
with open('files/wikicorpus_kb.py', 'r') as f:
    print(f.read())

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# Copyright (C) 2010 Radim Rehurek <radimrehurek@seznam.cz>
# Copyright (C) 2012 Lars Buitinck <larsmans@gmail.com>
# Copyright (C) 2018 Emmanouil Stergiadis <em.stergiadis@gmail.com>
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html

"""Construct a corpus from a Wikipedia (or other MediaWiki-based) database dump.

Uses multiprocessing internally to parallelize the work and process the dump more quickly.

Notes
-----
If you have the `pattern <https://github.com/clips/pattern>`_ package installed,
this module will use a fancy lemmatization to get a lemma of each token (instead of plain alphabetic tokenizer).

See :mod:`gensim.scripts.make_wiki` for a canned (example) command-line script based on this module.

"""

import bz2
import logging
import multiprocessing
import re
import signal
from pickle import PicklingError
# LXML isn't faster, so let's go with the built-in solution
try:
    from xml.etree.cElementTree i

Gensim provides well documented code that is written in an understandable way.

Most of the modifications I made occurred at the bottom of the code listing. However, the text routine at the top of the file allows us to tailor what page 'sections' are kept or not in each Wikipedia article. Because of their substantive lexical content, I add the page templates and category names to be retained with the text body.

Assuming I will want to retain these modifications and understand them at a later date, I block off all modified sections with <code>ORIGINAL VERSION</code> and <code>NEW VERSION</code> tags. One change was to remove punctuation. Another was to grab and capture the article title. 

This file, then, becomes a replacement to the original <code>wikicorpus.py</code> code. I am cognizant that changing underlying source code for local purposes is generally considered to be a **BAD** idea. It very well may be so in this case. However, with the backups, and being attentive to version updates and keeping working code in sync, I guess I do not see where keeping track of a modification is any less sustainable than needing to update existing code to a modification. Both require inspection and effort. If I diff on the changed underlying module, I suspect it is of equivalent effort or lesser effort to change a third-party interface modification.

The net result is that I am now capturing the substantive content of these articles in a form I want to process.

### Remove Stoplist

In my initial workflow, I had the step of stoplist removal later in the process because I thought it might be helpful to have all text prior to phrase identification. A stoplist (also known as '[stop words](https://en.wikipedia.org/wiki/Stop_word)'), by the way, is a listing of very common words (mostly conjuctions, common verb tenses, articles and propositions) that can be removed from a block of text without adversely affecting its meaning or readability.

Since it proved superior to not retain these stop words when forming [n-grams](https://en.wikipedia.org/wiki/N-gram) (see next section), I moved the routine up to be next processing of the Wikipedia pages. Here is the relevant code:

In [1]:
import sys
from gensim.parsing.preprocessing import remove_stopwords  # Key line for stoplist
from smart_open import smart_open

in_f = r'C:\1-PythonProjects\kbpedia\v300\models\inputs\wikipedia-output-full.txt'
out_f = r'C:\1-PythonProjects\kbpedia\v300\models\inputs\wikipedia-output-full-stopped.txt'

more_stops = ['b', 'c', 'category', 'com', 'd', 'f', 'formatnum', 'g', 'gave', 'gov', 'h', 
              'htm', 'html', 'http', 'https', 'id', 'isbn', 'j', 'k', 'l', 'loc', 'm', 'n', 
              'need', 'needed', 'org', 'p', 'properties', 'q', 'r', 's', 'took', 'url', 'use', 
              'v', 'w', 'www', 'y', 'z']  
documents = smart_open(in_f, 'r', encoding='utf-8')
content = [doc.split(' ') for doc in documents]
with open(out_f, 'w', encoding='utf-8') as output:
    i = 0
    for line in content:
        try:
            line = ', '.join(line)
            line = line.replace(', ', ' ')
            line = remove_stopwords(line)  
            querywords = line.split()
            resultwords = [word for word in querywords if word.lower() not in more_stops]
            line = ' '.join(resultwords)
            line = line + '\n'
            output.write(line)
        except Exception as e:
            print ('Exception error: ' + str(e))
        i = i + 1
        if (i % 10000 == 0):
            print('Stopwords applied to ' + str(i) + ' articles')
    output.close()
    print('Stopwords applied to ' + str(i) + ' articles;')
    print('Processing complete!')  

Stopwords applied to 10000 articles
Stopwords applied to 20000 articles
Stopwords applied to 30000 articles
Stopwords applied to 31157 articles;
Processing complete!


Gensim comes with its own stoplist, to which I added a few of my own, including removal of the <code>category</code> keyword that arose from adding that grouping. The output of this routine is the next file in the pipeline, <code>wikipedia-output-full-stopped.txt</code>.

### Phrase Identification and Extraction
Phrases are n-grams, generally composed of two or three paired words, which are known as 'bigrams' and 'trigrams', respectively. Phrases are one of the most powerful ways to capture domain or technical language, since these compounded terms arise through the use and consensus of their users. Some phrases help disambiguate specific entities or places, as when for example 'river', 'state', 'university' or 'buckeyes' does when combined with the term 'ohio'. 

Generally, most embeddings or corpora do not include n-grams in their initial preparation. But, for the reasons above, and experience of the usefulness of n-grams to text retrieval, we decided to includ phrase identification and extraction as part of our preprocessing.

Again, gensim comes with a pre-trained phrase identifier (like all gensim models, you can re-train and tune these models as you gain experience and want them to perform differently). The main work of this routine is the <code>ngram</code> call, wherein term adjacency is used to construct paired term indentifications. Here is the code and settings for our first pass with this function to create our initial bigrams from the stopped input text:

In [2]:
import sys
from gensim.models.phrases import Phraser, Phrases
from gensim.parsing.preprocessing import remove_stopwords  # Key line for stoplist
from smart_open import smart_open

in_f = r'C:\1-PythonProjects\kbpedia\v300\models\inputs\wikipedia-output-full-stopped.txt'
out_f = r'C:\1-PythonProjects\kbpedia\v300\models\inputs\wikipedia-bigram.txt'

documents = smart_open(in_f, 'r', encoding='utf-8')
sentence_stream = [doc.split(' ') for doc in documents]
common_terms = ['aka']
ngram = Phrases(sentence_stream, min_count=3,threshold=10, max_vocab_size=80000000, 
                delimiter=b'_', common_terms=common_terms)
ngram = Phraser(ngram)
content = list(ngram[sentence_stream])
with open(out_f, 'w', encoding='utf-8') as output:
    i = 0
    for line in content:
        try:
            line = ', '.join(line)
            line = line.replace(', ', ' ')
            line = line.replace(' s ', '')
            output.write(line)
        except Exception as e:
            print ('Exception error: ' + str(e))
        i = i + 1
        if (i % 10000 == 0):
            print('ngrams calculated for ' + str(i) + ' articles')
    output.close()
    print('Calculated ngrams for ' + str(i) + ' articles;')
    print('Processing complete!')   

ngrams calculated for 10000 articles
ngrams calculated for 20000 articles
ngrams calculated for 30000 articles
Calculated ngrams for 31157 articles;
Processing complete!


This routine takes about 14 minutes to run on my laptop, with the settings as shown. Note in the routine where we set the delimiter to be the underscore character; this is how we know the bigram.

Once this routine finishes, we can take its output and re-use it as input to a subsequent run. Now, we will be producing trigrams where we can match to existing bigrams. Generally, we set our thresholds and minimum counts higher. In our case, the new settings are <code>min_count=8, threshold=50</code> The trigram analysis takes 19 min to run.

We have now completed our preprocessing steps the for embedding models we introduce in the next installment.

### Additional Documentation
Here are many supplementary resources useful to the environment and natural language processing capabilities introduced in this installment. 

- [Transformers: State-of-the-art Natural Language Processing](http://arxiv.org/abs/1910.03771)
- [Corpus from a Wikipedia Dump](https://radimrehurek.com/gensim/corpora/wikicorpus.html)
- [Building a Wikipedia Text Corpus for Natural Language Processing](https://www.kdnuggets.com/2017/11/building-wikipedia-text-corpus-nlp.html).

#### PyTorch and pandas
- [Convert Pandas Dataframe to PyTorch Tensor](https://stackoverflow.com/questions/50307707/convert-pandas-dataframe-to-pytorch-tensor) - pandas &rarr; numpy &rarr; torch
- [PyTorch Dataset: Reading Data Using Pandas vs. NumPy](https://jamesmccaffrey.wordpress.com/2020/08/31/pytorch-dataset-reading-data-using-pandas-vs-numpy/)
- See [**CWPK #56**](https://www.mkbergman.com/2404/cwpk-56-graph-visualization-and-extraction/) for pandas/networkx reads and imports.

#### PyTorch Resources and Tutorials
- [The Most Complete Guide to PyTorch for Data Scientists](https://www.kdnuggets.com/2020/09/most-complete-guide-pytorch-data-scientists.html) provides the basics of tensors and tensor operations and then provides a high-level overview of the PyTorch capabilities
- [Awesome-Pytorch-list](https://awesomeopensource.com/project/bharathgs/Awesome-pytorch-list)  provides multiple resource categories, with 230 in related packages alone
- The official [PyTorch documentation](https://pytorch.org/docs/stable/index.html)
- [Building Efficient Custom Datasets in PyTorch](https://towardsdatascience.com/building-efficient-custom-datasets-in-pytorch-2563b946fd9f)
- [Getting Started with PyTorch: A Deep Learning Tutorial](https://adatis.co.uk/getting-started-with-pytorch-a-deep-learning-tutorial/)
- [Incredible PyTorch](https://www.ritchieng.com/the-incredible-pytorch/) is a curated list of tutorials, papers, projects, communities and more relating to PyTorch
- [Learning PyTorch with Examples](https://pytorch.org/tutorials/beginner/pytorch_with_examples.html).

#### spaCy and gensim
- [Natural Language in Python using spaCy: An Introduction](https://blog.dominodatalab.com/natural-language-in-python-using-spacy/)
- [Gensim Tutorial – A Complete Beginners Guide](https://www.machinelearningplus.com/nlp/gensim-tutorial/) is excellent and comprehensive.

 <div style="background-color:#ffecec; border:1px dotted #f5aca6; vertical-align:middle; margin:15px 60px; padding:8px;"> 
  <span style="font-weight: bold;">NOTE:</span> This article is part of the <a href="https://www.mkbergman.com/cooking-with-python-and-kbpedia/" style="font-style: italic;">Cooking with Python and KBpedia</a> series. See the <a href="https://www.mkbergman.com/cooking-with-python-and-kbpedia/"><strong>CWPK</strong> listing</a> for other articles in the series. <a href="http://kbpedia.org/">KBpedia</a> has its own Web site. The <em>cowpoke</em> Python <a href="https://github.com/Cognonto/cowpoke">code listing covering the series</a> is also available from GitHub.
  </div>

<div style="background-color:#ebf8e2; border:1px dotted #71c837; vertical-align:middle; margin:15px 60px; padding:8px;"> 

<span style="font-weight: bold;">NOTE:</span> This <strong>CWPK 
installment</strong> is available both as an online interactive
file <a href="https://mybinder.org/v2/gh/Cognonto/CWPK/master" ><img src="https://mybinder.org/badge_logo.svg" style="display:inline-block; vertical-align: middle;" /></a> or as a <a href="https://github.com/Cognonto/CWPK" title="CWPK notebook" alt="CWPK notebook">direct download</a> to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the <code>*.ipynb</code> file. It may take a bit of time for the interactive option to load.</div>

<div style="background-color:#feeedc; border:1px dotted #f7941d; vertical-align:middle; margin:15px 60px; padding:8px;"> 
<div style="float: left; margin-right: 5px;"><img src="http://kbpedia.org/cwpk-files/warning.png" title="Caution!" width="32" /></div>I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment -- which is part of the fun of Python -- and to <a href="mailto:mike@mkbergman.com">notify me</a> should you make improvements.    

</div>