# Let's build and initialise a MedCAT model!

### First we need to install MedCAT


In [1]:
# Install MedCAT
! pip install medcat==1.2.7
# Get the scispacy model
! python -m spacy download en_core_web_md

Collecting medcat==1.2.7
  Downloading medcat-1.2.7-py3-none-any.whl (141 kB)
[K     |████████████████████████████████| 141 kB 17.3 MB/s 
[?25hCollecting psutil<6.0.0,>=5.8.0
  Downloading psutil-5.9.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (280 kB)
[K     |████████████████████████████████| 280 kB 47.5 MB/s 
Collecting gensim~=4.1.2
  Downloading gensim-4.1.2-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (24.1 MB)
[K     |████████████████████████████████| 24.1 MB 3.4 MB/s 
[?25hCollecting spacy<3.1.4,>=3.1.0
  Downloading spacy-3.1.3-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.9 MB)
[K     |████████████████████████████████| 5.9 MB 45.9 MB/s 
[?25hCollecting elasticsearch>=7.10
  Downloading elasticsearch-8.0.0-py3-none-any.whl (369 kB)
[K     |████████████████████████████████| 369 kB 48.3 MB/s 
Collecting numpy<1.22.9,>=1.21.4
  Downloading numpy-1.21.5-cp37-cp37m-manylinux_2_12_x86_64.ma

Collecting en-core-web-md==3.1.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.1.0/en_core_web_md-3.1.0-py3-none-any.whl (45.4 MB)
[K     |████████████████████████████████| 45.4 MB 80 kB/s 
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.1.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


**Restart the runtime if on colab, sometimes necessary after installing models**

In [2]:
import pandas as pd
import numpy as np

from medcat.vocab import Vocab
from medcat.cdb import CDB
from medcat.config import Config
from medcat.cdb_maker import CDBMaker
from medcat.cat import CAT

In [3]:
DATA_DIR = "./data/"

In [4]:
# Load files if in google colab, otherwise skip this step
!wget https://raw.githubusercontent.com/CogStack/MedCATtutorials/main/notebooks/introductory/data/cdb_simple.csv -P ./data/
!wget https://raw.githubusercontent.com/CogStack/MedCATtutorials/main/notebooks/introductory/data/cdb_advanced.csv -P ./data/
!wget https://raw.githubusercontent.com/CogStack/MedCATtutorials/main/notebooks/introductory/data/vocab_data.txt -P ./data/

--2022-02-15 14:55:02--  https://raw.githubusercontent.com/CogStack/MedCATtutorials/main/notebooks/introductory/data/cdb_simple.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 50 [text/plain]
Saving to: ‘./data/cdb_simple.csv’


2022-02-15 14:55:02 (2.10 MB/s) - ‘./data/cdb_simple.csv’ saved [50/50]

--2022-02-15 14:55:02--  https://raw.githubusercontent.com/CogStack/MedCATtutorials/main/notebooks/introductory/data/cdb_advanced.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 150 [text/plain]
Saving to: ‘./data/cdb_advanced

# MedCAT Components
The medcat model requires 3 model components to run.
1. Vocab
2. CDB
3. Config (cdb configuration)

## Building a Vocabulary

The first of the two required models when running MedCAT is a Vocabulary model (Vocab). The model is used for two things: (1) Spell checking; and (2) Word Embedding. 

The Vocab is very simple and you can easily build it from a file that is structured as below:
```
<token>\t<word_count>\t<vector_embedding_separated_by_spaces>
```
`token` - Usually a word or subword if you are using Byte Pair Encoding or something similar.

`word_count` - The count for this word in your dataset or in any large dataset (wikipedia also works nicely).

`vector_embedding_separated_by_spaces` - precalculated vector embedding, can be from Word2Vec or BERT

---
An example with 3-dimension embedding would be:
```
house	34444	 0.3232 0.123213 1.231231
dog	14444	0.76762 0.76767 1.45454
.
.
.
```
The file is basically a TSV, but should not have any heading. 

---

**NOTE**: If spelling is important for your use-case, take care that there are no misspelt words in the Vocab.

In [5]:
# Let's have a look at an example, I've created a small vocabulary with only 2 words (the ones from above)
# Let's try to create a vocabulary from this two words.

vocab = Vocab()
vocab.add_words(DATA_DIR +'vocab_data.txt', replace=True)

**And that is everything, with this we have built our vocab and no futher training is necessary.**

---

A couple of useful functions of the vocab are presented below

In [6]:
# To see the words in the vocab
vocab.vocab.keys()

dict_keys(['house', 'dog'])

In [7]:
vocab.vocab

{'dog': {'cnt': 14444, 'ind': 1, 'vec': array([0.76762, 0.76767, 1.45454])},
 'house': {'cnt': 34444,
  'ind': 0,
  'vec': array([0.3232  , 0.123213, 1.231231])}}

In [8]:
# If you want to add words manually (one by one) use:
vocab.add_word("test", cnt=31, vec=[1.42, 1.44, 1.55], replace=True)
vocab.vocab.keys()

dict_keys(['house', 'dog', 'test'])

In [9]:
# To get a vector of word use:
vocab.vec("house")

array([0.3232  , 0.123213, 1.231231])

In [10]:
# Or to get the count
vocab['house']

34444

In [11]:
# To check if a word is in the vocab:
"house" in vocab

True

### Before we save the vocab model, we need to create the unigram table for negative sampling

In [12]:
# This is necessary after each change of the vocabulary (when we add new words)
vocab.make_unigram_table()

### Save the Vocab model

In [13]:
vocab.save(DATA_DIR + "vocab.dat")

### Load the Vocab model

In [14]:
vocab = Vocab.load(DATA_DIR + "vocab.dat")

## Building the Concept Database (CDB)

The second model we are going to need when using MedCAT is the Concept Database (CDB). This database holds a list of all concepts that we would like to detect and link to. For a lot of medical use-cases we would use giant databases like the UMLS or SNOMED CT. However, MedCAT can be used with any database no matter how big/small it is. 

To prepare the CDB we start off with a CSV with the following structure:
```
cui,name
1,kidney failure
7,CoVid 2
7,coronavirus
```
This is the most basic version of the CSV file, it has only:

`cui` - The concept unique identifier, this is simply an `ID` in your database.

`name` - String/Name of that concept. It is important to write all possible names and abbreviations for a concept of interest.

If you have a concept that can be recognised through multiple different names (like the one above with cui=7), you can simply add multiple rows with the same concept ID and MedCAT will merge that during the build phase.

## The Full CSV Specification
```
cui,name,ontologies,name_status,type_ids,description
1,Kidney Failure,SNOMED,P,T047,kidneys stop working
.
.
.
```
The rest of the fields are optional, each can be included or left out in your CSV:

`ontologies` - Source ontology, e.g. HPO, SNOMED, HPC,...

`name_status` - Term type e.g. P - Primary Name. Primary names are important and I would always recommend to add this fields when creating your CDB. This will help distinguish between synonyms.

`type_ids` - Type Ids are the broad category in which a concept may fall under. This is used to rapidly filter for concepts which fall under a specific category. In UMLS this could be the Semantic type identifier - e.g. T047 (taken from UMLS). A list of all semantic types can be found [here](https://metamap.nlm.nih.gov/Docs/SemanticTypes_2018AB.txt).
In SNOMED one could use the Semantic tags. E.g (Disease). A list of all Snomed semantic tags can be found [here](https://confluence.ihtsdotools.org/display/DOCGLOSS/semantic+tag).


`description` - Description of this concept

***Note***: If one concept has multiple names, you can also separate the different names by a "|" - pipe - symbol 

In [15]:
cdb_simple = pd.read_csv(DATA_DIR + 'cdb_simple.csv')


In [16]:
cdb_simple

Unnamed: 0,cui,name
0,1,kidney failure
1,7,CoVid 2
2,7,coronavirus


Let's try building our own concept databse from a simple CSV

In [17]:
# First initialise the default configuration
config = Config()
config.general['spacy_model'] = 'en_core_web_md'
maker = CDBMaker(config)

In [18]:
# Create an array containing CSV files that will be used to build our CDB
csv_path = [ DATA_DIR + 'cdb_advanced.csv', DATA_DIR + 'cdb_simple.csv',]

# Create your CDB
cdb = maker.prepare_csvs(csv_path, full_build=True)

Started importing concepts from: ./data/cdb_advanced.csv
Current progress: 0% at 0.000s per 0 rows
Current progress: 50% at 0.024s per 0 rows
Started importing concepts from: ./data/cdb_simple.csv
Current progress: 0% at 0.000s per 0 rows
Current progress: 33% at 0.009s per 0 rows
Current progress: 67% at 0.008s per 0 rows


**That is all, nothing else is necessary to build the CDB**

---

Some useful functions of the cdb are below

In [19]:
# To display all names and cui in the db
print(cdb.name2cuis)

{'kidney~failure': ['1'], 'failure~of~kidneys': ['1'], 'failure~of~kidney': ['1'], 'kf': ['1'], 'k~.~failure': ['1'], 'covid~2': ['7'], 'coronavirus': ['7']}


In [20]:
# To display all unique cuis and corresponding names in the db 
print(cdb.cui2names)

{'1': {'failure~of~kidney', 'failure~of~kidneys', 'kidney~failure', 'kf', 'k~.~failure'}, '7': {'covid~2', 'coronavirus'}}


In [21]:
# To display cui to preferred name
print(cdb.cui2preferred_name)


{'1': 'Kidney Failure'}


In [22]:
# We have a link from cui to type ids
print(cdb.cui2type_ids)


{'1': {'T047'}, '7': set()}


### Save the Concept Database model

In [23]:
cdb.save(DATA_DIR + "cdb.dat")

### Load the Concept Database model

In [24]:
cdb = CDB.load(DATA_DIR + "cdb.dat")

## Setting the CDB configuration

The CDB config sets the model parameters.
This allows you to tailor the model to your own specific use case. Although the default configuration will suit the majority of use cases.

In [25]:
# For further information on the cdb configuration options and explore what the default options are.
??cdb.config

In [26]:
# Set a couple of parameters, they are usually set via environments, but
#here we will do it explicitly. You can read more about each option in the 
#medcat repository: https://github.com/CogStack/MedCAT

cdb.config.ner['min_name_len'] = 2
cdb.config.ner['upper_case_limit_len'] = 3
cdb.config.general['spell_check'] = True
cdb.config.linking['train_count_threshold'] = 10
cdb.config.linking['similarity_threshold'] = 0.3
cdb.config.linking['train'] = True
cdb.config.linking['disamb_length_limit'] = 5
cdb.config.general['full_unlink'] = True

Note: Don't forget to save the cdb with the new configurations if you want to reuse them!

# Create a MedCAT model pack

A MedCAT model pack is an easy way to store all the various components of a MedCAT model in one place.

This includes the CDB, CDB configurations, Vocab, and even various MetaCAT models! We will learn more about the latter in the following tutorials.

In [27]:
# Initialise the model
cat = CAT(cdb=cdb, config=cdb.config, vocab=vocab)

In [28]:
# Create and save a model pack
cat.create_model_pack(DATA_DIR + "my_first_medcat_modelpack")

Please consider populating the version information [description, performance, location, ontology] in cat.config.version
  avg = a.mean(axis)
  
Please consider updating [description, performance, location, ontology] in cat.config.version
This will save all models into a zip file, can take some time and require quite a bit of disk space.
{
  "Model ID": "1c110382a5935e53",
  "Last Modifed On": "15 February 2022",
  "History (from least to most recent)": [],
  "Description": "No description",
  "Source Ontology": null,
  "Location": null,
  "MetaCAT models": {},
  "Basic CDB Stats": {
    "Number of concepts": 2,
    "Number of names": 7,
    "Number of concepts that received training": 0,
    "Number of seen training examples in total": 0,
    "Average training examples per concept": NaN
  },
  "Performance": {
    "ner": {},
    "meta": {}
  },
  "Important Parameters (Partial view, all available in cat.config)": {
    "config.ner['min_name_len']": {
      "value": 2,
      "descriptio

'medcat_model_pack_1c110382a5935e53'

# End

This is everything you need to create your own MedCAT models. In the following tutorials you will learn how to uses modelpacks to train models and use them to annotate documents. 