# RaJoLink workflow notebook tutorial based on Python script modules for LBD

RaJoLink is a method developed for **open** literature-based discovery (LBD). In contrast to Swanson's `ABC` model, RaJoLink focuses on a semi-automatic identification of candidates ($a$) that might be related to an investigated phenomenon ($c$). This selection is based on the identification of **rare terms** from the literature on $c$. At the heart of the RaJoLink strategy is the rational assumption that if literatures on several rare terms have a term in common, this term is a candidate for the term $a$.

The **RaJoLink** method comprises three main steps: **Ra**, **Jo**, and **Link**, which focus on rare terms, joint terms and linking terms, respectively. The Ra step searches the literature on phenomenon $C$ for unique or rare terms. The Jo step reviews articles related to these rare terms, and identifies joint terms (candidates for $a$) that appear in them, suggesting the hypothesis that $C$ is related to $A$. The Link step then looks for $b$-terms that bridge the literature on a selected $a$-term and $c$-term; $b$-terms are the candidates that can possibly explain the link.

The identification of rare terms in the Ra step is based on the statistical principle of outliers. Just as outliers in data can lead to significant discoveries, rare terms in the literature can pave the way for innovative connections. A term is considered rare if it occurs in $n$ or fewer data sets, where $n$ is adjustable depending on the experiment or context.

While Swanson's ABC model connects two disjoint literatures with the term $b$, RaJoLink uses rare terms to find the term $a$, which bridges the literatures with selected rare terms.

In this particular implementation, the search for b-terms is limited to the expert-selected [MeSH](https://www.nlm.nih.gov/mesh/meshhome.html) words for Enzymes and Coenzymes [D08] and Amino Acids, Peptides, and Proteins [D12]. The main purpose of MeSH filtering is to reduce the vocabulary size, which in turn improves the time complexity of the LBD algorithms used. Considering only the words from the two MeSH categories also reduces the effort for the human expert in guiding and evaluating the results.

<hr>

[1] Petrič, I., Urbančič, T., Cestnik, B., & Macedoni-Lukšič, M. (2009). Literature mining method RaLoLink for uncovering relations between biomedical concepts. Journal of Biomedical Informatics, 42(2), 219–227

Import and initialize `logging` library to track the execution of the scripts.

In [None]:
import logging

# Initialize logging with a basic configuration
logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s: %(levelname)s - %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S')

Import LBD components from the framework notebooks. The description of the individual components from the framework notebooks can be found in the respective notebooks. 

The purpose of the **import_ipynb** library is to enable the direct import of Jupyter notebooks as modules so that code, functions, and classes defined in one notebook can be easily reused in other notebooks or Python scripts.

In [None]:
import import_ipynb
import LBD_01_data_acquisition
import LBD_02_data_preprocessing
import LBD_03_feature_extraction
import LBD_04_text_mining
import LBD_05_results_analysis
import LBD_06_visualization

Import additional Python libraries.

In [None]:
import nltk
import numpy as np
import itertools
import pandas as pd
# import pickle
# import json
import spacy
from typing import List, Dict

When you run the script for the first time, `en_core_web_md`, a medium sized English model trained on written web text (blogs, news, comments), must be downloaded with the command:

```python
!python -m spacy download en_core_web_md
```

For the first run, you must therefore comment out the first line in the next cell.

In [None]:
#!python -m spacy download en_core_web_md 

nlpr = spacy.load("en_core_web_md")

# Step Ra

Define the name of the domain $C$ and load the responding text from the file. The expected file format is as follows:

1. The file is encoded in Ascii (if it is in UTF-8 or other encoding, it should be converted to Ascii).
2. Each line in the file represents one document. The words in each document are separated by spaces. The length of the individual documents may vary.
3. The first word in each line is the **unique id**, followed by a semicolon. Normally **pmid** (pubmed id) can be used for this purpose.
4. The second word in each line can optionally stand for a predefined domain (or class) of the document. In this case, the second word is preceded by **!**. For example, if the file contains documents that originate from two domains, e.g. *migraine* and *magnesium*, the second word in each line is either **!migraine** or **!magnesium**. If the file contains documents that originate from *autism* and *calcineurin*, the second word in each line will be either **!autism** or **!calcineurin**.
5. If the second word is not preceded by **!**, it will be considered the first word of the document. In this case, the document will be given the domain **!NA** (**not applicable** or **not available**).

**A background story for this experiment**

First, we selected *Autism* as our domain of interest. Then we searched PubMed and collected 214 full-text documents on autism from the decade before 2006 in PubMed Central. After collecting the documents, we converted them from HTML and PDF format to plain text and made sure that each document was formatted consistently for further analysis. The 214 full text documents are stored in the file `input/214Texts.txt`.

We extracted around 2000 unique terms, focusing particularly on rare terms from the fields of amino acids, peptides and proteins to assess their potential relevance to autism research. Notable rare terms such as *lactoylglutathione*, *synaptophysin* and *calcium channels* appeared in our dataset.

The selected rare terms *lactoylglutathione*, *synaptophysin* and *calcium channels* prompted our team's autism expert to specifically investigate their associations with *calcineurin* (as it appeared as a joint term in all literatures of the selected rare terms). *Calcineurin* is a protein phosphatase with a high prevalence in the brain.

In [None]:
domainName = 'Autism'
fileName = 'input/214Texts.txt'
lines = LBD_01_data_acquisition.load_data_from_file(fileName)
# display the first 7 lines of the document
lines[:7]

The next script is part of a pipeline that is used to pre-process medical literature data and focuses on terms related to specific MeSH (Medical Subject Headings) categories. In this specific case, it is about loading and preprocessing MESH terms for Enzymes and Coenzymes [D08] and Amino Acids, Peptides, and Proteins [D12]. The input file MESH_D08_D12.txt was created by selecting the relevant [D08] and [D12] terms from the xml file `desc2024.xml`, which was downloaded from <a href="https://www.nlm.nih.gov/databases/download/mesh.html">the MeSH website</a>. The input file contains 3534 words after preprocessing, which are used as filters in the further preprocessing of autism-related files.

**Functionality**

1. *Load data*:
 The script starts loading MeSH data from a specified file:
 ```python
 mesh_lines = LBD_01_data_acquisition.load_data_from_file("input/MESH_D08_D12.txt")
 ```
 The file `MESH_D08_D12.txt` contains words that refer to certain MeSH categories (D08 and D12).

2. *Dictionary construction*:
 In the next step, a dictionary is constructed from the loaded lines:
 ```python
 mesh_docs_dict = LBD_02_data_preprocessing.construct_dict_from_list(mesh_lines)
 ```
 This converts the list of lines into a dictionary (`mesh_docs_dict`) in which the keys represent document identifiers and the values contain the corresponding text. This structure is more efficient for subsequent text processing tasks.

3. *Pre-processing of documents*:
 In the pre-processing phase, the text data is cleaned up and standardized:
 ```python
 mesh_prep_docs_dict = LBD_02_data_preprocessing.preprocess_docs_dict(
 mesh_docs_dict, keep_list=[], remove_list=[], mesh_word_list=[], \
 cleaning=True, remove_stopwords=True, lemmatization=True, min_word_length=5)
 ```
 Various pre-processing methods are used here:
 - *Cleaning*: General cleaning of the text.
 - *Stopword Removal*: Removal of frequent words that do not provide meaningful information (e.g. "the", "and").
 - *Lemmatization*: Reduction of words to their basic or root form (e.g. "running" becomes "run").
 - *Minimum Word Length*: Filtering out words with less than five characters.
 These steps prepare the text data for further analysis.

4. *Extract pre-processed documents*:
 After preprocessing, the script extracts the cleaned text back into a list:
 ```python
 mesh_prep_docs_list = LBD_02_data_preprocessing.extract_preprocessed_documents_list(mesh_prep_docs_dict)
 ```
 This conversion is necessary for feature extraction, where the text must be in a list format.

5. *Feature extraction (Bag of Words)*:
 The last part of the script creates a Bag of Words (BoW) model:
 ```python
 mesh_word_list, mesh_bow_matrix = LBD_03_feature_extraction.create_bag_of_words(mesh_prep_docs_list, 1, 1)
 ```
 The BoW model is a text representation technique in which:
 - *`mesh_word_list`* contains the unique words identified in the documents.
 - *`mesh_bow_matrix`* is a matrix in which each row corresponds to a document and each column represents a word, with the matrix values indicating the frequency of words in the documents.

**Practical applications**

- *Biomedical research*: Researchers can use this script to pre-process and analyze large datasets of medical literature to identify new links between diseases, drugs and biological processes.
- *Text mining and NLP*: The script can be customized for more comprehensive text mining tasks, such as sentiment analysis, topic modeling or other areas that require structured text representation.

**Use**

To use this script in your workflow, you need to make sure you have the appropriate data file (`MESH_D08_D12.txt`) and the modules for data collection, preprocessing and feature extraction. After running the script, you will receive a vocabulary list and a corresponding BoW matrix that you can analyze further.

In [None]:
mesh_lines = LBD_01_data_acquisition.load_data_from_file("input/MESH_D08_D12.txt")

mesh_docs_dict = LBD_02_data_preprocessing.construct_dict_from_list(mesh_lines)

keep_list = []
remove_list = []
mesh_prep_docs_dict = LBD_02_data_preprocessing.preprocess_docs_dict(
    mesh_docs_dict, keep_list = keep_list, remove_list = remove_list, mesh_word_list = [], \
    cleaning = True, remove_stopwords = True, lemmatization = True, min_word_length = 5)

mesh_prep_docs_list = LBD_02_data_preprocessing.extract_preprocessed_documents_list(mesh_prep_docs_dict)

mesh_word_list, mesh_bow_matrix = LBD_03_feature_extraction.create_bag_of_words(mesh_prep_docs_list, 1, 1)
print('Number of terms in MESH D08 and D12 vocabulary: ', len(mesh_word_list))
print('First 7 words in the mesh_word_list:', mesh_word_list[:7])

The script in the next cell is used to prepare text data for further analysis in Literature-Based Discovery (LBD). The aim is to clean, standardize and structure the documents so that they are suitable for further tasks such as feature extraction, topic modeling and the discovery of hidden relationships in the literature. The script prepares the documents stored in `lines` in a dictionary and then processes the documents with the obtained MeSH word list of Enzymes and Coenzymes [D08] and Amino Acids, Peptides and Proteins [D12].

**Functionality**

1. *Creating a dictionary from raw data*:
 The script starts by converting a list of rows into a structured dictionary:
 ```python
 docs_dict = LBD_02_data_preprocessing.construct_dict_from_list(lines)
 ```
 - *`construct_dict_from_list`**: This function takes the raw list of text lines (`lines`) and creates a dictionary (`docs_dict`) in which each entry typically represents a document, with a unique identifier as the key and the text of the document as the value.
 - This conversion is important because it puts the text data into a more manageable format that allows efficient processing and retrieval.

2. *Preprocessing of documents*:
 The script then applies various pre-processing steps to the documents:
 ```python
 keep_list = []
 remove_list = []
 prep_docs_dict = LBD_02_data_preprocessing.preprocess_docs_dict(
 docs_dict, keep_list=keep_list, remove_list=remove_list, mesh_word_list=mesh_word_list, \
 cleaning=True, remove_stopwords=True, lemmatization=True, \
 min_word_length=5, keep_only_nouns=False, keep_only_mesh=True, stemming=False, stem_type=None)
 ```
 - *Cleaning*: The text is cleaned to remove unwanted characters, punctuation and other errors.
 - *Remove stop words*: Frequent words that do not provide meaningful information (e.g. "the", "and") are removed.
 - *Lemmatization*: Words are reduced to their base or root form (e.g. "running" becomes "run") to ensure consistency.
 - *Minimum word length*: Words shorter than five characters are filtered out.
 - *MeSH-specific filtering*: The parameter `keep_only_mesh=True` ensures that only terms from the Medical Subject Headings (MeSH) vocabulary are considered in order to focus the analysis on relevant biomedical terminology.

This pre-processing step is important to reduce noise and focus on the most important terms, which improves the quality of subsequent analyses.

3. *Extract document IDs and processed text*:
 The script then extracts lists of document IDs and the corresponding preprocessed text:
 ```python
 ids_list = LBD_02_data_preprocessing.extract_ids_list(prep_docs_dict)
 prep_docs_list = LBD_02_data_preprocessing.extract_preprocessed_documents_list(prep_docs_dict)
 ```
 - *`extract_ids_list`*: Returns a list of document IDs from the preprocessed dictionary to facilitate document lookup and management.
 - *`extract_preprocessed_documents_list`*: Extracts the cleaned and processed text for each document to prepare it for feature extraction or other analysis.

 By extracting these lists, the script organizes the data in a format that is easy to manipulate in subsequent steps, such as creating a Bag of Words (BoW) model or calculating TF-IDF scores.

**Applications**

- *Biomedical text mining**: This pre-processing approach is valuable in the biomedical field, where ensuring the relevance and accuracy of terms is critical to discovering new relationships between diseases, drugs and other biological concepts.
- *Data preparation for machine learning*: The cleaned and structured data generated by this script can be fed directly into machine learning models for tasks such as document classification or clustering.
- *Research and hypothesis generation*: By focusing on specific vocabularies such as MeSH, researchers can more effectively search the literature for new hypotheses or overlooked relationships.

**Use**

To use this script effectively:
1. *Prepare the data*: Make sure you have a list of raw text lines (`lines`) and a corresponding vocabulary list (e.g. `mesh_word_list`).
2. *Execute the preprocessing steps*: Run the script to clean, filter and structure the text data.
3. *Extract and analyze*: Use the extracted IDs and processed text for further analysis, e.g. to create models and visualizations or for exploratory research.

In [None]:
docs_dict = LBD_02_data_preprocessing.construct_dict_from_list(lines)

keep_list = []
remove_list = []
prep_docs_dict = LBD_02_data_preprocessing.preprocess_docs_dict(
    docs_dict, keep_list = keep_list, remove_list = remove_list, mesh_word_list = mesh_word_list, \
    cleaning = True, remove_stopwords = True, lemmatization = True, \
    min_word_length = 5, keep_only_nouns = False, keep_only_mesh = True, stemming = False, stem_type = None)

ids_list = LBD_02_data_preprocessing.extract_ids_list(prep_docs_dict)
prep_docs_list = LBD_02_data_preprocessing.extract_preprocessed_documents_list(prep_docs_dict)

The next three cells show the first dictionary entries, the document IDs (Pubmed) and the pre-processed documents.

When displaying the dictionary entries, we can see the difference between the original and the pre-processed documents.

In [None]:
# display the first 7 dictionary items
dict(itertools.islice(prep_docs_dict.items(), 7))

In [None]:
# display the ids of the first 7 documents
ids_list[:7]

In [None]:
# display the preprocessed text for the first 7 documents
prep_docs_list[:7]

The next script continues the feature extraction process and focuses on refining a Bag of Words (BoW) model by filtering out less important terms and n-grams. It creates a Bag of Words matrix from the list of pre-processed documents. It then removes n-gram words that occur less than *min_ngram_count* times (in our case 3) in the entire document corpus. The words that are not contained in the MESH list *mesh_word_list* are also removed. This step is important to improve the quality and relevance of the text representation by reducing the vocabulary so that the following steps can be carried out more efficiently (in terms of time).

**Functionality**

1. *Set parameters*:
    The script starts by setting the parameters for the n-gram size and the minimum document frequency:
    ```python
    ngram_size = 2
    min_df = 1
    ```
    - *`ngram_size`*: Specifies that the model considers pairs of consecutive words (bigrams) as features.
    - *`min_df`*: Specifies the minimum number of documents in which a word or n-gram must occur in order to be included in the initial vocabulary.

2. *Create Bag of Words representation*:
    The next step is to create the BoW model using the specified n-gram size:
    ```python
    word_list, bow_matrix = LBD_03_feature_extraction.create_bag_of_words(prep_docs_list, ngram_size, min_df)
    print('Number of terms in the source vocabulary with all n-grams: ', len(word_list))
    ```
    This function creates a vocabulary (`word_list`) from all terms and n-grams found in the preprocessed documents (`prep_docs_list`), together with the corresponding frequency matrix (`bow_matrix`). The output vocabulary includes all n-grams without filtering.

3. *Filtering low-frequency n-grams*:
    The script then filters out n-grams that occur less frequently than a certain threshold:
    ```python
    min_count_ngram = 3
    tmp_sum_count_docs_containing_word = LBD_03_feature_extraction.sum_count_documents_containing_each_word(word_list, bow_matrix)
    tmp_sum_count_word_in_docs = LBD_03_feature_extraction.sum_count_each_word_in_all_documents(word_list, bow_matrix)
    ```
    - *`min_count_ngram`*: Specifies the minimum number of occurrences of n-grams to keep.
    - *`keep_only_mesh_II`*: If true, only the n-grams with each word in MeSH are kept.
    - The script calculates two important metrics:
        - *document frequency*: How many documents contain each word or n-gram.
        - *Total frequency*: How often each word or n-gram appears in all documents.

4. *Filtering Based on Specific Criteria*:  
   The script applies a more sophisticated filtering process to refine the vocabulary:
   ```python
   tmp_filter_columns = []
   for i, word in enumerate(word_list):
       if not LBD_03_feature_extraction.word_is_nterm(word):
           if (not keep_only_mesh_II) or (word in mesh_word_list):
               tmp_filter_columns.append(i)
       else:
           if tmp_sum_count_word_in_docs[word] >= min_count_ngram:
               check_ngram = word.split()
               passed = True
               for check_word in check_ngram:
                   if keep_only_mesh_II:
                       if check_word not in mesh_word_list:
                           passed = False
               if check_ngram [0] == check_ngram [1]:
                   passed = False
               if passed:
                   tmp_filter_columns.append(i)
   ```
   This loop evaluates each term or n-gram in the vocabulary:
   - *Non-n-grams*: Will only be retained if they are in a predefined `mesh_word_list`.
   - *n-grams*: Are retained if:
       - They fulfill the minimum frequency criteria.
       - All partial words are contained in `mesh_word_list`.
       - The n-gram does not consist of repeated words (e.g. "word word").


5. *Applying the filters*:  
   The script then filters both the rows and the columns of the BoW matrix:
   ```python
   tmp_filter_rows = []
   for i, id in enumerate(ids_list):
       tmp_filter_rows.append(i)

   tmp_filtered_word_list, tmp_filtered_bow_matrix = LBD_03_feature_extraction.filter_matrix_columns(
       word_list, bow_matrix, tmp_filter_rows, tmp_filter_columns)

   word_list = tmp_filtered_word_list
   bow_matrix = tmp_filtered_bow_matrix
   print('Number of terms in the preprocessed vocabulary after removing infrequent n-grams and non-MESH words: ', len(word_list))
   ```
   - *`filter_matrix_columns`*: Refines the BoW matrix by retaining only the selected words or n-grams that meet the filter criteria.
   - The updated vocabulary and matrix are then stored in `word_list` and `bow_matrix`, respectively.

**Applications**

- *Medical text mining*: This filtering method is particularly useful in medical research, where the focus is on extracting and analyzing relevant biomedical terms and concepts.
- *Document classification*: By refining the feature set, this script can improve the performance of classifiers used in the categorization of scientific literature or other text corpora.
- *Network analysis*: The filtered vocabulary can serve as a node in a network graph representing meaningful terms and their co-occurrence, which can be analyzed to detect hidden connections.


**Use**

To use this script effectively, you need to make sure you have a preprocessed document list (`prep_docs_list`) and a list of MeSH terms (`mesh_word_list`). Adjust the parameters like `ngram_size`, `min_df` and `min_count_ngram` to your specific needs. After running the script, you will get a filtered vocabulary and a corresponding BoW matrix, which is more suitable for further analysis such as clustering, topic modeling or discovering new hypotheses in biomedical research.

In [None]:
ngram_size = 2
min_df = 1

# BOW representation
word_list, bow_matrix = LBD_03_feature_extraction.create_bag_of_words(prep_docs_list, ngram_size, min_df)
print('Number of terms in initial vocabulary with all n-grams: ', len(word_list))

# remove nterms with frequency count less than min_count_ngram from vocabulary word_list and bow_matrix
min_count_ngram = 3
keep_only_mesh_II = True

tmp_sum_count_docs_containing_word = LBD_03_feature_extraction.sum_count_documents_containing_each_word(word_list, bow_matrix)

tmp_sum_count_word_in_docs = LBD_03_feature_extraction.sum_count_each_word_in_all_documents(word_list, bow_matrix)

tmp_filter_columns = []
for i, word in enumerate(word_list):
    if not LBD_03_feature_extraction.word_is_nterm(word):
        if (not keep_only_mesh_II) or (word in mesh_word_list):
            tmp_filter_columns.append(i)
    else:
        if tmp_sum_count_word_in_docs[word] >= min_count_ngram:
            check_ngram = word.split()
            passed = True
            for check_word in check_ngram:
                if keep_only_mesh_II:
                    if check_word not in mesh_word_list:
                        passed = False
            if check_ngram[0] == check_ngram[1]:
                passed = False
            if passed:
                tmp_filter_columns.append(i)

tmp_filter_rows = []
for i, id in enumerate(ids_list):
    tmp_filter_rows.append(i)

tmp_filtered_word_list, tmp_filtered_bow_matrix = LBD_03_feature_extraction.filter_matrix_columns(
    word_list, bow_matrix, tmp_filter_rows, tmp_filter_columns)

word_list = tmp_filtered_word_list
bow_matrix = tmp_filtered_bow_matrix

kom_text = ''
if keep_only_mesh_II:
    kom_text = ' and non-MeSH words'
print('Number of terms in the preprocessed vocabulary after removing infrequent n-grams', kom_text, ': ', len(word_list), sep='')

LBD_02_data_preprocessing.save_list_to_file(word_list, "output/_list.txt")
LBD_02_data_preprocessing.save_list_to_file(prep_docs_list, "output/_prep_list.txt")


The script in the next cell is a continuation of the text preprocessing pipeline that calculates the margins for the Bag of Words (BoW) matrix and optimizes the BoW matrix for better interpretability and analysis. By arranging the matrix to highlight the most important terms and documents, this script helps to recognize patterns in the data, which is a crucial step in LBD.

**Functionality**

1. *Counting word frequencies*:
   The script begins by calculating various frequency counts that provide insight into how words are distributed across documents:
   ```python
   sum_count_docs_containing_word = LBD_03_feature_extraction.sum_count_documents_containing_each_word(word_list, bow_matrix)
   sum_count_word_in_docs = LBD_03_feature_extraction.sum_count_each_word_in_all_documents(word_list, bow_matrix)
   sum_count_words_in_doc = LBD_03_feature_extraction.sum_count_all_words_in_each_document(ids_list, bow_matrix)
   ```
   - *`sum_count_docs_containing_word`*: Counts how many documents each word appears in.
   - *`sum_count_word_in_docs`*: Totals the occurrences of each word across all documents.
   - *`sum_count_words_in_doc`*: Tallies the total number of words in each document.

   These metrics are essential for understanding the significance and distribution of terms within the corpus, which can guide further analysis.

2. *Displaying frequency counts*:
   The script then prints a subset of these frequency counts to give an overview of the data:
   ```python
   print('Number of documents in which each word is present: ', dict(itertools.islice(sum_count_docs_containing_word.items(), 7)))
   print('Number of occurrences of each word in all documents: ', dict(itertools.islice(sum_count_word_in_docs.items(), 7)))
   print('Number of words in each document: ', dict(itertools.islice(sum_count_words_in_doc.items(), 7)))
   ```
   - *`islice`* from `itertools` is used to print just the first few items, making it easier to inspect the data without overwhelming output.
   - These print statements help users quickly assess the distribution and frequency of words and documents in the BoW model.

3. *Optimizing the BoW matrix*:
   The script proceeds to rearrange the BoW matrix so that the most frequent words and documents are positioned at the top-left corner of the matrix:
   ```python
   filter_columns = LBD_02_data_preprocessing.get_index_list_of_dict1_keys(
       LBD_02_data_preprocessing.sort_dict_by_value(sum_count_word_in_docs, reverse=True), word_list)
   filter_rows = LBD_02_data_preprocessing.get_index_list_of_dict1_keys(
       LBD_02_data_preprocessing.sort_dict_by_value(sum_count_words_in_doc, reverse=True), ids_list)
   ```
   - *sorting*: The words and documents are sorted by their frequencies in descending order.
   - *filtering*: The indices of these sorted words and documents are then used to rearrange the BoW matrix.

   This step ensures that the most significant terms and documents are easily accessible, facilitating further analysis such as clustering, topic modeling, or visualization.

4. *Rearranging the matrix*:
   Finally, the script filters the matrix according to the computed order:
   ```python
   filtered_ids_list, filtered_word_list, filtered_bow_matrix = LBD_03_feature_extraction.filter_matrix(
       ids_list, word_list, bow_matrix, filter_rows, filter_columns)
   ```
   - *`filter_matrix`*: This function reorders the BoW matrix based on the sorted indices, ensuring that the most relevant terms and documents are emphasized.

   The script then prints out the first few items in the reordered lists:
   ```python
   print('The first few documents in the rows of the filtered bow matrix: ', filtered_ids_list[:7])
   print('The first few words in the columns of the filtered bow matrix: ', filtered_word_list[:7])
   ```
   - This output allows users to verify that the matrix has been rearranged as intended, highlighting the most important elements of the dataset.

**Use**

To use this script, you must have a BoW matrix (`bow_matrix`) and the corresponding lists of words (`word_list`) and document IDs (`ids_list`). The script processes these inputs to calculate the frequency counts, reorder the matrix and output the reordered BoW matrix. This optimized matrix can be used for various downstream tasks, e.g. for creating visualizations, for deeper statistical analysis or as a basis for machine learning models for predictions.

In [None]:
sum_count_docs_containing_word = LBD_03_feature_extraction.sum_count_documents_containing_each_word(word_list, bow_matrix)

sum_count_word_in_docs = LBD_03_feature_extraction.sum_count_each_word_in_all_documents(word_list, bow_matrix)

sum_count_words_in_doc = LBD_03_feature_extraction.sum_count_all_words_in_each_document(ids_list, bow_matrix)

print('Number of documents in which each word is present: ', dict(itertools.islice(sum_count_docs_containing_word.items(), 7)))
print('Number of occurences of each word in all documents: ', dict(itertools.islice(sum_count_word_in_docs.items(), 7)))
print('Number of words in each document: ', dict(itertools.islice(sum_count_words_in_doc.items(), 7)))

# Compute the order of rows (documents) and columns (words) in the bow matrix so that the most frequent words are in the top-left corner. 
filter_columns = LBD_02_data_preprocessing.get_index_list_of_dict1_keys(
    LBD_02_data_preprocessing.sort_dict_by_value(sum_count_word_in_docs, reverse=True), word_list)
filter_rows = LBD_02_data_preprocessing.get_index_list_of_dict1_keys(
    LBD_02_data_preprocessing.sort_dict_by_value(sum_count_words_in_doc, reverse=True), ids_list) 

# Rearange (filter) the bow matrix according to the previously computed order.
filtered_ids_list, filtered_word_list, filtered_bow_matrix = LBD_03_feature_extraction.filter_matrix(
    ids_list, word_list, bow_matrix, filter_rows, filter_columns)
print('The first few documents in the rows of the filtered bow matrix: ', filtered_ids_list[:7])
print('The first few words in the columns of the filtered bow matrix: ', filtered_word_list[:7])

Visualize left upper part of the Bag of Words matrix.

In [None]:
first_row = 0
last_row = 20
first_column = 0
last_column = 15
LBD_06_visualization.plot_bow_tfidf_matrix('Filtered bag of words', \
                                           filtered_bow_matrix[first_row:last_row,first_column:last_column], \
                                           filtered_ids_list[first_row:last_row], \
                                           filtered_word_list[first_column:last_column], as_int = True)

The next script is designed to create a Term Frequency-Inverse Document Frequency (TF-IDF) matrix from a set of preprocessed documents and then refine this matrix by filtering out less relevant terms.

**Functionality**

1. *Creating the TF-IDF matrix*:<br>
   The script begins by generating a TF-IDF matrix using a list of preprocessed documents:
   ```python
   word_list, tfidf_matrix = LBD_03_feature_extraction.create_tfidf(prep_docs_list, ngram_size, min_df)
   ```
   - *TF-IDF matrix*: This matrix represents the importance of each word (or n-gram) across all documents in the corpus.
   - *`ngram_size`*: Specifies the size of word sequences to consider (e.g., unigrams, bigrams).
   - *`min_df`*: Filters out terms that appear in fewer than a specified number of documents, reducing noise in the analysis.

   This step is essential for transforming raw text data into a structured format that highlights important terms.

2. *Rearranging the TF-IDF matrix*:
   The script then refines the TF-IDF matrix by rearranging and filtering the terms:
   ```python
   tmp_filtered_word_list, tmp_filtered_tfidf_matrix = LBD_03_feature_extraction.filter_matrix_columns(
       word_list, tfidf_matrix, tmp_filter_rows, tmp_filter_columns)
   word_list = tmp_filtered_word_list
   tfidf_matrix = tmp_filtered_tfidf_matrix
   ```
   - *filtering*: The matrix is filtered based on criteria such as the importance of terms, ensuring that only the most relevant words remain.
   - *rearranging*: The matrix is reorganized according to a predefined order, based on the significance of terms or their relevance to specific documents.

   This refinement process is crucial for improving the quality of the analysis by focusing on the most impactful terms, which can lead to more accurate and insightful results.

**Use**

Users can apply this script as part of a larger text mining workflow where the TF-IDF matrix serves as an important step in structuring and analyzing the data. By filtering and refining the matrix, users can ensure that their analysis focuses on the most relevant and meaningful terms, leading to more meaningful insights. In the context of LBD, this script is an essential tool for turning raw text data into actionable insights.

In [None]:
# TF-IDF representation
word_list, tfidf_matrix = LBD_03_feature_extraction.create_tfidf(prep_docs_list, ngram_size, min_df)
print('Number of terms in initial vocabulary with all n-grams: ', len(word_list))

# Rearange (filter) the tfidf matrix according to the previously computed order from bow matrix.
tmp_filtered_word_list, tmp_filtered_tfidf_matrix = LBD_03_feature_extraction.filter_matrix_columns(
    word_list, tfidf_matrix, tmp_filter_rows, tmp_filter_columns)

word_list = tmp_filtered_word_list
tfidf_matrix = tmp_filtered_tfidf_matrix
print('Number of terms in preprocessed vocabulary after removing infrequent n-grams and non MESH words: ', len(word_list))

Compute margins for TF-IDF matrix.

In [None]:
sum_word_tfidf = LBD_03_feature_extraction.sum_count_each_word_in_all_documents(word_list, tfidf_matrix)
max_word_tfidf = LBD_03_feature_extraction.max_tfidf_each_word_in_all_documents(word_list, tfidf_matrix)

sum_doc_tfidf = LBD_03_feature_extraction.sum_count_all_words_in_each_document(ids_list, tfidf_matrix)
max_doc_tfidf = LBD_03_feature_extraction.max_tfidf_all_words_in_each_document(ids_list, tfidf_matrix)

print('Sum of tfidf for each word: ', dict(itertools.islice(sum_word_tfidf.items(), 7)))
print('Max of tfidf for each word: ', dict(itertools.islice(max_word_tfidf.items(), 7)))

print('Sum of tfidf for each document: ', dict(itertools.islice(sum_doc_tfidf.items(), 7)))
print('Max of tfidf for each document: ', dict(itertools.islice(max_doc_tfidf.items(), 7)))

# Compute the order of rows (documents) and columns (words) in the tfidf matrix so that the most important words are in the top-left corner. 
filter_columns = LBD_02_data_preprocessing.get_index_list_of_dict1_keys(
    LBD_02_data_preprocessing.sort_dict_by_value(max_word_tfidf, reverse=True), word_list)
filter_rows = LBD_02_data_preprocessing.get_index_list_of_dict1_keys(
    LBD_02_data_preprocessing.sort_dict_by_value(max_doc_tfidf, reverse=True), ids_list) 

# Rearange (filter) the bow matrix according to the previously computed order.
filtered_ids_list, filtered_word_list, filtered_tfidf_matrix = LBD_03_feature_extraction.filter_matrix(
    ids_list, word_list, tfidf_matrix, filter_rows, filter_columns)

Visualize the left upper part of the TF-IDF matrix.

In [None]:
first_row = 0
last_row = 20
first_column = 0
last_column = 25
LBD_06_visualization.plot_bow_tfidf_matrix('Filtered TF-IDF', filtered_tfidf_matrix[first_row:last_row,first_column:last_column], \
                                           filtered_ids_list[first_row:last_row], filtered_word_list[first_column:last_column], as_int = False)

Create a list of domain names of all documents (from the dictionary containing the documents) and a list of unique domain names. Since all the documents are from the autism domain and there is no specific domain name for each document, we expect the only domain to be *NA*.

In [None]:
domains_list = LBD_02_data_preprocessing.extract_domain_names_list(docs_dict)
print('Domain names for the first few documents: ', domains_list[:7])
unique_domains_list = LBD_02_data_preprocessing.extract_unique_domain_names_list(prep_docs_dict)
print('A list of all uniques domain names in all the documents: ', unique_domains_list)

Visualize the documents in a 2D graph by reducing the dimensionality of the TF-IDF matrix with PCA. Visualizing TF-IDF data with PCA helps to understand complex, high-dimensional data by projecting it into a more interpretable form. This is crucial in LBD, as understanding the relationships between documents can lead to the discovery of new knowledge or the identification of new connections between concepts.

In [None]:
LBD_06_visualization.visualize_tfidf_pca_interactive(ids_list, domains_list, tfidf_matrix, transpose = False)

Transpose the TF-IDF matrix to display similarity of the words (instead of the documents) in the graph.

In [None]:
domains_list = [LBD_02_data_preprocessing.strDomainDefault]*len(word_list)
LBD_06_visualization.visualize_tfidf_pca_interactive(word_list, domains_list, tfidf_matrix, transpose = True)

The next script filters, sorts and analyzes *rare* words within a corpus based on their maximum TF-IDF values.

A word or n-gram is rare in the input documents if it only occurs in a relatively small portion of them. In this script, we assume that a term is rare if it only occurs in a single document. Note, however, that such a restriction is very sensitive to the addition of new elements to the input documents, since the rarity/frequency of a term in a text corpus can change by adding new texts to the existing input corpus. A rare term can become more frequent if the document added to the input file contains this term.

**Functionality**

1. *Filtering rare words*:
   The script first selects words that appear in only one document:
   ```python
   max_word_tfidf_selected = {}
   for word in max_word_tfidf.keys():
       if sum_count_docs_containing_word[word] <= 1:
           max_word_tfidf_selected[word] = max_word_tfidf[word]
   ```
   - *filtering criteria*: Words appearing in only one document are considered rare and are selected for further analysis.

2. *Displaying and sorting words*:
   The script then prints and sorts these rare words by their maximum TF-IDF value:
   ```python
   print('Selected rare words: ', len(max_word_tfidf_selected), ' ', dict(itertools.islice(max_word_tfidf_selected.items(), 30)))
   max_word_tfidf_selected_sorted = LBD_02_data_preprocessing.sort_dict_by_value(max_word_tfidf_selected, True)
   print('Sorted rare words: ', len(max_word_tfidf_selected_sorted), ' ', dict(itertools.islice(max_word_tfidf_selected_sorted.items(), 30)))
   ```
   - *sorting*: Rare words are sorted in descending order by their TF-IDF scores, highlighting the most important terms.

3. *Analyzing the results*:
   The script calculates the mean TF-IDF value of the sorted rare words:
   ```python
   print('Mean value of max TF-IDF values: ', np.array(list(max_word_tfidf_selected_sorted.values())).mean())
   ```

In [None]:
print("Dictionary of words, count and max(tfidf):")

max_word_tfidf_selected = {}
for word in max_word_tfidf.keys():
    if sum_count_docs_containing_word[word] <= 1:
        max_word_tfidf_selected[word] = max_word_tfidf[word]
         
import itertools
print('All the words in vocabulary: ', len(max_word_tfidf))
print('Selected rare words: ', len(max_word_tfidf_selected), ' ', dict(itertools.islice(max_word_tfidf_selected.items(), 30)))

max_word_tfidf_selected_sorted = LBD_02_data_preprocessing.sort_dict_by_value(max_word_tfidf_selected, True)

print('Sorted rare words: ', len(max_word_tfidf_selected_sorted), ' ', dict(itertools.islice(max_word_tfidf_selected_sorted.items(), 30)))
print('First and last sorted rare word: ', list(max_word_tfidf_selected_sorted.items())[0], ' ', list(max_word_tfidf_selected_sorted.items())[-1])
print('Mean value of max TF-IDF values: ', np.array(list(max_word_tfidf_selected_sorted.values())).mean())

Identify a few rare terms for further analysis. In our experiment [1], the autism expert identified three rare terms *calcium channel*, *synaptophysin*, and *lactoylglutathione* that appeared in our dataset, prompting the autism expert to specifically search for their similarities and associations with autism.

In [None]:
rare_terms_list = list(max_word_tfidf_selected_sorted.keys())
rare_terms_list_length = len(rare_terms_list)

df = pd.DataFrame({'Rare term': rare_terms_list, 'max TF-IDF': list(max_word_tfidf_selected_sorted.values())})
# display the first 25 rare terms
df[0:25]

In [None]:
# Save the list to a file
with open('B_strings_list.txt', 'w') as file:
    for string in rare_terms_list:
        file.write(string + '\n')

Let's check the presence and the position of the expert selected rare terms in the candidate list: 

In [None]:
name = 'calcium channel'
print(name, ': ', 'position in the list of rare terms ', list(max_word_tfidf_selected_sorted.keys()).index(name), ' (', len(max_word_tfidf_selected_sorted), \
      '), max tfidf: ', format(max_word_tfidf_selected_sorted[name], '.3f'), sep='')

In [None]:
name = 'synaptophysin'
print(name, ': ', 'position in the list of rare terms ', list(max_word_tfidf_selected_sorted.keys()).index(name), ' (', len(max_word_tfidf_selected_sorted), \
      '), max tfidf: ', format(max_word_tfidf_selected_sorted[name], '.3f'), sep='')

In [None]:

name = 'lactoylglutathione'
print(name, ': ', 'position in the list of rare terms ', list(max_word_tfidf_selected_sorted.keys()).index(name), ' (', len(max_word_tfidf_selected_sorted), \
      '), max tfidf: ', format(max_word_tfidf_selected_sorted[name], '.3f'), sep='')

The expert had searched through the list of 495 rare term cadidates. In the last part of the step Ra the expert had identified three rare terms for further exploration: 

* *calcium channel* (position 38/495), 
* *synaptophysin* (position 37/495), 
* and *lactoylglutathione* (position 377/495).

# Step Jo

How the input files were prepared?

In [None]:
fileName = 'input/f_calcium_channels.txt'
lines = LBD_01_data_acquisition.load_data_from_file(fileName)
lines[:7]

In [None]:
fileName = 'input/f_synaptophysin.txt'
lines2 = LBD_01_data_acquisition.load_data_from_file(fileName)
lines2[:7]

In [None]:
fileName = 'input/f_lactoylglutathione.txt'
lines3 = LBD_01_data_acquisition.load_data_from_file(fileName)
lines3[:7]

Combine all three input texts to a sinlge list.

In [None]:
lines.extend(lines2)
lines.extend(lines3)
len(lines)

11666 documents were collected from all three domains. The next step is to pre-process the list of input documents.

In [None]:
docs_dict = LBD_02_data_preprocessing.construct_dict_from_list(lines)

keep_list = []
remove_list = []
prep_docs_dict = LBD_02_data_preprocessing.preprocess_docs_dict(
    docs_dict, keep_list = keep_list, remove_list = remove_list, mesh_word_list = mesh_word_list, \
    cleaning = True, remove_stopwords = True, lemmatization = True, \
    min_word_length = 5, keep_only_nouns = False, keep_only_mesh = False, stemming = False, stem_type = None)

ids_list = LBD_02_data_preprocessing.extract_ids_list(prep_docs_dict)
domains_list = LBD_02_data_preprocessing.extract_domain_names_list(prep_docs_dict)
unique_domains_list = LBD_02_data_preprocessing.extract_unique_domain_names_list(prep_docs_dict)
prep_docs_list = LBD_02_data_preprocessing.extract_preprocessed_documents_list(prep_docs_dict)

Display the first 7 dictionary items.

In [None]:
dict(itertools.islice(prep_docs_dict.items(), 7))

Generate Bag of Words matrix.

In [None]:
ngram_size = 1
min_df = 1

# BOW representation
word_list, bow_matrix = LBD_03_feature_extraction.create_bag_of_words(prep_docs_list, ngram_size, min_df)
print('Number of terms in initial vocabulary: ', len(word_list))

# remove nterms with frequency count less than min_count_ngram from vocabulary word_list and bow_matrix
min_count_ngram = 3
keep_only_mesh_II = True

tmp_sum_count_docs_containing_word = LBD_03_feature_extraction.sum_count_documents_containing_each_word(word_list, bow_matrix)

tmp_sum_count_word_in_docs = LBD_03_feature_extraction.sum_count_each_word_in_all_documents(word_list, bow_matrix)

tmp_filter_columns = []
for i, word in enumerate(word_list):
    if not LBD_03_feature_extraction.word_is_nterm(word):
        if (not keep_only_mesh_II) or (word in mesh_word_list):
            tmp_filter_columns.append(i)
    else:
        if tmp_sum_count_word_in_docs[word] >= min_count_ngram:
            check_ngram = word.split()
            passed = True
            for check_word in check_ngram:
                if keep_only_mesh_II:
                    if check_word not in mesh_word_list:
                        passed = False
            if check_ngram[0] == check_ngram[1]:
                passed = False
            if passed:
                tmp_filter_columns.append(i)

# keep the original order of rows
tmp_filter_rows = []
for i, id in enumerate(ids_list):
    tmp_filter_rows.append(i)

tmp_filtered_word_list, tmp_filtered_bow_matrix = LBD_03_feature_extraction.filter_matrix_columns(
    word_list, bow_matrix, tmp_filter_rows, tmp_filter_columns)

word_list = tmp_filtered_word_list
bow_matrix = tmp_filtered_bow_matrix
print('Number of terms in preprocessed vocabulary: ', len(word_list))

TODO Explain the following cell.

In [None]:
# Generate domains_bow_matrix from bow_matrix using domain_names list to add bow_matrix rows for each unique domain name into a single row
domains_bow_matrix = np.empty((0, bow_matrix.shape[1]))
for i, domain_name in enumerate(unique_domains_list):
    domain_docs_indices = [i for i, label in enumerate(domains_list) if label == domain_name]
    print(domain_docs_indices[:7])
    tmp = (bow_matrix[domain_docs_indices,:]).sum(axis=0)
    print(i, tmp)
    domains_bow_matrix = np.vstack((domains_bow_matrix, tmp))
    # Compute centroid for the current cluster
    #centroid_x = np.mean(pca_result[cluster_docs_indices, 0])
    #centroid_y = np.mean(pca_result[cluster_docs_indices, 1])
print(domains_bow_matrix)

TODO Explain the following cell.

In [None]:
def cell_value_in_bow_matrix(bow_matrix, domain_name, word):
    """
    """
    line_idx = unique_domains_list.index(domain_name)
    column_idx = word_list.index(word)
    return(bow_matrix[line_idx, column_idx])

cell_value_in_bow_matrix(domains_bow_matrix, unique_domains_list[0], word_list[0])


In [None]:
sum_count_docs_containing_word = LBD_03_feature_extraction.sum_count_documents_containing_each_word(word_list, bow_matrix)

sum_count_word_in_docs = LBD_03_feature_extraction.sum_count_each_word_in_all_documents(word_list, bow_matrix)

sum_count_words_in_doc = LBD_03_feature_extraction.sum_count_all_words_in_each_document(ids_list, bow_matrix)

print(dict(itertools.islice(sum_count_docs_containing_word.items(), 7)))
print(dict(itertools.islice(sum_count_word_in_docs.items(), 7)))
print(dict(itertools.islice(sum_count_words_in_doc.items(), 7)))

filter_columns = LBD_02_data_preprocessing.get_index_list_of_dict1_keys(
    LBD_02_data_preprocessing.sort_dict_by_value(sum_count_word_in_docs, reverse=True), word_list)
filter_rows = LBD_02_data_preprocessing.get_index_list_of_dict1_keys(
    LBD_02_data_preprocessing.sort_dict_by_value(sum_count_words_in_doc, reverse=True), ids_list) 

filtered_ids_list, filtered_word_list, filtered_bow_matrix = LBD_03_feature_extraction.filter_matrix(
    ids_list, word_list, bow_matrix, filter_rows, filter_columns)
print(filtered_ids_list[:7])

In [None]:
first_row = 0
last_row = 20
first_column = 0
last_column = 15
LBD_06_visualization.plot_bow_tfidf_matrix('Filtered bag of words', \
                                           filtered_bow_matrix[first_row:last_row,first_column:last_column], \
                                           filtered_ids_list[first_row:last_row], \
                                           filtered_word_list[first_column:last_column], as_int = True)

In [None]:
# TF-IDF representation
word_list, tfidf_matrix = LBD_03_feature_extraction.create_tfidf(prep_docs_list, ngram_size, min_df)
print('Number of terms in initial vocabulary: ', len(word_list))
# print(word_list)
# print(tfidf_matrix)

tmp_filtered_word_list, tmp_filtered_tfidf_matrix = LBD_03_feature_extraction.filter_matrix_columns(
    word_list, tfidf_matrix, tmp_filter_rows, tmp_filter_columns)

word_list = tmp_filtered_word_list
tfidf_matrix = tmp_filtered_tfidf_matrix
print('Number of terms in preprocessed vocabulary: ',len(word_list))

In [None]:
sum_word_tfidf = LBD_03_feature_extraction.sum_count_each_word_in_all_documents(word_list, tfidf_matrix)
max_word_tfidf = LBD_03_feature_extraction.max_tfidf_each_word_in_all_documents(word_list, tfidf_matrix)

sum_doc_tfidf = LBD_03_feature_extraction.sum_count_all_words_in_each_document(ids_list, tfidf_matrix)
max_doc_tfidf = LBD_03_feature_extraction.max_tfidf_all_words_in_each_document(ids_list, tfidf_matrix)

print(dict(itertools.islice(sum_word_tfidf.items(), 7)))
print(dict(itertools.islice(max_word_tfidf.items(), 7)))

print(dict(itertools.islice(sum_doc_tfidf.items(), 7)))
print(dict(itertools.islice(max_doc_tfidf.items(), 7)))

filter_columns = LBD_02_data_preprocessing.get_index_list_of_dict1_keys(
    LBD_02_data_preprocessing.sort_dict_by_value(max_word_tfidf, reverse=True), word_list)
filter_rows = LBD_02_data_preprocessing.get_index_list_of_dict1_keys(
    LBD_02_data_preprocessing.sort_dict_by_value(max_doc_tfidf, reverse=True), ids_list) 

filtered_ids_list, filtered_word_list, filtered_tfidf_matrix = LBD_03_feature_extraction.filter_matrix(
    ids_list, word_list, tfidf_matrix, filter_rows, filter_columns)

In [None]:
first_row = 0
last_row = 20
first_column = 0
last_column = 25
LBD_06_visualization.plot_bow_tfidf_matrix('Filtered TfIdf', filtered_tfidf_matrix[first_row:last_row,first_column:last_column], \
                                           filtered_ids_list[first_row:last_row], filtered_word_list[first_column:last_column], as_int = False)

In [None]:
domains_list = LBD_02_data_preprocessing.extract_domain_names_list(docs_dict)
print('Domain names for the first few documents: ', domains_list[:7])
unique_domains_list = LBD_02_data_preprocessing.extract_unique_domain_names_list(prep_docs_dict)
print('Unique domain names: ', unique_domains_list)
print('Number of documents in each unique domain: ', )
for unique_domain in unique_domains_list:
    print('   ', unique_domain, ': ', domains_list.count(unique_domain), sep='')


In [None]:
LBD_06_visualization.visualize_tfidf_pca_interactive(ids_list, domains_list, tfidf_matrix, transpose = False)

In [None]:
domains_list = ['default']*len(word_list)
LBD_06_visualization.visualize_tfidf_pca_interactive(word_list, domains_list, tfidf_matrix, transpose = True)

In [None]:
print("Dictionary of words, count and max(tfidf):")

max_word_tfidf_selected = {}
for word in max_word_tfidf.keys():
    if sum_count_docs_containing_word[word] >= 10:
        passed = True
        for domain_name in unique_domains_list:
            if cell_value_in_bow_matrix(domains_bow_matrix, domain_name, word) <= 0:
                passed = False
        if passed:
            max_word_tfidf_selected[word] = max_word_tfidf[word]
         
import itertools
print('All the words in vocabulary: ', len(max_word_tfidf))
print('Selected common words: ', len(max_word_tfidf_selected), ' ', dict(itertools.islice(max_word_tfidf_selected.items(), 30)))

max_word_tfidf_selected_sorted = LBD_02_data_preprocessing.sort_dict_by_value(max_word_tfidf_selected, True)

print('Sorted joint words: ', len(max_word_tfidf_selected_sorted), ' ', dict(itertools.islice(max_word_tfidf_selected_sorted.items(), 30)))
print('First and last sorted joint word: ', list(max_word_tfidf_selected_sorted.items())[0], ' ', list(max_word_tfidf_selected_sorted.items())[-1])
print('Mean value of max tfidf values: ', np.array(list(max_word_tfidf_selected_sorted.values())).mean())

In [None]:
joint_terms_list = list(max_word_tfidf_selected_sorted.keys())
joint_terms_list_length = len(joint_terms_list)

df = pd.DataFrame({'Joint term': joint_terms_list, 'max TF-IDF': list(max_word_tfidf_selected_sorted.values())})
df[0:25]

In [None]:
name = 'calcineurin'
print(name, ': ', 'position in the list of joint terms ', list(max_word_tfidf_selected_sorted.keys()).index(name), ' (', len(max_word_tfidf_selected_sorted), \
      '), max tfidf: ', format(max_word_tfidf_selected_sorted[name], '.3f'), sep='')

In the last part of the step Jo we have identified a joint term for further exploration: *calcineurin*.
So, the Literature *C* is *autism* and the Literature *A* is *calcineurin*. In step Link the tesk is to search for linking b-terms that connect the two domains *C* and *A*.

# Step Link

Step Link implements closed discovery principle between two domains, in our case `autism` and`calcineurin`. It is implemented in `LBD_mini_CrossBee.ipynb` notebook.