Link to download the document set:
https://drive.google.com/drive/folders/16vVgaRMY_ZhmrkAWHQTF_aXXYvBulNCN?usp=sharing


Few examples of the documents:

"NCT00000102": "This study will test the ability of extended release nifedipine (Procardia XL), a blood\r\n      pressure medication, to permit a decrease in the dose of glucocorticoid medication children\r\n      take to treat congenital adrenal hyperplasia (CAH). This protocol is designed to assess both acute and chronic effects of the calcium channel\r\n      antagonist, nifedipine, on the hypothalamic-pituitary-adrenal axis in patients with\r\n      congenital adrenal hyperplasia. The multicenter trial is composed of two phases and will\r\n      involve a double-blind, placebo-controlled parallel design. The goal of Phase I is to examine\r\n      the ability of nifedipine vs. placebo to decrease adrenocorticotropic hormone (ACTH) levels,\r\n      as well as to begin to assess the dose-dependency of nifedipine effects. The goal of Phase II\r\n      is to evaluate the long-term effects of nifedipine; that is, can attenuation of ACTH release\r\n      by nifedipine permit a decrease in the dosage of glucocorticoid needed to suppress the HPA\r\n      axis? Such a decrease would, in turn, reduce the deleterious effects of glucocorticoid\r\n      treatment in CAH.",

"NCT00000104": "Inner city children are at an increased risk for lead overburden. This in turn affects\r\n      cognitive functioning. However, the underlying neuropsychological effects of lead overburden\r\n      and its age-specific effects have not been well delineated. This study is part of a larger\r\n      study on the effects of lead overburden on the development of attention and memory. The\r\n      larger study is using a multi-model approach to study the effects of lead overburden on these\r\n      effects including the event-related potential (ERP), electrophysiologic measures of attention\r\n      and memory are studied. Every eight months, for a total of three sessions the subjects will\r\n      complete ERP measures of attention and memory which require them to watch various computer\r\n      images while wearing scalp electrodes recording from 11 sites. It is this test that we are\r\n      going to be doing on CRC. There will be 30 lead overburdened children recruited from the\r\n      larger study for participation in the ERP studies on CRC. These 30 children will be matched\r\n      with 30 children without lead overburden. This portion of the study is important in providing\r\n      an index of physiological functioning to be used along with behaviorally based measures of\r\n      attention and memory, and for providing information about the different measures."

In [None]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [None]:
import json
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):

    tokens = word_tokenize(text.lower())

    filtered_tokens = [word for word in tokens if word.isalpha() and word not in stop_words]

    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]

    return " ".join(lemmatized_tokens)

json_file_path = "C:\\Users\\Timch\\nct_summaries_and_descriptions.json"  # change according to your path, this is the raw documents json file
with open(json_file_path, 'r') as file:
    data = json.load(file)

preprocessed_data = {doc_id: preprocess_text(content) for doc_id, content in data.items()}



In [None]:
from collections import Counter

all_words = " ".join(preprocessed_data.values()).split()

word_counts = Counter(all_words)

top_10_words = word_counts.most_common(10)

top_10_words

[('patient', 265917),
 ('study', 230985),
 ('treatment', 117959),
 ('day', 69748),
 ('cell', 58979),
 ('week', 57766),
 ('disease', 54928),
 ('therapy', 53478),
 ('blood', 53379),
 ('may', 50796)]

In [None]:
words_to_remove = [word for word, count in word_counts.most_common(3)]

def remove_frequent_words(text, words_to_remove):
    tokens = text.split()
    filtered_tokens = [word for word in tokens if word not in words_to_remove]
    return " ".join(filtered_tokens)

adjusted_preprocessed_data = {doc_id: remove_frequent_words(content, words_to_remove) for doc_id, content in preprocessed_data.items()}

In [None]:
output_file_path = 'preprocessed_data.json'
with open(output_file_path, 'w') as file:
    json.dump(adjusted_preprocessed_data, file, indent=4)

print("Preprocessing complete. Preprocessed data saved to:", output_file_path)

Preprocessing complete. Preprocessed data saved to: preprocessed_data.json


In [None]:
import json

json_file_path = "C:\\Users\\Timch\\preprocessed_data.json"  # change where you want to get the json file

with open(json_file_path, 'r') as file:
    data = json.load(file)

num_documents = len(data)

print(f"The JSON file contains {num_documents} documents.")


The JSON file contains 79628 documents.
