#**TEXT PRE-PROCESSING**
- REMOVING PUNCTUATION, URL AND UPPER CASING
- SEGMENTATION
- TOKENIZATION
- REMOVING STOP WORDS
- STEMMING
- LEMMATIZATION
- PART OF SPEECH TAGGING (POS)
- NAMED ENTITY TAGGING


DATASET USED - **PubMed 200K RCT**

### GETTING DATA

In [None]:
!git clone https://github.com/Franck-Dernoncourt/pubmed-rct.git
!ls pubmed-rct

fatal: destination path 'pubmed-rct' already exists and is not an empty directory.
PubMed_200k_RCT				       PubMed_20k_RCT_numbers_replaced_with_at_sign
PubMed_200k_RCT_numbers_replaced_with_at_sign  README.md
PubMed_20k_RCT


In [None]:
data_dir = "pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign/"

In [None]:
def get_lines(filename):
  with open(filename, "r") as f:
    return f.readlines()

In [None]:
train_lines = get_lines(data_dir+"train.txt")
train_lines[:20]

['###24293578\n',
 'OBJECTIVE\tTo investigate the efficacy of @ weeks of daily low-dose oral prednisolone in improving pain , mobility , and systemic low-grade inflammation in the short term and whether the effect would be sustained at @ weeks in older adults with moderate to severe knee osteoarthritis ( OA ) .\n',
 'METHODS\tA total of @ patients with primary knee OA were randomized @:@ ; @ received @ mg/day of prednisolone and @ received placebo for @ weeks .\n',
 'METHODS\tOutcome measures included pain reduction and improvement in function scores and systemic inflammation markers .\n',
 'METHODS\tPain was assessed using the visual analog pain scale ( @-@ mm ) .\n',
 'METHODS\tSecondary outcome measures included the Western Ontario and McMaster Universities Osteoarthritis Index scores , patient global assessment ( PGA ) of the severity of knee OA , and @-min walk distance ( @MWD ) .\n',
 'METHODS\tSerum levels of interleukin @ ( IL-@ ) , IL-@ , tumor necrosis factor ( TNF ) - , and 

### 1) REMOVING URL

In [None]:
import re

def remove_urls(text):
    return re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)

cleaned_lines_urls = [remove_urls(line) for line in train_lines]

for line in train_lines:
    if re.search(r'http\S+|www\S+|https\S+', line):
        print("Original Line with URL:")
        print(line)

        cleaned_line = remove_urls(line)

        print("\nCleaned Line without URL:")
        print(cleaned_line)
        break

with open("cleaned_train_urls.txt", "w") as f:
    f.writelines(cleaned_lines_urls)


Original Line with URL:
BACKGROUND	http://www.clinicaltrials.gov number NCT@ .


Cleaned Line without URL:
BACKGROUND	 number NCT@ .



### 2) REMOVING PUNCTUATION

In [None]:
def remove_punctuation(text):
    return re.sub(r'[^\w\s]', '', text)

cleaned_lines_punctuation = [remove_punctuation(line) for line in cleaned_lines_urls]

for line in cleaned_lines_urls:
    if re.search(r'[^\w\s]', line):
        print("Original Line with Punctuation:")
        print(line)

        cleaned_line = remove_punctuation(line)

        print("\nCleaned Line without Punctuation:")
        print(cleaned_line)
        break

with open("cleaned_train_punctuation.txt", "w") as f:
    f.writelines(cleaned_lines_punctuation)


Original Line with Punctuation:
###24293578


Cleaned Line without Punctuation:
24293578



### 3) REPLACING UPPER CASE WITH LOWER CASE

In [None]:
def convert_to_lowercase(text):
    return text.lower()

with open("cleaned_train_urls.txt", "r") as f:
    cleaned_lines_urls = f.readlines()

cleaned_lines_lowercase = [convert_to_lowercase(line) for line in cleaned_lines_urls]

for line in cleaned_lines_urls:
    if any(char.isupper() for char in line):
        print("Original Line with Uppercase Characters:")
        print(line)

        cleaned_line = convert_to_lowercase(line)

        print("\nCleaned Line with Lowercase Characters:")
        print(cleaned_line)
        break

with open("cleaned_train_lowercase.txt", "w") as f:
    f.writelines(cleaned_lines_lowercase)


Original Line with Uppercase Characters:
OBJECTIVE	To investigate the efficacy of @ weeks of daily low-dose oral prednisolone in improving pain , mobility , and systemic low-grade inflammation in the short term and whether the effect would be sustained at @ weeks in older adults with moderate to severe knee osteoarthritis ( OA ) .


Cleaned Line with Lowercase Characters:
objective	to investigate the efficacy of @ weeks of daily low-dose oral prednisolone in improving pain , mobility , and systemic low-grade inflammation in the short term and whether the effect would be sustained at @ weeks in older adults with moderate to severe knee osteoarthritis ( oa ) .



### 4) SEGMENTATION

In [None]:
import nltk
from nltk.tokenize import sent_tokenize

In [None]:
nltk.download('punkt')

with open("cleaned_train_lowercase.txt", "r") as f:
    cleaned_lines_lowercase = f.readlines()

def segment_sentences(text):
    return sent_tokenize(text)

segmented_lines = [segment_sentences(line) for line in cleaned_lines_lowercase]

for line in cleaned_lines_lowercase:
    if '.' in line:
        print("Original Line:")
        print(line)

        segmented_line = segment_sentences(line)

        print("\nSegmented Sentences:")
        for sentence in segmented_line:
            print(sentence)
        break

with open("segmented_train.txt", "w") as f:
    for sentences in segmented_lines:
        f.write('\n'.join(sentences) + '\n')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Original Line:
objective	to investigate the efficacy of @ weeks of daily low-dose oral prednisolone in improving pain , mobility , and systemic low-grade inflammation in the short term and whether the effect would be sustained at @ weeks in older adults with moderate to severe knee osteoarthritis ( oa ) .


Segmented Sentences:
objective	to investigate the efficacy of @ weeks of daily low-dose oral prednisolone in improving pain , mobility , and systemic low-grade inflammation in the short term and whether the effect would be sustained at @ weeks in older adults with moderate to severe knee osteoarthritis ( oa ) .


### 5) TOKENIZATION

In [None]:
from nltk.tokenize import word_tokenize

nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
with open("segmented_train.txt", "r") as f:
    segmented_lines = f.readlines()

def tokenize(text):
    return word_tokenize(text)

tokenized_lines = [tokenize(sentence) for line in segmented_lines for sentence in line.split('\n') if sentence.strip()]

sentence_count = 0
for line in segmented_lines:
    sentences = line.split('\n')
    for sentence in sentences:
        if sentence.strip():
            sentence_count += 1
            if sentence_count == 2:
                print("Original Sentence:")
                print(sentence)

                tokens = tokenize(sentence)

                print("\nTokenized Sentence:")
                print(tokens)
                print("\n" + "-"*40)
                break
    if sentence_count == 2:
        break

with open("tokenized_train.txt", "w") as f:
    for tokens in tokenized_lines:
        f.write(' '.join(tokens) + '\n')


Original Sentence:
objective	to investigate the efficacy of @ weeks of daily low-dose oral prednisolone in improving pain , mobility , and systemic low-grade inflammation in the short term and whether the effect would be sustained at @ weeks in older adults with moderate to severe knee osteoarthritis ( oa ) .

Tokenized Sentence:
['objective', 'to', 'investigate', 'the', 'efficacy', 'of', '@', 'weeks', 'of', 'daily', 'low-dose', 'oral', 'prednisolone', 'in', 'improving', 'pain', ',', 'mobility', ',', 'and', 'systemic', 'low-grade', 'inflammation', 'in', 'the', 'short', 'term', 'and', 'whether', 'the', 'effect', 'would', 'be', 'sustained', 'at', '@', 'weeks', 'in', 'older', 'adults', 'with', 'moderate', 'to', 'severe', 'knee', 'osteoarthritis', '(', 'oa', ')', '.']

----------------------------------------


### 6) REMOVING STOP WORDS

In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')

In [None]:
with open("tokenized_train.txt", "r") as f:
    tokenized_lines = [line.split() for line in f.readlines()]

stop_words = set(stopwords.words('english'))

def remove_stop_words(tokens):
    return [word for word in tokens if word.lower() not in stop_words]

cleaned_lines = [remove_stop_words(tokens) for tokens in tokenized_lines]

sentence_count = 0
for tokens in tokenized_lines:
    if tokens:
        sentence_count += 1
        if sentence_count == 2:
            original_tokens = tokens
            cleaned_tokens = remove_stop_words(tokens)

            print("Original Tokens:")
            print(original_tokens)

            print("\nTokens after Removing Stop Words:")
            print(cleaned_tokens)
            print("\n" + "-"*40)
            break

with open("cleaned_stopwords_train.txt", "w") as f:
    for tokens in cleaned_lines:
        f.write(' '.join(tokens) + '\n')


Original Tokens:
['objective', 'to', 'investigate', 'the', 'efficacy', 'of', '@', 'weeks', 'of', 'daily', 'low-dose', 'oral', 'prednisolone', 'in', 'improving', 'pain', ',', 'mobility', ',', 'and', 'systemic', 'low-grade', 'inflammation', 'in', 'the', 'short', 'term', 'and', 'whether', 'the', 'effect', 'would', 'be', 'sustained', 'at', '@', 'weeks', 'in', 'older', 'adults', 'with', 'moderate', 'to', 'severe', 'knee', 'osteoarthritis', '(', 'oa', ')', '.']

Tokens after Removing Stop Words:
['objective', 'investigate', 'efficacy', '@', 'weeks', 'daily', 'low-dose', 'oral', 'prednisolone', 'improving', 'pain', ',', 'mobility', ',', 'systemic', 'low-grade', 'inflammation', 'short', 'term', 'whether', 'effect', 'would', 'sustained', '@', 'weeks', 'older', 'adults', 'moderate', 'severe', 'knee', 'osteoarthritis', '(', 'oa', ')', '.']

----------------------------------------


### 7) STEMMING

In [None]:
from nltk.stem import PorterStemmer

In [None]:
with open("tokenized_train.txt", "r") as f:
    tokenized_lines = [line.split() for line in f.readlines()]

stemmer = PorterStemmer()

def apply_stemming(tokens):
    return [stemmer.stem(word) for word in tokens]

stemmed_lines = [apply_stemming(tokens) for tokens in tokenized_lines]

sentence_count = 0
for tokens in tokenized_lines:
    if tokens:
        sentence_count += 1
        if sentence_count == 2:
            original_tokens = tokens
            stemmed_tokens = apply_stemming(tokens)

            print("Original Tokens:")
            print(original_tokens)

            print("\nTokens after Stemming:")
            print(stemmed_tokens)
            print("\n" + "-"*40)
            break

with open("stemmed_train.txt", "w") as f:
    for tokens in stemmed_lines:
        f.write(' '.join(tokens) + '\n')


Original Tokens:
['objective', 'to', 'investigate', 'the', 'efficacy', 'of', '@', 'weeks', 'of', 'daily', 'low-dose', 'oral', 'prednisolone', 'in', 'improving', 'pain', ',', 'mobility', ',', 'and', 'systemic', 'low-grade', 'inflammation', 'in', 'the', 'short', 'term', 'and', 'whether', 'the', 'effect', 'would', 'be', 'sustained', 'at', '@', 'weeks', 'in', 'older', 'adults', 'with', 'moderate', 'to', 'severe', 'knee', 'osteoarthritis', '(', 'oa', ')', '.']

Tokens after Stemming:
['object', 'to', 'investig', 'the', 'efficaci', 'of', '@', 'week', 'of', 'daili', 'low-dos', 'oral', 'prednisolon', 'in', 'improv', 'pain', ',', 'mobil', ',', 'and', 'system', 'low-grad', 'inflamm', 'in', 'the', 'short', 'term', 'and', 'whether', 'the', 'effect', 'would', 'be', 'sustain', 'at', '@', 'week', 'in', 'older', 'adult', 'with', 'moder', 'to', 'sever', 'knee', 'osteoarthr', '(', 'oa', ')', '.']

----------------------------------------


### 8) LEMMATIZATION

In [None]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

nltk.download('wordnet')

In [None]:
with open("tokenized_train.txt", "r") as f:
    tokenized_lines = [line.split() for line in f.readlines()]

lemmatizer = WordNetLemmatizer()

def apply_lemmatization(tokens):
    return [lemmatizer.lemmatize(word) for word in tokens]

lemmatized_lines = [apply_lemmatization(tokens) for tokens in tokenized_lines]

sentence_count = 0
for tokens in tokenized_lines:
    if tokens:
        sentence_count += 1
        if sentence_count == 2:
            original_tokens = tokens
            lemmatized_tokens = apply_lemmatization(tokens)

            print("Original Tokens:")
            print(original_tokens)

            print("\nTokens after Lemmatization:")
            print(lemmatized_tokens)
            print("\n" + "-"*40)
            break

with open("lemmatized_train.txt", "w") as f:
    for tokens in lemmatized_lines:
        f.write(' '.join(tokens) + '\n')


Original Tokens:
['objective', 'to', 'investigate', 'the', 'efficacy', 'of', '@', 'weeks', 'of', 'daily', 'low-dose', 'oral', 'prednisolone', 'in', 'improving', 'pain', ',', 'mobility', ',', 'and', 'systemic', 'low-grade', 'inflammation', 'in', 'the', 'short', 'term', 'and', 'whether', 'the', 'effect', 'would', 'be', 'sustained', 'at', '@', 'weeks', 'in', 'older', 'adults', 'with', 'moderate', 'to', 'severe', 'knee', 'osteoarthritis', '(', 'oa', ')', '.']

Tokens after Lemmatization:
['objective', 'to', 'investigate', 'the', 'efficacy', 'of', '@', 'week', 'of', 'daily', 'low-dose', 'oral', 'prednisolone', 'in', 'improving', 'pain', ',', 'mobility', ',', 'and', 'systemic', 'low-grade', 'inflammation', 'in', 'the', 'short', 'term', 'and', 'whether', 'the', 'effect', 'would', 'be', 'sustained', 'at', '@', 'week', 'in', 'older', 'adult', 'with', 'moderate', 'to', 'severe', 'knee', 'osteoarthritis', '(', 'oa', ')', '.']

----------------------------------------


### 9) PART OF SPEECH TAGGING

In [None]:
from nltk import pos_tag
nltk.download('averaged_perceptron_tagger')

In [None]:
with open("lemmatized_train.txt", "r") as f:
    lemmatized_lines = [line.split() for line in f.readlines()]

def pos_tag_tokens(tokens):
    return pos_tag(tokens)

tagged_lines = [pos_tag_tokens(tokens) for tokens in lemmatized_lines]

sentence_count = 0
for tokens in lemmatized_lines:
    if tokens:
        sentence_count += 1
        if sentence_count == 2:
            original_tokens = tokens
            tagged_tokens = pos_tag_tokens(tokens)

            print("Original Tokens:")
            print(original_tokens)

            print("\nPOS Tagged Tokens:")
            print(tagged_tokens)
            print("\n" + "-"*40)
            break

with open("tagged_train.txt", "w") as f:
    for tagged_tokens in tagged_lines:
        f.write(' '.join([f'{word}/{tag}' for word, tag in tagged_tokens]) + '\n')


Original Tokens:
['objective', 'to', 'investigate', 'the', 'efficacy', 'of', '@', 'week', 'of', 'daily', 'low-dose', 'oral', 'prednisolone', 'in', 'improving', 'pain', ',', 'mobility', ',', 'and', 'systemic', 'low-grade', 'inflammation', 'in', 'the', 'short', 'term', 'and', 'whether', 'the', 'effect', 'would', 'be', 'sustained', 'at', '@', 'week', 'in', 'older', 'adult', 'with', 'moderate', 'to', 'severe', 'knee', 'osteoarthritis', '(', 'oa', ')', '.']

POS Tagged Tokens:
[('objective', 'JJ'), ('to', 'TO'), ('investigate', 'VB'), ('the', 'DT'), ('efficacy', 'NN'), ('of', 'IN'), ('@', 'JJ'), ('week', 'NN'), ('of', 'IN'), ('daily', 'JJ'), ('low-dose', 'JJ'), ('oral', 'JJ'), ('prednisolone', 'NN'), ('in', 'IN'), ('improving', 'VBG'), ('pain', 'NN'), (',', ','), ('mobility', 'NN'), (',', ','), ('and', 'CC'), ('systemic', 'JJ'), ('low-grade', 'JJ'), ('inflammation', 'NN'), ('in', 'IN'), ('the', 'DT'), ('short', 'JJ'), ('term', 'NN'), ('and', 'CC'), ('whether', 'IN'), ('the', 'DT'), ('effect

### 10) NAMED ENTITY TAGGING

In [None]:
from nltk import pos_tag, word_tokenize, ne_chunk
from nltk.tree import Tree

nltk.download('maxent_ne_chunker')
nltk.download('words')


[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [None]:
def load_subset(filename, num_lines=5):
    with open(filename, "r") as f:
        lines = f.readlines()
    return [line.split() for line in lines[:num_lines]]

tagged_subset = load_subset("tagged_train.txt", num_lines=5)

def named_entity_recognition(tokens):
    pos_tags = pos_tag(tokens)
    return ne_chunk(pos_tags)

ner_subset = [named_entity_recognition(tokens) for tokens in tagged_subset]

def format_ner_tree(tree):
    result = []
    for subtree in tree:
        if isinstance(subtree, Tree):
            result.append(f'{" ".join(word for word, tag in subtree.leaves())}/{subtree.label()}')
        else:
            result.append(f'{subtree[0]}/{subtree[1]}')
    return ' '.join(result)

if len(tagged_subset) >= 2:
    tokens = tagged_subset[1]
    ner_tree = ner_subset[1]

    print("Original Tokens:")
    print(tokens)

    print("\nNamed Entity Tagged Tokens:")
    print(format_ner_tree(ner_tree))
else:
    print("The subset does not contain enough sentences.")


Original Tokens:
['objective/JJ', 'to/TO', 'investigate/VB', 'the/DT', 'efficacy/NN', 'of/IN', '@/JJ', 'week/NN', 'of/IN', 'daily/JJ', 'low-dose/JJ', 'oral/JJ', 'prednisolone/NN', 'in/IN', 'improving/VBG', 'pain/NN', ',/,', 'mobility/NN', ',/,', 'and/CC', 'systemic/JJ', 'low-grade/JJ', 'inflammation/NN', 'in/IN', 'the/DT', 'short/JJ', 'term/NN', 'and/CC', 'whether/IN', 'the/DT', 'effect/NN', 'would/MD', 'be/VB', 'sustained/VBN', 'at/IN', '@/NNP', 'week/NN', 'in/IN', 'older/JJR', 'adult/NN', 'with/IN', 'moderate/JJ', 'to/TO', 'severe/VB', 'knee/NN', 'osteoarthritis/NN', '(/(', 'oa/NN', ')/)', './.']

Named Entity Tagged Tokens:
objective/JJ/JJ to/TO/NN investigate/VB/NN the/DT/NN efficacy/NN/NN of/IN/NN @/JJ/NNP week/NN/NN of/IN/NN daily/JJ/VBZ low-dose/JJ/JJ oral/JJ/NN prednisolone/NN/NN in/IN/NN improving/VBG/NN pain/NN/NN ,/,/NNP mobility/NN/NN ,/,/NNP and/CC/VBZ systemic/JJ/JJ low-grade/JJ/JJ inflammation/NN/NN in/IN/NN the/DT/NN short/JJ/NN term/NN/NN and/CC/NN whether/IN/NN the/DT