Q3: Clean and preprocess a research paper abstract for computational analysis to enhance searchability
and data retrieval.
Sample Text: "Abstract—Genomic Sequence Data Analysis: With the advent of high-throughput
sequencing technologies, vast amounts of genomic data have been generated. This paper discusses
computational methods for sequence data analysis, focusing on algorithms and machine learning
techniques. Visit our research portal for more information: www.genomeanalysis.net. Keywords:
Genomics, Bioinformatics, Machine Learning, Sequencing. @Genome_Research"
Tasks:
1. Cleaning Text Data:
o Remove URLs, email addresses, hashtags, mentions, and keywords section.
o Eliminate all punctuation and special characters.
2. Lowercasing and Handling Non-Alphanumeric Characters:
o Convert all text to lowercase.
o Ensure that only spaces and alphanumeric characters remain.
3. Tokenization:
o Perform word and sentence tokenization using NLTK.
4. Normalization Techniques:
o Apply both Porter and Snowball Stemmers to the tokenized words.
5. POS Tagging:
o Conduct POS tagging on the cleaned and tokenized text using NLTK’s default POS
tagger.
Expected Outputs:
• Cleaned Text: "abstract genomic sequence data analysis with the advent of high throughput
sequencing technologies vast amounts of genomic data have been generated this paper
discusses computational methods for sequence data analysis focusing on algorithms and
machine learning techniques"
• Word Tokenization (NLTK): ['abstract', 'genomic', 'sequence', 'data', 'analysis', 'with', 'the',
'advent', 'of', 'high', 'throughput', 'sequencing', 'technologies', 'vast', 'amounts', 'of', 'genomic',
'data', 'have', 'been', 'generated', 'this', 'paper', 'discusses', 'computational', 'methods', 'for',
'sequence', 'data', 'analysis', 'focusing', 'on', 'algorithms', 'and', 'machine', 'learning', 'techniques']
• Sentence Tokenization (NLTK): ["abstract genomic sequence data analysis with the advent
of high throughput sequencing technologies vast amounts of genomic data have been
generated", "this paper discusses computational methods for sequence data analysis focusing
on algorithms and machine learning techniques"]
• Stemming Output (Porter and Snowball): Similar stem outputs as the Snowball stemmer is
slightly more aggressive but the difference will mainly show in handling complex scientific
terms.
• POS Tagging (NLTK): [('abstract', 'NN'), ('genomic', 'JJ'), ('sequence', 'NN'), ('data', 'NNS'),
('analysis', 'NN'), ('with', 'IN'), ('the', 'DT'), ('advent', 'NN'), ('of', 'IN'), ('high', 'JJ'), ('throughput',
'NN'), ('sequencing', 'NN'), ('technologies', 'NNS'), ('vast', 'JJ'), ('amounts', 'NNS'), ('of', 'IN'),
('genomic', 'JJ'), ('data', 'NNS'), ('have', 'VBP'), ('been', 'VBN'), ('generated', 'VBN'), ('this',
'DT'), ('paper', 'NN'), ('discusses', 'VBZ'), ('computational', 'JJ'), ('methods', 'NNS'), ('for', 'IN'),
('sequence', 'NN'), ('data', 'NNS'), ('analysis', 'NN'), ('focusing', 'VBG'), ('on', 'IN'), ('algorithms',
'NNS'), ('and', 'CC'), ('machine', 'NN'), ('learning', 'VBG'), ('techniques', 'NNS')]


In [4]:
import re
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer, SnowballStemmer
from nltk.tag import pos_tag

# Download required NLTK data
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [16]:
# Sample abstract
text = """Abstract-Genomic Sequence Data Analysis: With the advent of high-throughput sequencing technologies,
vast amounts of genomic data have been generated. This paper discusses computational methods for sequence data analysis,
focusing on algorithms and machine learning techniques. Visit our research portal for more information: www.genomeanalysis.net.
Keywords: Genomics, Bioinformatics, Machine Learning, Sequencing @Genome_Research"""

#1 Cleaning the text
text=text.lower()
text = re.sub(r"http\S+|www\S+|@\S+|#\S+", "", text)  # Remove URLs, mentions, hashtags
text = re.sub(r"Keywords:.*", "", text)  # Remove keywords section
text = re.sub(r"[^a-zA-Z0-9\s]", "", text)  # Remove special characters and punctuation


In [17]:
#2 Sentence Tokenization
print("Sentence Tokenization:")
sentences = sent_tokenize(text)
for i, sentence in enumerate(sentences, 1):
    print(f"Sentence {i}: {sentence}")


Sentence Tokenization:
Sentence 1: abstractgenomic sequence data analysis with the advent of highthroughput sequencing technologies 
vast amounts of genomic data have been generated this paper discusses computational methods for sequence data analysis 
focusing on algorithms and machine learning techniques visit our research portal for more information  
keywords genomics bioinformatics machine learning sequencing


In [18]:
#3 word Tokenization
print("\nWord Tokenization:")
# Word Tokenization for each sentence
for i, sentence in enumerate(sentences, 1):
    words = word_tokenize(sentence)
    print(f"Words in Sentence {i}: {words}")


Word Tokenization:
Words in Sentence 1: ['abstractgenomic', 'sequence', 'data', 'analysis', 'with', 'the', 'advent', 'of', 'highthroughput', 'sequencing', 'technologies', 'vast', 'amounts', 'of', 'genomic', 'data', 'have', 'been', 'generated', 'this', 'paper', 'discusses', 'computational', 'methods', 'for', 'sequence', 'data', 'analysis', 'focusing', 'on', 'algorithms', 'and', 'machine', 'learning', 'techniques', 'visit', 'our', 'research', 'portal', 'for', 'more', 'information', 'keywords', 'genomics', 'bioinformatics', 'machine', 'learning', 'sequencing']


In [19]:
#4 PorterStemmer
ps = PorterStemmer()
words = word_tokenize(text)
stemmed_words = [ps.stem(word) for word in words]
print("Original Words:", words)
print("Stemmed Words:", stemmed_words)

Original Words: ['abstractgenomic', 'sequence', 'data', 'analysis', 'with', 'the', 'advent', 'of', 'highthroughput', 'sequencing', 'technologies', 'vast', 'amounts', 'of', 'genomic', 'data', 'have', 'been', 'generated', 'this', 'paper', 'discusses', 'computational', 'methods', 'for', 'sequence', 'data', 'analysis', 'focusing', 'on', 'algorithms', 'and', 'machine', 'learning', 'techniques', 'visit', 'our', 'research', 'portal', 'for', 'more', 'information', 'keywords', 'genomics', 'bioinformatics', 'machine', 'learning', 'sequencing']
Stemmed Words: ['abstractgenom', 'sequenc', 'data', 'analysi', 'with', 'the', 'advent', 'of', 'highthroughput', 'sequenc', 'technolog', 'vast', 'amount', 'of', 'genom', 'data', 'have', 'been', 'gener', 'thi', 'paper', 'discuss', 'comput', 'method', 'for', 'sequenc', 'data', 'analysi', 'focus', 'on', 'algorithm', 'and', 'machin', 'learn', 'techniqu', 'visit', 'our', 'research', 'portal', 'for', 'more', 'inform', 'keyword', 'genom', 'bioinformat', 'machin', 

In [20]:
#4 SnowballStemmer
stemmer = SnowballStemmer("english")
stemmed_words = [stemmer.stem(word) for word in words]
print("Original Words:", words)
print("Stemmed Words:", stemmed_words)

Original Words: ['abstractgenomic', 'sequence', 'data', 'analysis', 'with', 'the', 'advent', 'of', 'highthroughput', 'sequencing', 'technologies', 'vast', 'amounts', 'of', 'genomic', 'data', 'have', 'been', 'generated', 'this', 'paper', 'discusses', 'computational', 'methods', 'for', 'sequence', 'data', 'analysis', 'focusing', 'on', 'algorithms', 'and', 'machine', 'learning', 'techniques', 'visit', 'our', 'research', 'portal', 'for', 'more', 'information', 'keywords', 'genomics', 'bioinformatics', 'machine', 'learning', 'sequencing']
Stemmed Words: ['abstractgenom', 'sequenc', 'data', 'analysi', 'with', 'the', 'advent', 'of', 'highthroughput', 'sequenc', 'technolog', 'vast', 'amount', 'of', 'genom', 'data', 'have', 'been', 'generat', 'this', 'paper', 'discuss', 'comput', 'method', 'for', 'sequenc', 'data', 'analysi', 'focus', 'on', 'algorithm', 'and', 'machin', 'learn', 'techniqu', 'visit', 'our', 'research', 'portal', 'for', 'more', 'inform', 'keyword', 'genom', 'bioinformat', 'machin

In [21]:
#5 POS Tagging
import nltk
nltk.download('averaged_perceptron_tagger_eng')
tokens = word_tokenize(sentence)
tagged = pos_tag(tokens)
print(tagged)

[('abstractgenomic', 'JJ'), ('sequence', 'NN'), ('data', 'NNS'), ('analysis', 'NN'), ('with', 'IN'), ('the', 'DT'), ('advent', 'NN'), ('of', 'IN'), ('highthroughput', 'NN'), ('sequencing', 'VBG'), ('technologies', 'NNS'), ('vast', 'JJ'), ('amounts', 'NNS'), ('of', 'IN'), ('genomic', 'JJ'), ('data', 'NNS'), ('have', 'VBP'), ('been', 'VBN'), ('generated', 'VBN'), ('this', 'DT'), ('paper', 'NN'), ('discusses', 'VBZ'), ('computational', 'JJ'), ('methods', 'NNS'), ('for', 'IN'), ('sequence', 'NN'), ('data', 'NNS'), ('analysis', 'NN'), ('focusing', 'VBG'), ('on', 'IN'), ('algorithms', 'NN'), ('and', 'CC'), ('machine', 'NN'), ('learning', 'VBG'), ('techniques', 'NNS'), ('visit', 'VB'), ('our', 'PRP$'), ('research', 'NN'), ('portal', 'NN'), ('for', 'IN'), ('more', 'JJR'), ('information', 'NN'), ('keywords', 'NNS'), ('genomics', 'NNS'), ('bioinformatics', 'NNS'), ('machine', 'NN'), ('learning', 'VBG'), ('sequencing', 'VBG')]


[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
