#Week 4 Exercise 4.2 Author: Rex Gayas Course & Section: DSC360-T301 Data Mining: Text Analytics an (2243-1) Date: 07JAN2024

In the text, there’s a text normalizer created – your assignment is to re-create that normalizer as a Python class that can be re-used (within a .py file). However, unlike the book author’s version, pass a Pandas Series (e.g., dataframe[‘column’]) to your normalize_corpus function and use apply/lambda for each cleaning function

In [27]:
import re
import pandas as pd

# Define a class TextNormalizer
class TextNormalizer:
    def __init__(self):
        # Define a customized set of stopwords as an example
        self.stop_words = set(['a', 'an', 'the', 'and', 'or', 'for', 'to', 'of', 'in', 'on', 'at', 'by', 'up', 'out', 'as'])

    def strip_html_tags(self, text):
        # Define a function to strip HTML tags
        pattern = re.compile('<.*?>')
        return re.sub(pattern, '', text)

    def remove_accented_chars(self, text):
        # Define a function to remove accented characters
        return text.encode('ascii', 'ignore').decode('ascii')

    def text_to_lower(self, text):
        # Define a function to convert text to lowercase
        return text.lower()

    def remove_extra_newlines(self, text):
        # Define a function to remove extra newlines
        return re.sub(r'\r\n|\r|\n', ' ', text)

    def remove_special_characters_and_digits(self, text, remove_digits=True):
        # Define a function to remove special characters and digits
        pattern = r'[^a-zA-Z\s]' if not remove_digits else r'[^a-zA-Z0-9\s]'
        return re.sub(pattern, '', text)

    def remove_extra_whitespace(self, text):
        # Define a function to remove extra whitespaces
        return re.sub(' +', ' ', text)

    def remove_stopwords(self, text):
        # Define a function to remove stopwords
        return ' '.join([word for word in text.split() if word not in self.stop_words])

    def normalize_corpus(self, corpus):
        # Apply the preprocessing methods to the dataframe series using apply and lambda for each cleaning function
        corpus = corpus.apply(lambda x: self.strip_html_tags(x))
        corpus = corpus.apply(lambda x: self.remove_accented_chars(x))
        corpus = corpus.apply(lambda x: self.text_to_lower(x))
        corpus = corpus.apply(lambda x: self.remove_extra_newlines(x))
        corpus = corpus.apply(lambda x: self.remove_special_characters_and_digits(x))
        corpus = corpus.apply(lambda x: self.remove_extra_whitespace(x))
        corpus = corpus.apply(lambda x: self.remove_stopwords(x))
        return corpus



Using your new text normalizer, create a Jupyter Notebook that uses this class to clean up the text found in the file big.txt (that text file is in the GitHub for Week 4 repository). Your resulting text should be a (long) single stream of text.

In [29]:
import pandas as pd
import requests
from io import StringIO

# GitHub link to the big.txt file
file_url = 'https://raw.githubusercontent.com/bellevue-university/dsc360/main/12%20Week/week_4/big.txt'

# Use requests to get the content of the file
response = requests.get(file_url)

# Check the request was successful
if response.status_code == 200:
    # Use StringIO to simulate a file-like object
    file_content = StringIO(response.text)
    
    # Read the content line by line
    lines = file_content.readlines()
    
    # Create a DataFrame from the lines
    df = pd.DataFrame(lines, columns=['text'])
    
    # Instantiate the TextNormalizer
    normalizer = TextNormalizer()
    
    # Normalize the corpus
    df['normalized'] = normalizer.normalize_corpus(df['text'])
    
    # Combine into a single stream of text
    single_stream_text = df['normalized'].str.cat(sep=' ')
    
    # Print the first 5000 characters of the normalized text to check the output
    print(single_stream_text[:5000])
    
    # Print a snippet to confirm
else:
    print(f"Failed to download the file: {response.status_code}")


project gutenberg ebook adventures sherlock holmes sir arthur conan doyle 15 our series sir arthur conan doyle  copyright laws are changing all over world be sure check copyright laws your country before downloading redistributing this any other project gutenberg ebook  this header should be first thing seen when viewing this project gutenberg file please do not remove it do not change edit header without written permission  please read legal small print other information about ebook project gutenberg bottom this file included is important information about your specific rights restrictions how file may be used you can also find about how make donation project gutenberg how get involved   welcome world free plain vanilla electronic texts  ebooks readable both humans computers since 1971  these ebooks were prepared thousands volunteers   title adventures sherlock holmes  author sir arthur conan doyle  release date march 1999 ebook 1661 most recently updated november 29 2002  edition 12 

Using spaCy and NLTK, show the tokens, lemmas, parts of speech, and dependencies in the first 1,021 characters of big.txt.

In [30]:
import spacy
import requests
import nltk

# Utilize spaCy's English model
nlp = spacy.load("en_core_web_sm")

# GitHub link to the big.txt file
file_url = "https://raw.githubusercontent.com/bellevue-university/dsc360/main/12%20Week/week_4/big.txt"

# Download the first part of the file
response = requests.get(file_url)
if response.status_code != 200:
    raise Exception("Failed to download the file")

# Extract the first 1021 characters as per exercise 3's request
text = response.text[:1021]

# Process text with spaCy
doc = nlp(text)

# Print tokens, lemmas, parts of speech, and dependencies as per exercise 3's request
for token in doc:
    print(f"Token: {token.text}, Lemma: {token.lemma_}, POS: {token.pos_}, Dependency: {token.dep_}")



Token: The, Lemma: the, POS: DET, Dependency: det
Token: Project, Lemma: Project, POS: PROPN, Dependency: compound
Token: Gutenberg, Lemma: Gutenberg, POS: PROPN, Dependency: compound
Token: EBook, Lemma: EBook, POS: PROPN, Dependency: nmod
Token: of, Lemma: of, POS: ADP, Dependency: prep
Token: The, Lemma: the, POS: DET, Dependency: det
Token: Adventures, Lemma: Adventures, POS: PROPN, Dependency: pobj
Token: of, Lemma: of, POS: ADP, Dependency: prep
Token: Sherlock, Lemma: Sherlock, POS: PROPN, Dependency: compound
Token: Holmes, Lemma: Holmes, POS: PROPN, Dependency: pobj
Token: 
, Lemma: 
, POS: SPACE, Dependency: dep
Token: by, Lemma: by, POS: ADP, Dependency: prep
Token: Sir, Lemma: Sir, POS: PROPN, Dependency: compound
Token: Arthur, Lemma: Arthur, POS: PROPN, Dependency: compound
Token: Conan, Lemma: Conan, POS: PROPN, Dependency: compound
Token: Doyle, Lemma: Doyle, POS: PROPN, Dependency: pobj
Token: 
, Lemma: 
, POS: SPACE, Dependency: dep
Token: (, Lemma: (, POS: PUNCT, Dep