1. In the text, there’s a text normalizer created – your assignment is to re-create that normalizer as a Python class that can be re-used (within a .py file). However, unlike the book author’s version, pass a Pandas Series (e.g., dataframe[‘column’]) to your normalize_corpus function and use apply/lambda for each cleaning function. (Ask questions in Teams if that’s unclear.)

In [1]:
import requests
import pandas as pd

data = requests.get('http://www.gutenberg.org/cache/epub/8001/pg8001.html')
corpus = data.content
content = corpus[1163:2200]
df = pd.DataFrame(data, columns = ['text'])
df

Unnamed: 0,text
0,"b'<!DOCTYPE html>\r\n<html lang=""en""><head><me..."
1,b'isplay: block;\r\n margin-top: 1em;\r\n ...
2,b'90%;\r\n margin-top: 0;\r\n margin-bot...
3,b'ock;\r\n margin-top: 1em;\r\n margin-b...
4,b';\r\n font-weight: bold;\r\n}\r\n#pg-foot...
...,...
2783,b'med as not protected by copyright in\r\nthe ...
2784,b'n compliance with any particular paper\r\ned...
2785,"b'cility: <a href=""https://www.gutenberg.org"">..."
2786,"b' Gutenberg\xe2\x84\xa2,\r\nincluding how to ..."


In [34]:
import re
from bs4 import BeautifulSoup

def strip_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    [s.extract() for s in soup(['iframe', 'script'])]
    stripped_text = soup.get_text()
    stripped_text = re.sub(r'[\r|\n|\r\n]+', '\n', stripped_text)
    return stripped_text

clean_content = strip_html_tags(corpus)

print(clean_content[1163:2045])

form, and void; and darkness was
           upon the face of the deep. And the Spirit of God moved upon
           the face of the waters.
01:001:003 And God said, Let there be light: and there was light.
01:001:004 And God saw the light, that it was good: and God divided the
           light from the darkness.
01:001:005 And God called the light Day, and the darkness he called
           Night. And the evening and the morning were the first day.
01:001:006 And God said, Let there be a firmament in the midst of the
           waters, and let it divide the waters from the waters.
01:001:007 And God made the firmament, and divided the waters which were
           under the firmament from the waters which were above the
           firmament: and it was so.
01:001:008 And God called the firmament Heaven. And the evening and the
           morning were the second day.
01:001


In [18]:
import pandas as pd
import re
from bs4 import BeautifulSoup
import contractions
import unicodedata
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

class TextNormalizer:
    @staticmethod
    def strip_html_tags(text):
        soup = BeautifulSoup(text, "html.parser")
        return soup.get_text()

    @staticmethod
    def expand_contractions(text):
        return contractions.fix(text)

    @staticmethod
    def remove_accented_chars(text):
        text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
        return text

    @staticmethod
    def to_lowercase(text):
        return text.lower()

    @staticmethod
    def lemmatize_text(text):
        lemmatizer = WordNetLemmatizer()
        tokens = word_tokenize(text)
        lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
        return ' '.join(lemmatized_tokens)

    @staticmethod
    def remove_special_characters(text):
        return re.sub(r'[^a-zA-Z\s]', '', text)

    @staticmethod
    def remove_extra_whitespaces(text):
        return ' '.join(text.split())

    @staticmethod
    def remove_stopwords(text):
        stop_words = set(stopwords.words('english'))
        tokens = word_tokenize(text)
        filtered_tokens = [token for token in tokens if token.lower() not in stop_words]
        return ' '.join(filtered_tokens)

    @staticmethod
    def remove_digits(text):
        return re.sub(r'\d+', '', text)

    @staticmethod
    def normalize_text(text, html_stripping=True, contraction_expansion=False,
                        accented_char_removal=False, text_lower_case=False,
                        text_lemmatization=False, special_char_removal=False,
                        stopword_removal=False, remove_digits=False,remove_extra_whitespaces=False):

        if html_stripping:
            text = TextNormalizer.strip_html_tags(text)

        if contraction_expansion:
            text = TextNormalizer.expand_contractions(text)

        if accented_char_removal:
            text = TextNormalizer.remove_accented_chars(text)

        if text_lower_case:
            text = TextNormalizer.to_lowercase(text)

        if text_lemmatization:
            text = TextNormalizer.lemmatize_text(text)

        if special_char_removal:
            text = TextNormalizer.remove_special_characters(text)

        if remove_digits:
            text = TextNormalizer.remove_digits(text)

        if stopword_removal:
            text = TextNormalizer.remove_stopwords(text)

        if remove_extra_whitespaces:
            text = TextNormalizer.remove_extra_whitespaces(text)

        return text


if __name__ == '__main__':
    # Assuming you have a DataFrame df with a column 'text'
    data = requests.get('http://www.gutenberg.org/cache/epub/8001/pg8001.html')
    corpus = data.content
    content = corpus[1163:2200]
    df = pd.DataFrame(data, columns = ['text'])
 


    # Create an instance of TextNormalizer
normalizer = TextNormalizer()

    # Apply the normalization process to the 'text' column
df['normalized_text'] = df['text'].apply(lambda x: normalizer.normalize_text(x, html_stripping=True,                               remove_extra_whitespaces=True))
print(df.head())


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\bharo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\bharo\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\bharo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
  soup = BeautifulSoup(text, "html.parser")


                                                text  \
0  b'<!DOCTYPE html>\r\n<html lang="en"><head><me...   
1  b'isplay: block;\r\n    margin-top: 1em;\r\n  ...   
2  b'90%;\r\n    margin-top: 0;\r\n    margin-bot...   
3  b'ock;\r\n    margin-top: 1em;\r\n    margin-b...   
4  b';\r\n    font-weight: bold;\r\n}\r\n#pg-foot...   

                                     normalized_text  
0                                                     
1  isplay: block; margin-top: 1em; margin-bottom:...  
2  90%; margin-top: 0; margin-bottom: 0; text-ali...  
3  ock; margin-top: 1em; margin-bottom: 1em; text...  
4  ; font-weight: bold; } #pg-footer #project-gut...  
