In [36]:
#Requesting and parsing the HTML and XML document.
import requests
from bs4 import BeautifulSoup

#Natural Language Toolkit here is used for stopwords and tokenizing but can also be used for classification, stemming, tagging, parsing, etc.
import nltk

#nltk.corpus and nltk.tokenize are modules in nltk.
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [37]:
#Fetching the HTML content from the website.
html_text = requests.get("https://www.britannica.com/sports/sports").text

#soup object helps search and navigate through the html structure.
soup = BeautifulSoup(html_text, 'lxml')

#Capturing the entire paragragh.
blog = soup.find('p', class_ = "topic-paragraph").text

#Defining a function to set the stop words, tokenize the text and then filter out the non stop words.
def remove_stopwords(text):
    stop_words = set(stopwords.words('english'))
    words = nltk.word_tokenize(text)
    
    #"word.lower()" converts the entire text to lower case for uniformity purposes and then iterates throught all the words.
    #Does this because the ASCII value for upper case and lower case alphabets are different, ord('a') = 97 and ord('A') = 65.
    filtered_text = [word for word in words if word.lower() not in stop_words]
    return ' '.join(filtered_text)

#Removing the stopwords from the blog by calling the function above.
cleaned_blog = remove_stopwords(blog)

#Printing the original blog.
print("\033[1m\033[4mOriginal Blog:\033[0m \n")
print(blog)

print('')

#Printing the cleaned blog.
print("\033[1m\033[4mBlog without stop words:\033[0m \n")
print(cleaned_blog)

[1m[4mOriginal Blog:[0m 

sports,  physical contests pursued for the goals and challenges they entail. Sports are part of every culture past and present, but each culture has its own definition of sports. The most useful definitions are those that clarify the relationship of sports to play, games, and contests. “Play,” wrote the German theorist Carl Diem, “is purposeless activity, for its own sake, the opposite of work.” Humans work because they have to; they play because they want to. Play is autotelic—that is, it has its own goals. It is voluntary and uncoerced. Recalcitrant children compelled by their parents or teachers to compete in a game of football (soccer) are not really engaged in a sport. Neither are professional athletes if their only motivation is their paycheck. In the real world, as a practical matter, motives are frequently mixed and often quite impossible to determine. Unambiguous definition is nonetheless a prerequisite to practical determinations about what is and

In [30]:
len(blog)

998

In [32]:
len(cleaned_blog)

711

In [34]:
# searching for stopwords: an, the for
# "FOR", "THE"

#ASCII
ord('a')

97

In [35]:
ord('A')

65