# Text Preprocessing for Sports Commentary Dataset

In this lab, we perform standard NLP preprocessing steps on a sports commentary dataset:

1. Tokenization  
2. Case folding  
3. Stopword removal  
4. Stemming  
5. Lemmatization

We will use a small sample dataset to demonstrate these techniques.


In [8]:
import pandas as pd

# Create a small sports commentary dataset
data = {
    "commentary": [
        "The batsman hit a magnificent six over the boundary.",
        "The bowler delivered a perfect yorker.",
        "A brilliant catch by the fielder saved the match.",
        "The team celebrated their victory with cheers.",
        "Fans enjoyed the thrilling last over of the game."
    ],
    "sentiment": ["positive", "positive", "positive", "positive", "positive"]
}

df = pd.DataFrame(data)
print("Number of rows:", len(df))
df.head()


Number of rows: 5


Unnamed: 0,commentary,sentiment
0,The batsman hit a magnificent six over the bou...,positive
1,The bowler delivered a perfect yorker.,positive
2,A brilliant catch by the fielder saved the match.,positive
3,The team celebrated their victory with cheers.,positive
4,Fans enjoyed the thrilling last over of the game.,positive


## Tokenization and Cleaning

We remove any unnecessary characters and split the commentary into individual words (tokens).  
This step prepares the text for further NLP operations.


In [9]:
import nltk
from nltk.tokenize import word_tokenize

nltk.download('punkt')

# Remove any HTML-like tags (if any)
df['commentary'] = df['commentary'].str.replace(r'<br\s*/?>', '', regex=True)

# Tokenization
df['tokens'] = df['commentary'].apply(lambda x: word_tokenize(x))
df.head()


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Unnamed: 0,commentary,sentiment,tokens
0,The batsman hit a magnificent six over the bou...,positive,"[The, batsman, hit, a, magnificent, six, over,..."
1,The bowler delivered a perfect yorker.,positive,"[The, bowler, delivered, a, perfect, yorker, .]"
2,A brilliant catch by the fielder saved the match.,positive,"[A, brilliant, catch, by, the, fielder, saved,..."
3,The team celebrated their victory with cheers.,positive,"[The, team, celebrated, their, victory, with, ..."
4,Fans enjoyed the thrilling last over of the game.,positive,"[Fans, enjoyed, the, thrilling, last, over, of..."


## Case Folding

We convert all tokens to lowercase to ensure uniformity.  
This helps in matching words correctly during analysis.


In [10]:
df['tokens_lower'] = df['tokens'].apply(lambda x: [word.lower() for word in x])
df.head()


Unnamed: 0,commentary,sentiment,tokens,tokens_lower
0,The batsman hit a magnificent six over the bou...,positive,"[The, batsman, hit, a, magnificent, six, over,...","[the, batsman, hit, a, magnificent, six, over,..."
1,The bowler delivered a perfect yorker.,positive,"[The, bowler, delivered, a, perfect, yorker, .]","[the, bowler, delivered, a, perfect, yorker, .]"
2,A brilliant catch by the fielder saved the match.,positive,"[A, brilliant, catch, by, the, fielder, saved,...","[a, brilliant, catch, by, the, fielder, saved,..."
3,The team celebrated their victory with cheers.,positive,"[The, team, celebrated, their, victory, with, ...","[the, team, celebrated, their, victory, with, ..."
4,Fans enjoyed the thrilling last over of the game.,positive,"[Fans, enjoyed, the, thrilling, last, over, of...","[fans, enjoyed, the, thrilling, last, over, of..."


## Stopword Removal

We remove common English stopwords (like "the", "a", "of") which do not add much meaning.  
This step reduces noise in the text.


In [11]:
from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
print("Number of stopwords:", len(stop_words))

# Remove stopwords
df['tokens_nostop'] = df['tokens_lower'].apply(
    lambda x: [word for word in x if word.isalpha() and word not in stop_words]
)
df.head()


Number of stopwords: 198


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,commentary,sentiment,tokens,tokens_lower,tokens_nostop
0,The batsman hit a magnificent six over the bou...,positive,"[The, batsman, hit, a, magnificent, six, over,...","[the, batsman, hit, a, magnificent, six, over,...","[batsman, hit, magnificent, six, boundary]"
1,The bowler delivered a perfect yorker.,positive,"[The, bowler, delivered, a, perfect, yorker, .]","[the, bowler, delivered, a, perfect, yorker, .]","[bowler, delivered, perfect, yorker]"
2,A brilliant catch by the fielder saved the match.,positive,"[A, brilliant, catch, by, the, fielder, saved,...","[a, brilliant, catch, by, the, fielder, saved,...","[brilliant, catch, fielder, saved, match]"
3,The team celebrated their victory with cheers.,positive,"[The, team, celebrated, their, victory, with, ...","[the, team, celebrated, their, victory, with, ...","[team, celebrated, victory, cheers]"
4,Fans enjoyed the thrilling last over of the game.,positive,"[Fans, enjoyed, the, thrilling, last, over, of...","[fans, enjoyed, the, thrilling, last, over, of...","[fans, enjoyed, thrilling, last, game]"


## Stemming

Stemming reduces words to their base or root form.  
For example, "delivered" becomes "deliver".


In [12]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
df['tokens_stemmed'] = df['tokens_nostop'].apply(lambda x: [stemmer.stem(word) for word in x])
df.head()


Unnamed: 0,commentary,sentiment,tokens,tokens_lower,tokens_nostop,tokens_stemmed
0,The batsman hit a magnificent six over the bou...,positive,"[The, batsman, hit, a, magnificent, six, over,...","[the, batsman, hit, a, magnificent, six, over,...","[batsman, hit, magnificent, six, boundary]","[batsman, hit, magnific, six, boundari]"
1,The bowler delivered a perfect yorker.,positive,"[The, bowler, delivered, a, perfect, yorker, .]","[the, bowler, delivered, a, perfect, yorker, .]","[bowler, delivered, perfect, yorker]","[bowler, deliv, perfect, yorker]"
2,A brilliant catch by the fielder saved the match.,positive,"[A, brilliant, catch, by, the, fielder, saved,...","[a, brilliant, catch, by, the, fielder, saved,...","[brilliant, catch, fielder, saved, match]","[brilliant, catch, fielder, save, match]"
3,The team celebrated their victory with cheers.,positive,"[The, team, celebrated, their, victory, with, ...","[the, team, celebrated, their, victory, with, ...","[team, celebrated, victory, cheers]","[team, celebr, victori, cheer]"
4,Fans enjoyed the thrilling last over of the game.,positive,"[Fans, enjoyed, the, thrilling, last, over, of...","[fans, enjoyed, the, thrilling, last, over, of...","[fans, enjoyed, thrilling, last, game]","[fan, enjoy, thrill, last, game]"


## Lemmatization

Lemmatization also reduces words to their base form but considers the context.  
It is more linguistically informed compared to stemming.


In [13]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()
df['tokens_lemmatized'] = df['tokens_nostop'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])
df.head()


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,commentary,sentiment,tokens,tokens_lower,tokens_nostop,tokens_stemmed,tokens_lemmatized
0,The batsman hit a magnificent six over the bou...,positive,"[The, batsman, hit, a, magnificent, six, over,...","[the, batsman, hit, a, magnificent, six, over,...","[batsman, hit, magnificent, six, boundary]","[batsman, hit, magnific, six, boundari]","[batsman, hit, magnificent, six, boundary]"
1,The bowler delivered a perfect yorker.,positive,"[The, bowler, delivered, a, perfect, yorker, .]","[the, bowler, delivered, a, perfect, yorker, .]","[bowler, delivered, perfect, yorker]","[bowler, deliv, perfect, yorker]","[bowler, delivered, perfect, yorker]"
2,A brilliant catch by the fielder saved the match.,positive,"[A, brilliant, catch, by, the, fielder, saved,...","[a, brilliant, catch, by, the, fielder, saved,...","[brilliant, catch, fielder, saved, match]","[brilliant, catch, fielder, save, match]","[brilliant, catch, fielder, saved, match]"
3,The team celebrated their victory with cheers.,positive,"[The, team, celebrated, their, victory, with, ...","[the, team, celebrated, their, victory, with, ...","[team, celebrated, victory, cheers]","[team, celebr, victori, cheer]","[team, celebrated, victory, cheer]"
4,Fans enjoyed the thrilling last over of the game.,positive,"[Fans, enjoyed, the, thrilling, last, over, of...","[fans, enjoyed, the, thrilling, last, over, of...","[fans, enjoyed, thrilling, last, game]","[fan, enjoy, thrill, last, game]","[fan, enjoyed, thrilling, last, game]"
