In [1]:
import re

In [2]:
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

In [None]:
paragraph = """
The 2025 FIFA Club World Cup, also marketed as FIFA Club World Cup 25, was the 21st edition and the first of the expanded FIFA Club World Cup, an international club soccer competition organized by FIFA. The tournament was held in the United States from June 14 to July 13, 2025, and featured 32 teams. The expanded format included the continental champions from the past four years as well as additional qualified teams. Chelsea won the tournament, defeating Paris Saint-Germain 3–0 in the final and becoming the inaugural world champions under the expanded format.

The revised structure was modeled more closely on the FIFA World Cup as a quadrennial world championship, replacing the annual seven-team format used between 2000 and 2023. It featured the winners of each continent's top club competition from 2021 to 2024, except for a single entry from Oceania. Additional slots were awarded to clubs from Europe and South America based on rankings across the same four-year period. Manchester City, who won the final edition under the previous format in 2023, entered as the technical title holders but were eliminated in the round of 16 by Al-Hilal.

FIFA first announced the expanded format in March 2019, originally selecting China to host the inaugural edition in 2021. This was later postponed due to the global COVID-19 pandemic. In February 2023, FIFA confirmed the allocation of qualification slots among confederations, and four months later announced the United States as the new host nation. Alongside this expansion, FIFA also introduced the FIFA Intercontinental Cup, an annual tournament based on the previous Club World Cup format.

The expansion of the tournament drew varied responses, with some concerns raised by the players' union FIFPRO and the World Leagues Forum regarding potential effects on fixture schedules and player welfare. Ticket sales were managed using dynamic pricing, which was later adjusted for several matches to boost attendance. International broadcasting rights were secured by streaming service DAZN, which sublicensed coverage to other networks. A total of $1 billion in prize money was distributed among the 32 clubs, including solidarity payments and allocations by confederation.

It was the first major FIFA tournament since the 1978 FIFA World Cup not to feature a penalty shootout.

Background and format
Since its return from hiatus in 2005, the FIFA Club World Cup had been held annually in December and was limited to the winners of continental club competitions.[1] As early as late 2016, FIFA president Gianni Infantino suggested expanding the Club World Cup to 32 teams beginning in 2019 and rescheduling it to June/July to be more balanced and attractive to broadcasters and sponsors.[2] In late 2017, FIFA discussed proposals to expand the competition to 24 teams and have it be played every four years starting in 2021, replacing the FIFA Confederations Cup.[3] The expanded format and schedule of Club World Cup, to be played in June and July 2021, was confirmed at the March 2019 FIFA Council meeting in Miami.[4][5] China was appointed as host in October 2019,[6] but the 2021 event was canceled due to the COVID-19 pandemic.[7][8][9]

On June 23, 2023, FIFA confirmed that the United States would host the 2025 tournament as a prelude to the 2026 FIFA World Cup.[10] The 32 teams were divided into eight groups of four teams, with the top two teams in each group qualifying to the knockout stage.[11] However, the only difference from the format used in the FIFA World Cup between 1998 and 2022 was that there was no third place playoff.[12]

In January 2024, it was reported that the tournament would mainly take place on the East Coast to be closer to European broadcasters and viewers while also avoiding conflicts with the 2025 CONCACAF Gold Cup, which also took place primarily in the United States around the same time, but mainly in the Western part of the country.[13]

Trophy
FIFA unveiled a newly designed trophy created by Tiffany & Co. for the 2025 FIFA Club World Cup. Made from pure 24-karat gold, the trophy's design drew inspiration from pioneering maps, the periodic table, astronomy, and the Voyager Golden Record. It featured laser-engraved details including a world map, the names of all 211 FIFA member associations, descriptions of football, and inscriptions in 13 languages, including braille.[14] The trophy weighs approximately 5 kilograms (11 lb) and is valued between €200,000 or US$230,000.[15] The original trophy was kept by United States president Donald Trump in the Oval Office, while an identical replica was awarded to Chelsea, the first winners of the expanded tournament.
"""


### Converting paragraph into sentences and words

In [5]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Amreet\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [17]:
from nltk.tokenize import sent_tokenize

text = "This is a sentence. Here is another one."
print(sent_tokenize(text))

['This is a sentence.', 'Here is another one.']


In [18]:
sentence = nltk.sent_tokenize(paragraph)
print(sentence)

['\nThe 2025 FIFA Club World Cup, also marketed as FIFA Club World Cup 25, was the 21st edition and the first of the expanded FIFA Club World Cup, an international club soccer competition organized by FIFA.', 'The tournament was held in the United States from June 14 to July 13, 2025, and featured 32 teams.', 'The expanded format included the continental champions from the past four years as well as additional qualified teams.', 'Chelsea won the tournament, defeating Paris Saint-Germain 3–0 in the final and becoming the inaugural world champions under the expanded format.', 'The revised structure was modeled more closely on the FIFA World Cup as a quadrennial world championship, replacing the annual seven-team format used between 2000 and 2023.', "It featured the winners of each continent's top club competition from 2021 to 2024, except for a single entry from Oceania.", 'Additional slots were awarded to clubs from Europe and South America based on rankings across the same four-year pe

In [15]:
stemmer = PorterStemmer()

# Checking for stemming
print(stemmer.stem('history'))
print(stemmer.stem('drinking'))
print(stemmer.stem('going'))

histori
drink
go


### Lemmatization 

In [23]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('history'))
print(lemmatizer.lemmatize('drinking'))
print(lemmatizer.lemmatize('goes'))

history
drinking
go


### Data Cleaning on paragraph for stemming and lemmatization

In [19]:
import re
corpus = []
for i in range(len(sentence)):
    review = re.sub('[^a-zA-Z]','',sentence[i])   # Review stores all the alphabet(upper and lower) and subtitutes everything else into space
    review = review.lower() # convert to lower case
    corpus.append(review) # storing in the corpus list
# Note:  Corpus will not return the clean text 


In [21]:
print(corpus[0])

thefifaclubworldcupalsomarketedasfifaclubworldcupwasthesteditionandthefirstoftheexpandedfifaclubworldcupaninternationalclubsoccercompetitionorganizedbyfifa


### Applying Stemming to the corpus

In [25]:
for i in corpus:
    words = nltk.word_tokenize(i)
    for word in words:
        if word not in set(stopwords.words('english')):
            print(stemmer.stem(word))

thefifaclubworldcupalsomarketedasfifaclubworldcupwasthesteditionandthefirstoftheexpandedfifaclubworldcupaninternationalclubsoccercompetitionorganizedbyfifa
thetournamentwasheldintheunitedstatesfromjunetojulyandfeaturedteam
theexpandedformatincludedthecontinentalchampionsfromthepastfouryearsaswellasadditionalqualifiedteam
chelseawonthetournamentdefeatingparissaintgermaininthefinalandbecomingtheinauguralworldchampionsundertheexpandedformat
therevisedstructurewasmodeledmorecloselyonthefifaworldcupasaquadrennialworldchampionshipreplacingtheannualseventeamformatusedbetweenand
itfeaturedthewinnersofeachcontinentstopclubcompetitionfromtoexceptforasingleentryfromoceania
additionalslotswereawardedtoclubsfromeuropeandsouthamericabasedonrankingsacrossthesamefouryearperiod
manchestercitywhowonthefinaleditionunderthepreviousformatinenteredasthetechnicaltitleholdersbutwereeliminatedintheroundofbyalhil
fifafirstannouncedtheexpandedformatinmarchoriginallyselectingchinatohosttheinauguraleditionin
thisw

### Applying Lemmatization to the Corpus

In [26]:
for i in corpus:
    words = nltk.word_tokenize(i)
    for word in words:
        if word not in set(stopwords.words('english')):
            print(lemmatizer.lemmatize(word))

thefifaclubworldcupalsomarketedasfifaclubworldcupwasthesteditionandthefirstoftheexpandedfifaclubworldcupaninternationalclubsoccercompetitionorganizedbyfifa
thetournamentwasheldintheunitedstatesfromjunetojulyandfeaturedteams
theexpandedformatincludedthecontinentalchampionsfromthepastfouryearsaswellasadditionalqualifiedteams
chelseawonthetournamentdefeatingparissaintgermaininthefinalandbecomingtheinauguralworldchampionsundertheexpandedformat
therevisedstructurewasmodeledmorecloselyonthefifaworldcupasaquadrennialworldchampionshipreplacingtheannualseventeamformatusedbetweenand
itfeaturedthewinnersofeachcontinentstopclubcompetitionfromtoexceptforasingleentryfromoceania
additionalslotswereawardedtoclubsfromeuropeandsouthamericabasedonrankingsacrossthesamefouryearperiod
manchestercitywhowonthefinaleditionunderthepreviousformatinenteredasthetechnicaltitleholdersbutwereeliminatedintheroundofbyalhilal
fifafirstannouncedtheexpandedformatinmarchoriginallyselectingchinatohosttheinauguraleditionin
t