## Imports

In [1]:
!pip install plotly --upgrade

Collecting plotly
  Downloading plotly-5.4.0-py2.py3-none-any.whl (25.3 MB)
[K     |████████████████████████████████| 25.3 MB 62.2 MB/s 
Collecting tenacity>=6.2.0
  Downloading tenacity-8.0.1-py3-none-any.whl (24 kB)
Installing collected packages: tenacity, plotly
  Attempting uninstall: plotly
    Found existing installation: plotly 4.4.1
    Uninstalling plotly-4.4.1:
      Successfully uninstalled plotly-4.4.1
Successfully installed plotly-5.4.0 tenacity-8.0.1


In [2]:
import nltk
import pandas as pd
import plotly.express as px
import string
import spacy

In [3]:
nltk.download('popular')

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cmudict.zip.
[nltk_data]    | Downloading package gazetteers to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gazetteers.zip.
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/genesis.zip.
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gutenberg.zip.
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/inaugural.zip.
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/movie_reviews.zip.
[nltk_data]    | Downloading package names to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/names.zip.
[nltk_data]    | Downloading package shakespeare to /root/nltk_data...
[nlt

True

In [4]:
## Source article: https://www.reuters.com/world/africa/safrica-says-it-is-being-punished-early-covid-variant-detection-2021-11-27/

article_text = '''
JOHANNESBURG, Nov 27 (Reuters) - South Africa said on Saturday it was being punished for its advanced ability to detect new COVID-19 variants early, as travel bans and restrictions imposed because of the new Omicron variant threaten to harm tourism and other sectors of the economy.

South Africa has some of the world's top epidemiologists and scientists, who have managed to detect emerging coronavirus variants and their mutations early on in their life cycle. The Omicron variant was first discovered in South Africa and has since been detected in Belgium, Botswana, Israel and Hong Kong.

"This latest round of travel bans is akin to punishing South Africa for its advanced genomic sequencing and the ability to detect new variants quicker," the Ministry of International Relations and Cooperation said.

"Excellent science should be applauded and not punished," it said in a statement.

Many nations rushed on Friday and Saturday to announce travel curbs to South Africa and other countries in the region.

The foreign ministry noted that while the new variant was also detected in other countries, the global reaction to those countries have been "starkly different" to cases in southern Africa.

The new variant was first announced on Wednesday by a team of scientists in South Africa who said they had detected a variant that could possibly evade the body's immune response and make it more transmissible.

On Friday the World Health Organization named it Omicron and designated it as a "variant of concern" - its most serious level - saying preliminary evidence suggests an increased risk of re-infection. read more

"Our immediate concern is the damage that these restrictions are causing to families, the travel and tourism industries and business," South African Foreign Minister Naledi Pandor said in the statement.

The government was engaging with countries that have imposed travel bans to persuade them to reconsider, it added.

On Friday, the WHO cautioned countries against hastily imposing travel restrictions linked to the variant, saying they should take a "risk-based and scientific approach".'''

## Tokenization

In [5]:
sentence_tokens = nltk.sent_tokenize(article_text)
sentence_tokens

['\nJOHANNESBURG, Nov 27 (Reuters) - South Africa said on Saturday it was being punished for its advanced ability to detect new COVID-19 variants early, as travel bans and restrictions imposed because of the new Omicron variant threaten to harm tourism and other sectors of the economy.',
 "South Africa has some of the world's top epidemiologists and scientists, who have managed to detect emerging coronavirus variants and their mutations early on in their life cycle.",
 'The Omicron variant was first discovered in South Africa and has since been detected in Belgium, Botswana, Israel and Hong Kong.',
 '"This latest round of travel bans is akin to punishing South Africa for its advanced genomic sequencing and the ability to detect new variants quicker," the Ministry of International Relations and Cooperation said.',
 '"Excellent science should be applauded and not punished," it said in a statement.',
 'Many nations rushed on Friday and Saturday to announce travel curbs to South Africa and

In [6]:
word_tokens = nltk.word_tokenize(article_text.lower())
word_tokens

['johannesburg',
 ',',
 'nov',
 '27',
 '(',
 'reuters',
 ')',
 '-',
 'south',
 'africa',
 'said',
 'on',
 'saturday',
 'it',
 'was',
 'being',
 'punished',
 'for',
 'its',
 'advanced',
 'ability',
 'to',
 'detect',
 'new',
 'covid-19',
 'variants',
 'early',
 ',',
 'as',
 'travel',
 'bans',
 'and',
 'restrictions',
 'imposed',
 'because',
 'of',
 'the',
 'new',
 'omicron',
 'variant',
 'threaten',
 'to',
 'harm',
 'tourism',
 'and',
 'other',
 'sectors',
 'of',
 'the',
 'economy',
 '.',
 'south',
 'africa',
 'has',
 'some',
 'of',
 'the',
 'world',
 "'s",
 'top',
 'epidemiologists',
 'and',
 'scientists',
 ',',
 'who',
 'have',
 'managed',
 'to',
 'detect',
 'emerging',
 'coronavirus',
 'variants',
 'and',
 'their',
 'mutations',
 'early',
 'on',
 'in',
 'their',
 'life',
 'cycle',
 '.',
 'the',
 'omicron',
 'variant',
 'was',
 'first',
 'discovered',
 'in',
 'south',
 'africa',
 'and',
 'has',
 'since',
 'been',
 'detected',
 'in',
 'belgium',
 ',',
 'botswana',
 ',',
 'israel',
 'and

In [7]:
freq_dist_words = nltk.FreqDist(word_tokens)
freq_dist_words

FreqDist({"''": 7,
          "'s": 2,
          '(': 1,
          ')': 1,
          ',': 13,
          '-': 3,
          '.': 12,
          '27': 1,
          '``': 5,
          'a': 5,
          'ability': 2,
          'added': 1,
          'advanced': 2,
          'africa': 7,
          'african': 1,
          'against': 1,
          'akin': 1,
          'also': 1,
          'an': 1,
          'and': 16,
          'announce': 1,
          'announced': 1,
          'applauded': 1,
          'approach': 1,
          'are': 1,
          'as': 2,
          'bans': 3,
          'be': 1,
          'because': 1,
          'been': 2,
          'being': 1,
          'belgium': 1,
          'body': 1,
          'botswana': 1,
          'business': 1,
          'by': 1,
          'cases': 1,
          'causing': 1,
          'cautioned': 1,
          'concern': 2,
          'cooperation': 1,
          'coronavirus': 1,
          'could': 1,
          'countries': 5,
          'covid-19': 1,
   

In [8]:
df_freq = pd.DataFrame.from_dict(freq_dist_words, orient='index', columns=['Frequency']).reset_index()
df_freq.sort_values('Frequency', ascending=False, inplace=True)
top_20_words = df_freq.head(20)
top_20_words = top_20_words.rename(columns={'index': 'Word'})

In [9]:
fig = px.bar(top_20_words, x='Word', y='Frequency', color='Frequency')

fig.update_layout(height=800, width=1200, template='plotly_dark', title='Distribution of Top 20 Words From Article')
fig.show()

## Stopwords Removal

In [10]:
word_tokens_clean = []
my_punct = ["''", "``", '""', "'s", '-']

for word in word_tokens:
  if word not in nltk.corpus.stopwords.words('english'):
    if word not in string.punctuation:
      if word not in my_punct:
        word_tokens_clean.append(word)

print(word_tokens_clean)

['johannesburg', 'nov', '27', 'reuters', 'south', 'africa', 'said', 'saturday', 'punished', 'advanced', 'ability', 'detect', 'new', 'covid-19', 'variants', 'early', 'travel', 'bans', 'restrictions', 'imposed', 'new', 'omicron', 'variant', 'threaten', 'harm', 'tourism', 'sectors', 'economy', 'south', 'africa', 'world', 'top', 'epidemiologists', 'scientists', 'managed', 'detect', 'emerging', 'coronavirus', 'variants', 'mutations', 'early', 'life', 'cycle', 'omicron', 'variant', 'first', 'discovered', 'south', 'africa', 'since', 'detected', 'belgium', 'botswana', 'israel', 'hong', 'kong', 'latest', 'round', 'travel', 'bans', 'akin', 'punishing', 'south', 'africa', 'advanced', 'genomic', 'sequencing', 'ability', 'detect', 'new', 'variants', 'quicker', 'ministry', 'international', 'relations', 'cooperation', 'said', 'excellent', 'science', 'applauded', 'punished', 'said', 'statement', 'many', 'nations', 'rushed', 'friday', 'saturday', 'announce', 'travel', 'curbs', 'south', 'africa', 'count

In [11]:
freq_dist_words_clean = nltk.FreqDist(word_tokens_clean)

In [12]:
df_freq_clean = pd.DataFrame.from_dict(freq_dist_words_clean, orient='index', columns=['Frequency']).reset_index()
df_freq_clean.sort_values('Frequency', ascending=False, inplace=True)
top_20_words_clean = df_freq_clean.head(20)
top_20_words_clean = top_20_words_clean.rename(columns={'index': 'Word'})

In [13]:
fig = px.bar(top_20_words_clean, x='Word', y='Frequency', color='Frequency')

fig.update_layout(height=800, width=1200, template='plotly_dark', title='Distribution of Top 20 Words From Article')
fig.show()


## Stemming

In [14]:
stem_en = nltk.stem.SnowballStemmer("english")

In [15]:
stemmed_words = [stem_en.stem(word) for word in word_tokens_clean]
stemmed_words

['johannesburg',
 'nov',
 '27',
 'reuter',
 'south',
 'africa',
 'said',
 'saturday',
 'punish',
 'advanc',
 'abil',
 'detect',
 'new',
 'covid-19',
 'variant',
 'earli',
 'travel',
 'ban',
 'restrict',
 'impos',
 'new',
 'omicron',
 'variant',
 'threaten',
 'harm',
 'tourism',
 'sector',
 'economi',
 'south',
 'africa',
 'world',
 'top',
 'epidemiologist',
 'scientist',
 'manag',
 'detect',
 'emerg',
 'coronavirus',
 'variant',
 'mutat',
 'earli',
 'life',
 'cycl',
 'omicron',
 'variant',
 'first',
 'discov',
 'south',
 'africa',
 'sinc',
 'detect',
 'belgium',
 'botswana',
 'israel',
 'hong',
 'kong',
 'latest',
 'round',
 'travel',
 'ban',
 'akin',
 'punish',
 'south',
 'africa',
 'advanc',
 'genom',
 'sequenc',
 'abil',
 'detect',
 'new',
 'variant',
 'quicker',
 'ministri',
 'intern',
 'relat',
 'cooper',
 'said',
 'excel',
 'scienc',
 'applaud',
 'punish',
 'said',
 'statement',
 'mani',
 'nation',
 'rush',
 'friday',
 'saturday',
 'announc',
 'travel',
 'curb',
 'south',
 'afric

In [16]:
freq_dist_words_stem = nltk.FreqDist(stemmed_words)
freq_dist_words_stem

FreqDist({'27': 1,
          'abil': 2,
          'ad': 1,
          'advanc': 2,
          'africa': 7,
          'african': 1,
          'akin': 1,
          'also': 1,
          'announc': 2,
          'applaud': 1,
          'approach': 1,
          'ban': 3,
          'belgium': 1,
          'bodi': 1,
          'botswana': 1,
          'busi': 1,
          'case': 1,
          'caus': 1,
          'caution': 1,
          'concern': 2,
          'cooper': 1,
          'coronavirus': 1,
          'could': 1,
          'countri': 5,
          'covid-19': 1,
          'curb': 1,
          'cycl': 1,
          'damag': 1,
          'design': 1,
          'detect': 6,
          'differ': 1,
          'discov': 1,
          'earli': 2,
          'economi': 1,
          'emerg': 1,
          'engag': 1,
          'epidemiologist': 1,
          'evad': 1,
          'evid': 1,
          'excel': 1,
          'famili': 1,
          'first': 2,
          'foreign': 2,
          'friday': 3,


In [17]:
df_freq_stem = pd.DataFrame.from_dict(freq_dist_words_stem, orient='index', columns=['Frequency']).reset_index()
df_freq_stem.sort_values('Frequency', ascending=False, inplace=True)
top_20_words_stem = df_freq_stem.head(20)
top_20_words_stem = top_20_words_stem.rename(columns={'index': 'Word'})

In [18]:
fig = px.bar(top_20_words_stem, x='Word', y='Frequency', color='Frequency')

fig.update_layout(height=800, width=1200, template='plotly_dark', title='Distribution of Top 20 Words From Article (stemmed)')
fig.show()

## Lemmatizing

In [19]:
nlp = spacy.load("en_core_web_sm")

string_words = ' '.join([str(word) for word in word_tokens_clean])
nlp_tokens = nlp(string_words)

word_tokens_lemma = [word.lemma_ for word in nlp_tokens]
word_tokens_lemma

['johannesburg',
 'nov',
 '27',
 'reuters',
 'south',
 'africa',
 'say',
 'saturday',
 'punish',
 'advanced',
 'ability',
 'detect',
 'new',
 'covid-19',
 'variant',
 'early',
 'travel',
 'ban',
 'restriction',
 'impose',
 'new',
 'omicron',
 'variant',
 'threaten',
 'harm',
 'tourism',
 'sector',
 'economy',
 'south',
 'africa',
 'world',
 'top',
 'epidemiologist',
 'scientist',
 'manage',
 'detect',
 'emerge',
 'coronavirus',
 'variant',
 'mutation',
 'early',
 'life',
 'cycle',
 'omicron',
 'variant',
 'first',
 'discover',
 'south',
 'africa',
 'since',
 'detect',
 'belgium',
 'botswana',
 'israel',
 'hong',
 'kong',
 'late',
 'round',
 'travel',
 'ban',
 'akin',
 'punish',
 'south',
 'africa',
 'advance',
 'genomic',
 'sequencing',
 'ability',
 'detect',
 'new',
 'variant',
 'quick',
 'ministry',
 'international',
 'relations',
 'cooperation',
 'say',
 'excellent',
 'science',
 'applaud',
 'punish',
 'say',
 'statement',
 'many',
 'nation',
 'rush',
 'friday',
 'saturday',
 'annou

In [20]:
# Looks like some punctuation came back, let's get rid of it again

word_tokens_lemma_2 = []

for word in word_tokens_lemma:
  if word not in string.punctuation:
    word_tokens_lemma_2.append(word)

In [21]:
freq_dist_words_lemm = nltk.FreqDist(word_tokens_lemma_2)
freq_dist_words_lemm

FreqDist({'27': 1,
          'ability': 2,
          'add': 1,
          'advance': 1,
          'advanced': 1,
          'africa': 7,
          'african': 1,
          'akin': 1,
          'also': 1,
          'announce': 2,
          'applaud': 1,
          'approach': 1,
          'ban': 3,
          'base': 1,
          'belgium': 1,
          'body': 1,
          'botswana': 1,
          'business': 1,
          'case': 1,
          'cause': 1,
          'caution': 1,
          'concern': 2,
          'cooperation': 1,
          'coronavirus': 1,
          'could': 1,
          'country': 5,
          'covid-19': 1,
          'curbs': 1,
          'cycle': 1,
          'damage': 1,
          'designate': 1,
          'detect': 6,
          'different': 1,
          'discover': 1,
          'early': 2,
          'economy': 1,
          'emerge': 1,
          'engage': 1,
          'epidemiologist': 1,
          'evade': 1,
          'evidence': 1,
          'excellent': 1,
        

In [22]:
df_freq_lemm = pd.DataFrame.from_dict(freq_dist_words_lemm, orient='index', columns=['Frequency']).reset_index()
df_freq_lemm.sort_values('Frequency', ascending=False, inplace=True)
top_20_words_lemm = df_freq_lemm.head(20)
top_20_words_lemm = top_20_words_lemm.rename(columns={'index': 'Word'})

In [23]:
# Putting 2 charts next to each other to compare

fig = px.bar(top_20_words_stem, x='Word', y='Frequency', color='Frequency', labels={'Word': 'Stemmed Words'})

fig.update_layout(height=450, width=1100, template='plotly_dark', title='Distribution of Top 20 Words From Article (stemmed)')
fig.show()

fig = px.bar(top_20_words_lemm, x='Word', y='Frequency', color='Frequency', labels={'Word': 'Lemmatized Words'})

fig.update_layout(height=450, width=1100, template='plotly_dark', title='Distribution of Top 20 Words From Article (lemmatized)')
fig.show()

Both frequency lists look roughly the same, some words are better captured in the lemmatized list (like the verb say, instead of say and said on two occurrences), but the overall distribution is comparable. We can note though that words are more readable in the lemmatized version, as they are actual words and not words missing some letters.