In [1]:
with open("wiki.txt") as f:
  wiki = f.read()
wiki

'The history of NLP generally started in the 1950s, although work can be found from earlier periods. In 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence.\n\nThe Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The authors claimed that within three or five years, machine translation would be a solved problem.[2] However, real progress was much slower, and after the ALPAC report in 1966, which found that ten-year-long research had failed to fulfill the expectations, funding for machine translation was dramatically reduced. Little further research in machine translation was conducted until the late 1980s, when the first statistical machine translation systems were developed.\n\nSome notably successful NLP systems developed in the 1960s were SHRDLU, a natural-language system working in restricted "blocks wo

In [2]:
wiki = wiki.lower()

In [3]:
import re

### Extracting Year Mentions

*Note*. A direct "\d{4}" pattern identification would extract the first four digits of tokens like 123456, etc. See the code below.

In [4]:
re.compile(r"\d{4}").findall("The total cost in 2022 was 250072$.")

['2022', '2500']

The approach to first extract 4 digit numbers surrounded by non-digits and then extract those four digits will make sure that the extraction is at least not a part of a non-4 digit number token. Anyhow, we cannot guarantee that the four digit number is a year. 

In [5]:
year_context_pattern = re.compile(r"\D\d{4}\D") # pattern to identify a year - 4 consecutive digits surrounded by non-digit characters
extracted_year_contexts = " ".join(year_context_pattern.findall(wiki))
print("Years with the surrounding characters: ", extracted_year_contexts)
year_pattern = re.compile(r"\d{4}") 
extracted_years = year_pattern.findall(extracted_year_contexts)
print("Years extracted: ", extracted_years)

Years with the surrounding characters:   1950s  1950,  1954   1966,  1980s  1960s  1964   1966.
Years extracted:  ['1950', '1950', '1954', '1966', '1980', '1960', '1964', '1966']


### Punctuation removal

In [7]:
# "-" shall be replaced with single space character and not empty string to avoid words like naturallanguage & tenyearlong
wiki = re.sub(r"-", " ", wiki)
wiki

'the history of nlp generally started in the 1950s, although work can be found from earlier periods. in 1950, alan turing published an article titled "computing machinery and intelligence" which proposed what is now called the turing test as a criterion of intelligence.\n\nthe georgetown experiment in 1954 involved fully automatic translation of more than sixty russian sentences into english. the authors claimed that within three or five years, machine translation would be a solved problem.[2] however, real progress was much slower, and after the alpac report in 1966, which found that ten year long research had failed to fulfill the expectations, funding for machine translation was dramatically reduced. little further research in machine translation was conducted until the late 1980s, when the first statistical machine translation systems were developed.\n\nsome notably successful nlp systems developed in the 1960s were shrdlu, a natural language system working in restricted "blocks wo

In [8]:
# removing the citation [n]
wiki = re.sub(r"[\\[]\d[\]]", "", wiki)
wiki

'the history of nlp generally started in the 1950s, although work can be found from earlier periods. in 1950, alan turing published an article titled "computing machinery and intelligence" which proposed what is now called the turing test as a criterion of intelligence.\n\nthe georgetown experiment in 1954 involved fully automatic translation of more than sixty russian sentences into english. the authors claimed that within three or five years, machine translation would be a solved problem. however, real progress was much slower, and after the alpac report in 1966, which found that ten year long research had failed to fulfill the expectations, funding for machine translation was dramatically reduced. little further research in machine translation was conducted until the late 1980s, when the first statistical machine translation systems were developed.\n\nsome notably successful nlp systems developed in the 1960s were shrdlu, a natural language system working in restricted "blocks world

In [9]:
# replacing new-line character \n with " " (any extra space resulting due to this will be automatically removed while tokenized)
wiki = re.sub(r"[\n]", " ", wiki)
wiki

'the history of nlp generally started in the 1950s, although work can be found from earlier periods. in 1950, alan turing published an article titled "computing machinery and intelligence" which proposed what is now called the turing test as a criterion of intelligence.  the georgetown experiment in 1954 involved fully automatic translation of more than sixty russian sentences into english. the authors claimed that within three or five years, machine translation would be a solved problem. however, real progress was much slower, and after the alpac report in 1966, which found that ten year long research had failed to fulfill the expectations, funding for machine translation was dramatically reduced. little further research in machine translation was conducted until the late 1980s, when the first statistical machine translation systems were developed.  some notably successful nlp systems developed in the 1960s were shrdlu, a natural language system working in restricted "blocks worlds" w

In [10]:
# replacing any non-alphanumeric and non-space characters with empty string
wiki = re.sub("[^\w\s]", "", wiki)
wiki

'the history of nlp generally started in the 1950s although work can be found from earlier periods in 1950 alan turing published an article titled computing machinery and intelligence which proposed what is now called the turing test as a criterion of intelligence  the georgetown experiment in 1954 involved fully automatic translation of more than sixty russian sentences into english the authors claimed that within three or five years machine translation would be a solved problem however real progress was much slower and after the alpac report in 1966 which found that ten year long research had failed to fulfill the expectations funding for machine translation was dramatically reduced little further research in machine translation was conducted until the late 1980s when the first statistical machine translation systems were developed  some notably successful nlp systems developed in the 1960s were shrdlu a natural language system working in restricted blocks worlds with restricted voca

### Tokenization

In [11]:
from nltk.tokenize import RegexpTokenizer, word_tokenize
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Meenakshi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [12]:
regexpTokenized = RegexpTokenizer(pattern='\w+|\$[\d\.]+|\S+').tokenize(wiki)
print(regexpTokenized)

['the', 'history', 'of', 'nlp', 'generally', 'started', 'in', 'the', '1950s', 'although', 'work', 'can', 'be', 'found', 'from', 'earlier', 'periods', 'in', '1950', 'alan', 'turing', 'published', 'an', 'article', 'titled', 'computing', 'machinery', 'and', 'intelligence', 'which', 'proposed', 'what', 'is', 'now', 'called', 'the', 'turing', 'test', 'as', 'a', 'criterion', 'of', 'intelligence', 'the', 'georgetown', 'experiment', 'in', '1954', 'involved', 'fully', 'automatic', 'translation', 'of', 'more', 'than', 'sixty', 'russian', 'sentences', 'into', 'english', 'the', 'authors', 'claimed', 'that', 'within', 'three', 'or', 'five', 'years', 'machine', 'translation', 'would', 'be', 'a', 'solved', 'problem', 'however', 'real', 'progress', 'was', 'much', 'slower', 'and', 'after', 'the', 'alpac', 'report', 'in', '1966', 'which', 'found', 'that', 'ten', 'year', 'long', 'research', 'had', 'failed', 'to', 'fulfill', 'the', 'expectations', 'funding', 'for', 'machine', 'translation', 'was', 'dramat

In [13]:
wordTokenized = word_tokenize(wiki)
print(wordTokenized)

['the', 'history', 'of', 'nlp', 'generally', 'started', 'in', 'the', '1950s', 'although', 'work', 'can', 'be', 'found', 'from', 'earlier', 'periods', 'in', '1950', 'alan', 'turing', 'published', 'an', 'article', 'titled', 'computing', 'machinery', 'and', 'intelligence', 'which', 'proposed', 'what', 'is', 'now', 'called', 'the', 'turing', 'test', 'as', 'a', 'criterion', 'of', 'intelligence', 'the', 'georgetown', 'experiment', 'in', '1954', 'involved', 'fully', 'automatic', 'translation', 'of', 'more', 'than', 'sixty', 'russian', 'sentences', 'into', 'english', 'the', 'authors', 'claimed', 'that', 'within', 'three', 'or', 'five', 'years', 'machine', 'translation', 'would', 'be', 'a', 'solved', 'problem', 'however', 'real', 'progress', 'was', 'much', 'slower', 'and', 'after', 'the', 'alpac', 'report', 'in', '1966', 'which', 'found', 'that', 'ten', 'year', 'long', 'research', 'had', 'failed', 'to', 'fulfill', 'the', 'expectations', 'funding', 'for', 'machine', 'translation', 'was', 'dramat

In [14]:
regexpTokenized == wordTokenized

True

At this stage, both the tokenising methods produced the same list of tokens. 

One advantage of RegexpTokenizer() is that we can specify either the pattern of tokens or of the delimiter using regular expressions, thereby customise the tokenisation process as according to the domain of the text or the requirement. This cannot be done in case of word_tokenize().

*Note*. If the above two tokenisations were done without the removal of punctuations, the following nuances can be observed.
* RegexpTokenizer() removed double quotations while word_tokenize() took them as separate tokens.
* RegexpTokenizer() kept anything immediately followed by a period as such while word_tokenize() split the period and the following terms as separate tokens.
> * Eg. RegexpTokenizer() kept '.[2]' as '.[2]' itself while word_tokenize() split them into '.', '[', '2', ']'
* Hyphenated words were tokenized as follows.
> * Example 1: RegexpTokenizer() tokenised 'ten-year-long' as 'ten' & '-year-long' while word_tokenize() tokenised it as 'ten-year-long' itself. 
> * Example 2: RegexpTokenizer() tokenised 'natural-language' as 'natural' & '-language' while word_tokenize() tokenised it as 'natural-language' itself. 

Anyhow, since RegexpTokenizer()'s tokenization/delimiter pattern can be customised, these differences can be removed.


In [15]:
wiki_words = wordTokenized

### Stop words removal

In [16]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stopwds = set(stopwords.words('english'))
print(stopwds)

{"mightn't", 'between', 'yours', 'mightn', 'by', 't', 'before', 'shouldn', 'as', 'those', 'but', 'about', "aren't", 'whom', 'll', 'weren', 'under', "wouldn't", 'very', 'for', 'hasn', 'ain', 'down', 'only', 'now', 'their', "she's", 'there', 'don', 'of', 'when', 'you', 'm', 'has', 'am', 'just', 'we', 'where', 'ours', "hadn't", 'were', 'they', 'other', "doesn't", 'that', "that'll", 'aren', 'on', 'o', 'all', 's', 'was', 'will', 'until', 'it', "hasn't", 'my', 'from', 'are', 'themselves', 'the', "you'll", 'against', 'doing', "needn't", 'himself', 'them', 'again', 're', 'both', "won't", 'such', 'which', "haven't", 'couldn', 'didn', 'above', 'does', 'not', 'i', "wasn't", 'these', 'up', 'hadn', 'me', 'is', 'so', 'your', 'theirs', 'do', 'be', 'have', 'through', "you've", 'haven', 'because', 'and', 'ma', 'our', "isn't", "it's", 'or', 'ourselves', 'this', 'having', 'further', 'won', 'being', 'shan', 'what', 'itself', 'yourselves', 've', 'below', 'into', 'herself', 'had', 'here', 'more', 'a', 'afte

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Meenakshi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [18]:
# checking the words that would get removed as a stopword from wiki
import numpy as np
np.unique([word for word in wiki_words if word in stopwds])

array(['a', 'about', 'after', 'an', 'and', 'as', 'be', 'between', 'by',
       'can', 'do', 'for', 'from', 'further', 'had', 'in', 'into', 'is',
       'more', 'my', 'no', 'now', 'of', 'or', 'some', 'than', 'that',
       'the', 'to', 'until', 'very', 'was', 'were', 'what', 'when',
       'which', 'why', 'with', 'you', 'your'], dtype='<U7')

In [19]:
# checking the tokens for any additional stopwords
print(np.unique([word for word in wiki_words if word not in stopwds]))
' '.join([word for word in wiki_words if word not in stopwds])

['1950' '1950s' '1954' '1960s' '1964' '1966' '1980s' 'alan' 'almost'
 'alpac' 'although' 'article' 'authors' 'automatic' 'base' 'blocks'
 'called' 'claimed' 'computing' 'conducted' 'criterion' 'developed'
 'dramatically' 'earlier' 'eliza' 'emotion' 'english' 'example' 'exceeded'
 'expectations' 'experiment' 'failed' 'first' 'five' 'found' 'fulfill'
 'fully' 'funding' 'generally' 'generic' 'georgetown' 'head' 'history'
 'however' 'human' 'hurts' 'information' 'intelligence' 'interaction'
 'involved' 'joseph' 'knowledge' 'language' 'late' 'like' 'little' 'long'
 'machine' 'machinery' 'might' 'much' 'natural' 'nlp' 'notably' 'patient'
 'periods' 'problem' 'progress' 'proposed' 'provide' 'provided'
 'psychotherapist' 'published' 'real' 'reduced' 'report' 'research'
 'responding' 'response' 'restricted' 'rogerian' 'russian' 'say'
 'sentences' 'shrdlu' 'simulation' 'sixty' 'slower' 'small' 'solved'
 'sometimes' 'started' 'startlingly' 'statistical' 'successful' 'system'
 'systems' 'ten' 'tes

'history nlp generally started 1950s although work found earlier periods 1950 alan turing published article titled computing machinery intelligence proposed called turing test criterion intelligence georgetown experiment 1954 involved fully automatic translation sixty russian sentences english authors claimed within three five years machine translation would solved problem however real progress much slower alpac report 1966 found ten year long research failed fulfill expectations funding machine translation dramatically reduced little research machine translation conducted late 1980s first statistical machine translation systems developed notably successful nlp systems developed 1960s shrdlu natural language system working restricted blocks worlds restricted vocabularies eliza simulation rogerian psychotherapist written joseph weizenbaum 1964 1966 using almost information human thought emotion eliza sometimes provided startlingly human like interaction patient exceeded small knowledge 

Remove 'no' from the stopwords list since "using almost no information" and "using almost information" have different meanings.

Add 'much' & 'however' to the stopwords list since "*however* progress was *much* slower" would mean almost the same without these words.

In [20]:
stopwds.remove('no')
stopwds.update({'much', 'however'})

In [21]:
# removing stopwords
wiki_words = [word for word in wiki_words if word not in stopwds]

### Final tokens

In [22]:
print(wiki_words)

['history', 'nlp', 'generally', 'started', '1950s', 'although', 'work', 'found', 'earlier', 'periods', '1950', 'alan', 'turing', 'published', 'article', 'titled', 'computing', 'machinery', 'intelligence', 'proposed', 'called', 'turing', 'test', 'criterion', 'intelligence', 'georgetown', 'experiment', '1954', 'involved', 'fully', 'automatic', 'translation', 'sixty', 'russian', 'sentences', 'english', 'authors', 'claimed', 'within', 'three', 'five', 'years', 'machine', 'translation', 'would', 'solved', 'problem', 'real', 'progress', 'slower', 'alpac', 'report', '1966', 'found', 'ten', 'year', 'long', 'research', 'failed', 'fulfill', 'expectations', 'funding', 'machine', 'translation', 'dramatically', 'reduced', 'little', 'research', 'machine', 'translation', 'conducted', 'late', '1980s', 'first', 'statistical', 'machine', 'translation', 'systems', 'developed', 'notably', 'successful', 'nlp', 'systems', 'developed', '1960s', 'shrdlu', 'natural', 'language', 'system', 'working', 'restricte