## Pre-processing
Learning goals:
- Spacy: Stop word removal
- Spacy: Lemmatization
- Spacy: Part-of-speech (POS)
- NLTK: tokenize
- NLTK: Stop word removal
- NLTK: Lemmatization
- Gensim: pre-processing
- Gensim: Stop word removal
- Gensim: Stemming
- Gensim: Lemmatization
- Spacy for a corpus




In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


### (A) Spacy

#### (A.1) Stop word Removasl


In [2]:
import spacy
text = "The US economy has been looking solid lately that Federal Reserve officials will probably double their projection for growth next year."

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)

for token in doc:
    print(token.text,"-->", token.is_stop)

The --> True
US --> True
economy --> False
has --> True
been --> True
looking --> False
solid --> False
lately --> False
that --> True
Federal --> False
Reserve --> False
officials --> False
will --> True
probably --> False
double --> False
their --> True
projection --> False
for --> True
growth --> False
next --> True
year --> False
. --> False


In [3]:
import spacy
#loading the english language small model of spacy
en = spacy.load('en_core_web_sm')
stopwords = en.Defaults.stop_words

lst=[]
for token in text.split():
    if token.lower() not in stopwords:    #checking whether the word is not
        lst.append(token)                    #present in the stopword list.

print(' '.join(lst))

economy looking solid lately Federal Reserve officials probably double projection growth year.


spaCy can let you add your own stop words. Suppose we consider "year" a stop word in our context. Below we add to the stop word list and remove it.

In [4]:
nlp = spacy.load("en_core_web_sm")
nlp.Defaults.stop_words.add("year")

lst=[]
for token in text.split():
    if token.lower() not in stopwords:    #checking whether the word is not
        lst.append(token)                    #present in the stopword list.

print(' '.join(lst))

economy looking solid lately Federal Reserve officials probably double projection growth year.


#### (A.2) spaCy for Lemmatization

In [5]:
for token in doc:
    print(token.text,"-->", token.lemma_)

The --> the
US --> US
economy --> economy
has --> have
been --> be
looking --> look
solid --> solid
lately --> lately
that --> that
Federal --> Federal
Reserve --> Reserve
officials --> official
will --> will
probably --> probably
double --> double
their --> their
projection --> projection
for --> for
growth --> growth
next --> next
year --> year
. --> .


#### (A.3) SpaCy for Part-of-speech

In [6]:
for token in doc:
    print(token.text, "-->", token.pos_)

The --> DET
US --> PROPN
economy --> NOUN
has --> AUX
been --> AUX
looking --> VERB
solid --> ADJ
lately --> ADV
that --> SCONJ
Federal --> PROPN
Reserve --> PROPN
officials --> NOUN
will --> AUX
probably --> ADV
double --> VERB
their --> PRON
projection --> NOUN
for --> ADP
growth --> NOUN
next --> ADP
year --> NOUN
. --> PUNCT


In [7]:
nlp = spacy.load("en_core_web_sm")
for token in nlp(text):
    print(token.text)

The
US
economy
has
been
looking
solid
lately
that
Federal
Reserve
officials
will
probably
double
their
projection
for
growth
next
year
.


### (B) NLTK

In [8]:
import nltk
nltk.download("popular")

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cmudict.zip.
[nltk_data]    | Downloading package gazetteers to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gazetteers.zip.
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/genesis.zip.
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gutenberg.zip.
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/inaugural.zip.
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/movie_reviews.zip.
[nltk_data]    | Downloading package names to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/names.zip.
[nltk_data]    | Downloading package shakespeare to /root/nltk_data...
[nlt

True

#### (B.1) Tokenize

In [9]:
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

text_tokenized = word_tokenize(text)
print(text_tokenized)

['The', 'US', 'economy', 'has', 'been', 'looking', 'solid', 'lately', 'that', 'Federal', 'Reserve', 'officials', 'will', 'probably', 'double', 'their', 'projection', 'for', 'growth', 'next', 'year', '.']


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


#### (B.2) Stop word removal

In [10]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
filtered_sentence = [w for w in text_tokenized if not w.lower() in stop_words]
#with no lower case conversion
filtered_sentence = []

for w in text_tokenized:
    if w not in stop_words:
        filtered_sentence.append(w)
print(text_tokenized)
print(filtered_sentence)

['The', 'US', 'economy', 'has', 'been', 'looking', 'solid', 'lately', 'that', 'Federal', 'Reserve', 'officials', 'will', 'probably', 'double', 'their', 'projection', 'for', 'growth', 'next', 'year', '.']
['The', 'US', 'economy', 'looking', 'solid', 'lately', 'Federal', 'Reserve', 'officials', 'probably', 'double', 'projection', 'growth', 'next', 'year', '.']


In [11]:
removed_words = []
for w in text_tokenized:
    if w in stop_words:
        removed_words.append(w)
print(removed_words)

['has', 'been', 'that', 'will', 'their', 'for']


#### (B.3) Lemmatization

In [12]:
# Create WordNetLemmatizer object
wnl = WordNetLemmatizer()
for words in text_tokenized:
    print(words + " --> " + wnl.lemmatize(words))

The --> The
US --> US
economy --> economy
has --> ha
been --> been
looking --> looking
solid --> solid
lately --> lately
that --> that
Federal --> Federal
Reserve --> Reserve
officials --> official
will --> will
probably --> probably
double --> double
their --> their
projection --> projection
for --> for
growth --> growth
next --> next
year --> year
. --> .


## (C) Gensim
#### (C.1) Pre-processing

In [13]:
from gensim.parsing.preprocessing import remove_stopwords, preprocess_string
remove_stopwords(text)
preprocess_string(text)

['economi',
 'look',
 'solid',
 'late',
 'feder',
 'reserv',
 'offici',
 'probabl',
 'doubl',
 'project',
 'growth',
 'year']

#### (C.2) Stop word removal

In [14]:
stopword_removed = remove_stopwords(text)
stopword_removed

'The US economy looking solid lately Federal Reserve officials probably double projection growth year.'

### (C.3) Stemming

In [15]:
from gensim.parsing.preprocessing import stem_text
stem_text(text)


'the us economi ha been look solid late that feder reserv offici will probabl doubl their project for growth next year.'

### (C.4) Lemmatization

#### Error message when using Gensim for lemmatization
* Gensim only ever previously wrapped the lemmatization routines of another library (Pattern) – which was not a particularly modern/maintained option, so was removed from Gensim-4.0.
* Below I purposely show you the error.

## (D) SpaCy for a corpus
- SpaCy is a pipeline that can process an entire corpus


The nlp() of spaCy automatically performs the following steps:

| Syntax     | Description |
| :---        |    :----   |
| tagger |	Assign part-of-speech-tags |
| parser	| Assign dependency labels |
| ner |	Assign named entities |
| entity_linker |	Assign knowledge base IDs to named entities. Should be added after the entity recognizer |
| entity_ruler |	Assign named entities based on pattern rules and dictionaries |
| textcat	| Assign text categories: exactly one category is predicted per document |
| textcat_multilabel	| Assign text categories in a multi-label setting: zero, one or more labels per document |
| lemmatizer	| Assign base forms to words using rules and lookups |
| trainable_lemmatizer	| Assign base forms to words |
| morphologizer	| Assign morphological features and coarse-grained POS tags |
| attribute_ruler	| Assign token attribute mappings and rule-based exceptions |
| senter |	Assign sentence boundaries |
| sentencizer	| Add rule-based sentence segmentation without the dependency parse |
| tok2vec	| Assign token-to-vector embeddings |
| transformer	| Assign the tokens and outputs of a transformer model |

We can disable some of them to increase the speed.

In [16]:
import pandas as pd
pd.set_option('display.max_colwidth', -1)
train = pd.read_csv("/content/gdrive/My Drive/data/gensim/ag_news_train.csv")

  pd.set_option('display.max_colwidth', -1)


In [17]:
import spacy
en = spacy.load('en_core_web_sm')
to_be_disabled = ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'transformer']
nlp = spacy.load("en_core_web_sm")
stopwords = en.Defaults.stop_words
text_tokenized = []
for doc in nlp.pipe(train['Description'], disable = to_be_disabled):
   k = [token.lemma_ for token in doc if not token.is_stop]
   text_tokenized.append(k)



In [18]:
train['Tokenized'] = text_tokenized
train.to_csv("/content/gdrive/My Drive/data/ag_news_train_samples.csv")
train.head()

Unnamed: 0,Class Index,Title,Description,Tokenized
0,3,Wall St. Bears Claw Back Into the Black (Reuters),"Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.","[reuters, -, short, -, sellers, ,, wall, street, dwindling\band, ultra, -, cynics, ,, seeing, green, .]"
1,3,Carlyle Looks Toward Commercial Aerospace (Reuters),"Reuters - Private investment firm Carlyle Group,\which has a reputation for making well-timed and occasionally\controversial plays in the defense industry, has quietly placed\its bets on another part of the market.","[reuters, -, private, investment, firm, carlyle, group,\which, reputation, making, -, timed, occasionally\controversial, plays, defense, industry, ,, quietly, placed\its, bets, market, .]"
2,3,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\about the economy and the outlook for earnings are expected to\hang over the stock market next week during the depth of the\summer doldrums.,"[reuters, -, soaring, crude, prices, plus, worries\about, economy, outlook, earnings, expected, to\hang, stock, market, week, depth, the\summer, doldrums, .]"
3,3,Iraq Halts Oil Exports from Main Southern Pipeline (Reuters),"Reuters - Authorities have halted oil export\flows from the main pipeline in southern Iraq after\intelligence showed a rebel militia could strike\infrastructure, an oil official said on Saturday.","[reuters, -, authorities, halted, oil, export\flows, main, pipeline, southern, iraq, after\intelligence, showed, rebel, militia, strike\infrastructure, ,, oil, official, said, saturday, .]"
4,3,"Oil prices soar to all-time record, posing new menace to US economy (AFP)","AFP - Tearaway world oil prices, toppling records and straining wallets, present a new economic menace barely three months before the US presidential elections.","[afp, -, tearaway, world, oil, prices, ,, toppling, records, straining, wallets, ,, present, new, economic, menace, barely, months, presidential, elections, .]"


In [19]:
train[['Description','Tokenized']].head(1)

Unnamed: 0,Description,Tokenized
0,"Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.","[reuters, -, short, -, sellers, ,, wall, street, dwindling\band, ultra, -, cynics, ,, seeing, green, .]"


In [20]:
text_tokenized

[['reuters',
  '-',
  'short',
  '-',
  'sellers',
  ',',
  'wall',
  'street',
  'dwindling\\band',
  'ultra',
  '-',
  'cynics',
  ',',
  'seeing',
  'green',
  '.'],
 ['reuters',
  '-',
  'private',
  'investment',
  'firm',
  'carlyle',
  'group,\\which',
  'reputation',
  'making',
  '-',
  'timed',
  'occasionally\\controversial',
  'plays',
  'defense',
  'industry',
  ',',
  'quietly',
  'placed\\its',
  'bets',
  'market',
  '.'],
 ['reuters',
  '-',
  'soaring',
  'crude',
  'prices',
  'plus',
  'worries\\about',
  'economy',
  'outlook',
  'earnings',
  'expected',
  'to\\hang',
  'stock',
  'market',
  'week',
  'depth',
  'the\\summer',
  'doldrums',
  '.'],
 ['reuters',
  '-',
  'authorities',
  'halted',
  'oil',
  'export\\flows',
  'main',
  'pipeline',
  'southern',
  'iraq',
  'after\\intelligence',
  'showed',
  'rebel',
  'militia',
  'strike\\infrastructure',
  ',',
  'oil',
  'official',
  'said',
  'saturday',
  '.'],
 ['afp',
  '-',
  'tearaway',
  'world',
  