# Basic Text Preprocessing
1. **Text Cleaning**
    - Removing digits and words containing digits
    - Removing newline characters and extra spaces
    - Removing HTML tags
    - Removing URLs
    - Removing punctuations
    

2. **Basic Text Preprocessing**
    - Case folding
    - Expand contractions
    - Chat word treatment
    - Handle emojis
    - Spelling correction
    - Tokenization
    - Creating N-grams
    - Stop words Removal
 
 
3. **Advanced Preprocessing**
    - Stemming
    - Lemmatization
    - POS tagging
    - NER
    - Parsing
    - Coreference Resolution

#### Download and Install Required Libraries

In [27]:
import sys
!{sys.executable} -m pip install -q --upgrade pip
!{sys.executable} -m pip install -q numpy pandas sklearn
!{sys.executable} -m pip install -q nltk spacy gensim wordcloud textblob contractions text-clean unicode

## Text Cleaning
#### 1) Removing Digits and Words Containing Digits  
- The **`re.sub(pattern, replacement_string, str)`** method return the string obtained by replacing the occurrences of `pattern` in `str` with the `replacement_string`. If the pattern isn’t found, the string is returned unchanged.

In [5]:
import re
mystr = "This is abc32 a abc32xyz string containing 32abc words  32 having digits"
re.sub(r'\w*\d\w*', '', mystr)

'This is  a  string containing  words   having digits'

#### 2) Removing Newline Characters and Extra Spaces 
- Most of the time text data contain extra spaces or while removing digits more than one space is left between the text.
- We can use Python's string and re module to perform this pre-processing task.

In [11]:
import re
mystr = "      This         is a       string  with   lots of   extra spaces      in beteween    words     ."
re.sub(' +', ' ', mystr)

' This is a string with lots of extra spaces in beteween words .'

In [13]:
mystr = "This is\na string\nwith lots of new\nline characters."
print('Original String:\n', mystr)
print('Preprocessed String:', re.sub('\n',' ', mystr))

Original String:
 This is
a string
with lots of new
line characters.
Preprocessed String: This is a string with lots of new line characters.


#### 3) Removing HTML Tags
- Once you get data via scraping websites, your data might contain HTML tags, which are not required as such in the data. So we need to remove them.

In [20]:
import re
mystr = "<html> <head> An empty head. </head><body><p> This is so simple and fun. </p> </body> </html>"
print('Original String:\n', mystr)
print('Preprocessed Strings:', re.sub('<.+?>', ' ', mystr))

Original String:
 <html> <head> An empty head. </head><body><p> This is so simple and fun. </p> </body> </html>
Preprocessed Strings:     An empty head.     This is so simple and fun.      


#### 4) Removing URLS
- At times the text data you have some URLS, which might not be helpful in suppose sentiment analysis. So better to remove those URLS from your dataset
- Once again, we can use Python's re module to remove the URLs.

In [22]:
import re
mystr = "Good youTube lectures by Arif are available at http://www.youtube.com/c/LearnWithArif/playlists"
re.sub('https?://(www/.)?\w+\S+', ' ', mystr)

'Good youTube lectures by Arif are available at  '

#### 5) Removing Punctuations
- Punctuations are symbols that are used to divide written words into sentences and clauses
- Once you tokenize your text, these punctuation symbols may become part of a token, and may become a token by itself, which is not required in most of the cases
- We can use Python's `string.punctuation` constant and `replace()` method to replace any punctuation in text with an empty string

In [23]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [27]:
mystr = 'A {text} ^having$ "lot" of #s and [puncutations]!.;%..'
newstr = ''.join([i for i in mystr if i not in string.punctuation])
print(newstr)

A text having lot of s and puncutations


## Basic Text Preprocessing
#### 1) Case Folding 
- The text we need to process may come in lower, upper, sentence, camel cases
- If the text is in the same case, it is easy for a machine to interpret the words because the lower case and upper case are treated differently by the machine
- In applications like Information Retrieval, we reduce all letters to lower case
- In applications like sentiment analysis, machine translation and information extraction, keeping the case might be helpful. For example US vs us.

In [29]:
mystr = "This IS GREAT series of Lectures by Arif at the Deaprtment of DS"
mystr.lower()

'this is great series of lectures by arif at the deaprtment of ds'

#### 2) Expand Contractions
- Contractions are words or combinations of words that are shortened by dropping letters and replacing them by an apostrophe.
- Examples:
    - you're ---> you are
    - ain't ---> am not / are not / is not / has not / have not
    - you'll ---> you shall / you will
    - wouldn't 've ---> would not haveyou are
- In order to expand contractions, you can install and use the `contractions` module or can create your own dictionary to expand contractions

In [30]:
import sys
!{sys.executable} -m pip install -q contractions

In [33]:
import contractions
print(contractions.fix("you're"))
print(contractions.fix("ain't"))
print(contractions.fix("you'll"))
print(contractions.fix("wouldn't've"))

you are
are not
you will
would not have


In [34]:
mystr = '''I'll be there within 5 min. Shouldn't you be there too? I'd love to see u there my dear. 
It's awesome to meet new friends. We've been waiting for this day for so long.'''
mystr

"I'll be there within 5 min. Shouldn't you be there too? I'd love to see u there my dear. \nIt's awesome to meet new friends. We've been waiting for this day for so long."

In [36]:
# using loop
mylist = []
for i in mystr.split(sep = ' '):
    mylist.append(contractions.fix(i))
mystr = ' '.join(mylist)
print(mystr)

I will be there within 5 min. Should not you be there too? I would love to see you there my dear. 
It is awesome to meet new friends. We have been waiting for this day for so long.


In [38]:
# Using List Comprehensions

mystr = '''I'll be there within 5 min. Shouldn't you be there too? I'd love to see u there my dear. 
It's awesome to meet new friends. We've been waiting for this day for so long.'''
expanded_list = ' '.join([contractions.fix(i) for i in mystr.split()]) 
print(expanded_list)

I will be there within 5 min. Should not you be there too? I would love to see you there my dear. It is awesome to meet new friends. We have been waiting for this day for so long.


#### 3) Chat Word Treatment
- Some commonly used abbreviated chat words that are used on social media these days are:
    - GN for good night
    - fyi for for your information
    - asap for as soon as possible
    - yolo for you only live once
    - rofl for rolling on floor laughing
    - nvm for never mind
    - ofc for ofcourse

- To pre-process any text containing such abbreviations we can search for an online dictionary, or can create a dictionary of our own

In [3]:
dict1 = { 
    'ack': 'acknowledge',
    'omg': 'oh my God',
    'aisi': 'as i see it',
    'bi5': 'back in 5 minutes',
    'lmk': 'let me know',
    'gn' : 'good night',
    'fyi': 'for your information',
    'asap': 'as soon as possible',
    'yolo': 'you only live once',
    'rofl': 'rolling on floor laughing',
    'nvm': 'never ming',
    'ofc': 'ofcourse',
    'blv' : 'boulevard',
    'cir' : 'circle',
    'hwy' : 'highway',
    'ln' : 'lane',
    'pt' : 'point',
    'rd' : 'road',
    'sq' : 'square',
    'st' : 'street'
    }

In [2]:
mystr = "omg this is aisi I ack your work and will be bi5"
mystr

'omg this is aisi I ack your work and will be bi5'

In [7]:
newList = []
for i in mystr.split(sep =' '):
    if i in dict1.keys():
        newList.append(dict1[i])
    else:
        newList.append(i)
newString = ' '.join(newList)
print(newString)

oh my God this is as i see it I acknowledge your work and will be back in 5 minutes


#### 4) Handle Emojis
- We come across lots and lots of emojis while scraping comments/posts from social media websites like Facebook, Instagram, Whatsapp, Twitter, LinkedIn, which needs to be removed from text.
- Machine Learrning algorithm cannot understand emojis, so we have two options:
    - Simply remove the emojis from the text data, and this can be done using `clean-text` library
    - Replace the emoji with its meaning happy, sad, angry,....
>- ***a) Remove Emojis***

In [8]:
mystr = "These emojis needs to be removed, there is a huge list...😃😬😂😅😇😉😊😜😎🤗🙄🤔😡😤😭🤠🤡🤫💩😈👻🙌👍✌️👌🙏"
mystr

'These emojis needs to be removed, there is a huge list...😃😬😂😅😇😉😊😜😎🤗🙄🤔😡😤😭🤠🤡🤫💩😈👻🙌👍✌️👌🙏'

In [9]:
import re
 
emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # code range for emoticons
        u"\U0001F300-\U0001F5FF"  # code range for symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # code range for transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # code range for flags (iOS)
        u"\U00002700-\U000027BF"  # code range for Dingbats
        u"\U00002500-\U00002BEF"  # code range for chinese char
        u"\U00002702-\U000027B0"
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u"\U00010000-\U0010ffff"
        u"\u2640-\u2642" 
        u"\u2600-\u2B55"
        u"\u200d"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\ufe0f" 
        u"\u3030"
        "]+", flags=re.UNICODE)

print(emoji_pattern.sub(r'', mystr)) # no emoji


These emojis needs to be removed, there is a huge list...


>- ***b) Replace Emojis with Their Meanings***

In [15]:
import sys
!{sys.executable} -m pip install -q  emoji

In [16]:
import emoji
mystr = "This is  👍"
emoji.demojize(mystr)

'This is  :thumbs_up:'

In [17]:
mystr = "I am 🤔"
emoji.demojize(mystr)

'I am :thinking_face:'

In [18]:
mystr = "This is  👍"
emoji.replace_emoji(mystr, replace='positive')

'This is  positive'

#### 5) Spelling Correction
- Most of the times the text data you have contain spelling errors, which if not corrected the same word may be represented in two or may be more different ways.
- Almost all word editors, today underline incorrectly typed words and provide you possible correct options
- So spelling correction is a two step task:
    - Detection of spelling errors
    - Correction of spelling errors
        - Autocorrect as you type space
        - Suggest a single correct word
        - Suggest a list of words (from which you can choose one)
- Types of spelling errors:
    - **Non-word Errors:** are non-dictionary words or words that do not exist in the language dictionary. For example instead of typing `reading` the user typed `reeding`. These are easy to detect as they do not exist in the language dictionary and can be corrected using algorithms like shortest weighted edit distance and highest noisy channel probability.
    - **Real-word Errors:** are dictionary words and are hard to detect. These can be of two types:
        - Typographical errors: For example instead of typing `great` the user typed `greet`
        - Cognitive errors (homophones: For example instead of typing `two` the user typed `too`


<h4 align="left" style="font-family:'Arial'">"I am reeding thiss gret boook on deta sciance suject, which is a greet curse"</h4>

In [20]:
import sys
!{sys.executable} -m pip install -q textblob

In [21]:
import textblob
textblob.__version__

'0.17.1'

In [22]:
from textblob import TextBlob
mystr = "I am reeding thiss gret boook on deta sciance suject, which is a greet curse"
bloob = TextBlob(mystr)
print(bloob)
print(type(bloob))

I am reeding thiss gret boook on deta sciance suject, which is a greet curse
<class 'textblob.blob.TextBlob'>


In [24]:
bloob.correct().string

'I am reading this great book on data science subject, which is a greet curse'

>-  The non-word errors like `reeding`, `this`, `gret`, `boook`, `deta`, `sciance` and `suject` have been corrected by `blob.correct()` method
>- However, the real word errors like `greet` and `curse` are not corrected

#### 6) Tokenization

<img align=right src="images/tokenization.png" width="500">

- **What is Tokenization:** Tokenization is a process of splitting text into meaningful segments called tokens. It can be character level, subword level, word level (unigram), two word level (bigram), three word level (trigram), and sentence level.
- **Why to do Tokenization:** For classification of a product review as positive or negative, we may need to count the number of positive words and compare them with the count of negative words in the text of that review. For this we first need to tokenize the text of the product review. Tokens are the basic uilding locks of a document oject. Everything that helps us understand the meaning of the text is derived from tokens and their relationship to one another.
- **How to do Tokenization:** In a sentence you may come across following four items:
    -  **Prefix**:	Character(s) at the beginning &#9656; `( “ $ Rs Dr`
    -  **Suffix**:	Character(s) at the end &#9656; `km ) , . ! ”`
    -  **Infix**:	Character(s) in between &#9656; `- -- / ...`
    -  **Exception**: Special-case rule to split a string into several tokens or prevent a token from being split when punctuation rules are applied. From `L.A.!` the exclamation mark (!) is separated, while `L.A.` is not split

#### a) Tokenization using NLTK
- NLTK stands for Natural Language Toolkit (https://www.nltk.org/). This is a suite of libraries and programs for statistical natural language processing for English language
- NLTK was released in 2001, and is available for Windows, Mac OS X, and Linux.. 
- NLTK provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.
- NLTK fully supports the English language, but others like Spanish or French are not supported as extensively.
- It is a string processing libbrary, i.e., you give a string as input and get a string as output
- There are. different tokenizer available in nltk:
    - `nltk.tokenize.sent_tokenize(str)` for sentence tokenization
    - `nltk.tokenize.word_tokenize(str)` for word tokenization
    - `nltk.tokenize.treebank.TreebankWordTokenizer(str)`

In [2]:
import sys
!{sys.executable} -m pip install -q nltk

In [5]:
import nltk
nltk.__version__

'3.7'

In [7]:
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\iqbal\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\iqbal\AppData\Roaming\nltk_data...
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\iqbal\AppData\Roaming\nltk_data...


True

In [6]:
from nltk.tokenize import word_tokenize, sent_tokenize
mystr="This example is great!" 
print(word_tokenize(mystr))

['This', 'example', 'is', 'great', '!']


In [7]:
mystr="You should do your Ph.D in A.I!" 
print(word_tokenize(mystr))

['You', 'should', 'do', 'your', 'Ph.D', 'in', 'A.I', '!']


In [8]:
mystr="You should've sent me an email at arif@pucit.edu.pk or vist http://www/arifbutt.me"
print(word_tokenize(mystr))

['You', 'should', "'ve", 'sent', 'me', 'an', 'email', 'at', 'arif', '@', 'pucit.edu.pk', 'or', 'vist', 'http', ':', '//www/arifbutt.me']


In [9]:
mystr="Here's an example worth $100. I am 384400km away from earth's moon!" 
print(word_tokenize(mystr))

['Here', "'s", 'an', 'example', 'worth', '$', '100', '.', 'I', 'am', '384400km', 'away', 'from', 'earth', "'s", 'moon', '!']


#### b) Tokenization using spaCy
- **spaCy** (https://spacy.io/) is an open-source Natural Language Processing library designed to handle NLP tasks with the most efficient and state of the art algorithm, released in 2015. 
- Spacy support many languages (over 65) where you can perform tokenizing, however, for this other than importing spacy, you have to load the appropriate library using spacy.load() method. But before that make sure you have downloaded the model in your system.
- spaCy will isolate punctuation that does *not* form an integral part of a word. Quotation marks, commas, and punctuation at the end of a sentence will be assigned their own token. However, punctuation that exists as part of an email address, website or numerical value will be kept as part of the token.

- **Download spacy model for English language**
    - Spacy comes with pretrained models and pipelines for different languages.
    - You can download any of the following models for English language, but better to download the small as this will require a reasonable amount of space on your disk, and may take a bit of time to download:
        - en_core_web_sm
        - en_core_web_md
        - en_core_web_lg
        - en_core_web_trf
    - The model name consist of four parts:
        - Language (en): The language abreviation can be `en` for English, `fr` for French, `zh` for Chinese
        - Type (core/dep): It can be core for general-purpose pipeline with tagging, parsing, lemmatization and NER recognition. It can be dep for only tagging, parsing and lemmatization
        - Genre (web/news): It measn the type of text the pipeline is trained on, e.g., web or news. 
        - Size: Package size indicator. `sm` for small, `md` for medium, `lg` for large and `trf for transformer
        - Package version (a.b.c): Here a is the major version for spaCy, b is the minor version for spaCy, while c is the model verion dependent to the data on which the model is trained, it parameters, number of iterations and different vectors.
        
> For details read spaCy101: https://spacy.io/usage/spacy-101

In [10]:
import sys
!{sys.executable} -m pip install -q spacy

In [11]:
import spacy
spacy.__version__

'3.6.0'

In [12]:
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

In [13]:
import sys
!{sys.executable} -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.6.0/en_core_web_sm-3.6.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     --------------------------------------- 0.0/12.8 MB 991.0 kB/s eta 0:00:13
     --------------------------------------- 0.1/12.8 MB 919.0 kB/s eta 0:00:14
     --------------------------------------- 0.1/12.8 MB 901.1 kB/s eta 0:00:15
      -------------------------------------- 0.2/12.8 MB 958.1 kB/s eta 0:00:14
      --------------------------------------- 0.2/12.8 MB 1.1 MB/s eta 0:00:12
      --------------------------------------- 0.2/12.8 MB 1.1 MB/s eta 0:00:12
      --------------------------------------- 0.2/12.8 MB 1.1 MB/s eta 0:00:12
      -------------------------------------- 0.3/12.8 MB 684.6 kB/s eta 0:00:19
      -------------------------------------- 0.3/12.8 MB 655.8 kB/s eta 0:00:20
     - ----------------------------

#### Example # 01

In [17]:
# import spacy and load the language library
import spacy
nlp = spacy.load('en_core_web_sm')
mystr="'A 7km Uber cab ride from Gulberg to Joher Town will cost you $20" 
doc = nlp(mystr)
for i in doc:
    print(i, end = ' , ')

' , A , 7 , km , Uber , cab , ride , from , Gulberg , to , Joher , Town , will , cost , you , $ , 20 , 

> <font color=green> Note that spacy has successfully tokenized the distance symbol, which nltk failed to separate.</font>

#### Example # 02

In [18]:
# import spacy and load the language library
import spacy
nlp = spacy.load('en_core_web_sm')

mystr="You should've sent me an email at arif@pucit.edu.pk or vist http://www/arifbutt.me"
doc = nlp(mystr)

for token in doc:
    print(token, end=' , ')

You , should , 've , sent , me , an , email , at , arif@pucit.edu.pk , or , vist , http://www , / , arifbutt.me , 

>- <font color=green> Note that spacy has kept the email as a single token, while nltk separated it.</font>
>- <font color=green> However, spacy also failed to properly tokenize the URL :(</font>

#### 7) Creating N-grams
- **What are n-grams?** 
    - A sequence of n words, can be bigram, trigram,....
- **Why to use n-grams?** 
    - Capture contextual information (`good food` carries more meaning than just `good` and `food` when observed independently)
    - Applications of N-grams:
        - Sentence Completion
        - Auto Spell Check and correction
        - Auto Grammer Check and correction
    - Is there a perfect value of n?
        - Different types of n-grams are suitable for different types of applications. You should try different n-grams on your data in order to confidently conclude which one works the best among all for your text analysis. 

In [19]:
import nltk
mystr = "Allama Iqbal was a visionary philosopher and politician. Thank you"
tokens = nltk.tokenize.word_tokenize(mystr)
ngrams = nltk.ngrams(tokens, 2)
for grams in ngrams:
    print(grams)

('Allama', 'Iqbal')
('Iqbal', 'was')
('was', 'a')
('a', 'visionary')
('visionary', 'philosopher')
('philosopher', 'and')
('and', 'politician')
('politician', '.')
('.', 'Thank')
('Thank', 'you')


>- The formula to calculate the count of n-grams in a document is: **`X - N + 1`**, where `X` is the number of words in a given document and `N` is the number of words in n-gram
\begin{equation}
    \text{Count of N-grams} \hspace{0.5cm} = \hspace{0.5cm} 11 - 2 + 1 \hspace{0.5cm} = \hspace{0.5cm} 10
\end{equation}

In [20]:
ngrams = nltk.ngrams(tokens, 3)
for grams in ngrams:
    print(grams)

('Allama', 'Iqbal', 'was')
('Iqbal', 'was', 'a')
('was', 'a', 'visionary')
('a', 'visionary', 'philosopher')
('visionary', 'philosopher', 'and')
('philosopher', 'and', 'politician')
('and', 'politician', '.')
('politician', '.', 'Thank')
('.', 'Thank', 'you')


\begin{equation}
    \text{Count of N-grams} \hspace{0.5cm} = \hspace{0.5cm} 11 - 3 + 1 \hspace{0.5cm} = \hspace{0.5cm} 9
\end{equation}

In [21]:
ngrams = nltk.ngrams(tokens, 4)
for grams in ngrams:
    print(grams)

('Allama', 'Iqbal', 'was', 'a')
('Iqbal', 'was', 'a', 'visionary')
('was', 'a', 'visionary', 'philosopher')
('a', 'visionary', 'philosopher', 'and')
('visionary', 'philosopher', 'and', 'politician')
('philosopher', 'and', 'politician', '.')
('and', 'politician', '.', 'Thank')
('politician', '.', 'Thank', 'you')


#### 8) Stopwords Removal
- Stopwords are extremely common words of a language having very little meanings, and it is usually safe to remove them and not consider them as important for later processing of our data.
- Every language has its own set of stopwords. For example, some stopwords of English language are: the, a, an, was, were, at, will, on, in, from, to, me, you, yours,....
- Whether you should remove stop words from your text or not mainly depends on the problem you are solving.
- Remove stop words from your text if you are working on:
    - Text Classification (Spam Filtering, Language Classification, Genre Classification)
    - Caption Generation
    - Auto-Tag Generation
- Avoid removing stop words from your text if you are working on:
    - Machine Translation
    - Language Modeling
    - Text Summarization
    - Question-Answering problems

#### a) Using NLTK
- The NLTK library has a defined set of stopwords for different languages like English. Here, we will focus on ‘english’ stopwords. One can also consider additional stopwords if required
- Note that there is no single universal list of stopwords. The list of the stop words can change depending on your problem statement
- Once you install nltk, it just install the base library and do not install all the packages related to different languages, different tokenization schemes, etc. To install all the nltk packages and corpora use `nltk.download()`
- An installation window will pop up. Select all and click ‘Download’ to download and install the additional bundles. This will download all the dictionaries and other language and grammar data frames necessary for full NLTK functionality.

In [22]:
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

In [23]:
import nltk
nltk.download("stopwords")
# nltk.download()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\iqbal\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

> After completion of downloading, you can load the package of `stopwords` from the `nltk.corpus` and use it to load the stop words

In [25]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
print(stop_words)

{'my', 'ourselves', 'be', 'now', 'why', 'any', "needn't", 'each', 'that', 'don', 'll', "wasn't", 'itself', 'isn', 'myself', 'being', 'him', 'hers', 'this', 'wouldn', 'our', 'ours', 'into', 'through', 'hadn', "hasn't", 'out', 'i', 'under', 'yourself', 'it', 'or', 'some', "don't", 'only', "haven't", 'shouldn', 'been', 'we', 'themselves', 'on', 'o', 'very', 'does', "weren't", "couldn't", 'up', "isn't", 'which', 'so', 'the', 'didn', 'theirs', 'where', 'his', "doesn't", 'of', 'few', 'for', 'ma', 'weren', 'yourselves', 'was', 'm', 'doesn', 'yours', 'from', 'were', 'they', "shouldn't", 'he', 'did', 'can', 'hasn', 'once', 'needn', 'her', 'are', 'while', "won't", 'those', 'and', 'these', "mustn't", 'all', 'here', 'as', 'by', 'because', 'until', 'no', 'am', 'off', 'will', "you've", 'them', 'again', 'having', 'himself', 'd', "didn't", 'you', 'then', 'down', 'not', 'couldn', 'below', 'should', 'than', 'too', 've', 'other', "it's", 'has', 'during', 'over', 'in', 'do', 'if', "should've", 'to', 'afte

In [26]:
def remove_stopwords(text):
    newlist = list()
    for word in text.split():
        if word not in stopwords.words('english'):
            newlist.append(word)
    return ' '.join(newlist)

In [27]:
import nltk
from nltk.corpus import stopwords

mystr="Your Google account has been compromised. \
    Your account will be closed. Immediately click this link to update your account"
remove_stopwords(mystr)

'Your Google account compromised. Your account closed. Immediately click link update account'

In [28]:
mystr="This movie is not good"
remove_stopwords(mystr)

'This movie good'

>- <font color=green>For sentiment analysis purposes, the overall meaning of the resulting sentence is positive, which is not at all the reality. So either do not remove sentiment analysis while doing sentiment analysis or handle the negation before removing stopwords 

#### b) Using spaCy
- **spaCy** (https://spacy.io/) is an open-source Natural Language Processing library designed to handle NLP tasks with the most efficient and state of the art algorithm, released in 2015. 
- Spacy support many languages (over 65) where you can perform tokenizing, however, for this other than importing spacy, you have to load the appropriate library using spacy.load() method. But before that make sure you have downloaded the model in your system.
- **Download spacy model for English language:** Spacy comes with pretrained models and pipelines for different languages. We have already downloaded the pre-trained spacy model for English language
> For details read spaCy101: https://spacy.io/usage/spacy-101

In [29]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [30]:
# returns a set of around 326 English stopwords built into spaCy
print(len(nlp.Defaults.stop_words))
print(nlp.Defaults.stop_words)

326
{'thru', 'although', '’d', 'various', 'thereby', 'due', 'him', 'hundred', 'per', 'ours', 'through', 'bottom', 'i', 'or', 'only', 'been', 'front', 'fifteen', 'become', 'regarding', 'give', 'hereafter', 'very', 'whether', 'up', 'which', 'somewhere', 'must', 'his', 'us', 'yet', 'may', 'yourselves', 'therein', '’ll', 'are', 'her', 'herein', 'whose', '‘ll', 'by', 'no', 'off', 'them', 'himself', 'ten', 'thereafter', 'none', 'though', 'latter', 'noone', 'something', 'in', '’s', 'n‘t', 'perhaps', 'already', 'amount', 'several', 'formerly', 'together', 'about', 'otherwise', 'beside', 'third', 'nor', 'well', 'same', 'whoever', 're', '‘s', '’ve', 'seemed', 'twelve', 'before', 'eight', 'everyone', 'becoming', 'when', 'my', 'thereupon', 'why', 'that', 'also', 'sometime', 'call', 'except', 'some', 'themselves', 'whereby', 'put', 'does', 'nowhere', 'another', 'the', 'hence', 'ever', 'ca', 'nine', 'mine', 'anyone', 'take', 'few', 'afterwards', 'of', 'from', 'were', 'they', 'go', 'can', 'once', 'wh

In [31]:
def remove_stopwords_spacy(text):
    new_text = list()
    for word in text.split():
        if word not in nlp.Defaults.stop_words:
            new_text.append(word)
    return " ".join(new_text)