## NLTK : A NATURAL LANGUAGE PROCESSING TOOLKIT

In [1]:
import nltk

#### TOKENIZING

Here’s how to import the relevant parts of NLTK so you can tokenize by word and by sentence:

In [2]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [3]:
sentence = "ML-Capsule is a github repo that shows a good collection of Machine learning with python and data science with algorithms,projects,explanations from basic to advance level."

In [4]:
# Tokenization
tokens = nltk.word_tokenize(sentence)

In [5]:
tokens

['ML-Capsule',
 'is',
 'a',
 'github',
 'repo',
 'that',
 'shows',
 'a',
 'good',
 'collection',
 'of',
 'Machine',
 'learning',
 'with',
 'python',
 'and',
 'data',
 'science',
 'with',
 'algorithms',
 ',',
 'projects',
 ',',
 'explanations',
 'from',
 'basic',
 'to',
 'advance',
 'level',
 '.']

#### POS TAGGING

Here’s how to import the relevant parts of NLTK in order to tag parts of speech:

In [6]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [7]:
# POS Tagging
tagged = nltk.pos_tag(tokens)

In [8]:
tagged[0:6]

[('ML-Capsule', 'NN'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('github', 'NN'),
 ('repo', 'NN'),
 ('that', 'WDT')]

#### Named Entity Recognition (NER)

With named entity recognition, you can find the named entities in your texts and also determine what kind of named entity they are.You can use nltk.ne_chunk() to recognize named entities. 

In [9]:
nltk.download('maxent_ne_chunker')

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.


True

In [10]:
nltk.download('words')

[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

In [11]:
# Extracting all types of entities
entities = nltk.chunk.ne_chunk(tagged)

In [12]:
print(entities)

(S
  ML-Capsule/NN
  is/VBZ
  a/DT
  github/NN
  repo/NN
  that/WDT
  shows/VBZ
  a/DT
  good/JJ
  collection/NN
  of/IN
  (GPE Machine/NNP)
  learning/VBG
  with/IN
  python/NN
  and/CC
  data/NNS
  science/NN
  with/IN
  algorithms/NN
  ,/,
  projects/NNS
  ,/,
  explanations/NNS
  from/IN
  basic/JJ
  to/TO
  advance/VB
  level/NN
  ./.)


#### LOWER CASE CONVERSION

Here’s how to import the relevant libraries to convert each word in a sentence into lowercase:

In [18]:
import re

In [19]:
# Lower case conversion --- helps to avoid redundancy in token list
text = re.sub(r"[^a-zA-Z0-9]", " ", sentence.lower())
words = text.split()
print(words)

['ml', 'capsule', 'is', 'a', 'github', 'repo', 'that', 'shows', 'a', 'good', 'collection', 'of', 'machine', 'learning', 'with', 'python', 'and', 'data', 'science', 'with', 'algorithms', 'projects', 'explanations', 'from', 'basic', 'to', 'advance', 'level']


#### REMOVAL OF STOPWORDS

Here’s how to import the relevant parts of NLTK in order to filter out stop words:

In [23]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [24]:
# All stopwords in English
print(stopwords.words("english"))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [26]:
# Stopwords Removal
words = [w for w in words if w not in stopwords.words("english")]
print(words)

['ml', 'capsule', 'github', 'repo', 'shows', 'good', 'collection', 'machine', 'learning', 'python', 'data', 'science', 'algorithms', 'projects', 'explanations', 'basic', 'advance', 'level']


#### STEMMING

Here’s how to import the relevant parts of NLTK in order to start stemming.After importing, you can create a stemmer with PorterStemmer().

In [27]:
# Stemming
from nltk.stem.porter import PorterStemmer
# Reduce words to their stems
stemmed = [PorterStemmer().stem(w) for w in words]
print(stemmed)

['ml', 'capsul', 'github', 'repo', 'show', 'good', 'collect', 'machin', 'learn', 'python', 'data', 'scienc', 'algorithm', 'project', 'explan', 'basic', 'advanc', 'level']


#### LEMMATIZATION

Here’s how to import the relevant parts of NLTK in order to start lemmatizing.After importing, you can create a lemmatizer with WordNetLemmatizer().

In [29]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [30]:
# Lemmatization
from nltk.stem.wordnet import WordNetLemmatizer
# Reduce words to their root form
lemmed = [WordNetLemmatizer().lemmatize(w) for w in words]
print(lemmed)

['ml', 'capsule', 'github', 'repo', 'show', 'good', 'collection', 'machine', 'learning', 'python', 'data', 'science', 'algorithm', 'project', 'explanation', 'basic', 'advance', 'level']
