# Natural language Processing in Artificial Intelligence

Natural Language Processing (NLP) is a branch of Artificial Intelligence (AI) that studies how machines understand human language. Its goal is to build systems that can make sense of text and perform tasks like translation, grammar checking, or topic classification.

# Applications

* Text Classification and Categorization
* Named Entity Recognition (NER)
* Part-of-Speech Tagging
* Semantic Parsing and Question Answering
* Language Generation and Multi-document Summarization
* Speech Recognition
* Spell Checking
* Sentiment Analysis
* Chatbots & Virtual Assistants
* Text Extraction
* Machine Translation
* Text Summarization
* Market Intelligence
* Auto-Correct
* Intent Classification
* Urgency Detection


# NLP Frameworks 

* NLTK - the swiss army knife, incredibly useful all around
* Gensim - another Python centric library that’s highly effective for topic modeling tasks
* Rasa NLU - increasingly an excellent alternative to proprietary language understanding engines
* Spacy NLP
* Watson by IBM

!['nlp'](nlp_lib.jpg)

# Text Summarization with NLTK in Python

* Build simple NLP-based technique for text summarization
* we will simply use Python's NLTK library for summarizing Wikipedia articles.

https://www.nltk.org/

# Text Summarization Steps



So, keep working. Keep striving. Never give up. Fall down seven times, get up eight. Ease is a greater threat to progress than hardship. Ease is a greater threat to progress than hardship. So, keep moving, keep growing, keep learning. See you at work.

1)Convert Paragraphs to Sentences

2)Text Preprocessing

we need to remove 
* all the special characters, 
* stop words 
* numbers

3)Tokenizing the Sentences
* get all the words that exist in the sentences. 

4)Find Weighted Frequency of Occurrence
* We can find the weighted frequency of each word by dividing its frequency by the frequency of the most occurring word. 

5)Replace Words by Weighted Frequency in Original Sentences

* t is important to mention that weighted frequency for the words removed during preprocessing (stop words, punctuation, digits etc.) will be zero and therefore is not required to be added,

6)Sort Sentences in Descending Order of Sum


# Summarizing Wikipedia Articles

### Fetching Articles from Wikipedia'

* We need to download is the beautiful soup which is very useful Python utility for web scraping.

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

In [1]:
import bs4 as bs
import urllib.request
import re

# We then use the urlopen function from the urllib.request utility to scrape the data.

# https://en.wikipedia.org/wiki/Kenya
scraped_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Kenya')
scraped_data 

<http.client.HTTPResponse at 0x2353b005ec8>

In [23]:
# call read function on the object returned by urlopen function in order to read the data.
article = scraped_data.read()
article

b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Kenya - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"b6c290a5-036d-4224-9a7c-cc348dd2ed1a","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Kenya","wgTitle":"Kenya","wgCurRevisionId":977245330,"wgRevisionId":977245330,"wgArticleId":188171,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1: Julian\xe2\x80\x93Gregorian uncertainty","Webarchive template wayback links","All articles with dead external links","Articles with dead external links from December 2017","Articles 

In [24]:
parsed_article = bs.BeautifulSoup(article,'lxml')
parsed_article 

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Kenya - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"b6c290a5-036d-4224-9a7c-cc348dd2ed1a","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Kenya","wgTitle":"Kenya","wgCurRevisionId":977245330,"wgRevisionId":977245330,"wgArticleId":188171,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1: Julian–Gregorian uncertainty","Webarchive template wayback links","All articles with dead external links","Articles with dead external links from December 2017","Articles with permanently d

In [25]:
# To retrieve the text we need to call find_all function on the object returned by the BeautifulSoup
paragraphs = parsed_article.find_all('p')
paragraphs 

[<p class="mw-empty-elt">
 </p>,
 <p>
 <span style="font-size: small;"><span id="coordinates"><a href="/wiki/Geographic_coordinate_system" title="Geographic coordinate system">Coordinates</a>: <span class="plainlinks nourlexpansion"><a class="external text" href="//geohack.toolforge.org/geohack.php?pagename=Kenya&amp;params=1_N_38_E_" rel="nofollow"><span class="geo-default"><span class="geo-dms" title="Maps, aerial photos, and other data for this location"><span class="latitude">1°N</span> <span class="longitude">38°E</span></span></span><span class="geo-multi-punct">﻿ / ﻿</span><span class="geo-nondefault"><span class="geo-dec" title="Maps, aerial photos, and other data for this location">1°N 38°E</span><span style="display:none">﻿ / <span class="geo">1; 38</span></span></span></a></span></span></span>
 </p>,
 <p><b>Kenya</b> (<span class="rt-commentedText nowrap"><span class="IPA nopopups noexcerpt"><a href="/wiki/Help:IPA/English" title="Help:IPA/English">/<span style="border-botto

In [26]:
article_text = ""

for p in paragraphs:
    article_text += p.text
    
article_text 

'\n\nCoordinates: 1°N 38°E\ufeff / \ufeff1°N 38°E\ufeff / 1; 38\nKenya (/ˈkɛnjə/ (listen)), officially the Republic of Kenya (Swahili: Jamhuri ya Kenya), is a country in Eastern Africa. At 580,367 square kilometres (224,081\xa0sq\xa0mi), Kenya is the world\'s 48th largest country by total area. With a population of more than 47.6 million people, Kenya is the 29th most populous country.[5] Kenya\'s capital and largest city is Nairobi, while its oldest city and first capital is the coastal city of Mombasa. Kisumu City is the third largest city and also an inland port on Lake Victoria. Other important urban centres include Nakuru and Eldoret. As of 2020, Kenya is the third largest economy in sub-Saharan Africa after Nigeria and South Africa.[11] Kenya is bordered by South Sudan to the northwest, Ethiopia to the north, Somalia to the east, Uganda to the west, Tanzania to the south, and the Indian Ocean to the southeast.\nAccording to archaeological dating of associated artifacts and skelet

### Preprocessing

In [27]:
#  remove references from the article, references are enclosed in square brackets. 
# Removing Square Brackets and Extra Spaces
#removes the square brackets and replaces the resulting multiple spaces by a single space. 
article_text = re.sub(r'\[[0-9]*\]', ' ', article_text)
article_text = re.sub(r'\s+', ' ', article_text)

In [28]:
# Removing special characters and digits
formatted_article_text = re.sub('[^a-zA-Z]', ' ', article_text )
formatted_article_text = re.sub(r'\s+', ' ', formatted_article_text)

Now we have two objects article_text, which contains the original article and formatted_article_text 

### Converting Text To Sentences

In [29]:
import nltk
sentence_list = nltk.sent_tokenize(article_text)

### Find Weighted Frequency of Occurrence

In [30]:
stopwords = nltk.corpus.stopwords.words('english')

word_frequencies = {}
for word in nltk.word_tokenize(formatted_article_text):
    if word not in stopwords:
        if word not in word_frequencies.keys():
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1
            
word_frequencies

{'Coordinates': 1,
 'N': 3,
 'E': 3,
 'Kenya': 220,
 'k': 2,
 'nj': 2,
 'listen': 1,
 'officially': 1,
 'Republic': 6,
 'Swahili': 16,
 'Jamhuri': 1,
 'ya': 3,
 'country': 40,
 'Eastern': 6,
 'Africa': 38,
 'At': 6,
 'square': 2,
 'kilometres': 4,
 'sq': 3,
 'mi': 4,
 'world': 19,
 'th': 15,
 'largest': 19,
 'total': 9,
 'area': 12,
 'With': 3,
 'population': 20,
 'million': 23,
 'people': 19,
 'populous': 1,
 'capital': 4,
 'city': 13,
 'Nairobi': 17,
 'oldest': 1,
 'first': 18,
 'coastal': 8,
 'Mombasa': 11,
 'Kisumu': 2,
 'City': 2,
 'third': 4,
 'also': 27,
 'inland': 3,
 'port': 4,
 'Lake': 6,
 'Victoria': 2,
 'Other': 5,
 'important': 7,
 'urban': 7,
 'centres': 3,
 'include': 16,
 'Nakuru': 1,
 'Eldoret': 1,
 'As': 6,
 'economy': 10,
 'sub': 3,
 'Saharan': 2,
 'Nigeria': 2,
 'South': 5,
 'bordered': 1,
 'Sudan': 4,
 'northwest': 3,
 'Ethiopia': 4,
 'north': 5,
 'Somalia': 7,
 'east': 3,
 'Uganda': 7,
 'west': 3,
 'Tanzania': 6,
 'south': 2,
 'Indian': 9,
 'Ocean': 3,
 'southeast

In [31]:
# e can simply divide the number of occurances of all the words by the frequency of the most occurring word
maximum_frequncy = max(word_frequencies.values())

for word in word_frequencies.keys():
    word_frequencies[word] = (word_frequencies[word]/maximum_frequncy)
    
word_frequencies[word]

0.004545454545454545

### Calculating Sentence Scores

In [32]:
#we first create an empty sentence_scores dictionary
sentence_scores = {}
for sent in sentence_list:
                #tokenize the sentence into words.
    for word in nltk.word_tokenize(sent.lower()):
        #we then check if the word exists in the word_frequencies dictionary.
        if word in word_frequencies.keys():
            # we calculate the score for only sentences with less than 30 words
            if len(sent.split(' ')) < 30:
                # we check whether the sentence exists in the sentence_scores dictionary or no
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word]
                else:
                    sentence_scores[sent] += word_frequencies[word]
                    
sentence_scores

{' Coordinates: 1°N 38°E\ufeff / \ufeff1°N 38°E\ufeff / 1; 38 Kenya (/ˈkɛnjə/ (listen)), officially the Republic of Kenya (Swahili: Jamhuri ya Kenya), is a country in Eastern Africa.': 0.2318181818181818,
 "At 580,367 square kilometres (224,081 sq mi), Kenya is the world's 48th largest country by total area.": 0.509090909090909,
 'With a population of more than 47.6 million people, Kenya is the 29th most populous country.': 0.46818181818181814,
 "Kenya's capital and largest city is Nairobi, while its oldest city and first capital is the coastal city of Mombasa.": 0.42272727272727273,
 'Kisumu City is the third largest city and also an inland port on Lake Victoria.': 0.37727272727272726,
 'Other important urban centres include Nakuru and Eldoret.': 0.15,
 'As of 2020, Kenya is the third largest economy in sub-Saharan Africa after Nigeria and South Africa.': 0.15909090909090912,
 "Nilotic-speaking pastoralists (ancestral to Kenya's Nilotic speakers) started migrating from present-day sou

### Getting the Summary

In [35]:
#, we use the heapq library and call its nlargest function to retrieve the top 7 sentences with the highest scores.
import heapq
summary_sentences = heapq.nlargest(3, sentence_scores, key=sentence_scores.get)

summary = ' '.join(summary_sentences)
print(summary)

Mwai Kibaki became the first president to serve under this new constitution while Uhuru Kenyatta became the first president elected under this constitution. Several contentious clauses, including the one that allowed for only one political party, were changed in the following years. Basic formal education starts at age six and lasts 12 years, consisting of eight years in primary school and four in high school or secondary.
