<p>&nbsp;</p>
</p><h1 style="text-align: center;"><strong>Using Natural Language Processing</strong></h1>
<h2 style="text-align: center;"><strong>to create automatic text summarization</strong></h2>
<p>&nbsp;</p><p>&nbsp;</p><p>&nbsp;</p>

# Introduction

We will use the libraries:

- NLTK: open source library called Natural Language Toolkit.
- Beautiful Soap: Web scraping is a technique of extracting data used to collect data from websites.

**Imports and Parameters:**

In [45]:
from urllib.request import Request, urlopen
import pandas as pd
import nltk

***

# NLTK Configuration

**To do this, simply execute the following code within REPL:**

In [47]:
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

It will open the graphical interface of NLTK, where you can choose corporas and models to download:

For the tutorial it is recommended to download the following packages:
- averaged_perceptron_tagger
- floresta
- mac_morpho
- machado
- punkt
- stopwords
- wordnet
- words

***

# Extract Site Information

**Read a News:**

In [48]:
link = Request('https://www.nexojornal.com.br/externo/2019/01/12/5-revela%C3%A7%C3%B5es-da-psicologia-para-voc%C3%AA-encontrar-sua-verdadeira-voca%C3%A7%C3%A3o',headers={'User-Agent': 'Mozilla/5.0'})
page = urlopen(link).read().decode('utf-8', 'ignore')

The information returned will be stored in a variable named page.

***

# To pan the Page

**Page:**

In this step it is important to know that the code depends on the structure of the page that we are collecting, that is, we must modify it to pan other pages.

On the Nexo Journal website the [news](https://www.nexojornal.com.br/externo/2019/01/12/5-revela%C3%A7%C3%B5es-da-psicologia-para-voc%C3%AA-encontrar-sua-verdadeira-voca%C3%A7%C3%A3o) is inside a DIV with ID = content-body-548-15316.

In [51]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(pagina, "lxml")
text = soup.find(id="content-body-548-153162").text

So what we're going to do is perform a search for the elements that have ID = content-body-548-15316. To make a filter and just catch the news on the website.

***

# Natural Language Toolkit (NLTK) 

**Imports and Parameters:**

In [52]:
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize

**Divide the text into sentences and then into words:**

In [54]:
sentence = sent_tokenize(text)
words = word_tokenize(text.lower())

***

# Stopwords: 

**Imports and Parameters:**

In [57]:
from nltk.corpus import stopwords
from string import punctuation

**Removing stopwords from the word list:**

In [58]:
stopwords = set(stopwords.words('portuguese') + list(punctuation))
words_without_stopwords = [word for word in words if word not in stopwords]

The NLTK tokenizer process considers the text scores as tokens as well and so we can not miss them too!

When using SET (does not allow repeated elements) and also list comprehension.

Setting the language of the stopwords we want, Portuguese.

We also remove all scores using ponctuation from the string library.

Now we have our list of text words without the stopwords, stored in the variable words_without_stopwords.

***

# Frequency Distribution and Sentences

**Imports and Parameters:**

In [61]:
from nltk.probability import FreqDist
from collections import defaultdict

**Discovering Frequency:**

In [62]:
frequency = FreqDist(words_without_stopwords)

**Most Important Sentences:**

In [63]:
byword_importants = defaultdict(int)

A score was created for each sentence based on the number of times an important word is repeated within it.

A special dictionary called defaultdict from the collections library was used.

The main difference for a common dictionary is that it does not throw an exception when you search for a non-existent key (KeyError). Instead, it adds this key in the dictionary!

**Polulating Dictionary.:**

Creating a looping to go through all the sentences and collect all the statistics.

In [64]:
for i, sentences in enumerate(sentence):
    for word in word_tokenize(sentences.lower()):
        if word in frequency:
            byword_importants[i] += frequency[word]

Note that the above code populates the dictionary with the index of the sentence (key) and the sum of the frequency of each word present in the sentence (value).

***

# Summary

**Imports and Parameters:**

In [68]:
from heapq import nlargest

To facilitate, the nlargest functionality of the heapq library was used.

**Selecting in the dictionary the "n" most important sentences to form our summary:**

In [69]:
idx_byword_importants = nlargest(4, byword_importants, byword_importants.get)

In the above code, you have chosen the 4 most important sentences!

***

# Creating Abstract

**Here's our news summary:**

In [70]:
for i in sorted(idx_byword_importants):
    print(sentence[i])

O que você precisa fazer é primeiro descobrir sua paixão — aquilo que realmente é importante para você.”Barack ObamaSe você, como muitos, está procurando por seu chamado na vida – talvez você ainda esteja incerto sobre qual profissão se alinha com o que é mais importante – aqui estão cinco descobertas de pesquisas recentes que merecem ser levadas em consideração.Primeiro, há uma diferença entre ter uma paixão harmoniosa e uma paixão obsessiva.
Eles concluíram: “ter um chamado é apenas um benefício se for realizado, mas pode ser um prejuízo quando não for, mesmo em comparação a não ter nenhum chamado".A terceira descoberta a ter em mente é que, sem paixão, a ousadia é “apenas uma dia-a-dia atarefado”.
Duckworth sempre enfatizou que existe aí outro componente vital que nos traz de volta à paixão – ao lado da persistência, ela diz que as pessoas corajosas também têm um “interesse maior” (outra maneira de descrever a existência de uma paixão ou de um chamado).Se você continua sem chegar a 

***

# <p>&nbsp;</p>
<h1 style="text-align: center;"><strong><span lang="pt">CONCLUSION</strong></span></h1>
<p>&nbsp;</p><p>&nbsp;</p><p>&nbsp;</p>

This is a tutorial made by the author: Vinícius R. Lima, and can be found [here](https://medium.com/@viniljf/utilizando-processamento-de-linguagem-natural-para-criar-um-sumariza%C3%A7%C3%A3o-autom%C3%A1tica-de-textos-775cb428c84e).

As a challenge, modify to extract texts from other sites, from txt files, among others.

I'm using this as a basis for acquiring NLP (Natural Language Processing) and Web Scraping skills, since with this template it's so easy to perform automatic text summarization using Python libraries on web sites, or from any source for later create a summary of the text.

**References:**
- [NLTK - Natural Language Toolkit: documentation](http://www.nltk.org/)
- [Beautiful Soup: Python library for pulling data out of HTML and XML files.](https://www.crummy.com/software/BeautifulSoup/)

***

##### INSTALLED VERSIONS

In [71]:
pd.show_versions()


INSTALLED VERSIONS
------------------
commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 142 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.4
pytest: 3.9.1
pip: 18.1
setuptools: 40.4.3
Cython: 0.29.1
numpy: 1.15.4
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.0.1
sphinx: 1.8.1
patsy: 0.5.0
dateutil: 2.7.5
pytz: 2018.5
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.9
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.1.2
lxml: 4.1.1
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.14
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None


***