---
# Natural Language Processing (NLP)
![](https://www.encora.com/hs-fs/hubfs/Blog-EncoraML.png?width=800&name=Blog-EncoraML.png)

Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The result is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural-language generation.

Natural language processing (or Computational linguistic is becoming the state of art in today’s world. It has evolved many years ago in past 1960’s. The task of NLP is understanding the natural human utterances in terms of speech or text, taking as input and giving proper response or output. Text mining also called as Text Analytics uses Natural language processing to transform unstructured corpus into standard and normalised documents or databases for further analysis by applying Artificial intelligence techniques and Machine learning algorithms.

## Common NLP tasks
The following is a list of some of the most commonly researched tasks in natural language processing. Some of these tasks have direct real-world applications, while others more commonly serve as subtasks that are used to aid in solving larger tasks.

Though natural language processing tasks are closely intertwined, they can be subdivided into categories for convenience. A coarse division is given below.
* **Text and speech processing** 
* Morphological analysis 
* Syntactic analysis
* Lexical semantics 
* Relational semantics (semantics of individual sentences)
* Discourse (semantics beyond individual sentences)
* Higher-level NLP applications

---
# PART 1 - MANIPULATING STRINGS

**String:** In computer programming, a string is traditionally a sequence of characters, either as a literal constant or as some kind of variable. The latter may allow its elements to be mutated and the length changed, or it may be fixed (after creation). A string is generally considered as a data type and is often implemented as an array data structure of bytes (or words) that stores a sequence of elements, typically characters, using some character encoding.

## 1.1. Python String Methods

In [1]:
from notebook.services.config import ConfigManager
cm = ConfigManager().update('notebook', {'limit_output': 50})

In [2]:
# PRINT A STENCE
hello_world="olá mundo , meu nome é Felipe meu sobrenome é Oliveira"
print(hello_world)

olá mundo , meu nome é Felipe meu sobrenome é Oliveira


In [3]:
#>>BASIC
# capitalize()  - Converts the first character to upper case
# title()       - Converts the first character of each word to upper case
# upper()       - Converts a string into upper case
# lower()       - Converts a string into lower case

#>> NLP USEFUL
# count()   - Returns the number of times a specified value occurs in a string
# replace()	- Returns a string where a specified value is replaced with a specified value
# split()	- Splits the string at the specified separator, and returns a list

print("Original:                    "    + hello_world)
print("Capitalize:                  "  + hello_world.capitalize())
print("Title:                       "  + hello_world.title())
print("Title:                       "  + hello_world.title())
print("Upper:                       "  + hello_world.upper())
print("Lower:                       "  + hello_world.lower())

print("Count   (meu):               " +  str(hello_world.count("meu") ))
print("Replace (Oliveira<>Ramos):   " +   hello_world.replace('Oliveira','Ramos'))
print("Split:                       " +   str(hello_world.split()))
print("Split (Words Total):         " +   str(len(hello_world.split())))



Original:                    olá mundo , meu nome é Felipe meu sobrenome é Oliveira
Capitalize:                  Olá mundo , meu nome é felipe meu sobrenome é oliveira
Title:                       Olá Mundo , Meu Nome É Felipe Meu Sobrenome É Oliveira
Title:                       Olá Mundo , Meu Nome É Felipe Meu Sobrenome É Oliveira
Upper:                       OLÁ MUNDO , MEU NOME É FELIPE MEU SOBRENOME É OLIVEIRA
Lower:                       olá mundo , meu nome é felipe meu sobrenome é oliveira
Count   (meu):               2
Replace (Oliveira<>Ramos):   olá mundo , meu nome é Felipe meu sobrenome é Ramos
Split:                       ['olá', 'mundo', ',', 'meu', 'nome', 'é', 'Felipe', 'meu', 'sobrenome', 'é', 'Oliveira']
Split (Words Total):         11


---
## Test 01
Create a function that counts the number of words in the sentence

In [4]:
# Function
def count_words(setence):
     [print( f"Count ({word}):" +  str(setence.count(word))) for word in setence.split()]
# Test    
count_words(hello_world)

Count (olá):1
Count (mundo):1
Count (,):1
Count (meu):2
Count (nome):2
Count (é):2
Count (Felipe):1
Count (meu):2
Count (sobrenome):1
Count (é):2
Count (Oliveira):1


In [5]:
# Alternative
from collections import Counter
Counter(hello_world.split())

Counter({'olá': 1,
         'mundo': 1,
         ',': 1,
         'meu': 2,
         'nome': 1,
         'é': 2,
         'Felipe': 1,
         'sobrenome': 1,
         'Oliveira': 1})

---
## 1.2. NLP Hello World and Collections Methods

In [6]:
# Import the packge 
from collections import Counter

In [13]:
#Read the file with the string (Helo World)
rap_lord=open('rap_lord.txt','r',encoding="utf8").read()
print(rap_lord)

Lutei pra entrar e não vou sair Os que não pertencem eu devolvi Ácido no metal causa efeito letal Teto baixo te espreme, respira Quem pira tá na mira da minha firma Então me espera recupera o fôlego Se comigo não morre nunca cai Não tento a sorte, Woodstock num flow metódico Toma é pra quem quer dou é pra quem pode E nosso destino é uma caixa de surpresa Leopardo ou zebra me diz 'Cê quer ser predador ou presa (é assim ô) Percorri pela beira da terra Até a sorte me dizer, menino você tem uma aval No tempo essência eu elevo no peito O excesso essencial É muito bom não se acomodar Satisfação se o verso ecoar Vendo em polpa não vou me poupar Então demorou meu mano, let's go Quero que se foda o que disser 'Tô de pé, vou mantendo a fé até Meu mano vou correndo igual ralé Adivinha o que tu quer, vagabundo quer Mas e quem não quer, né? Quero ver dinheiro na responsa Ser amigo da onça, jacaré que banca vira bolsa Mano, então me mostre a cara Convivência com malandro que já foi da costa Fala pra

In [8]:
# Count letters
Counter(rap_lord.lower()).most_common(10)

[(' ', 791),
 ('a', 498),
 ('e', 454),
 ('o', 400),
 ('n', 220),
 ('r', 216),
 ('i', 202),
 ('m', 198),
 ('u', 191),
 ('s', 186)]

In [9]:
# Count words
Counter(rap_lord.lower().split()).most_common(10)

[('não', 42),
 ('que', 40),
 ('o', 26),
 ('de', 19),
 ('se', 18),
 ('é', 18),
 ('e', 17),
 ('eu', 16),
 ('na', 16),
 ('a', 14)]

---
## Test 02
Create a function that counts the perecentage of letters or words in the sentence

In [10]:
def freq_letters(sentece,n_top):
  letters_count = Counter(sentece.lower()).most_common(n_top)
  letters_total = sum(Counter(sentece.lower()).values())
  
  for letter in letters_count:
      print (f"{letter[0]} => {round((letter[1]/letters_total)*100,2)}%")
  
freq_letters(rap_lord,10)

  => 16.69%
a => 10.51%
e => 9.58%
o => 8.44%
n => 4.64%
r => 4.56%
i => 4.26%
m => 4.18%
u => 4.03%
s => 3.92%


---
## Test 03
Create a function that counts the perecentage of words in the sentence

In [11]:
def freq_words(sentece,n_top):
  word_count = Counter(sentece.lower().split()).most_common(n_top)
  word_total = sum(Counter(sentece.lower().split()).values())
  
  for word in word_count:
      print (f"{word[0]} => {round((word[1]/word_total)*100,2)}%")
  
freq_words(rap_lord,10)

não => 4.54%
que => 4.32%
o => 2.81%
de => 2.05%
se => 1.94%
é => 1.94%
e => 1.84%
eu => 1.73%
na => 1.73%
a => 1.51%
