# 4.1- NLP

NLP trata de aplicaciones que entiendan nuestro idioma, reconocimiento de voz, traducción, comprensión semántica, análisis de sentimiento..

**Usos**

+ Motores de búsqueda
+ Feed de redes sociales
+ Asistentes de voz 
+ Filtros de span
+ Chatbots

**Librerías**

+ NLTK
+ Spacy
+ TFIDF
+ OpenNLP

La dificultad del NLP está en varios niveles:

+ Ambigüedad:

  * Nivel léxico: por ejemplo, varios significados
  * Nivel referencial: anáforas, metáforas, etc...
  * Nivel estructural: la semántica es necesaria para entender la estructura de una oración
  * Nivel pragmático: dobles sentidos, ironía, humor
  
+ Detección de espacios
+ Recepción imperfecta: acentos, -ismos, OCR

El proceso es similar que en USL, primero se vectorizan las palabras y después se miden sus distancias/similitudes. 

In [4]:
# lista de 100 peliculas

titles=open('data/title_list.txt').read().split('\n')[:100]

titles[:10]

['The Godfather',
 'The Shawshank Redemption',
 "Schindler's List",
 'Raging Bull',
 'Casablanca',
 "One Flew Over the Cuckoo's Nest",
 'Gone with the Wind',
 'Citizen Kane',
 'The Wizard of Oz',
 'Titanic']

In [10]:
synopsis=open('data/synopses_list.txt').read().split('\n BREAKS HERE')[:100]

synopsis[0][:100]

" Plot  [edit]  [  [  edit  edit  ]  ]  \n  On the day of his only daughter's wedding, Vito Corleone h"

### Limpieza

In [11]:
#!pip install spacy

In [12]:
import string

import spacy

from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English

import re

In [None]:
#!python -m spacy download en

In [13]:
nlp=spacy.load('en')

parser=English()

In [23]:
def spacy_tokenizer(frase):
    
    tokens=parser(frase)
    #print(help(tokens))   # lo tratamos como string pero es un objeto de spacy
    
    clean_tokens=[]
    
    for e in tokens:
        
        lema=e.lemma_.lower().strip()
        
        if lema not in STOP_WORDS and re.search('^[a-zA-Z]+$', lema):
            
            clean_tokens.append(lema)
            
    return clean_tokens

In [25]:
spacy_tokenizer(synopsis[0][:200])

['plot',
 'edit',
 'edit',
 'edit',
 'day',
 'daughter',
 'wedding',
 'vito',
 'corleone',
 'hears',
 'requests',
 'role',
 'godfather',
 'don',
 'new',
 'york',
 'crime',
 'family',
 'vito',
 'youngest',
 'son']

### TFIDF (term frequency inverse document frequency)

In [27]:
type(synopsis[0])

str

In [28]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [29]:
tfidf=TfidfVectorizer(min_df=0.15, tokenizer=spacy_tokenizer)

In [30]:
tfidf_matrix=tfidf.fit_transform(synopsis)

In [33]:
tfidf_matrix.shape, len(synopsis)

((100, 254), 100)

In [57]:
(str(tfidf_matrix[0]).split('\n'))

['  (0, 110)\t0.0795498967156828',
 '  (0, 98)\t0.08264708598686148',
 '  (0, 111)\t0.0695613301538079',
 '  (0, 36)\t0.08796517530628192',
 '  (0, 216)\t0.07415434536439715',
 '  (0, 225)\t0.08608908178427178',
 '  (0, 195)\t0.07673467810212535',
 '  (0, 141)\t0.17217816356854357',
 '  (0, 170)\t0.08432025376158821',
 '  (0, 47)\t0.07811028526143413',
 '  (0, 193)\t0.0665149972056148',
 '  (0, 1)\t0.06463890368360466',
 '  (0, 137)\t0.04655051554995016',
 '  (0, 103)\t0.08432025376158821',
 '  (0, 251)\t0.06287007566092108',
 '  (0, 30)\t0.08608908178427178',
 '  (0, 250)\t0.07811028526143413',
 '  (0, 181)\t0.05461896756448502',
 '  (0, 151)\t0.24317928100601863',
 '  (0, 161)\t0.1412945368753922',
 '  (0, 54)\t0.20249248351645205',
 '  (0, 43)\t0.06119690788619435',
 '  (0, 41)\t0.11474308459843525',
 '  (0, 123)\t0.056660107160767005',
 '  (0, 29)\t0.07415434536439715',
 '  :\t:',
 '  (0, 140)\t0.0637418534667537',
 '  (0, 167)\t0.08432025376158821',
 '  (0, 35)\t0.0729406186033744

In [42]:
import pandas as pd

df=pd.DataFrame(tfidf_matrix)

In [43]:
df.head()

Unnamed: 0,0
0,"(0, 110)\t0.0795498967156828\n (0, 98)\t0.0..."
1,"(0, 34)\t0.07983654988983906\n (0, 71)\t0.1..."
2,"(0, 188)\t0.08130989727604856\n (0, 68)\t0...."
3,"(0, 133)\t0.07493861532072099\n (0, 55)\t0...."
4,"(0, 213)\t0.08554981380443288\n (0, 52)\t0...."


In [41]:
terms=tfidf.get_feature_names()

terms[:15]

['able',
 'agrees',
 'air',
 'american',
 'apartment',
 'army',
 'arrive',
 'arrives',
 'asks',
 'attack',
 'attempt',
 'attempts',
 'attention',
 'away',
 'battle']

In [58]:
# os dejo la kata

tfidf_matrix

<100x254 sparse matrix of type '<class 'numpy.float64'>'
	with 6489 stored elements in Compressed Sparse Row format>

### Distancias

In [59]:
from sklearn.metrics.pairwise import cosine_similarity as cos

In [60]:
distancias=1-cos(tfidf_matrix)

distancias.shape

(100, 100)

In [63]:
pd.DataFrame(distancias).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,-2.220446e-16,0.871403,0.8827002,0.7558619,0.831149,0.887473,0.696613,0.798515,0.792525,0.800604,...,0.791397,0.859286,0.800776,0.872153,0.745884,0.704577,0.863281,0.756453,0.881292,0.872031
1,0.8714034,0.0,0.7720319,0.8770177,0.8232513,0.825881,0.82456,0.742223,0.88228,0.829316,...,0.801477,0.840941,0.843457,0.868724,0.844129,0.879514,0.90814,0.809165,0.836998,0.892112
2,0.8827002,0.772032,-2.220446e-16,0.8435603,0.7916849,0.873981,0.602106,0.778088,0.775477,0.906123,...,0.855187,0.917054,0.829631,0.882148,0.87641,0.929238,0.869463,0.809234,0.853123,0.758212
3,0.7558619,0.877018,0.8435603,-2.220446e-16,0.8302214,0.841258,0.712993,0.724218,0.840976,0.818933,...,0.826246,0.943786,0.788355,0.761454,0.721442,0.866239,0.809433,0.82218,0.867124,0.872169
4,0.831149,0.823251,0.7916849,0.8302214,1.110223e-16,0.734452,0.697182,0.812257,0.787668,0.916966,...,0.711169,0.807732,0.787696,0.818702,0.686726,0.721795,0.766586,0.706896,0.746433,0.80796


### Clustering

### titulos de los clusters

## NLP_es 

##### similitud

# WordClouds

#### Mascara

### ejemplo con todo

## NER

### Transformers (creacion de texto)