# Punto 1: Pre-Procesamiento

- `[18 pts]` Leer el archivo `Princesas.csv` usando `pandas` y crear una nueva columna con el texto en minúscula, sin caracteres especiales ni números, sin palabras vacias y hacer stemming de las palabras

- Importar librerias

In [33]:
import re
import numpy as np
import pandas as pd

from nltk.corpus import stopwords
stopwords_sp = stopwords.words('spanish')

from nltk.stem.snowball import SnowballStemmer
spanishStemmer=SnowballStemmer("spanish") # carga de raices en español

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import cosine_distances
from sklearn.metrics.pairwise import euclidean_distances
import nltk

**1.1 Lectura archivo:**

In [2]:
# lectura de archivo
princesas = pd.read_csv('Princesas.csv', sep=',')
princesas.head()

Unnamed: 0,Princesa,Personalidad
0,Blancanieves,Blancanieves es una princesa de noble cuna que...
1,Cenicienta,Cenicienta es inicialmente una sirvienta en su...
2,Aurora,"La Princesa Aurora, la Bella Durmiente, es la ..."
3,Bella,Bella es una muchacha que vive en la campiña f...
4,Jasmín,"Cuando se introdujo por primera vez, la Prince..."


In [3]:
princesas.shape

(10, 2)

**1.2 texto en minúscula, sin caracteres especiales ni números, sin palabras vacias y con stemming:**

- Definición de funciones:

In [4]:
def quitar_tildes(s):
    tildes = (
        ("á", "a"),
        ("é", "e"),
        ("í", "i"),
        ("ó", "o"),
        ("ú", "u"),
    )
    for origen, destino in tildes:
        s = s.replace(origen, destino)
    return s

In [11]:
def Preprocesamiento(s):
    
    texto_min = s.lower() # minúsculas
    texto_l = re.sub(r"[\W\d_]+", " ",texto_min) # remove caract, números
    texto_sint = quitar_tildes(texto_l) # remove tildes
    texto_t = texto_sint.split() # tokenizar
    texto_stopW = [palabra for palabra in texto_t if palabra not in stopwords_sp] # stopwords
    texto_stem = [spanishStemmer.stem(palabra) for palabra in texto_stopW] # stemming
    texto_procesado = " ".join(texto_stem)
    
    return texto_procesado

- Creación de nueva columna con el pre-procesamiento:

In [12]:
princesas['pre-procesado'] = princesas['Personalidad'].apply(lambda comentario: Preprocesamiento(comentario))

In [13]:
princesas

Unnamed: 0,Princesa,Personalidad,pre-procesado
0,Blancanieves,Blancanieves es una princesa de noble cuna que...,blancaniev princes nobl cun ve forz servidumbr...
1,Cenicienta,Cenicienta es inicialmente una sirvienta en su...,cenicient inicial sirvient cas constant objet ...
2,Aurora,"La Princesa Aurora, la Bella Durmiente, es la ...",princes auror bell durmient hij unic rein flor...
3,Bella,Bella es una muchacha que vive en la campiña f...,bell muchach viv campiñ frances padr inventor ...
4,Jasmín,"Cuando se introdujo por primera vez, la Prince...",introduj primer vez princes jasmin poc dias de...
5,Pocahontas,"El nombre de Pocahontas significa ""Pequeña Sil...",nombr pocahont signif pequeñ silenci bas figur...
6,Mulan,Mulan es atípica a los anteriores papeles feme...,mul atip anterior papel femenin pelicul disney...
7,Tiana,Es una joven camarera que sueña con ser dueña ...,jov camarer sueñ ser dueñ propi restaur algun ...
8,Mérida,Mérida llama la atención por su característico...,mer llam atencion caracterist pel anaranj oscu...
9,Moana,"Moana, una joven de 16 años de edad, hija únic...",moan jov años edad hij unic sucesor import jef...


# Punto 2: TF-IDF

- `[16 pts]` Crear la matriz TF-IDF

**2.1 TF-IDF**

In [25]:
tfidf_vect = TfidfVectorizer()

tfidf = tfidf_vect.fit_transform(princesas['pre-procesado'].values)

tfidf_matrix = pd.DataFrame(tfidf.toarray(), index=princesas['Princesa'].values, columns=tfidf_vect.get_feature_names())

tfidf_matrix.T.round(3)

Unnamed: 0,Blancanieves,Cenicienta,Aurora,Bella,Jasmín,Pocahontas,Mulan,Tiana,Mérida,Moana
abrum,0.000,0.0,0.065,0.000,0.000,0.000,0.0,0.000,0.00,0.000
abuel,0.000,0.0,0.000,0.000,0.000,0.000,0.0,0.000,0.00,0.102
acuerd,0.000,0.0,0.055,0.000,0.000,0.000,0.0,0.000,0.08,0.000
adem,0.000,0.0,0.000,0.000,0.000,0.097,0.0,0.000,0.00,0.000
afortun,0.087,0.0,0.000,0.000,0.000,0.000,0.0,0.000,0.00,0.000
...,...,...,...,...,...,...,...,...,...,...
viv,0.000,0.0,0.000,0.073,0.000,0.000,0.0,0.000,0.00,0.087
volunt,0.000,0.0,0.000,0.000,0.163,0.000,0.0,0.000,0.00,0.000
volv,0.000,0.0,0.065,0.000,0.000,0.000,0.0,0.000,0.00,0.000
vudu,0.000,0.0,0.000,0.000,0.000,0.000,0.0,0.126,0.00,0.000


# Punto 3: Distancia del coseno
- `[12 pts]` Calcular la distancia del coseno entre cada una de las princesas
- `[2 pts]` ¿Cuáles son las princesas más parecidas?
- `[2 pts]` ¿Cuáles son las princesas más diferentes?

**3.1 Calcular la distancia del coseno**

In [37]:
#tfidf_matrix

In [36]:
dist_cos = cosine_distances(tfidf_matrix.values)
dist_cos = pd.DataFrame(dist_cos, columns = princesas['Princesa'].values, index = princesas['Princesa'].values)
dist_cos

Unnamed: 0,Blancanieves,Cenicienta,Aurora,Bella,Jasmín,Pocahontas,Mulan,Tiana,Mérida,Moana
Blancanieves,0.0,0.846938,0.841187,0.921131,0.93453,0.920278,0.871945,0.961197,0.985196,0.958144
Cenicienta,0.846938,0.0,0.855609,0.943221,0.95842,0.956574,0.933945,0.918273,0.970544,0.974853
Aurora,0.841187,0.855609,0.0,0.816663,0.937056,0.865628,0.91084,0.919058,0.938931,0.961817
Bella,0.921131,0.943221,0.816663,0.0,0.888784,0.865305,0.879418,0.963203,0.947451,0.907177
Jasmín,0.93453,0.95842,0.937056,0.888784,0.0,0.94359,0.936092,0.935021,0.972123,0.968945
Pocahontas,0.920278,0.956574,0.865628,0.865305,0.94359,0.0,0.877391,0.962937,0.96389,0.917242
Mulan,0.871945,0.933945,0.91084,0.879418,0.936092,0.877391,0.0,0.980113,0.98756,0.947563
Tiana,0.961197,0.918273,0.919058,0.963203,0.935021,0.962937,0.980113,0.0,0.978574,0.97078
Mérida,0.985196,0.970544,0.938931,0.947451,0.972123,0.96389,0.98756,0.978574,0.0,0.965109
Moana,0.958144,0.974853,0.961817,0.907177,0.968945,0.917242,0.947563,0.97078,0.965109,0.0


**3.2 ¿Cuales son las princesas mas parecidas?**

- Bella y Aurora, con $0.816663$ 
- Aurora y Blancanieves, con $0.841187$

Personalidades de Aurora y Bella:

In [76]:
print("Aurora: \n\n",
      princesas.set_index('Princesa').loc['Aurora','Personalidad'],
      "\nBella: \n\n",
      princesas.set_index('Princesa').loc['Bella','Personalidad'])

Aurora: 

 La Princesa Aurora, la Bella Durmiente, es la hija única de la Reina Flor y el Rey Estéfano. La Princesa Aurora, la Bella Durmiente, es retratada como una chica amable, juguetona, tímida, gentil, y bastante ingenua, que ama a los animales. Su rasgo de personalidad más destacado es su pasión por el amor y es vista como una romántica empedernida. Las hadas buenas del reino la han bendecido con la belleza y el don del canto. Ella se siente sola la mayor parte de la película, ya que está aislada del castillo de su padre dónde ha nacido en una cabaña en el bosque llamada la Cabaña del Leñador con sus tres hadas buenas y madrinas Flora, Fauna y Primavera.
En La Bella Durmiente, cuando llega de recoger las fresas, el hada Flora revela su nombre largo: Rosabelle.
Amante de las cosas simples de la vida, Aurora a menudo se pregunta por qué sus tías la tratan como a una niña. Ella desea conocer gente nueva, conocer nuevos lugares y tomar sus propias decisiones.
Pero aun así, ya que es 

**3.3 ¿Cuales son las princesas mas diferentes?**

- Mérida y Mulan, con $0.987560$
- Mérida y Blancanieves, con $0.985196$