# Adevinta: Text similarity

Another way of finding similarities on the properties dataset is by using text similarities techniques by computing the similarity in meaning between texts.

**What is text similarity?**

Text similarity has to determine how "close" two pieces of text are both in surface closeness **lexical similarity** and meaning **semantic similarity**.

On the surface, if you consider only word level similarity, these two phrases appear very similar as 3 of the 4 unique words are an exact overlap. It typically does not take into account the actual meaning behind words or the entire phrase in context.

Instead of doing a word for word comparison, we also need to pay attention to context in order to capture more of the semantics. To consider semantic similarity we need to focus on phrase/paragraph levels (or lexical chain level) where a piece of text is broken into a relevant group of related words prior to computing similarity. We know that while the words significantly overlap, these two phrases actually have different meaning.


There is a dependency structure in any sentences:

mouse is the object of ate in the first case and food is the object of ate in the second case
Since differences in word order often go hand in hand with differences in meaning (compare the dog bites the man with the man bites the dog), we'd like our sentence embeddings to be sensitive to this variation.

In [21]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))

In [22]:
%matplotlib inline

In [23]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

### Adevinta: Dataset loading and description

In [24]:
df_fotocasa = pd.read_csv("./data/problem_data_reduced.csv",sep="|")

In [25]:
df_fotocasa.head()

Unnamed: 0,idproperty,province,municipality,surface,rooms,baths,property_type,property_subtype,transacion_type,price,description
0,qkgdhixsul,Girona,Castell-Platja d'Aro,60,2,1,Vivienda,Apartamento,Sell,178000.0,"apartamento de 60 m2, dsistribuido en cocina i..."
1,swigwvclxz,Barcelona,Vilanova i la Geltrú,197,4,2,Vivienda,Casa-Chalet,Sell,345000.0,VILANOVA I LA GELTRULes presentamos esta casa ...
2,bfvgsrcdoj,Lleida,Fondarella,375,5,3,Vivienda,Casa-Chalet,Sell,180000.0,
3,tsracvmevc,Girona,Girona Capital,89,4,2,Vivienda,Piso,Sell,187000.0,"Pis de 89m2, menjador de 23m2, cuina office de..."
4,biayppbmen,Barcelona,Manresa,180,6,1,Vivienda,Piso,Sell,350000.0,"MANRESA, piso de 6 habitaciones muy amplias to..."


In [53]:
print("\nObservations: {}, Features: {}\n".format(df_fotocasa.shape[0], df_fotocasa.shape[1]-1))


Observations: 1208, Features: 10



### Adevinta: Dataset missing values and duplicates

For the purpose of this exercise we had to elinate duplicate so each property represents a single instance. A more detailed look of the different instances could yield better results.

In [54]:
df_fotocasa_text = df_fotocasa[["idproperty","description"]].sort_values(by='idproperty').copy()
df_fotocasa_text = df_fotocasa_text.groupby("idproperty").first().reset_index()

In [55]:
df_fotocasa_text.head()

Unnamed: 0,idproperty,description
0,abkvpehvdk,"SH194.- Masia rural en Olivella, muy cerca de ..."
1,abrrpeggwd,"VENDO MASÍA - NEGOCIO TURISMO RURAL, en Pineda..."
2,adqphxnrhg,Lapos;Estartit - Els Griells: A 50 metros de l...
3,adsvpzczjm,Casa de aspecto rústico totalmente reformada (...
4,aemkznotwk,"situada en zona muy tranquila, agua de pozo ,..."


In [56]:
print("\nAfter eliminating duplicates we endup with {} observations\n" \
      .format(df_fotocasa_text.shape[0]))


After eliminating duplicates we endup with 904 observations



### Adevinta: Text normalization: Regular expressions, Tokenization and Steaming

Text normalization is the process of transforming text into a single canonical form that it might not have had before and requires being aware of what type of text is to be normalized and how it is to be processed afterwards; there is no all-purpose normalization procedure. 

Text normalization performed by regular expressions:

- Remove special characters
- Remove single characters
- Remove multiple spaces
- Remove punctuation
- Lower case all words ( loss of semantics in some cases )

Additional text normalization performed by Tokenization and Steaming

- Remove stopwords
- Change blacklisted words ( not done )
- Create steams for the words ( snowball stemmer )

Other, more complex techniques could be used in this section (not the point of this exercise but could be useful for future works):

- Spelling correction
- Statistical machine translation
- Automatic speech recognition

In [57]:
import string, re

class TextProcessing:

    def remove_punctuation(self, s):
        translate_table = dict(
            (ord(char), None) for char in string.punctuation
        )
        return s.translate(translate_table)

    def remove_special_characters(self, s):
        if pd.isnull(s):
            return ''
        return re.sub(r'\W', ' ', s)

    def remove_single_characters(self, s):
        t = re.sub(r'\s+[a-zA-Z]\s+', ' ', s)
        return re.sub(r'\^[a-zA-Z]\s+', ' ', t) 

    def multiple_spaces_to_single_spaces(self, s):
        return re.sub(r'\s+', ' ', s, flags=re.I)

    def remove_prefixed_b(self, s):
        return re.sub(r'^b\s+', '', s)

    def convert_to_lowercase(self, s):
        return s.lower()
    
    def fit(self):
        pass
    
    def transform(self, df, column, new_column):
        df[new_column] = df[column].apply(
            lambda x: self.remove_special_characters(x))
        df[new_column] = df[new_column].apply(
            lambda x: self.remove_single_characters(x))
        df[new_column] = df[new_column].apply(
            lambda x: self.multiple_spaces_to_single_spaces(x))
        df[new_column] = df[new_column].apply(
            lambda x: self.remove_prefixed_b(x))
        df[new_column] = df[new_column].apply(
            lambda x: self.convert_to_lowercase(x))
        return df

In [58]:
tp = TextProcessing()

In [59]:
df_fotocasa_text_trans = tp.transform(
    df=df_fotocasa_text.copy(), 
    column='description', 
    new_column='description_clean')

df_fotocasa_text_trans.head()

Unnamed: 0,idproperty,description,description_clean
0,abkvpehvdk,"SH194.- Masia rural en Olivella, muy cerca de ...",sh194 masia rural en olivella muy cerca de pla...
1,abrrpeggwd,"VENDO MASÍA - NEGOCIO TURISMO RURAL, en Pineda...",vendo masía negocio turismo rural en pineda de...
2,adqphxnrhg,Lapos;Estartit - Els Griells: A 50 metros de l...,lapos estartit els griells 50 metros de la pla...
3,adsvpzczjm,Casa de aspecto rústico totalmente reformada (...,casa de aspecto rústico totalmente reformada a...
4,aemkznotwk,"situada en zona muy tranquila, agua de pozo ,...",situada en zona muy tranquila agua de pozo ins...


In [60]:
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer

class TextStemmer:
    
    def __init__(self, language):
        self.stopwords = stopwords.words(language)
        self.stemmer = SnowballStemmer(language=language)
    
    def fit(self):
        pass
    
    def transform(self, df, column, new_column):
        df_ = df.copy()
        df_[new_column] = df_[column].apply(
            lambda x: [
                self.stemmer.stem(word) for word in x.split() \
                    if not word in self.stopwords
            ])
        return df_

In [61]:
ts = TextStemmer(language='spanish')

In [62]:
df_fotocasa_text_trans = ts.transform(
    df=df_fotocasa_text_trans, 
    column='description_clean', 
    new_column='description_stem')

df_fotocasa_text_trans.head()

Unnamed: 0,idproperty,description,description_clean,description_stem
0,abkvpehvdk,"SH194.- Masia rural en Olivella, muy cerca de ...",sh194 masia rural en olivella muy cerca de pla...,"[sh194, masi, rural, olivell, cerc, plan, nove..."
1,abrrpeggwd,"VENDO MASÍA - NEGOCIO TURISMO RURAL, en Pineda...",vendo masía negocio turismo rural en pineda de...,"[vend, mas, negoci, turism, rural, pined, mar,..."
2,adqphxnrhg,Lapos;Estartit - Els Griells: A 50 metros de l...,lapos estartit els griells 50 metros de la pla...,"[lap, estartit, els, griells, 50, metr, play, ..."
3,adsvpzczjm,Casa de aspecto rústico totalmente reformada (...,casa de aspecto rústico totalmente reformada a...,"[cas, aspect, rustic, total, reform, año, fach..."
4,aemkznotwk,"situada en zona muy tranquila, agua de pozo ,...",situada en zona muy tranquila agua de pozo ins...,"[situ, zon, tranquil, agu, poz, instalacion, a..."


## Adevinta: Text embedding and similarity metric

**Text embeddings** are the mathematical representations of words as vectors. They are created by analyzing a body of text and representing each word, phrase, or entire document as a vector in a high dimensional space (similar to a multi-dimensional graph).

We've tried a couple of methods for the purpose of the exercise:

- A more classic approach: Word Count Vectors + TF-IDF + Cosine
- Using word models: Doc2Vec + Cosine

Here is our list of embeddings that can be used in future works:

- Bag of Words (BoW)
- Term Frequency - Inverse Document Frequency (TF-IDF)
- Continuous BoW (CBOW) model and SkipGram model embedding(SkipGram)
- Pre-trained word embedding models: 
  * Word2Vec (by Google) and Doc2Vec
  * GloVe (by Stanford)
  * fastText (by Facebook)
- Poincarré embedding
- Node2Vec embedding based on Random Walk and Graph

Are is capable of capturing context of a word in a document, semantic and syntactic similarity, relation with other words, etc.

### Adevinta: Text embedding and similarity: Word Count Vectors + TF-IDF + Cosine

- **Word Count Vectors:** With this method, every column is a term from the corpus, and every cell represents the frequency count of each term in each document.

- **TF–IDF Vectors:** TF-IDF is a score that represents the relative importance of a term in the document and the entire corpus. TF stands for Term Frequency, and IDF stands for Inverse 

- **Cosine Similarity:** Calculates the cosine similarity metric between two given word vectors

In [63]:
documents =  df_fotocasa_text_trans['description_stem'].values
documents = [' '.join(document) for document in documents]

**Word Count Vectors**

In [64]:
from sklearn.feature_extraction.text import CountVectorizer

In [65]:
vectorizer = CountVectorizer(max_features=1500, 
                             min_df=5, 
                             max_df=0.7, 
                             stop_words=ts.stopwords)
vectorizer.fit(documents)

X = vectorizer.transform(documents).toarray()

**TF-IDF**

In [66]:
from sklearn.feature_extraction.text import TfidfTransformer

In [67]:
tfidfconverter = TfidfTransformer()
tfidfconverter.fit(X)

X = tfidfconverter.transform(X).toarray()

In [72]:
word_freq_df = pd.DataFrame(X, columns=vectorizer.get_feature_names())
word_freq_df[word_freq_df.wifi > 0.1]

Unnamed: 0,000,000m2,10,100,100m2,105,10m2,11,115,12,...,vitroceram,viur,viv,vivend,viviend,wc,web,wifi,with,zon
149,0.0,0.155567,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.075428,0.0,0.0,0.138496,0.0,0.0
258,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.125518,0.0,0.0,0.0,0.0,0.194149,0.0,0.072638
269,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.187692,0.0,0.0
312,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.178888,0.0,0.066928
382,0.0,0.0,0.095318,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.148416,0.0,0.055527
512,0.0,0.0,0.0,0.0,0.0,0.0,0.184199,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.170632,0.0,0.0
522,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.231511,0.0,0.0,0.0,0.0,0.0,0.0,0.26167,0.0,0.097899
659,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.134965,0.0,0.10099
779,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.260931,0.1384,0.0
797,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.121575,...,0.0,0.0,0.0,0.0,0.226336,0.0,0.0,0.138529,0.0,0.0


In [77]:
df_fototext = pd.concat(
    [df_fotocasa_text_trans[['idproperty']], word_freq_df], sort=False, axis=1)
df_fototext.head()

Unnamed: 0,idproperty,000,000m2,10,100,100m2,105,10m2,11,115,...,vitroceram,viur,viv,vivend,viviend,wc,web,wifi,with,zon
0,abkvpehvdk,0.0,0.183648,0.105003,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,abrrpeggwd,0.120449,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,adqphxnrhg,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.097672
3,adsvpzczjm,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.142983,0.0,0.10448,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,aemkznotwk,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.113075


**Cosine similarity**

In [82]:
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine

In [83]:
dist_out = 1-pairwise_distances(X, metric="cosine")
dist_out

array([[1.        , 0.19681883, 0.0372361 , ..., 0.03217765, 0.03059349,
        0.03993822],
       [0.19681883, 1.        , 0.08504716, ..., 0.20906892, 0.09127283,
        0.08237164],
       [0.0372361 , 0.08504716, 1.        , ..., 0.13742142, 0.09743345,
        0.11721089],
       ...,
       [0.03217765, 0.20906892, 0.13742142, ..., 1.        , 0.14003873,
        0.05508991],
       [0.03059349, 0.09127283, 0.09743345, ..., 0.14003873, 1.        ,
        0.10131242],
       [0.03993822, 0.08237164, 0.11721089, ..., 0.05508991, 0.10131242,
        1.        ]])

In [89]:
df_res = pd.DataFrame(dist_out)
df_res.columns = list(df_fotocasa_text_trans.idproperty)

df_res['idproperty'] = df_fotocasa_text_trans.idproperty
df_res.head()

Unnamed: 0,abkvpehvdk,abrrpeggwd,adqphxnrhg,adsvpzczjm,aemkznotwk,aeunvoqpfk,aexiyslvzb,afpxjapnaa,aglafrntto,agvdpedzlx,...,zrgbsuyzxw,zrsflarpkv,zswfozpnpq,zufqkgdlos,zuqgmuxggs,zvepjajedz,zxkniyripf,zxmtqvtewr,zzqybyrjsg,idproperty
0,1.0,0.196819,0.037236,0.059884,0.052991,0.037511,0.080819,0.073396,0.022855,0.048095,...,0.029469,0.0,0.049465,0.046885,0.024888,0.021081,0.032178,0.030593,0.039938,abkvpehvdk
1,0.196819,1.0,0.085047,0.046122,0.188895,0.0,0.048406,0.121482,0.102697,0.09971,...,0.084646,0.027377,0.06603,0.069142,0.070403,0.070217,0.209069,0.091273,0.082372,abrrpeggwd
2,0.037236,0.085047,1.0,0.081249,0.04363,0.018452,0.043877,0.054748,0.075536,0.0,...,0.093351,0.044651,0.075739,0.0,0.022833,0.0,0.137421,0.097433,0.117211,adqphxnrhg
3,0.059884,0.046122,0.081249,1.0,0.124933,0.194237,0.079182,0.071844,0.089035,0.015719,...,0.130974,0.0,0.024531,0.082097,0.022194,0.011943,0.062753,0.034066,0.083845,adsvpzczjm
4,0.052991,0.188895,0.04363,0.124933,1.0,0.08286,0.038582,0.092299,0.105353,0.176399,...,0.041149,0.064934,0.054348,0.014821,0.016194,0.0,0.063964,0.148819,0.022154,aemkznotwk


In [160]:
df_res.to_pickle("./models/text_sim_matrix.pkl")

**Prediction**

In [95]:
df_pred = df_res[
    ['idproperty','abkvpehvdk']
].sort_values(by='abkvpehvdk', ascending=False).head(5)
df_pred

Unnamed: 0,idproperty,abkvpehvdk
0,abkvpehvdk,1.0
261,hqehavlecq,0.23102
751,vwhiyxqlye,0.228284
395,loawjvpcup,0.222046
12,aidfoucmhl,0.220192


In [164]:
pd.set_option('display.max_colwidth', -1)

pd.merge(df_pred,df_fotocasa_text_trans[['idproperty','description']],how='left',on='idproperty')

Unnamed: 0,idproperty,abkvpehvdk,description
0,abkvpehvdk,1.0,"SH194.- Masia rural en Olivella, muy cerca de Plana Novella nos encontramos con esta Masía del siglo XVIII, completamente restaurada, y en la actualidad con uso de turismo rural. Dispone de 25.000m2 de terreno en un enclave natural inigualable. La casa tiene 10 habitaciones, 7 dormitorios y 7 baños, todos ellos decorados a estilo provenzal con todo lujo de detalles y diferentes cada uno de ellos. La cocina tiene el horno de leña, suelos de madera tratados, todos acabados de alta calidad. Viaje en coche de 15 minutos de Sitges, 20 min de Barcelona y aeropuerto. Un paraiso frente al mar! Nuria Mir sitgeshouses"
1,hqehavlecq,0.23102,"Propiedad rural constituida por dos fincas rústicas de 50 hectáreas de extensión total. Terreno excelente para cultivo. Incluye masía del siglo XVII de 400 m2. Ideal para vivienda, segunda residencia o establecimiento de turismo rural. Posibilidad de construir establos, piscina y pistas de deporte."
2,vwhiyxqlye,0.228284,"SH.- Casa en Quint Mar con vistas mar y montaña. Muy luminosa, amplia y en muy buen estado. Distribuida en tres plantas, acogedor salón con terraza y cocina semi -abierta y equipada , 1 aseo. En piso superior encontramos tres habitaciones dobles , 1 baño y 1 aseo. En planta tres, gran habitación suite con baño, vestidor y terraza. la vivienda tiene calefacción de gas, suelo de gres. Garaje para dos coches y trastero. En la zona hay bus, el puerto y el mar se encuentran a 2km. Nuria Mir sitgeshouses"
3,loawjvpcup,0.222046,"SH.- En Sitges, Casa exclusiva en Can Girona. Entre el campo de golf y el mar y con servicio de seguridad de 24 horas. Dispone de un gran jardín donde hay un porche de 30m2 junto a la piscina. El salón con comedor aparte hacen un total de 55m2. La cocina tiene zona de aguas. 1 aseo de cortesia y 1 baño en planta baja así como habitación de servicio. En planta primera encontramos dos suites y dos habitaciones dobles, todas ellas con armarios y vistas al mar. 3 baños. 2 terrazas. Solarium de 50m2 con espectaculares vistas al golf y al mar. Garaje para tres coches más trastero. Se halla en una de las mejores urbanizaciones de Sitges, a solo 1 km del centro y cerca de colegios , bus y centro comercial. Tiene todo lo necesario para vivir con el máximo confort, calefacción de gas y bomba de calor y alarma. Nuria Mir sitgeshouses"
4,aidfoucmhl,0.220192,"Preciosa casa de pueblo de 208 m2 del siglo XVIII totalmente restaurada en y de estilo rústico. Consta de 4 plantas de 46m2 cada una de ellas tipo loft más una bodega en la planta baja. Las paredes son de piedra natural de gran grosor que añadido a unas ventanas de madera con cristal Climalit, proporciona un aislamiento térmico y acústico. La casa se vende totalmente amueblada con complementos de calidad como frigorífico MIELE, grifería ROCA, chimenea EBRO, azulejos Gres, puerta principal de madera tropical, etc. La casa se puede utilizar tanto como negocio de turismo rural como particular. Prat de Comte pertenece a la comarca de la Terra Alta y ubicado estratégicamente a 5 minutos del parque natural de Els Ports de Beceit y atravesado por la Ruta Verde de la Terra Alta, siendo este tramo el más espectacular de la ruta debido al balneario de la Fontcalda [fuente de agua minero-medicinal que surge a 38 grados] y ha sus preciosos acantilados."


In [166]:
df = df_fotocasa[['idproperty','province','municipality','surface',
   'rooms','baths','price','property_subtype']].groupby('idproperty').first().reset_index().copy()

pd.merge(df_pred,df,how='left',on='idproperty')

Unnamed: 0,idproperty,abkvpehvdk,province,municipality,surface,rooms,baths,price,property_subtype
0,abkvpehvdk,1.0,Barcelona,Sant Pere de Ribes,700,7,7,750000.0,Finca rústica
1,hqehavlecq,0.23102,Barcelona,Santa Maria de Miralles,400,6,1,475000.0,Finca rústica
2,vwhiyxqlye,0.228284,Barcelona,Sitges,320,4,2,650000.0,Casa-Chalet
3,loawjvpcup,0.222046,Barcelona,Sitges,420,6,5,1980000.0,Casa-Chalet
4,aidfoucmhl,0.220192,Tarragona,Prat de Comte,208,4,2,227000.0,Casa adosada


### Adevinta: Text embedding and similarity: Doc2Vec model

Doc2vec is an unsupervised algorithm to generate vectors for sentences. The algorithm is an adaptation of word2vec which can generate vectors for words.

The vectors generated by doc2vec can be used for tasks like finding similarity between sentences on similarity between words n-grams.

**Training set and Testing set split**

In [139]:
from sklearn.model_selection import train_test_split

In [140]:
df_fotocasa_textmodel = \
    df_fotocasa_text_trans[df_fotocasa_text_trans.description != ' '].copy()

sentences = list(df_fotocasa_textmodel.description_clean)

In [142]:
X_train, X_test = train_test_split(sentences, test_size=0.05, random_state=0)

**Model training**

In [143]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize

In [144]:
tagged_data = [
    TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) \
        for i, _d in enumerate(X_train)]


In [145]:
max_epochs = 200
vec_size = 20
alpha = 0.025

model = Doc2Vec(size=vec_size,
                alpha=alpha, 
                min_alpha=0.00025,
                min_count=1,
                dm =1)
  
model.build_vocab(tagged_data)

for epoch in range(max_epochs):
    print('iteration {0}'.format(epoch))
    model.train(tagged_data,
                total_examples=model.corpus_count,
                epochs=model.iter)
    model.alpha -= 0.0002
    model.min_alpha = model.alpha

model.save("./models/text_sim_d2v.model")



iteration 0
iteration 1
iteration 2
iteration 3
iteration 4
iteration 5
iteration 6
iteration 7
iteration 8
iteration 9
iteration 10
iteration 11
iteration 12
iteration 13
iteration 14
iteration 15
iteration 16
iteration 17
iteration 18
iteration 19
iteration 20
iteration 21
iteration 22
iteration 23
iteration 24
iteration 25
iteration 26
iteration 27
iteration 28
iteration 29
iteration 30
iteration 31
iteration 32
iteration 33
iteration 34
iteration 35
iteration 36
iteration 37
iteration 38
iteration 39
iteration 40
iteration 41
iteration 42
iteration 43
iteration 44
iteration 45
iteration 46
iteration 47
iteration 48
iteration 49
iteration 50
iteration 51
iteration 52
iteration 53
iteration 54
iteration 55
iteration 56
iteration 57
iteration 58
iteration 59
iteration 60
iteration 61
iteration 62
iteration 63
iteration 64
iteration 65
iteration 66
iteration 67
iteration 68
iteration 69
iteration 70
iteration 71
iteration 72
iteration 73
iteration 74
iteration 75
iteration 76
iteration

In [146]:
new_sentence = ("Casa de pedra reformada estructuralment la façana, " + \
                "fusteria finestres i balcons,falta fer tot els interiors.").split(" ")  

model.docvecs.most_similar(positive=[model.infer_vector(new_sentence)],topn=5)

[('208', 0.8348480463027954),
 ('648', 0.8180837035179138),
 ('239', 0.803045392036438),
 ('731', 0.7766746878623962),
 ('573', 0.7730152606964111)]

In [159]:
pd.concat([
    pd.DataFrame(df_fotocasa_textmodel.iloc[208]).T,
    pd.DataFrame(df_fotocasa_textmodel.iloc[648]).T,
    pd.DataFrame(df_fotocasa_textmodel.iloc[239]).T,
    pd.DataFrame(df_fotocasa_textmodel.iloc[731]).T,
    pd.DataFrame(df_fotocasa_textmodel.iloc[573]).T
])[['idproperty','description']] 

Unnamed: 0,idproperty,description
221,gmudqhvrrc,"Masia de piedra en un valle unico muy bien comunicado(A 1 min.B-300 y a 10 min.C-25) con 14 Ha. de terreno de siembra, bosques y huertos.Mejor verlo.Equipada con agua potable y pozo,luz 220V y 380V, y teléfono.Ideal turismo rural o hipica.Garages colindantes de 600m2 de superficie.Posibilidad de añadir a la venta, otra casa de piedra para restaurar de 80m2 por planta."
694,tzipshomny,"casa ubicada en urbanización muy tranquila, a 30 km de la costa brava y 20 de la capital, tiene a 10 km el aeropuerto y 10 km el tren dirección barcelona-junquera, El pueblo está a 3 km. casa de estructura de hormigón , suelos de porcelana italiana, puertas de cedro. ventanas exteriores de aluminio, todas las estancias tienen fm-tv,telefono, autobuses escolares y tramsporte urbano. Ideal para vivir en la naturaleza y en menos de una hora llegar en tren a Barcelona para trabajar, Casa muy amplia."
257,hoohutjmwi,"La casa consta de 2 viviendas. La primera es un dúplex que se encuentra en la planta baja y el primer piso. 150 m2. Salón comedor con vistas al mar, amplia cocina office, 4 habitaciones, 2 baños, calefacción. Salida directa al paseo y la playa.La segunda se encuentra en la segunda planta, 80 m2, salón comedor con vistas al mar, cocina, 3 habitaciones, 1 baño, aire acondicionado.Dispone de un jardín de 212 m2 y un garaje."
778,whpdgorsec,"CHALET EN TARRAGONAMagnifico y exclusivo chalet individual ubicado en un terreno de 900 m2 en la mejor zona de “Bosques de Tarragona”, a 50 ml del “Centro comercial”, frente a la parada de Autobús, cerca de la “Playa larga”, de 2 clubs de tenis, del “Club de golf costa dorada”. De 400 m2. Consta de 6 habitaciones, 4 baños, Salón comedor de 40 m2 a 2 alturas, comedor de diario, bar, gran despacho. Piscina cubierta y caseta servicio. Zona barbacoa, terraza con gran toldo. Jardín con frutales y fuente luminosa. Cava. Almacén. Parking 3-4 coches. Lavadero."
612,saxdfjvlon,"Ubicada en la urbanizacion el papagayo parcelas con los chalets a 4 vientos sin adosadar extenso terreno m,con posibilidad de hacer piscina campo de tenis o ampliar casa Tiene un magnifico porche vistas al mar a lo lejos tan solo a 10 minutos en coche, magnifica barbacoa con fuente y encimera de gres ideal para lavar platos Magnifico para ver la TV ya que hay instalacion ,comer y descansar por la tranquilidad que hay en la fincaParcela esquinera de m"
