# Introduction

One was contracted to evaluate a dataset based on real estate prices. The objective is to transform and prepare the data, revealing an estimation of the renting prices of real estate using machine learning.

In [1]:
import pandas as pd
data = pd.read_json('https://caelum-online-public.s3.amazonaws.com/2928-transformacao-manipulacao-dados/dados_hospedagem.json')
data.head()

Unnamed: 0,info_moveis
0,"{'avaliacao_geral': '10.0', 'experiencia_local..."
1,"{'avaliacao_geral': '10.0', 'experiencia_local..."
2,"{'avaliacao_geral': '10.0', 'experiencia_local..."
3,"{'avaliacao_geral': '10.0', 'experiencia_local..."
4,"{'avaliacao_geral': '10.0', 'experiencia_local..."


## JSON normalize
Sometimes the data is nested and pandas will read it as a dictionary. This can be converted into a data frame using the JSON normalize.

In [2]:
data = pd.json_normalize(data['info_moveis'])
data.head()

Unnamed: 0,avaliacao_geral,experiencia_local,max_hospedes,descricao_local,descricao_vizinhanca,quantidade_banheiros,quantidade_quartos,quantidade_camas,modelo_cama,comodidades,taxa_deposito,taxa_limpeza,preco
0,10.0,--,1,[This clean and comfortable one bedroom sits r...,[Lower Queen Anne is near the Seattle Center (...,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[Real Bed, Futon, Futon, Pull-out Sofa, Real B...","[{Internet,""Wireless Internet"",Kitchen,""Free P...","[$0, $0, $0, $0, $0, $350.00, $350.00, $350.00...","[$0, $0, $0, $20.00, $15.00, $28.00, $35.00, $...","[$110.00, $45.00, $55.00, $52.00, $85.00, $50...."
1,10.0,--,10,[Welcome to the heart of the 'Ballard Brewery ...,"[--, Capital Hill is the heart of Seattle, bor...","[2, 3, 2, 3, 3, 3, 2, 1, 2, 2, 2]","[3, 4, 2, 3, 3, 3, 3, 3, 3, 4, 3]","[5, 6, 8, 3, 3, 5, 4, 5, 6, 7, 4]","[Real Bed, Real Bed, Real Bed, Real Bed, Real ...","[{TV,Internet,""Wireless Internet"",Kitchen,""Fre...","[$500.00, $300.00, $0, $300.00, $300.00, $360....","[$125.00, $100.00, $85.00, $110.00, $110.00, $...","[$350.00, $300.00, $425.00, $300.00, $285.00, ..."
2,10.0,--,11,[New modern house built in 2013. Spectacular ...,[Upper Queen Anne is a charming neighborhood f...,[4],[5],[7],[Real Bed],"[{TV,""Cable TV"",Internet,""Wireless Internet"",""...","[$1,000.00]",[$300.00],[$975.00]
3,10.0,--,12,[Our NW style home is 3200+ sq ft with 3 level...,[The Views from our top floor! Wallingford ha...,"[3, 3, 3, 3, 3, 3, 3, 3]","[6, 6, 5, 5, 5, 5, 4, 4]","[6, 6, 7, 8, 7, 7, 6, 6]","[Real Bed, Real Bed, Real Bed, Real Bed, Real ...","[{Internet,""Wireless Internet"",Kitchen,""Free P...","[$500.00, $500.00, $500.00, $500.00, $500.00, ...","[$225.00, $300.00, $250.00, $250.00, $250.00, ...","[$490.00, $550.00, $350.00, $350.00, $350.00, ..."
4,10.0,--,14,"[Perfect for groups. 2 bedrooms, full bathroom...",[Safeway grocery store within walking distance...,"[2, 3]","[2, 6]","[3, 9]","[Real Bed, Real Bed]","[{TV,Internet,""Wireless Internet"",Kitchen,""Fre...","[$300.00, $2,000.00]","[$40.00, $150.00]","[$200.00, $545.00]"


avaliacao_geral: refere-se à média de notas dadas para a avaliação da hospedagem no imóvel.

experiencia_local: descreve as experiências oferecidas durante a hospedagem no imóvel.

max_hospedes: informa a quantidade máxima de hóspedes que o local permite.

descricao_local: descreve o imóvel.

descricao_vizinhanca: descreve a vizinhança ao redor do imóvel.

quantidade_banheiros: informa a quantidade de banheiros disponíveis.

quantidade_quartos: informa a quantidade de quartos disponíveis.

quantidade_camas: informa a quantidade de camas disponíveis.

modelo_cama: informa o modelo de cama oferecido.

comodidades: informa as comodidades oferecidas pelo imóvel.

taxa_deposito: informa a taxa de depósito mínima para segurança de hospedagem.

taxa_limpeza: informa a taxa cobrada para o serviço de limpeza.

preco: refere-se ao preço base a ser cobrado pela diária no imóvel.

*Translation*

general_evaluation: refers to the average score given to evaluate accommodation in the property.

local_experience: describes the experiences offered while staying at the property.

max_hospedes: informs the maximum number of guests that the location allows.

local_description: describes the property.

neighborhood_description: describes the neighborhood around the property.

quantity_bathrooms: informs the number of bathrooms available.

quantity_rooms: informs the number of rooms available.

quantity_beds: informs the number of beds available.

bed_model: informs the bed model offered.

amenities: informs the amenities offered by the property.

deposit_rate: informs the minimum deposit rate for hosting security.

cleaning_rate: informs the fee charged for the cleaning service.

price: refers to the base price to be charged for the daily stay at the property.

## Translation

Since the data is in PT-BR, for a better visualization one will translate it to US-EN.

In [3]:
columns_pt_br = list(data.columns)
columns_us_en = ['general_evaluation', 'local_experience',
                 'max_hospedes','local_description',
                 'neighborhood_description','quantity_bathrooms',
                 'quantity_rooms','quantity_beds',
                 'bed_model','amenities', 'deposit_tax',
                 'cleaning_tax','price']
data.columns = columns_us_en
data.head()

Unnamed: 0,general_evaluation,local_experience,max_hospedes,local_description,neighborhood_description,quantity_bathrooms,quantity_rooms,quantity_beds,bed_model,amenities,deposit_tax,cleaning_tax,price
0,10.0,--,1,[This clean and comfortable one bedroom sits r...,[Lower Queen Anne is near the Seattle Center (...,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[Real Bed, Futon, Futon, Pull-out Sofa, Real B...","[{Internet,""Wireless Internet"",Kitchen,""Free P...","[$0, $0, $0, $0, $0, $350.00, $350.00, $350.00...","[$0, $0, $0, $20.00, $15.00, $28.00, $35.00, $...","[$110.00, $45.00, $55.00, $52.00, $85.00, $50...."
1,10.0,--,10,[Welcome to the heart of the 'Ballard Brewery ...,"[--, Capital Hill is the heart of Seattle, bor...","[2, 3, 2, 3, 3, 3, 2, 1, 2, 2, 2]","[3, 4, 2, 3, 3, 3, 3, 3, 3, 4, 3]","[5, 6, 8, 3, 3, 5, 4, 5, 6, 7, 4]","[Real Bed, Real Bed, Real Bed, Real Bed, Real ...","[{TV,Internet,""Wireless Internet"",Kitchen,""Fre...","[$500.00, $300.00, $0, $300.00, $300.00, $360....","[$125.00, $100.00, $85.00, $110.00, $110.00, $...","[$350.00, $300.00, $425.00, $300.00, $285.00, ..."
2,10.0,--,11,[New modern house built in 2013. Spectacular ...,[Upper Queen Anne is a charming neighborhood f...,[4],[5],[7],[Real Bed],"[{TV,""Cable TV"",Internet,""Wireless Internet"",""...","[$1,000.00]",[$300.00],[$975.00]
3,10.0,--,12,[Our NW style home is 3200+ sq ft with 3 level...,[The Views from our top floor! Wallingford ha...,"[3, 3, 3, 3, 3, 3, 3, 3]","[6, 6, 5, 5, 5, 5, 4, 4]","[6, 6, 7, 8, 7, 7, 6, 6]","[Real Bed, Real Bed, Real Bed, Real Bed, Real ...","[{Internet,""Wireless Internet"",Kitchen,""Free P...","[$500.00, $500.00, $500.00, $500.00, $500.00, ...","[$225.00, $300.00, $250.00, $250.00, $250.00, ...","[$490.00, $550.00, $350.00, $350.00, $350.00, ..."
4,10.0,--,14,"[Perfect for groups. 2 bedrooms, full bathroom...",[Safeway grocery store within walking distance...,"[2, 3]","[2, 6]","[3, 9]","[Real Bed, Real Bed]","[{TV,Internet,""Wireless Internet"",Kitchen,""Fre...","[$300.00, $2,000.00]","[$40.00, $150.00]","[$200.00, $545.00]"


## Exploding the data and resetting the index

Some of the data are inside lists and this is not useful to the machine learning model, that will be used. By using the explode method, one can clear out brackets and other sources of groupings.

In [4]:
data = (data.explode(columns_us_en[3:])).reset_index(drop=True)
data

Unnamed: 0,general_evaluation,local_experience,max_hospedes,local_description,neighborhood_description,quantity_bathrooms,quantity_rooms,quantity_beds,bed_model,amenities,deposit_tax,cleaning_tax,price
0,10.0,--,1,This clean and comfortable one bedroom sits ri...,Lower Queen Anne is near the Seattle Center (s...,1,1,1,Real Bed,"{Internet,""Wireless Internet"",Kitchen,""Free Pa...",$0,$0,$110.00
1,10.0,--,1,Our century old Upper Queen Anne house is loca...,"Upper Queen Anne is a really pleasant, unique ...",1,1,1,Futon,"{TV,Internet,""Wireless Internet"",Kitchen,""Free...",$0,$0,$45.00
2,10.0,--,1,Cozy room in two-bedroom apartment along the l...,The convenience of being in Seattle but on the...,1,1,1,Futon,"{TV,Internet,""Wireless Internet"",Kitchen,""Free...",$0,$0,$55.00
3,10.0,--,1,Very lovely and cozy room for one. Convenientl...,"Ballard is lovely, vibrant and one of the most...",1,1,1,Pull-out Sofa,"{Internet,""Wireless Internet"",Kitchen,""Free Pa...",$0,$20.00,$52.00
4,10.0,--,1,The “Studio at Mibbett Hollow' is in a Beautif...,--,1,1,1,Real Bed,"{""Wireless Internet"",Kitchen,""Free Parking on ...",$0,$15.00,$85.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3813,,--,8,Beautiful craftsman home in the historic Wedgw...,--,3,4,5,Real Bed,"{TV,""Cable TV"",Internet,""Wireless Internet"",""A...","$1,000.00",$178.00,$299.00
3814,,--,8,Located in a very easily accessible area of Se...,"Quiet, dead end street near I-5. The proximity...",2,4,4,Real Bed,"{TV,""Cable TV"",Internet,""Wireless Internet"",Ki...",$0,$99.00,$199.00
3815,,--,8,This home is fully furnished and available wee...,--,1,3,4,Real Bed,"{TV,""Cable TV"",Internet,""Wireless Internet"",""A...",$0,$0,$400.00
3816,,--,9,This business-themed modern home features: *H...,Your hosts made Madison Valley their home when...,2,3,6,Real Bed,"{TV,""Cable TV"",Internet,""Wireless Internet"",""A...","$1,000.00",$150.00,$250.00


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3818 entries, 0 to 3817
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   general_evaluation        3818 non-null   object
 1   local_experience          3818 non-null   object
 2   max_hospedes              3818 non-null   object
 3   local_description         3818 non-null   object
 4   neighborhood_description  3818 non-null   object
 5   quantity_bathrooms        3818 non-null   object
 6   quantity_rooms            3818 non-null   object
 7   quantity_beds             3818 non-null   object
 8   bed_model                 3818 non-null   object
 9   amenities                 3818 non-null   object
 10  deposit_tax               3818 non-null   object
 11  cleaning_tax              3818 non-null   object
 12  price                     3818 non-null   object
dtypes: object(13)
memory usage: 387.9+ KB


Now one found that some of the data are considered objects, but they are integers and floats. This must be fixed.

In [6]:
import numpy as np
def convert_column_type(df):
    for col in df.columns:
        first_value = df[col].iloc[0]
        
        try:
            if "." in str(first_value) or "$" in str(first_value):
                df[col] = df[col].str.replace('[\$,]', '', regex=True).astype(np.float64)
            else:
                df[col] = df[col].astype(np.int64)
        except ValueError as e:
            print(f"{col} has no value to be converted: {e}")
        
convert_column_type(data)

local_experience has no value to be converted: invalid literal for int() with base 10: '--'
local_description has no value to be converted: could not convert string to float: "This clean and comfortable one bedroom sits right across from Kinnear Park in Seattle's lower Queen Anne neighborhood. Walk to Seattle Center the SAM Sculpture Park or just sit on the deck and enjoy the view of Puget Sound and downtown Seattle. Kitchen has hot water tap and sodastream Excellent water pressure Original art throughout the house Dogs under 30 lbs welcome Roof deck Lower Queen Anne is near the Seattle Center (space needle EMP museum Glass museum Science Center and Children's museum). It's also near SAM sculpture park stores restaurants SIFF theater and more."
neighborhood_description has no value to be converted: could not convert string to float: "Lower Queen Anne is near the Seattle Center (space needle EMP museum Glass museum Science Center and Children's museum). It's also near SAM sculpture park

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3818 entries, 0 to 3817
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   general_evaluation        3162 non-null   float64
 1   local_experience          3818 non-null   object 
 2   max_hospedes              3818 non-null   int64  
 3   local_description         3818 non-null   object 
 4   neighborhood_description  3818 non-null   object 
 5   quantity_bathrooms        3818 non-null   int64  
 6   quantity_rooms            3818 non-null   int64  
 7   quantity_beds             3818 non-null   int64  
 8   bed_model                 3818 non-null   object 
 9   amenities                 3818 non-null   object 
 10  deposit_tax               3818 non-null   float64
 11  cleaning_tax              3818 non-null   float64
 12  price                     3818 non-null   float64
dtypes: float64(4), int64(4), object(5)
memory usage: 387.9+ KB


In [8]:
data.head()

Unnamed: 0,general_evaluation,local_experience,max_hospedes,local_description,neighborhood_description,quantity_bathrooms,quantity_rooms,quantity_beds,bed_model,amenities,deposit_tax,cleaning_tax,price
0,10.0,--,1,This clean and comfortable one bedroom sits ri...,Lower Queen Anne is near the Seattle Center (s...,1,1,1,Real Bed,"{Internet,""Wireless Internet"",Kitchen,""Free Pa...",0.0,0.0,110.0
1,10.0,--,1,Our century old Upper Queen Anne house is loca...,"Upper Queen Anne is a really pleasant, unique ...",1,1,1,Futon,"{TV,Internet,""Wireless Internet"",Kitchen,""Free...",0.0,0.0,45.0
2,10.0,--,1,Cozy room in two-bedroom apartment along the l...,The convenience of being in Seattle but on the...,1,1,1,Futon,"{TV,Internet,""Wireless Internet"",Kitchen,""Free...",0.0,0.0,55.0
3,10.0,--,1,Very lovely and cozy room for one. Convenientl...,"Ballard is lovely, vibrant and one of the most...",1,1,1,Pull-out Sofa,"{Internet,""Wireless Internet"",Kitchen,""Free Pa...",0.0,20.0,52.0
4,10.0,--,1,The “Studio at Mibbett Hollow' is in a Beautif...,--,1,1,1,Real Bed,"{""Wireless Internet"",Kitchen,""Free Parking on ...",0.0,15.0,85.0


## Text processing

A simple tokenization will be applied to each word, to evaluate if the description could affect the renting price of the real estate.


In [9]:
def convert_column_type(df):
    for col in df.columns:
        first_value = df[col].iloc[0]
        
        try:
            # Added statement to check if there are values already converted
            if isinstance(first_value, (int, float)):
                pass
            elif "." in str(first_value) or "$" in str(first_value):
                df[col] = df[col].str.replace('[\$,]', '', regex=True).astype(np.float64)
            else:
                df[col] = df[col].astype(np.int64)
        except ValueError as e:
            print(f"{col} values were converted to lowercase")
            # Convert the column to lowercase
            df[col] = df[col].str.lower()
        
convert_column_type(data)

local_experience values were converted to lowercase
local_description values were converted to lowercase
neighborhood_description values were converted to lowercase
bed_model values were converted to lowercase
amenities values were converted to lowercase


In [10]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3818 entries, 0 to 3817
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   general_evaluation        3162 non-null   float64
 1   local_experience          3818 non-null   object 
 2   max_hospedes              3818 non-null   int64  
 3   local_description         3818 non-null   object 
 4   neighborhood_description  3818 non-null   object 
 5   quantity_bathrooms        3818 non-null   int64  
 6   quantity_rooms            3818 non-null   int64  
 7   quantity_beds             3818 non-null   int64  
 8   bed_model                 3818 non-null   object 
 9   amenities                 3818 non-null   object 
 10  deposit_tax               3818 non-null   float64
 11  cleaning_tax              3818 non-null   float64
 12  price                     3818 non-null   float64
dtypes: float64(4), int64(4), object(5)
memory usage: 387.9+ KB


In [11]:
data['local_description'][3816]

"this business-themed modern home features:  *high-end kitchen/baths *open concept floor-thru living area  *full floor master suite w/ jetted tub & view deck *garage centrally located near: *top dining *parks *hiking *markets *day spas **note: this home is the latest addition to our portfolio.  photos are dated realtor images and don't reflect the decor and new furniture.  as we await our first guests and great reviews, enjoy a steep discount on this modern home!*** this home has it all: modern amenities, a quiet location, and proximity to all that seattle has to offer.  all furnishings and housewares are brand new as of november 2015, including comfy new bedding and towels, furniture, and appliances. the living areas are spread over three levels.  the entry level is the main living area.  this open concept space has a chef's kitchen with granite counter tops and stainless steel appliances.  this flows directly to the dining area which includes 3 stools at the breakfast counter, and a 

Note that some characters like * and / are present in the text? Those muset be removed, since they could affect the model.

## Text processing 2 - Removing the "stop words" and some other characters

In natural language processing (NLP), "stop words" or "noise words" refer to common, low-information words like "the," "and," or "in," which are typically disregarded during text analysis due to their limited semantic value. These linguistic elements, prevalent in most languages, are filtered out to streamline text data and enhance the focus on meaningful terms, improving the efficiency of tasks such as sentiment analysis or text classification. However, their exclusion is context-dependent, as some applications, like information retrieval, may benefit from retaining stop words. Furthermore, besides stop words, extraneous characters like punctuation and numbers are often removed to tailor the text data for specific NLP objectives.

In [12]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
english_stopwords = stopwords.words('english')

[nltk_data] Downloading package stopwords to C:\Users\Gustavo
[nltk_data]     Fortunato\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [13]:
def convert_column_type(df):
    for col in df.columns:
        first_value = df[col].iloc[0]
        
        try:
            # Added statement to check if there are values already converted
            if isinstance(first_value, (int, float)):
                pass
            elif "." in str(first_value) or "$" in str(first_value):
                df[col] = df[col].str.replace('[\$,]', '', regex=True).astype(np.float64)
            else:
                df[col] = df[col].astype(np.int64)
        except ValueError as e:
            print(f"{col} values were text values, and were processed")
            # Convert the column to lowercase
            df[col] = df[col].str.lower()

            df[col] = (df[col].str.lower()).replace('[^a-zA-Z0-9\-\']',' ', regex=True)
#This REGEX used means:
#[^ will not select
#letters from a-z (lowercase),
#letters from A-Z (uppercase),
#numbers from 0-9,
#this - and this '
#]

#removing the - 
            df[col] = df[col].replace('(?<!\w)-(?!\w)',' ', regex=True)
#This REGEX used means:
#(?<!\w) look behind the word -
#- this is the word one is looking for
#(?!\w) look after the word -
#if any of these are true the - will not be replaced

            # Tokenize
            df[col] = df[col].apply(nltk.word_tokenize)
            
            # Remove stopwords
            df[col] = df[col].apply(lambda words: [word for word in words if word.lower() not in english_stopwords])
            
            # Join the filtered words back into a sentence
            df[col] = df[col].apply(lambda words: ' '.join(words))


convert_column_type(data)

local_experience values were text values, and were processed
local_description values were text values, and were processed
neighborhood_description values were text values, and were processed
bed_model values were text values, and were processed
amenities values were text values, and were processed


In [14]:
data['local_description'][3816]

"business-themed modern home features high-end kitchen baths open concept floor-thru living area full floor master suite w jetted tub view deck garage centrally located near top dining parks hiking markets day spas note home latest addition portfolio photos dated realtor images n't reflect decor new furniture await first guests great reviews enjoy steep discount modern home home modern amenities quiet location proximity seattle offer furnishings housewares brand new november 2015 including comfy new bedding towels furniture appliances living areas spread three levels entry level main living area open concept space chef 's kitchen granite counter tops stainless steel appliances flows directly dining area includes 3 stools breakfast counter"

In [15]:
data.head()

Unnamed: 0,general_evaluation,local_experience,max_hospedes,local_description,neighborhood_description,quantity_bathrooms,quantity_rooms,quantity_beds,bed_model,amenities,deposit_tax,cleaning_tax,price
0,10.0,,1,clean comfortable one bedroom sits right acros...,lower queen anne near seattle center space nee...,1,1,1,real bed,internet wireless internet kitchen free parkin...,0.0,0.0,110.0
1,10.0,,1,century old upper queen anne house located nea...,upper queen anne really pleasant unique little...,1,1,1,futon,tv internet wireless internet kitchen free par...,0.0,0.0,45.0
2,10.0,,1,cozy room two-bedroom apartment along lower ph...,convenience seattle west slope water views sal...,1,1,1,futon,tv internet wireless internet kitchen free par...,0.0,0.0,55.0
3,10.0,,1,lovely cozy room one conveniently located hear...,ballard lovely vibrant one rapidly growing nei...,1,1,1,pull-out sofa,internet wireless internet kitchen free parkin...,0.0,20.0,52.0
4,10.0,,1,studio mibbett hollow ' beautiful houseboat qu...,,1,1,1,real bed,wireless internet kitchen free parking premise...,0.0,15.0,85.0


Now the data is ready to be used by the Machine learning model. There is also another part of this dataset that must be transformed to be used as another feature for the model.

# Processing the second dataset

This part of the dataset is based on the vacancies of the real estate during the year of 2016

In [16]:
data_2 = pd.read_json('https://caelum-online-public.s3.amazonaws.com/2928-transformacao-manipulacao-dados/moveis_disponiveis.json')
data_2.head()

Unnamed: 0,id,data,vaga_disponivel,preco
0,857,2016-01-04,False,
1,857,2016-01-05,False,
2,857,2016-01-06,False,
3,857,2016-01-07,False,
4,857,2016-01-08,False,


In [17]:
# small translation
columns_pt_br2 = list(data_2.columns)
columns_us_en2 = ['id', 'data','vacancy','price']
data_2.columns = columns_us_en2
data_2.head()

Unnamed: 0,id,data,vacancy,price
0,857,2016-01-04,False,
1,857,2016-01-05,False,
2,857,2016-01-06,False,
3,857,2016-01-07,False,
4,857,2016-01-08,False,


Well, the data seens to be ok, but one needs to check if it is really at date time format.

In [18]:
data_2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 365000 entries, 0 to 364999
Data columns (total 4 columns):
 #   Column   Non-Null Count   Dtype 
---  ------   --------------   ----- 
 0   id       365000 non-null  int64 
 1   data     365000 non-null  object
 2   vacancy  365000 non-null  bool  
 3   price    270547 non-null  object
dtypes: bool(1), int64(1), object(2)
memory usage: 11.5+ MB


It is not a datetime...

In [20]:
data_2['data'] = pd.to_datetime(data_2['data'])
data_2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 365000 entries, 0 to 364999
Data columns (total 4 columns):
 #   Column   Non-Null Count   Dtype         
---  ------   --------------   -----         
 0   id       365000 non-null  int64         
 1   data     365000 non-null  datetime64[ns]
 2   vacancy  365000 non-null  bool          
 3   price    270547 non-null  object        
dtypes: bool(1), datetime64[ns](1), int64(1), object(1)
memory usage: 11.5+ MB


## Building the vacancies by month

In [24]:
vacancies_by_month = data_2.groupby(data_2['data'].dt.strftime('%Y-%m'))['vacancy'].sum()
vacancies_by_month

data
2016-01    16543
2016-02    20128
2016-03    23357
2016-04    22597
2016-05    23842
2016-06    23651
2016-07    22329
2016-08    22529
2016-09    22471
2016-10    23765
2016-11    23352
2016-12    24409
2017-01     1574
Name: vacancy, dtype: int64

# Conclusions

Now both datasets are ready to be used by the machine learning model. They could also be used by Seaborn, Plotly, and Matplotlib libraries to be visualized, showing which month that real states will have more vacancies, and which trait is more important when making a description.