# Precificação de Alugéis em Nova York

## Business Understanding

Você foi alocado(a) em um time da Indicium que está trabalhando atualmente junto a um cliente no processo de criação de uma plataforma de aluguéis temporários na cidade de Nova York. Para o desenvolvimento de sua estratégia de precificação, pediu para que a Indicium fizesse uma análise exploratória dos dados de seu maior concorrente, assim como um teste de validação de um modelo preditivo.

Seu objetivo é desenvolver um modelo de previsão de preços a partir do dataset oferecido, e avaliar tal modelo utilizando as métricas de avaliação que mais fazem sentido para o problema. O uso de outras fontes de dados além do dataset é permitido (e encorajado). Você poderá encontrar em anexo um dicionário dos dados.

In [1]:
import pandas as pd
import os

In [2]:
DATA_DIR = "../Data"
POLLUTION_DATASET_FILE_NAME = "teste_indicium_precificacao.csv"

file_path = os.path.join(DATA_DIR, POLLUTION_DATASET_FILE_NAME)

if os.path.exists(file_path):
    df = pd.read_csv(file_path)
else: 
     print("Arquivo não encontrado")

In [3]:
df.head()

Unnamed: 0,id,nome,host_id,host_name,bairro_group,bairro,latitude,longitude,room_type,price,minimo_noites,numero_de_reviews,ultima_review,reviews_por_mes,calculado_host_listings_count,disponibilidade_365
0,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
1,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
2,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
3,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0
4,5099,Large Cozy 1 BR Apartment In Midtown East,7322,Chris,Manhattan,Murray Hill,40.74767,-73.975,Entire home/apt,200,3,74,2019-06-22,0.59,1,129


In [4]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,48894.0,19017530.0,10982880.0,2595.0,9472371.0,19677430.0,29152250.0,36487240.0
host_id,48894.0,67621390.0,78611180.0,2438.0,7822737.0,30795530.0,107434400.0,274321300.0
latitude,48894.0,40.72895,0.05452939,40.49979,40.6901,40.72308,40.76312,40.91306
longitude,48894.0,-73.95217,0.04615712,-74.24442,-73.98307,-73.95568,-73.93627,-73.71299
price,48894.0,152.7208,240.1566,0.0,69.0,106.0,175.0,10000.0
minimo_noites,48894.0,7.030085,20.51074,1.0,1.0,3.0,5.0,1250.0
numero_de_reviews,48894.0,23.27476,44.55099,0.0,1.0,5.0,24.0,629.0
reviews_por_mes,38842.0,1.373251,1.680453,0.01,0.19,0.72,2.02,58.5
calculado_host_listings_count,48894.0,7.144005,32.95286,1.0,1.0,1.0,2.0,327.0
disponibilidade_365,48894.0,112.7762,131.6187,0.0,0.0,45.0,227.0,365.0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48894 entries, 0 to 48893
Data columns (total 16 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   id                             48894 non-null  int64  
 1   nome                           48878 non-null  object 
 2   host_id                        48894 non-null  int64  
 3   host_name                      48873 non-null  object 
 4   bairro_group                   48894 non-null  object 
 5   bairro                         48894 non-null  object 
 6   latitude                       48894 non-null  float64
 7   longitude                      48894 non-null  float64
 8   room_type                      48894 non-null  object 
 9   price                          48894 non-null  int64  
 10  minimo_noites                  48894 non-null  int64  
 11  numero_de_reviews              48894 non-null  int64  
 12  ultima_review                  38842 non-null 

## MLFLOW

In [6]:
import mlflow
import mlflow.sklearn

In [7]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [8]:
mlflow.set_experiment("pollution_dataset_experiment")

<Experiment: artifact_location='file:///home/aurelio/projetos/Python/indicium/Notebooks/mlruns/854777663987049179', creation_time=1734524203582, experiment_id='854777663987049179', last_update_time=1734524203582, lifecycle_stage='active', name='pollution_dataset_experiment', tags={}>

In [9]:
X = df.iloc[:,:-1]
y = df.iloc[:,-1:]

In [10]:
encoder = LabelEncoder()
y = encoder.fit_transform(y)

  y = column_or_1d(y, warn=True)


In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [12]:
model = RandomForestClassifier(n_estimators=100, random_state=42)

In [13]:
with mlflow.start_run():
    # Treinando o modelo
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    # Calculando métricas
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='macro')
    recall = recall_score(y_test, y_pred, average='macro')
    f1 = f1_score(y_test, y_pred, average='macro')

    # Registrando parâmetros, métricas e o modelo
    mlflow.log_param("n_estimators", 100)
    mlflow.log_param("random_state", 42)
    mlflow.log_metric("accuracy", accuracy)
    mlflow.log_metric("precision", precision)
    mlflow.log_metric("recall", recall)
    mlflow.log_metric("f1_score", f1)

    # Registrando o modelo
    mlflow.sklearn.log_model(model, "classification_rf_model",input_example=X_test)



In [1]:
import mlflow
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes
from sklearn.ensemble import RandomForestRegressor

# set the experiment id
mlflow.set_experiment(experiment_id="0")

mlflow.autolog()
db = load_diabetes()

X_train, X_test, y_train, y_test = train_test_split(db.data, db.target)

# Create and train models.
rf = RandomForestRegressor(n_estimators=100, max_depth=6, max_features=3)
rf.fit(X_train, y_train)

# Use the model to make predictions on the test dataset.
predictions = rf.predict(X_test)

2024/12/18 09:37:08 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.
2024/12/18 09:37:08 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '7eef292f94514658942bd094855f689f', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow


In [2]:
mlflow.set_tracking_uri("/home/aurelio/projetos/Python/indicium/Notebooks/mlruns")