# Supervised learning
We aim to create a model that can suggest a price for new landlords when they wish to add a new home to airbnb.

In [None]:
import pandas as pd

df = pd.read_csv('./data/supervised_cleaned_airbnb_data.csv')

df.head()


All data is numerical ensuring that we can use any value to train our model.
The table below explains what the different values mean.

| Column name                  | Description                                                       |
|-----------------------------|-------------------------------------------------------------------|
| **realSum**                 | The total price of the Airbnb listing. **(Numeric)**                 |
| **room_type**               | 1. Private room, 2. Entire home/apt, 3. Shared room **(Numeric)** |
| **room_shared_bool**             | Whether the room is shared or not. **(Boolean)**                     |
| **room_private_bool**            | Whether the room is private or not. **(Boolean)**                    |
| **person_capacity**         | The maximum number of people that can stay in the room. **(Numeric)** |
| **host_is_superhost_bool**       | Whether the host is a superhost or not. **(Boolean)**                |
| **multi_bool**                   | Whether the listing is for multiple rooms or not. **(Boolean)**      |
| **biz_bool**                     | Whether the listing is for business purposes or not. **(Boolean)**   |
| **cleanliness_rating**      | The cleanliness rating of the listing. **(Numeric)**                 |
| **guest_satisfaction_overall** | The overall guest satisfaction rating of the listing. **(Numeric)** |
| **bedrooms**                | The number of bedrooms in the listing. **(Numeric)**                 |
| **dist**                    | The distance from the city centre. **(Numeric)**                     |
| **metro_dist**              | The distance from the nearest metro station. **(Numeric)**           |
| **City**                    | 1. Amsterdam, 2. Athen, 3. Barcelona, 4. Berlin, 5. Budapest, 6. Lisbon, 7. London, 8. Paris, 9. Rome, 10. Vienna **(Numeric)** |
| **Is_weekend_bool**         | Whether the home is available on weekends or not. **(Boolean)**      |

## Model 1
Initially we thought it would be a good idea, to have a single model that took data from all cities across Europe. This model proved to be quite far off the mark.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# The ratings are omitted as a new home would not have ratings upfront.
# We also did not care about multiple rooms or business purpose.

# x = The features/columns we want our model to look at when making the model
X = df[['room_type','room_shared_bool','room_private_bool','person_capacity','bedrooms','dist','metro_dist', 'City','Is_weekend_bool']]  
# y = The price we wish to compare with
y = df['realSum'] 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

In [None]:
import warnings
warnings.filterwarnings("ignore", message="X does not have valid feature names")

def predict_price(room_type, room_shared_bool, room_private_bool, person_capacity, bedrooms, dist, metro_dist, City, Is_weekend_bool):
    input_data = [[room_type, room_shared_bool, room_private_bool, person_capacity, bedrooms, dist, metro_dist, City, Is_weekend_bool]]
    predicted_price = model.predict(input_data)[0]
    return round(predicted_price, 2)

amsterdam = predict_price(
    room_type=1,
    room_shared_bool=0,
    room_private_bool=1,
    person_capacity=2,
    bedrooms=1,
    dist=5.0,
    metro_dist=2.5,
    City=2,        
    Is_weekend_bool=0
)


london = predict_price(
    room_type=1,
    room_shared_bool=0,
    room_private_bool=1,
    person_capacity=2,
    bedrooms=1,
    dist=5.0,
    metro_dist=2.5,
    City=7,        
    Is_weekend_bool=0
)

athens = predict_price(
    room_type=1,
    room_shared_bool=0,
    room_private_bool=1,
    person_capacity=2,
    bedrooms=1,
    dist=5.0,
    metro_dist=2.5,
    City=2,        
    Is_weekend_bool=0
)

print("Predicted price Amsterdam:", amsterdam)
print("Predicted price London:", london)
print("Predicted price Athens:", athens)

First we wanted to test the predicted price against the actual price in the dataset. The actual price of the first entry in the dataset is 194.03 leaving us 42.15 off the mark (Amsterdam)

Another expected result would be the price being different from city to city. London is a very expensive city but the price is nearly identical regardless of the chosen city. 

This lead us to do some more testing and we decided on another approach to the model. Instead of having a single model we would have 1 model for each city and compare the results to our first model.

## Model 2
1 model per city

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

city_models = {}


for city, city_data in df.groupby('City'):
    X = city_data[['room_type','room_shared_bool','room_private_bool','person_capacity','bedrooms','dist','metro_dist','Is_weekend_bool']]
    y = city_data['realSum']

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    model = LinearRegression()
    model.fit(X_train, y_train)

    # Store the trained model
    city_models[city] = model



In [None]:
# Example features: room_type, room_shared_bool, room_private_bool, person_capacity, bedrooms, dist, metro_dist, Is_weekend_bool

# Same values as the first test
amsterdam = {
    "data": [[1, 0, 1, 2, 1, 5.0, 2.5, 0]],
    "cityId": 1
}
athens = {
    "data": [[1, 0, 1, 2, 1, 5.0, 2.5, 0]],
    "cityId": 2
}
athens2 = {
    "data": [[1, 0, 1, 2, 1, 1.0, 0.5, 0]],
    "cityId": 2
}

london = { 
    "data": [[1, 0, 1, 2, 1, 5.0, 2.5, 0]],
    "cityId": 7
}

def predict_price(dataObject):
    city_id = dataObject["cityId"]
    model = city_models.get(city_id)
    if model:
        predicted_price = model.predict(dataObject["data"])[0]
        print(f"Predicted price for city {city_id}:", round(predicted_price, 2))
    else:
        print(f"No model found for city {city_id}")

predict_price(amsterdam)
predict_price(athens)
predict_price(athens2)
predict_price(london)

With our new model we test the same values as model 1 used.

Now the prices are more varied and going to Athens seems like a great deal to us! We investigated the dataset and the reason for the minus value in Athens.
It turns out that all homes in Athens are very close to the city center and the metro. Changing the values of those 2 parameters provides a more fitting price (athens2 contains changed values).


While the price does not match the original 194.03 from the first entry. It is still closer than the first model. There could be various parameters that are unknown to us. The condition of the interior can for instance vary a lot. Maybe theres an old bathroom and the furniture is not of great quality. A landlord might also want to compete on price putting the home lower than what could have been the case.




In [None]:
# Vi skal have lavet funktionalitet til at kunne modtage input og ud fra dette input bruge vores model 2.
# Efterfølgende skal en foreslået pris printes ud til brugeren
# Skriv kommentarer/markdown til denne del as well.