### Now I want to build a model and see how the apartment price would go

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor

In [2]:
df = pd.read_csv('data/data.csv')
df_dummy = pd.get_dummies(df)

#### I used a linear regression model to test on the dummy dataframa and got below results:
##### RMSE: 1163122.1221026003
##### Median salary for data analysts in Stockholm: 34175
##### Monthly savings: 10790.675
##### Predicted down payment based on 15% of apartment price: 3564620.5846851994
##### Months to save up for down payment: 330.3426879861732
##### Years to save up for down payment: 27.528557332181098
##### Years to save up for down payment with salary growth: 18

##### Based on the given results, the RMSE value of 1163122.12 is quite high, which suggests that (1) the data I have may have ourliers or (2) the model may not be accurately predicting the apartment prices. 


##### Let's take a look at the outliers

In [3]:
print(df_dummy.price.max())
print(df_dummy.price.min())
print(df_dummy.price.median())

25900000
2070000
4850000.0


In [4]:
print(df_dummy.size_in_sqm.max())
print(df_dummy.size_in_sqm.min())
print(df_dummy.size_in_sqm.median())

249.0
24.0
48.0


##### After the quick check we can see there are outliers in the data, as the median price and median size of the apartments in the collected data is 4 850 000 SEK and 48 sqr meters, the highest price was 25 900 000 SEK for 249 sqr meters which is exteremly high.

##### Now I want to define outliers in my data, so the prediction can be more reasonable.
##### By doing it I need to use the interquartile range method - IQR.

##### I have considered IQR and z-score to identify and handle the outliers, but given the previous visualisation, the prices in each area is skewed but not very much skewed, IQR method is best suited as it's less sensitive to outliers compared to the z-score model.

In [5]:
# calculate the 25th and 75th percentiles
q1 = np.percentile(df_dummy['price'], 25)
q3 = np.percentile(df_dummy['price'], 75)

iqr = q3 - q1

# get the lower and upper bounds: calculate the lower and upper bounds by subtracting 1.5 times the IQR from Q1 and adding 1.5 times the IQR to Q3, respectively.
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr

outliers = df_dummy[(df_dummy['price'] < lower_bound) | (df_dummy['price'] > upper_bound)]

print('Number of outliers:', len(outliers))

Number of outliers: 26


##### Now we found that there are 26 outliers, giving the small amount of them, I took a manual check and found the prices are all over 9 million SEK which are not initially considered that a junior data analyst would save up for, but they were collected due to that they match our other criterias: 2 rooms apartment in the 4 popular areas. Then let's locate them, and get a test database without them and then do the prediction again

In [6]:
df_no_outlier = df.drop(df[(df['price'] < lower_bound) | (df['price'] > upper_bound)].index)
df_no_outlier.to_csv("no_outlier_price.csv", index=False)

In [7]:
df_cleaned = pd.read_csv("no_outlier_price.csv")

##### Now I try to use the linear regression model again but on the cleaned data.

In [8]:
median_salary = 34175 # media salary for junior data analyst in Stockholm

# Calculate the monthly savings
monthly_savings = median_salary * 0.781 - 15900 #0.781 is the tax rate, 15900 SEK is consdiered as living cost per month in Stockholm

# Set the yearly salary growth rate
salary_growth_rate = 0.05

# Calculate the predicted down payment based on 15% of the apartment price
def calculate_down_payment(price):
    return price * 0.15

# Define the features and target variable
X = df_cleaned[['size_in_sqm', 'monthly_fee']]
y = df_cleaned['price']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a linear regression model on the training set
lr = LinearRegression()
lr.fit(X_train, y_train)

# Use the model to make predictions on the testing set
y_pred = lr.predict(X_test)

# Calculate the root mean squared error (RMSE) of the model on the testing set
mse = ((y_test - y_pred) ** 2).mean()
rmse = mse ** 0.5

# Print the RMSE
print('RMSE:', rmse)

# Predict the prices of apartments in each area based on the square meters and fees
prices = lr.predict(df_cleaned[['size_in_sqm', 'monthly_fee']])

# Add the predicted prices to the DataFrame
df_cleaned = df_cleaned.assign(predicted_price=prices)

# Calculate the predicted down payment for each apartment
df_cleaned = df_cleaned.assign(predicted_down_payment=df_cleaned['predicted_price'].apply(calculate_down_payment))

# Calculate how many months it will take to save up for the down payment
df_cleaned = df_cleaned.assign(months_to_save=df_cleaned['predicted_down_payment'] / monthly_savings)

# Calculate the years it will take to save up for the down payment
df_cleaned = df_cleaned.assign(years_to_save=df_cleaned['months_to_save'] / 12)

# Calculate the years it will take to save up for the down payment with salary growth
current_savings = 0
years_to_save_with_growth = 0
while current_savings < df_cleaned['predicted_down_payment'].max():
    current_savings += monthly_savings * 12
    current_savings *= (1 + salary_growth_rate)
    years_to_save_with_growth += 1

# Print the results
print('Median salary for data analysts in Stockholm:', median_salary)
print('Monthly savings:', monthly_savings)
print('Predicted down payment based on 15% of apartment price:', df_cleaned['predicted_down_payment'].max())
print('Months to save up for down payment:', df_cleaned['months_to_save'].max())
print('Years to save up for down payment:', df_cleaned['years_to_save'].max())
print('Years to save up for down payment with salary growth:', years_to_save_with_growth)

RMSE: 1044296.3628306319
Median salary for data analysts in Stockholm: 34175
Monthly savings: 10790.675
Predicted down payment based on 15% of apartment price: 1350677.7044825382
Months to save up for down payment: 125.17082615151863
Years to save up for down payment: 10.430902179293218
Years to save up for down payment with salary growth: 9


##### Compare to previous result from the linear regression model I used on the data with outliers, the outcome here with clearned data differs a lot.

##### Therefore, we have cleaned up the first issue with the high RMSE - there are outliers in my data, the second potential issue is that the model may not be accurately predicting the apartment prices.

##### Now I try to train and evaluate different models to see which suits my data the best.

##### There are 3 models chosen, they are commonly used for regression problems and they have shown to perform well in many cases.

In [None]:
# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df_cleaned.drop('price', axis=1), df_cleaned['price'], test_size=0.3, random_state=42)

# train and evaluate different models
models = [
    ('Linear Regression', LinearRegression()),
    ('Decision Tree', DecisionTreeRegressor()),
    ('Random Forest', RandomForestRegressor())
]

for name, model in models:
    # Train the model
    model.fit(X_train, y_train)
    
    # Make predictions on the test set
    y_pred = model.predict(X_test)
    
    # Evaluate the model
    rmse = mean_squared_error(y_test, y_pred, squared=False)
    r2 = r2_score(y_test, y_pred)
    print(f"{name} RMSE: {rmse:.2f}, R-squared: {r2:.2f}")


##### It seems Random Forest model has the lowest RMSE with highest r-squared, given that a higher R-squared indicates a better fit between the regression line and the data, while a lower RMSE indicates better predictive accuracy, let's try to fit the data in the Random Forest model and see.

In [10]:
q1 = np.percentile(df['price'], 25)
q3 = np.percentile(df['price'], 75)

iqr = q3 - q1

# get the lower and upper bounds: calculate the lower and upper bounds by subtracting 1.5 times the IQR from Q1 and adding 1.5 times the IQR to Q3, respectively.
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr

outliers = df[(df['price'] < lower_bound) | (df['price'] > upper_bound)]

print('Number of outliers:', len(outliers))

Number of outliers: 26


In [11]:
areas = ['Södermalm', 'Årsta', 'Östermalm', 'Vasastan', 'Fredhäll']
df_areas = df_cleaned[df_cleaned['area'].isin(areas)]

def calculate_down_payment(price):
    return 0.15 * price

def calculate_time_to_save(down_payment, annual_salary=410100, annual_savings=None, growth_rate=0.05):
    if annual_savings is None:
        annual_savings = annual_salary * 0.781 - 15900 * 12
    current_savings = 0
    years_to_save_with_growth = 0
    while current_savings < down_payment:
        current_savings += annual_savings
        current_savings *= (1 + growth_rate)
        years_to_save_with_growth += 1
    return years_to_save_with_growth

X = df_areas[['size_in_sqm', 'monthly_fee']]
y = df_areas['price']

# train the Random Forest model
rf = RandomForestRegressor()
rf.fit(X, y)

# Predict the prices of apartments in each area based on the square meters and monthly fees
df_areas['predicted_price'] = rf.predict(X)

# Calculate the down payment for each apartment
df_areas['down_payment'] = df_areas['predicted_price'].apply(calculate_down_payment)

# Calculate the time required to save up for the down payment for each apartment
df_areas['time_to_save'] = df_areas['down_payment'].apply(calculate_time_to_save)

for area in areas:
    df_area = df_areas[df_areas['area'] == area]
    avg_price = df_area['predicted_price'].mean()
    avg_down_payment = df_area['down_payment'].mean()
    avg_time_to_save = df_area['time_to_save'].mean()
    print(f"Area: {area}")
    print(f"Average predicted price: {avg_price}")
    print(f"Average down payment: {avg_down_payment}")
    print(f"Average time to save up for the down payment: {avg_time_to_save} years\n")

Area: Södermalm
Average predicted price: 4956536.397652598
Average down payment: 743480.4596478897
Average time to save up for the down payment: 5.408163265306122 years

Area: Årsta
Average predicted price: 3447215.2604826204
Average down payment: 517082.289072393
Average time to save up for the down payment: 4.105442176870748 years

Area: Östermalm
Average predicted price: 6055774.183059587
Average down payment: 908366.1274589382
Average time to save up for the down payment: 6.367647058823529 years

Area: Vasastan
Average predicted price: 5479117.068113127
Average down payment: 821867.5602169691
Average time to save up for the down payment: 5.901464713715047 years

Area: Fredhäll
Average predicted price: 4005338.0809705807
Average down payment: 600800.7121455871
Average time to save up for the down payment: 4.5675675675675675 years



##### 