<h3>Imports</h3>

In [2]:
import duckdb
import math
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

<h3>Extracting data from DB</h3>

In [4]:
con = duckdb.connect("../../../dataManagementBackbone/data/exploitation/crimesPrices.db")
df = con.execute("SELECT * FROM crimesPrices").df()

<h3>Scaling numeric attributes</h3>

We are only interesting in scaling the numeric variables that will be used as explanatory varibales in the regression model.

In [5]:
cols = df.iloc[:,7:].select_dtypes(np.number).columns
df[cols] = (df[cols] - df[cols].min()) / (df[cols].max() - df[cols].min())

<h3>Correlation matrix</h3>

In [34]:
corr = df.corr()
corr.style.background_gradient(cmap='coolwarm')

Unnamed: 0,AveragePrice,AveragePriceDetached,AveragePriceSemiDetached,AveragePriceTerraced,AveragePriceFlatOrMaisonette,Economically inactive,Employed,Unemployed,Anti-social behaviour Count,Anti-social behaviour Lon,Anti-social behaviour Lat,Criminal damage and arson Count,Criminal damage and arson Lon,Criminal damage and arson Lat,Violence and sexual offences Count,Violence and sexual offences Lon,Violence and sexual offences Lat,Public order Count,Public order Lon,Public order Lat,Other crime Count,Other crime Lon,Other crime Lat,Theft Count,Theft Lon,Theft Lat,Burglary Count,Burglary Lon,Burglary Lat,Vehicle crime Count,Vehicle crime Lon,Vehicle crime Lat,Drugs Count,Drugs Lon,Drugs Lat,Possession of weapons Count,Possession of weapons Lon,Possession of weapons Lat,Shoplifting Count,Shoplifting Lon,Shoplifting Lat,Robbery Count,Robbery Lon,Robbery Lat
AveragePrice,1.0,0.754518,0.799372,0.876338,0.509647,0.375022,0.750531,-0.125724,-0.361809,-0.132157,-0.415141,-0.384998,-0.120898,-0.412096,-0.090093,-0.141427,-0.41353,-0.15935,-0.118068,-0.428707,-0.399378,-0.130743,-0.437537,-0.141636,-0.131562,-0.420799,-0.141636,-0.131562,-0.420799,-0.18199,-0.139156,-0.413158,-0.598792,-0.128117,-0.45547,-0.485104,-0.109257,-0.427515,-0.465646,-0.13437,-0.421818,-0.485104,-0.109257,-0.427515
AveragePriceDetached,0.754518,1.0,0.959234,0.942871,0.858541,0.280017,0.784468,0.252854,-0.069964,-0.475801,-0.403587,0.066424,-0.455382,-0.387695,0.044689,-0.478924,-0.391706,0.036669,-0.455777,-0.40413,-0.054959,-0.486677,-0.406712,0.176977,-0.477295,-0.38718,0.176977,-0.477295,-0.38718,0.138345,-0.476022,-0.389868,-0.157577,-0.464474,-0.403474,-0.002789,-0.442527,-0.400764,0.15027,-0.458403,-0.367996,-0.002789,-0.442527,-0.400764
AveragePriceSemiDetached,0.799372,0.959234,1.0,0.984409,0.856918,0.395619,0.839562,0.339609,0.025567,-0.411362,-0.377614,0.113412,-0.384874,-0.368428,0.076566,-0.41356,-0.368359,0.077041,-0.387009,-0.381393,-0.067748,-0.420451,-0.377718,0.212274,-0.408028,-0.367925,0.212274,-0.408028,-0.367925,0.181739,-0.410164,-0.367572,-0.14664,-0.393548,-0.38755,0.013331,-0.365848,-0.380626,0.120124,-0.378291,-0.357921,0.013331,-0.365848,-0.380626
AveragePriceTerraced,0.876338,0.942871,0.984409,1.0,0.805358,0.357336,0.844232,0.199539,-0.078462,-0.406092,-0.418255,-0.0459,-0.382687,-0.408688,0.016718,-0.410368,-0.410091,-0.004078,-0.38424,-0.423967,-0.197876,-0.413657,-0.425698,0.095395,-0.403636,-0.410512,0.095395,-0.403636,-0.410512,0.076044,-0.406811,-0.409342,-0.282338,-0.394145,-0.434705,-0.134522,-0.365253,-0.425019,-0.038193,-0.382629,-0.403774,-0.134522,-0.365253,-0.425019
AveragePriceFlatOrMaisonette,0.509647,0.858541,0.856918,0.805358,1.0,0.397458,0.752122,0.538228,0.256562,-0.68345,0.037585,0.395247,-0.660236,0.053623,0.126622,-0.678112,0.050712,0.168025,-0.658298,0.03775,0.15349,-0.684821,0.042428,0.335302,-0.67338,0.056684,0.335302,-0.67338,0.056684,0.299894,-0.677595,0.052059,0.111812,-0.642555,0.040823,0.251392,-0.625829,0.031594,0.411993,-0.629177,0.071592,0.251392,-0.625829,0.031594
Economically inactive,0.375022,0.280017,0.395619,0.357336,0.397458,1.0,0.702852,0.661815,0.508184,0.164504,0.230277,0.456117,0.169562,0.223015,0.209585,0.165397,0.236802,0.249433,0.187972,0.219404,0.434276,0.180187,0.247803,0.40985,0.170799,0.22806,0.40985,0.170799,0.22806,0.249456,0.160706,0.238722,0.127151,0.222769,0.192636,0.090089,0.227209,0.213315,0.099521,0.216325,0.232368,0.090089,0.227209,0.213315
Employed,0.750531,0.784468,0.839562,0.844232,0.752122,0.702852,1.0,0.441407,0.259739,-0.27466,-0.232732,0.141749,-0.258554,-0.224259,0.080843,-0.278094,-0.220909,0.081981,-0.247839,-0.241734,0.038935,-0.272575,-0.229645,0.216923,-0.273515,-0.221846,0.216923,-0.273515,-0.221846,0.144763,-0.27808,-0.219061,-0.097741,-0.237684,-0.262161,-0.082643,-0.208516,-0.253689,0.017257,-0.235038,-0.213303,-0.082643,-0.208516,-0.253689
Unemployed,-0.125724,0.252854,0.339609,0.199539,0.538228,0.661815,0.441407,1.0,0.831822,-0.109663,0.287594,0.920071,-0.088585,0.285297,0.348701,-0.098717,0.297316,0.465955,-0.083026,0.290271,0.694102,-0.108602,0.327565,0.689286,-0.099464,0.29861,0.689286,-0.099464,0.29861,0.599474,-0.100966,0.29822,0.697312,-0.050698,0.296197,0.693245,-0.040937,0.28496,0.764001,-0.028504,0.310125,0.693245,-0.040937,0.28496
Anti-social behaviour Count,-0.361809,-0.069964,0.025567,-0.078462,0.256562,0.508184,0.259739,0.831822,1.0,-0.031054,0.271343,0.773229,-0.019199,0.267289,0.274011,-0.021038,0.281682,0.384394,-0.013484,0.276294,0.610353,-0.029276,0.311782,0.537631,-0.024444,0.284825,0.537631,-0.024444,0.284825,0.493847,-0.025474,0.283584,0.766891,0.008795,0.28308,0.613751,0.025984,0.263458,0.643496,0.039052,0.289374,0.613751,0.025984,0.263458
Anti-social behaviour Lon,-0.132157,-0.475801,-0.411362,-0.406092,-0.68345,0.164504,-0.27466,-0.109663,-0.031054,1.0,-0.233571,-0.136946,0.99844,-0.261451,0.044541,0.998125,-0.247196,0.030417,0.998925,-0.243443,0.000477,0.998483,-0.214601,-0.022875,0.997846,-0.263237,-0.022875,0.997846,-0.263237,-0.09993,0.998099,-0.248901,0.082398,0.995155,-0.257648,-0.095323,0.990528,-0.2331,-0.254376,0.992827,-0.277046,-0.095323,0.990528,-0.2331


From this correlation matrix, we can see that all crimes have a negative effect on the average price. We should mention that correlation does not mean causation. Therefore, we cannot say that an increase in the number of crimes affects the price. We can say that there seems to be a relationship between the number of crimes and the average price of properties per district.

We can also observe how the number of employed residents, shows a positive relationship with price. A negative relationship can be seen between price and the number of unemplyed. This is to be expected as ususally, the areas with better more expensive houses are populated by people with well-paid jobs.

<h2>District Only<h2>

<h3>Splitting dataset</h3>

In [36]:
y = df.AveragePrice
X = pd.get_dummies(df.iloc[:, 1])


In [37]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)

<h3>Mutiple Linear Regression</h3>

In [38]:
reg = LinearRegression().fit(X_train, y_train)
print('R-squared train')
print(reg.score(X_train, y_train))

print('\nR-squared test')
print(reg.score(X_test, y_test))

print('\nPrediction error')
print('\nMean Absolute Error')
print(np.sum(abs(y_test - reg.predict(X_test))) / len(y_test))
print('\nRoot Mean Squared Error')
print(math.sqrt(np.sum((y_test - reg.predict(X_test))*(y_test - reg.predict(X_test))) / len(y_test)))

print('\nCoefficients')
coef = pd.DataFrame([reg.coef_], columns=list(X.columns))
pd.set_option('display.max_columns', None)
display(coef)
print(reg.intercept_)


R-squared train
0.8729482303470458

R-squared test
0.865126736169055

Prediction error

Mean Absolute Error
9311.607142857143

Root Mean Squared Error
11209.980506864662

Coefficients


Unnamed: 0,Boston,East Lindsey,Lincoln,North Kesteven,South Holland,South Kesteven,West Lindsey
0,4.822189e+16,4.822189e+16,4.822189e+16,4.822189e+16,4.822189e+16,4.822189e+16,4.822189e+16


-4.822188791563829e+16


<h2>Crimes Only<h2>

<h3>Splitting dataset</h3>

In [42]:
y = df.AveragePrice
dropCols = [s for s in df.columns.to_list() if 'Lon' in s or 'Lat' in s] + ['Month', 'District', 'AveragePrice', 'AveragePriceDetached', 'AveragePriceSemiDetached', 'AveragePriceTerraced', 'AveragePriceFlatOrMaisonette', 'Economically inactive', 'Employed', 'Unemployed']
X = df.drop(dropCols, axis=1)
numCols = X.select_dtypes('number').columns.to_list()
X[numCols] = (X[numCols]-X[numCols].min())/(X[numCols].max()-X[numCols].min())
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)

<h3>Mutiple Linear Regression</h3>

In [43]:
reg = LinearRegression().fit(X_train, y_train)
print('R-squared train')
print(reg.score(X_train, y_train))

print('\nR-squared test')
print(reg.score(X_test, y_test))

print('\nPrediction error')
print('\nMean Absolute Error')
print(np.sum(abs(y_test - reg.predict(X_test))) / len(y_test))
print('\nRoot Mean Squared Error')
print(math.sqrt(np.sum((y_test - reg.predict(X_test))*(y_test - reg.predict(X_test))) / len(y_test)))

print('\nCoefficients')
coef = pd.DataFrame([reg.coef_], columns=list(X.columns))
pd.set_option('display.max_columns', None)
display(coef)
print(reg.intercept_)


R-squared train
0.7031696313644349

R-squared test
0.6424301534571597

Prediction error

Mean Absolute Error
15410.342013319987

Root Mean Squared Error
18252.500450726588

Coefficients


Unnamed: 0,Anti-social behaviour Count,Criminal damage and arson Count,Violence and sexual offences Count,Public order Count,Other crime Count,Theft Count,Burglary Count,Vehicle crime Count,Drugs Count,Possession of weapons Count,Shoplifting Count,Robbery Count
0,5864.723172,-34144.692765,-31116.954056,-27531.345193,-79072.294104,152846.969036,150103.881478,3381.403127,-74359.511286,24557.327927,-57783.959671,-2886.595887


210809.38282499017


<h2>Crimes And District<h2>

<h3>Splitting dataset</h3>

In [44]:
y = df.AveragePrice
dropCols = [s for s in df.columns.to_list() if 'Lon' in s or 'Lat' in s] + ['Month', 'AveragePrice', 'AveragePriceDetached', 'AveragePriceSemiDetached', 'AveragePriceTerraced', 'AveragePriceFlatOrMaisonette', 'Economically inactive', 'Employed', 'Unemployed']
X = df.drop(dropCols, axis=1)
numCols = X.select_dtypes('number').columns.to_list()
X[numCols] = (X[numCols]-X[numCols].min())/(X[numCols].max()-X[numCols].min())
X = X.join(pd.get_dummies(X['District'])).iloc[:, 1:]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)

<h3>Mutiple Linear Regression</h3>

In [45]:
reg = LinearRegression().fit(X_train, y_train)
print('R-squared train')
print(reg.score(X_train, y_train))

print('\nR-squared test')
print(reg.score(X_test, y_test))

print('\nPrediction error')
print('\nMean Absolute Error')
print(np.sum(abs(y_test - reg.predict(X_test))) / len(y_test))
print('\nRoot Mean Squared Error')
print(math.sqrt(np.sum((y_test - reg.predict(X_test))*(y_test - reg.predict(X_test))) / len(y_test)))

print('\nCoefficients')
coef = pd.DataFrame([reg.coef_], columns=list(X.columns))
pd.set_option('display.max_columns', None)
display(coef)
print(reg.intercept_)


R-squared train
0.9713456860409166

R-squared test
0.9720420595933867

Prediction error

Mean Absolute Error
3854.4150345288353

Root Mean Squared Error
5103.8101493218155

Coefficients


Unnamed: 0,Anti-social behaviour Count,Criminal damage and arson Count,Violence and sexual offences Count,Public order Count,Other crime Count,Theft Count,Burglary Count,Vehicle crime Count,Drugs Count,Possession of weapons Count,Shoplifting Count,Robbery Count,Boston,East Lindsey,Lincoln,North Kesteven,South Holland,South Kesteven,West Lindsey
0,-54170.978457,-54018.511559,32859.713305,29635.236108,-2540.452516,34965.214455,39043.973354,6315.495307,49848.177785,17095.954928,24790.519176,19329.219176,-41970.214869,338.843594,-70960.889413,52079.735511,18471.016053,34132.946547,7908.562576


176209.1137438481


<h2>Crimes, District and Month<h2>

<h3>Splitting dataset</h3>

In [8]:
X.columns

Index(['Anti-social behaviour Count', 'Criminal damage and arson Count',
       'Violence and sexual offences Count', 'Public order Count',
       'Other crime Count', 'Theft Count', 'Burglary Count',
       'Vehicle crime Count', 'Drugs Count', 'Possession of weapons Count',
       'Shoplifting Count', 'Robbery Count', '2021-06', '2021-07', '2021-08',
       '2021-09', '2021-10', '2021-11', '2021-12', '2022-01', '2022-02',
       '2022-03', '2022-04', '2022-05', 'Boston', 'East Lindsey', 'Lincoln',
       'North Kesteven', 'South Holland', 'South Kesteven', 'West Lindsey'],
      dtype='object')

In [6]:
y = df.AveragePrice
dropCols = [s for s in df.columns.to_list() if 'Lon' in s or 'Lat' in s] + ['AveragePrice', 'AveragePriceDetached', 'AveragePriceSemiDetached', 'AveragePriceTerraced', 'AveragePriceFlatOrMaisonette', 'Economically inactive', 'Employed', 'Unemployed']
X = df.drop(dropCols, axis=1)
numCols = X.select_dtypes('number').columns.to_list()
X[numCols] = (X[numCols]-X[numCols].min())/(X[numCols].max()-X[numCols].min())
X = X.join(pd.get_dummies(X['Month'])).iloc[:, 1:]
X = X.join(pd.get_dummies(X['District'])).iloc[:, 1:]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)

<h3>Mutiple Linear Regression</h3>

In [7]:
reg = LinearRegression().fit(X_train, y_train)
print('R-squared train')
print(reg.score(X_train, y_train))

print('\nR-squared test')
print(reg.score(X_test, y_test))

print('\nPrediction error')
print('\nMean Absolute Error')
print(np.sum(abs(y_test - reg.predict(X_test))) / len(y_test))
print('\nRoot Mean Squared Error')
print(math.sqrt(np.sum((y_test - reg.predict(X_test))*(y_test - reg.predict(X_test))) / len(y_test)))

print('\nCoefficients')
coef = pd.DataFrame([reg.coef_], columns=list(X.columns))
pd.set_option('display.max_columns', None)
display(coef)
print(reg.intercept_)

R-squared train
0.9810522885849212

R-squared test
0.9782372748691788

Prediction error

Mean Absolute Error
3285.891351602863

Root Mean Squared Error
4502.964905076852

Coefficients


Unnamed: 0,Anti-social behaviour Count,Criminal damage and arson Count,Violence and sexual offences Count,Public order Count,Other crime Count,Theft Count,Burglary Count,Vehicle crime Count,Drugs Count,Possession of weapons Count,Shoplifting Count,Robbery Count,2021-06,2021-07,2021-08,2021-09,2021-10,2021-11,2021-12,2022-01,2022-02,2022-03,2022-04,2022-05,Boston,East Lindsey,Lincoln,North Kesteven,South Holland,South Kesteven,West Lindsey
0,-15642.150937,-59420.591078,37511.224904,34485.22622,-12564.553601,33936.649957,37157.035088,8859.543693,31421.53471,19462.396075,-8186.182737,21422.09707,-7428.832174,-7470.609321,-9634.445416,-5524.186314,4510.286653,-2564.808878,1570.896158,-1143.004155,3527.194264,5042.613487,7726.627063,11388.268633,-34430.038135,-5593.976478,-58824.369817,42246.01019,11453.618055,36432.764124,8715.992061


180170.4873403276


<h2>Crimes, District and Economic Activity<h2>

<h3>Splitting dataset</h3>

In [8]:
y = df.AveragePrice
dropCols = [s for s in df.columns.to_list() if 'Lon' in s or 'Lat' in s] + ['Month', 'AveragePrice', 'AveragePriceDetached', 'AveragePriceSemiDetached', 'AveragePriceTerraced', 'AveragePriceFlatOrMaisonette']
X = df.drop(dropCols, axis=1)
numCols = X.select_dtypes('number').columns.to_list()
X[numCols] = (X[numCols]-X[numCols].min())/(X[numCols].max()-X[numCols].min())
X = X.join(pd.get_dummies(X['District'])).iloc[:, 1:]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)

<h3>Mutiple Linear Regression</h3>

In [9]:
reg = LinearRegression().fit(X_train, y_train)
print('R-squared train')
print(reg.score(X_train, y_train))

print('\nR-squared test')
print(reg.score(X_test, y_test))

print('\nPrediction error')
print('\nMean Absolute Error')
print(np.sum(abs(y_test - reg.predict(X_test))) / len(y_test))
print('\nRoot Mean Squared Error')
print(math.sqrt(np.sum((y_test - reg.predict(X_test))*(y_test - reg.predict(X_test))) / len(y_test)))

print('\nCoefficients')
coef = pd.DataFrame([reg.coef_], columns=list(X.columns))
pd.set_option('display.max_columns', None)
display(coef)
print(reg.intercept_)

R-squared train
0.9724487886794692

R-squared test
0.9707748970123579

Prediction error

Mean Absolute Error
3971.7326370234478

Root Mean Squared Error
5218.190729651223

Coefficients


Unnamed: 0,Economically inactive,Employed,Unemployed,Anti-social behaviour Count,Criminal damage and arson Count,Violence and sexual offences Count,Public order Count,Other crime Count,Theft Count,Burglary Count,Vehicle crime Count,Drugs Count,Possession of weapons Count,Shoplifting Count,Robbery Count,Boston,East Lindsey,Lincoln,North Kesteven,South Holland,South Kesteven,West Lindsey
0,35609.960504,7970.060179,4840.013595,-54753.15264,-59252.076544,41516.351022,39302.04255,-3954.208715,36902.546052,37758.292531,10609.488627,53906.87695,17895.919446,24325.940077,17305.051679,-22330.533785,-23111.36193,-73386.194426,52565.190434,26796.934968,22713.162443,16752.802295


151436.38054993085


<h2>Crimes, District, Month and Economic Activity<h2>

<h3>Splitting dataset</h3>

In [62]:
y = df.AveragePrice
dropCols = [s for s in df.columns.to_list() if 'Lon' in s or 'Lat' in s] + ['AveragePrice', 'AveragePriceDetached', 'AveragePriceSemiDetached', 'AveragePriceTerraced', 'AveragePriceFlatOrMaisonette']
X = df.drop(dropCols, axis=1)
numCols = X.select_dtypes('number').columns.to_list()
X[numCols] = (X[numCols]-X[numCols].min())/(X[numCols].max()-X[numCols].min())
X = X.join(pd.get_dummies(X['Month'])).iloc[:, 1:]
X = X.join(pd.get_dummies(X['District'])).iloc[:, 1:]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)

<h3>Mutiple Linear Regression</h3>

In [63]:
reg = LinearRegression().fit(X_train, y_train)
print('R-squared train')
print(reg.score(X_train, y_train))

print('\nR-squared test')
print(reg.score(X_test, y_test))

print('\nPrediction error')
print('\nMean Absolute Error')
print(np.sum(abs(y_test - reg.predict(X_test))) / len(y_test))
print('\nRoot Mean Squared Error')
print(math.sqrt(np.sum((y_test - reg.predict(X_test))*(y_test - reg.predict(X_test))) / len(y_test)))

print('\nCoefficients')
coef = pd.DataFrame([reg.coef_], columns=list(X.columns))
pd.set_option('display.max_columns', None)
display(coef)
print(reg.intercept_)

R-squared train
0.9824114535373675

R-squared test
0.9769292290231852

Prediction error

Mean Absolute Error
3610.8577424125288

Root Mean Squared Error
4636.315455719247

Coefficients


Unnamed: 0,Economically inactive,Employed,Unemployed,Anti-social behaviour Count,Criminal damage and arson Count,Violence and sexual offences Count,Public order Count,Other crime Count,Theft Count,Burglary Count,Vehicle crime Count,Drugs Count,Possession of weapons Count,Shoplifting Count,Robbery Count,2021-06,2021-07,2021-08,2021-09,2021-10,2021-11,2021-12,2022-01,2022-02,2022-03,2022-04,2022-05,Boston,East Lindsey,Lincoln,North Kesteven,South Holland,South Kesteven,West Lindsey
0,28605.647184,15064.147998,-27235.784651,-14398.219366,-73191.798097,58794.409149,56333.162219,-20644.971461,34164.139692,34255.223278,15653.316575,34005.63893,18839.20902,-14394.5125,18200.59283,-8434.038122,-7460.945414,-9918.819776,-3877.278219,4946.516612,-2468.996053,849.78894,-2324.59527,3775.390677,5003.834473,8344.633868,11564.508283,-27399.583804,-14177.999739,-39569.417306,30835.832021,7859.524834,29416.318464,13035.32553


170325.98578598586


Despite the error being larger, we will keep the model with the months because it does have a very small difference in correlation with the average price over time.

In [11]:
con.close()

<h2>Conclusion</h2>
Best model: Crimes, district and month

Disclaimer: The metrics will vary with every new execution of the whole data science project. However, the general behaviour is still the same.

We begin with creating a model using just the districts as the predictors. This model performed relatively well (0.87 R-squared) for the small number of predictors. This shows that the main affect on the average price comes from the district the property is in. This makes sense because there areas that are going to be more expensive due to the properties being bigger.

The second model, uses the counts of all the crimes as predictors. This model did not perform as well as the district model. However, it does show that there exists a relationship between the number of crimes and the average price.

Next, we create a model using the districts and the crime counts. We did not use the longitude or latitude of the crimes because they did not prove useful. The resulting model is the best performing yet with an R-squared of 0.97 for both training and testing, meaning that there is no ovrefitting or underfitting. This is very promising. With a RMSE of 5103, we can say that this model is quite accurate. From now on, we will continue adding different variables to perfect the model without overfitting it.

We created models with:
<ul>
    <li>Crimes, District, Month</li>
    <ul>
        <li>R-squared training: 0.9819</li>
        <li>R-squared testing: 0.9782</li>
        <li>RMSE: 4502.96</li>
    </ul>
    <li>Crimes, District, Economic Status</li>
    <ul>
        <li>R-squared training: 0.9724</li>
        <li>R-squared testing: 0.9707</li>
        <li>RMSE: 5218.19</li>
    </ul>
    <li>Crimes, District, Month, Economic Status</li>
    <ul>
        <li>R-squared training: 0.9824</li>
        <li>R-squared testing: 0.9769</li>
        <li>RMSE: 4636.32</li>
    </ul>
</ul>

All of the models above, have a good performance. However, the model with the least RMSE and the best balance between the multiple correlation coefficient, R-squared, for both training and testing is the model with crimes, district and month. Therefore, we will select this model as our final version. This means that the data discovery task wasn't useful for our analysis question. However, it could prove useful for other analysis and we could showcase that our data management backbone can scale easily with new data sources. We should mention that the data sources must fulfill certain requirements, such as, sharing the same months and districts as our original data. If this isn't fulfilled we would not be able to perform the join operation.



Finally, we should mention that the data set is relatively small and it uses aggreagated values which don't change much month to month. This makes quite easy to predict a correct value for each district. Therefore, this model can only really be used to predict the average price per district over the months. It cannot be used to predict the price of a particular property. For future improvements, we would need to change the main dataset of prices. We would need to find/create a data set with prices of individual properties with attributes such as the district, size, number of rooms, outdoor area... The resulting model from this dataset would have the potential to be more useful than the one we have developed if it had a good performance.