In [203]:
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import Ridge, LogisticRegression

In [204]:
csv = pd.read_csv('Real estate.csv')
csv.head()

Unnamed: 0,No,X1 transaction date,X2 house age,X3 distance to the nearest MRT station,X4 number of convenience stores,X5 latitude,X6 longitude,Y house price of unit area
0,1,2012.917,32.0,84.87882,10,24.98298,121.54024,37.9
1,2,2012.917,19.5,306.5947,9,24.98034,121.53951,42.2
2,3,2013.583,13.3,561.9845,5,24.98746,121.54391,47.3
3,4,2013.5,13.3,561.9845,5,24.98746,121.54391,54.8
4,5,2012.833,5.0,390.5684,5,24.97937,121.54245,43.1


In [205]:
y = csv['Y house price of unit area'].to_numpy()
x = csv.drop(['X1 transaction date', 'Y house price of unit area', 'No'], axis=1).to_numpy()

In [206]:
model = Ridge(alpha=0.0, fit_intercept=True)
model_without = Ridge(alpha=0.0, fit_intercept=False)
model.fit(x, y)
model_without.fit(x, y)

In [207]:
print("θ values = ", model.coef_)

θ values =  [-2.68916833e-01 -4.25908898e-03  1.16302048e+00  2.37767191e+02
 -7.80545273e+00]


What do we learn from the estimate ˆθ? Can you for example
say which variable is more important in the pricing?

The values for θ are printed above and we can see that the last 3 columns are the most important for pricing. This would include number of convenience stores, latitude, longitude with latitude being the most important parameter.  

In [208]:
print("Model with intercept σ value = ", mean_squared_error(model.predict((x)), y))
print("Model with zero intercept σ value = ", mean_squared_error(model_without.predict((x)), y))

Model with intercept σ value =  79.20185189210909
Model with zero intercept σ value =  79.32492614841333


Is the intercept μ important? What happens if we ignore it?

When the intercept is ignored for x values of 0, y would be zero thus fitting a line through the origin. Thus we can see that μ is important if we want the line to fit the data more accurately as we're not forcing the fitted line to go through the origin. 

what does ˆσ mean? Is it better for it to be small or big?

σ refers the average square loss between the prediction vs actual true value of y. It's better for σ to be smaller as we want to minimize losses. As we see from the above σ values the model with intercept performs better on the dataset than when intercept set 0. 

In [209]:
model = Ridge(alpha=10000.0, fit_intercept=True)
model.fit(x, y)
print("For model with λ as 10000.0 ")
print("σ value = ", mean_squared_error(model.predict((x)), y))
print("θ values = ", model.coef_)

For model with λ as 10000.0 
σ value =  91.09507555938664
θ values =  [-1.98044932e-01 -6.88274450e-03  2.37091847e-01  1.08220301e-03
 -1.94787873e-04]


In [210]:
model = Ridge(alpha=100000000.0, fit_intercept=True)
model.fit(x, y)
print("For model with λ as 100000000.0 ")
print("σ value = " , mean_squared_error(model.predict((x)), y))
print("θ values = ", model.coef_)

For model with λ as 100000000.0 
σ value =  102.3367953997244
θ values =  [-1.25146484e-04 -6.30377153e-03  3.61994423e-05  1.39964188e-07
  4.46778539e-08]


For ridge regression, how does the coefficient ˆθ change with increasing λ? how does ˆσ change with it?

As we can see from the above attempt, as we increased λ, the values of θ decreased as λ penalizes higher values of for θ. The σ values increased as λ increased. 

In [211]:
df12 = csv.loc[csv['X1 transaction date'] < 2013.0]
df13 = csv.loc[csv['X1 transaction date'] >= 2013.0]

In [212]:
y = df12['Y house price of unit area'].to_numpy()
x = df12.drop(['X1 transaction date', 'Y house price of unit area', 'No'], axis=1).to_numpy()
model = Ridge(alpha=0.0, fit_intercept=True)
model.fit(x, y)
print("For model with 2012 data σ value = " , mean_squared_error(model.predict((x)), y))

For model with 2012 data σ value =  37.934153591310164


In [213]:
y = df13['Y house price of unit area'].to_numpy()
x = df13.drop(['X1 transaction date', 'Y house price of unit area', 'No'], axis=1).to_numpy()
model = Ridge(alpha=0.0, fit_intercept=True)
model.fit(x, y)
print("For model with 2013 data σ value = " , mean_squared_error(model.predict((x)), y))

For model with 2013 data σ value =  92.50314353688712


Divide the dataset to the two parts related to the transactions of 2012 and 2013. Apply the models to each part separately. Do you see any difference between the two years? Is the individual models per year are more reliable than one model?

Yes, we see the models trained on 2012 data works better that the model trained on 2013 data. 

In [214]:
data=np.zeros((683,10))
label=np.zeros(683)
count=0


# The file for the dataset is found in the assignment
with open('breast-cancer.txt', 'r') as f:
    for x in f:
        x1=x.strip()
        y=x1.split(' ')
        label[count]=(float(y[0])-2)/2
        for k in range(10):
            h=y[k+2].split(':')
            data[count,k]=float(h[1])
        count+=1

clf = LogisticRegression().fit(data[500:], label[500:])
print("Accuracy on unnormalized data = ", clf.score(data[500:], label[500:]))
print("θ values = ", clf.coef_)

Accuracy on unnormalized data =  0.7704918032786885
θ values =  [[-1.07382720e-06  2.14965367e-11  4.98711172e-11  4.35381449e-11
   3.92419556e-11  2.02094661e-11  4.38989962e-11  4.22894155e-11
   3.71703897e-11  5.89898419e-12]]


In [215]:
data=np.zeros((683,10))
label=np.zeros(683)
count=0


# The file for the dataset is found in the assignment
with open('breast-cancer.txt', 'r') as f:
    for x in f:
        x1=x.strip()
        y=x1.split(' ')
        label[count]=(float(y[0])-2)/2
        for k in range(10):
            h=y[k+2].split(':')
            # Here we make the distinction between the first variable and the rest
            # You should implement the two cases
            if k==0:
                data[count,k]=float(h[1])/1000000
            else:
                data[count,k]=float(h[1])
        count+=1

clf = LogisticRegression().fit(data[:500], label[:500])
print("Accuracy on normalized data = ", clf.score(data[:500], label[:500]))
print("θ values = ", clf.coef_)

Accuracy on normalized data =  0.96
θ values =  [[ 0.10124302  0.51957796 -0.05453361  0.31650104  0.29195348  0.12337555
   0.36326314  0.35259347  0.19394253  0.43613779]]


Does the second case (i.e. preprocessing the data) give you a different solution? Why do you think?

The scales of the data have a direct correllation on the values for θ as θ is multiplied by x. If all xs are not in the same scale then the values of θ cannot be compared for importance since the θs are now on a different scale too. This also adversely affects regularization as the values of θ are penalized and this isn't helpful when θs are on different scales as different θs would be penalized differently based on their scale, with higher valued θs penalized more than lower valued ones. 

What can we learn from θ? is there any particular gene that has a bigger effect on cancer?

We can learn the importance of the different genes for cancer when the data is normalized, we can see the gene 2 and 10 have the most significance when it comes to cancer. 

How well is the accuracy? Discuss what is the reason and what are the practical considerations for using the learned model.

The accuracy when the data is normalized is quite high and thus this model can be predict cancer from genes with a high accuracy. The accuracy when the data is not normalized is quite low and thus this model is unable predict cancer from genes properly. The reason for the difference in accuracy is due to the fact unnormalized data adversely affect regularization of the values of θ would be different scales as discussed earlier. 