# Morning practical 2:

In the first practical, you worked on getting the basics of hypothesis functions and gradient descent down. This was probably somewhat difficult, but extending this towards polynomial regression will be relatively easier now that you have the basis. Let's move towards polynomial regression. First, run the two code cells below, then move on. <br>

In [1]:
#run this cell to set things up
import ipywidgets as widgets, numpy as np, pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
import seaborn as sns
from IPython.display import display, Markdown
from sklearn.decomposition import PCA
%matplotlib notebook

In [2]:
#necessary functions you defined before:
def univariateHypothesis(x, thetas):
    predict = thetas[0] + thetas[1] * x
    return predict

def MyMSE(dataframe, thetas):
    totalSumSquares = 0
    for index, rowData in dataframe.iterrows():
        prediction = univariateHypothesis(rowData['x'], thetas)
        squareError = (prediction-rowData["y"])**2
        totalSumSquares += squareError
    meanSquaredError = totalSumSquares/len(dataframe) 
    return meanSquaredError

def gradientDescent(dataframe, thetas, alpha):
    m = len(dataframe)
    #print("m: "); print(m)
    totalErrorThetaZero = 0
    totalErrorThetaOne = 0
    for index, row in dataframe.iterrows():
    #print ([row["x"], row["y"]])
        errorThetaZero = univariateHypothesis(row["x"], thetas) - row["y"]
        errorThetaOne  = (univariateHypothesis(row["x"], thetas) - row["y"]) * row["x"]
        totalErrorThetaZero += errorThetaZero
        totalErrorThetaOne += errorThetaOne
    
    thetaZeroStep = thetas[0] - alpha * (1/m) * totalErrorThetaZero
    thetaOneStep  = thetas[1] - alpha * (1/m) * totalErrorThetaOne
    return np.array([thetaZeroStep, thetaOneStep])
    
#sample data
data = pd.read_csv("sampleDataLinearRegression.csv", header= None)
data.columns = ["x", "y"]


## Making some dummy polynomial features
For polynomial regression, we first need some polynomial data. To get that, we give you this function called `makePolynomialFeatures` that takes in a feature column and an argument `power`, that returns a DataFrame with a number of columns equal to power, where each column contains the original feature column raised to a power in `range(1, power+1)`. So `makePolynomialFeatures(np.array(featureColumn = [3,4,5], power = 3)` returns: <br> 
\[3; 9 ; 27\] <br>
\[4; 16; 64\] <br>
\[5; 25; 125\] <br>
<br>

Run the below cell to see it in action and move on to normalising the data.

In [3]:

def makePolynomialFeatures(x = pd.DataFrame(None), power = 2):
    
    columns = []
    nameList = []
    for i in range(1, power+1):
        columns.append(x**i)
    finalFeaturesDataFrame = pd.concat(columns, axis = 1)
    columnNames = [name + "toPower" + str(power) for name, power in zip(finalFeaturesDataFrame.columns, list(range(1,power+1)))]
    finalFeaturesDataFrame.columns = columnNames
    return finalFeaturesDataFrame

        
testPolynomial = makePolynomialFeatures(x = data["x"], power = 2)
display(testPolynomial)

Unnamed: 0,xtoPower1,xtoPower2
0,2,4
1,6,36
2,10,100
3,16,256
4,18,324
5,21,441
6,23,529
7,25,625
8,30,900
9,35,1225


---
## Making a function for data normalisation
Good. You've seen in **Exercise 1** and in the lectures that the features should be normalised to make gradient descent work well. High time to write a function to normalise these features. Call it `createNormalisedFeatures`. The function should have an argument `mode` that accepts two inputs, 'SD' and 'range'. For 'SD', the normalisation should return:
* the mean of each feature.
* the standard deviation of each feature.
* the normalised feature itself, using the formula: <br> $$\frac{(feature\ values - feature\ mean)}{feature\ standard\  deviation}$$ <br>

For 'range' as input the normalisation should return:
* the mean of each feature.
* the range of each feature (max - min).
* the normalised feature itself, using the formula:
$$\frac{(feature\ values - feature\ mean)}{(feature\ max - feature\ min)}$$

After you are done, test your function on a DataFrame with a linear and a quadratic feature (i.e. made with `power = 2` in `makePolynomialFeatures`) using `mode = 'range'`. Code to plot the features in this space has been provided below.

##### Why do we need the means and std. dev./range saved?

We need these saved so if we want to predict for an unseen data point, we can transform its features in the same way we do with our training data: by subtracting the means of the training data and dividing by the std.dev./range as we do in our normalisations.

##### Why these different normalisations? <br>
The SD normalisation brings features to *approximately* the \[-1,1\] range, however outliers still go slightly above or below this: we normalise based on the standard deviation or average difference of the data from the mean, so points that are far from this mean will have larger values. In other words: this assumes a normal distribution of data, and shrinks that distribution to have mean 0 and *most* of its observations in the \[-1,1\] range. <br>
The range normalisation brings everything to a \[-1,1\] range outright, no exceptions. Do note that if you have huge outliers, most data will be compressed to a tiny range, with the outliers being close to 1 or -1. <br>
<br>
<b> Hints </b> 
* To return multiple values in one function, simply put them in a list: `return [thing1,thing2,ChickenLittle]`. In this case, order the return values like so: `[normalisedFeatures, featureMeans, featureSD/featureRanges]` <br>
* You can simply use `someDataFrame.mean()` and `someDataFrame.std()` to get means and standard deviations for each column, respectively. Range can be found in a similar way, using `someDataFrameRange = someDataFrame.max() - someDataFrame.min()`. If you then do `transformedDataFrame = (someDataFrame - someDataFrame.mean())/someDataFrameRange` you're done for range transformation! <br>

---

In [4]:
#your answer here


#answer

def createNormalisedFeatures(dataFrame, mode = "range"):
    featureMeans = dataFrame.mean()
    if mode == "range":
        featureRanges = dataFrame.max() - dataFrame.min()
        normalisedFeatures = (dataFrame - featureMeans)/featureRanges
        return [normalisedFeatures, featureMeans, featureRanges]
    elif mode == "SD":
        featureSDs = dataFrame.std()
        normalisedFeatures = (dataFrame - featureMeans)/featureSDs
        return [normalisedFeatures, featureMeans, featureSDs]

#plotting code
runOnPowerTwoData = createNormalisedFeatures(makePolynomialFeatures(data["x"], power = 2),
                                          mode = "range")

fig, ax = plt.subplots(figsize=(10,10))
ax.set_xlim([-1,1])
ax.set_ylim([0,100])
ax.scatter(runOnPowerTwoData[0].iloc[:,0], data["y"], label = "linear feature")
ax.plot([np.mean(runOnPowerTwoData[0].iloc[:,0]), np.mean(runOnPowerTwoData[0].iloc[:,0])],
       [0,100], linestyle = 'dashed', linewidth = 1, color = "red", label = 'mean linear feature')
ax.scatter(runOnPowerTwoData[0].iloc[:,1], data["y"], label = "quadratic feature")
ax.plot([np.mean(runOnPowerTwoData[0].iloc[:,0]), np.mean(runOnPowerTwoData[0].iloc[:,0])],
       [0,100], linestyle = 'dotted', linewidth = 1, color = "blue", label = 'mean quadratic feature')
ax.legend()



<IPython.core.display.Javascript object>

<matplotlib.legend.Legend at 0x2bc33d215e0>


---

## Plotting different data normalisations
Run the below cell to see its effects on the sample linear regression data with only the normal feature `data['x']`. Note how every normalisation has slightly different characteristics (note the axis scales!).
<br>

---

In [5]:
#set up a figure
figNormPlot, ((ax1NormPlot, ax2NormPlot, ax3NormPlot), (ax4NormPlot, ax5NormPlot, ax6NormPlot)) = plt.subplots(2,3, figsize=(15,6.15))
figNormPlot.tight_layout()

#create normal scatterplot
ax1NormPlot.set_title("Untransformed data")
scatter = ax1NormPlot.scatter(data["x"], data["y"])
regressionLine, = ax1NormPlot.plot(data["x"], 0.31 + 1.05 * data["x"],
                                   color = 'red', linestyle = "dashed")

#range-normalised plot
ax2NormPlot.set_title("Range transformed (mean-normalised) data")
featsRange, featsRangeMean, featsRangeRange =  createNormalisedFeatures(data["x"], mode = "range")
scatter = ax2NormPlot.scatter(featsRange, data["y"])
ax2NormPlot.set_xlim([-1,1])

#st.dev.-normalised plot
ax3NormPlot.set_title("St.Dev. transformed data")
featsSD, featsSDMean, featsSDSD =  createNormalisedFeatures(data["x"], mode = "SD")
scatter = ax3NormPlot.scatter(featsSD, data["y"])
ax3NormPlot.set_xlim([-2.5,2.5])

#distribution plots. They all look the same, but note that the axis limits vary!
sns.kdeplot(data = np.array(data["x"]), ax = ax4NormPlot)
sns.kdeplot(data = np.array(featsRange), ax = ax5NormPlot)
sns.kdeplot(data = np.array(featsSD), ax = ax6NormPlot)
ax4NormPlot.hist(data["x"], bins = 10, edgecolor='black', linewidth=1.2, density = True)
ax5NormPlot.hist(featsRange, bins = 10, edgecolor='black', linewidth=1.2, density = True)
ax6NormPlot.hist(featsSD, bins = 10, edgecolor='black', linewidth=1.2, density = True)

#show all plots
plt.show()

<IPython.core.display.Javascript object>

---

## Towards multivariate linear regression: updating the hypothesis function

Okay, now let's put it all together. First, update the hypothesis function to accomodate any number of features and parameters. It should take in the features for one sample, so a row of a DataFrame with features of length *n*, and an array/list of thetas with length *n*, calculate $\theta_0, \theta_1 * x_1, ..., \theta_n * x_n$ and return the predicted value for the data point. Call it `multiHypothesis`. <br> <br>

<b> Hints </b>
* Loop over the thetas, and multiply each with the corresponding element in x. <br>
* You can prepend 1 to the data that the function takes in. By adding this feature that is 1, $\theta_0$ is multiplied by 1 when you use a for loop. To do this, when you are working with a row of a Pandas DataFrame, which is a `pd.Series()` object, use: `pd.Series(1).append(currentRowOfFeatures)`<br>

---

In [6]:
#answer
def multiHypothesis(x, thetas):
    #add a 0 to x as the first 'feature'
    #print(thetas)
    #print(x)
    x = pd.Series(1).append(x)
    #print(x)
    #print("in multiHypothesis")
    #print(x)
    if not len(x) == len(thetas):
        print("Error, x and theta should have equal length!")
        return
    #for index, _ in enumerate(x):
        #print(index)
        #print(x.iloc[index])
    prediction = sum([x.iloc[index] * thetas[index] for index, _ in enumerate(x)])
    return prediction

        
    

---

## Towards multivariate regression: running gradient descent with `multiHypothesis`

Armed with this glorious function, we'll take the following steps:
* Use the `makePolynomialFeatures` function with `power = 2` to make features.
* Normalise the features using `createNormalisedFeatures`, with `mode = 'range'`.
* Use this normalised data in a linear regression with 2 variables by using gradient descent with this new hypothesis function. To do this, we need to change the `MyMSE` and `gradientDescent` functions to use the new `multiHypothesis` function to calculate cost. The new `MyMSE` function has been supplied as an example below. <br>
* <b> You need to change the `gradientDescent` function to use `multiHypothesis`! </b> <br> <br>

<b> Hints </b>
* It is wasteful to keep redefining the MSE function for different hypotheses. Instead, we added an argument `hypothesis` to the MSE function. Then, in the code, we use `globals()[hypothesis](arg1, arg2)`. This causes Python to search the functions that have been defined in the session for the one called `hypothesis` and execute it if found. <br>
* The `multiHypothesis` function does not need y values. When you are iterating over the samples using `index, rowData = DataFrame.iterrows()`, you can use `withoutYRowData = rowData.drop("y")` to remove it. <br>
* You can assume that there are only two features, 'xToPower1' and 'xToPower2'. <br>
* `gradientDescent` should now return 3 values, for $\theta_0$ , $\theta_1$, and $\theta_2$. Do this in a list, as before.
* If you want to access a Pandas DataFrame column or Series object by a number, use `DataFrame.iloc[0,2]` (first row, second column) or `Series.iloc[1]` (2nd item in the series).

---

In [7]:
#start things off
startThetasHP = np.array([0,0,0])

featuresHP = makePolynomialFeatures(data["x"], power = 2)

normalisedFeaturesHP, featureMeansHP, featureRangesHP = createNormalisedFeatures(featuresHP, mode = "range")

#add 'y' variable to the normalised data frame
normalisedFeaturesHP["y"] = data["y"]

display(normalisedFeaturesHP)

#example use of globals()
def printMyMan(smiley = False):
    if smiley == True:
        print("My Man! (:")
    else:
        print("My Man!")
    return None

globals()["printMyMan"]()
globals()["printMyMan"](smiley = True)

#example change in MyMse() to use any hypothesis function you define
def MyMSE(dataframe, thetas, hypothesis = "multiHypothesis"):
    totalSumSquares = 0
    for index, rowData in dataframe.iterrows():
        prediction = globals()[hypothesis](rowData.drop("y"), thetas)
        squareError = (prediction-rowData["y"])**2
        totalSumSquares += squareError
    meanSquaredError = totalSumSquares/len(dataframe) 
    return meanSquaredError



#sample usage:
exampleMSE = MyMSE(normalisedFeaturesHP, startThetasHP, "multiHypothesis")
print("MSE for thetas " + np.array2string(startThetasHP) + ": " + str(exampleMSE))

#now it's up to you to change gradientDescent in a similar way in the cell below!

Unnamed: 0,xtoPower1,xtoPower2,y
0,-0.423862,-0.26616,0
1,-0.379906,-0.262458,1
2,-0.33595,-0.255055,8
3,-0.270016,-0.23701,20
4,-0.248038,-0.229144,25
5,-0.215071,-0.21561,27
6,-0.193093,-0.205431,34
7,-0.171115,-0.194326,43
8,-0.11617,-0.162516,45
9,-0.061224,-0.124922,46


My Man!
My Man! (:
MSE for thetas [0 0 0]: 3166.809523809524


In [14]:
#answer    

def gradientDescent(dataframe, thetas, alpha, hypothesis = "multiHypothesis"):
    m = len(dataframe)
    #print("m: "); print(m)
    totalErrorThetaZero = 0
    totalErrorOtherThetas = [0] * (len(thetas)-1)
    for index, row in dataframe.iterrows():
        #needed for all thetas, calculate the prediction only once.
        hypothesisOutcome = globals()[hypothesis](row.drop("y"), thetas)
        #calculate partial derivative theta zero.
        errorThetaZero = hypothesisOutcome - row["y"]
        totalErrorThetaZero += errorThetaZero
        #theta zero partial derivative calculated. Now loop over all remaining thetas:
        for otherThetasIndex in range(0,len(thetas)-1):
            #print(otherThetasIndex)
            #print(row)
            #print(row[otherThetasIndex])
            errorThisTheta  = (hypothesisOutcome - row["y"]) * row[otherThetasIndex]
            totalErrorOtherThetas[otherThetasIndex] += errorThisTheta
    
    
    #now take a step for every theta
    finalThetas = []
    partialDerivativesAllThetas = totalErrorOtherThetas
    partialDerivativesAllThetas.insert(0, totalErrorThetaZero)
    #loop over all thetas, subtracting alpha/m * its partial derivative from what was put into the function.
    for index, theta in enumerate(partialDerivativesAllThetas):
        finalThetas.append(thetas[index] - alpha/m * partialDerivativesAllThetas[index])
    
    return np.array(finalThetas)

---

## Running your very own multivariate linear regression algorithm
All done? Great! You've made it through Mean-Squared Error functions, gradient descent implementations, polynomial feature generation and gradient descent that works for >1 feature. Let's see this brand spankin' new set-up in action. Of course, the underlying data is actually linear, so the quadratic equation we're fitting here will not be the best fit. But it will improve thanks to your gradient descent function-writing prowess. Run the cell below and marvel (and/or feel slightly underwhelmed) at the results!

---


In [22]:
thetaValuesDuringDescent = []
MSEDuringDescent = []
thetasNow = startThetasHP.copy()
stepsGradDecent = 20
alpha = 0.8
for i in range(0,stepsGradDecent):
    oneStep = gradientDescent(normalisedFeaturesHP, thetasNow, 0.2)
    thetaValuesDuringDescent.append(oneStep)
    thetasNow = oneStep
    MSEDuringDescent.append(MyMSE(normalisedFeaturesHP, thetasNow, hypothesis = 'multiHypothesis'))
#print(thetaValuesDuringDescent)

fig10, ax10 = plt.subplots(figsize = (10,10))
scatter10One = ax10.scatter(normalisedFeaturesHP.iloc[:,0], normalisedFeaturesHP["y"])
regressionLine10, = ax10.plot(normalisedFeaturesHP.iloc[:,0],
                              [multiHypothesis(row, startThetasHP) for index, row in normalisedFeaturesHP.drop("y", axis = 1).iterrows()],
                              color = 'red', linestyle = "dashed",
                              label = "start thetas: " + str(startThetasHP) + " ; MSE: " + str(np.round(exampleMSE,1)))

colors = ['b', 'g', 'm', 'c', 'orange']
#repeat colours if more steps
colors = colors * int(np.ceil(len(thetaValuesDuringDescent)/5))
#show for all gradient steps
for i in range(0, len(thetaValuesDuringDescent)):
    
    ax10.plot(normalisedFeaturesHP.iloc[:,0],
                              [multiHypothesis(row, thetaValuesDuringDescent[i]) for index, row in normalisedFeaturesHP.drop("y", axis = 1).iterrows()],
                              color = colors[i], linestyle = "dashed", alpha = 0.8,
                              label = "thetas: " + str(np.round(thetaValuesDuringDescent[i], 1)) + " ; MSE: " + str(np.round(MSEDuringDescent[i])))

ax10.legend(bbox_to_anchor=(0, 1), loc='upper left', borderaxespad=0, fontsize = "small")    
plt.show()


<IPython.core.display.Javascript object>

## Final words

"Aaarghh". But also: well done! Time for one more lecture, in which we'll dive into using linear algebra to make our calculations (and hopefully also our function definitions) easier, and/or further explore the bias-variance trade-off.

Note: linear regression actually has an algebraically defined minimum of the cost function, which can also be approached via something called the Normal Equation (given tractable data size). Hence, linear regression libraries will probably make use of this method if feasible, and many other speed ups and other features. Still, doing it yourself is an achievement!

## What I'd like you to remember here:
* Extending linear regression to multinomial (and polynomial, i.e. with exponents) regression is relatively straightforward once you have the basic routine down.
* How to normalise data. Especially the mean-centering and scaling to unit variance (data-mean(data))/std(data) is used very often. 
* Why normalising data is important: that otherwise you get gradients operating at different scales, messing up your gradient steps, and costs which are much higher for the polynomial variables than for the others. To avoid this, we normalise. You'll be using normalisation prodigiously throughout the rest of the course.
* That running polynomial regression on a linear relationship is a bit bonkers (but illustrative, nonetheless)


## Survey
Hi it's me again, the incessant reminder that you fill out the survey. Boy, surveys huh, who doesn't love 'em!? Great minds think alike, [here you go](https://docs.google.com/forms/d/e/1FAIpQLScoqJtzOclzOl8DrXnoukfySI3HAdfJNeGw_Gxplas09KdEDw/viewform?usp=sf_link)!