# CS4305TU - Assignment 2 - Regression

In this assignment, you will apply your newly obtained regression techniques with real-life data. **You should work in groups for this assignment.**

## Data source

What you will be playing with is aircraft trajectory data derived from [ADS-B](https://www.skybrary.aero/index.php/Automatic_Dependent_Surveillance_Broadcast_(ADS-B)). It is collected using the antenna from the top of the aerospace building:

<img src="https://pbs.twimg.com/media/EoBz7vVXEAAze48?format=jpg&name=medium" width="400"/>

Ensentially, ADS-B data is what you see on website like FlightRadar24:

<img src="https://media.giphy.com/media/cPutGcE0a9jdS/giphy.gif" width="400"/>

## Background

In the dataset, all flight trajectories include only the descent part of the flight. The dataset is split into two directories. One directory contains flights that are following the [Continous Descent Approach (CDA)](https://www.skybrary.aero/index.php/Continuous_Descent). The other directory contains flights that do not follow CDA. 

CDA is an operation, where the aircraft does not have any level flight segment during the descent. Follow the link above to know more.

<img src="https://1.bp.blogspot.com/-UFmjVcjmqCM/UIai54Y_wYI/AAAAAAAAAUM/tW1HTFP1IGI/s1600/image02_05_large.gif" width="400">


## Data attributes

The structures of all CSV files are the same. Here are descriptions of all columns:

- **time**: flight time in seconds, the first row starts at time 0.
- **icao**: aircraft transponder address, string format, unique for each aircraft.
- **type**: aircraft type code, string format.
- **callsign**: string format, often related to the flight number, unique for each flight.
- **latitude**: latitude coordinate in degrees.
- **longitude**: latitude coordinate  in degrees.
- **speed**: aircraft speed respective to ground, unit is in knots (1 knot = 0.51444 m/s).
- **track_angle**: direction of aircraft in relation to the true north, in degrees.
- **vertical_rate**: aircraft climb or descent speed in feet/minute (1 ft/min = 000508 m/s), negative value indicates aircraft is descending.

The most important features we are using are **time**, **altitude**, **speed**, and **vertical_rate**. 

## Instructions

The code in this notebook serves as the base for your assignment. The tasks are defined in each section.

You should implement the solutions using code cells and write your analysis using markdown cells.

Once you have complete everything, before submission, remember to restart the kernel and run all cells again. Make sure there are no errors. Then you should:

 - Save the notebook (**replace XX in the filename with your group number**)
 - Export a HTML version of the notebook. Hint: follow Mene -> File -> Download as -> HTML
 - Submit both the notebook (.ipynb) and the export (.html)


 ## References

- Quick tutorial for **Jupyter Notebook** : https://www.youtube.com/watch?v=2eCHD6f_phE

- Quick tutorial for **Jupyter Lab** (if you wish to use): https://www.youtube.com/watch?v=A5YyoCKxEOU
 

In [1]:
import glob
import warnings
%matplotlib notebook
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import ipywidgets as widgets
warnings.filterwarnings("ignore")

In [2]:
# loading all trajectory files

cda_files = sorted(glob.glob("data/cda/*.csv"))
noncda_files = sorted(glob.glob("data/noncda/*.csv"))

## Examples

Following two cells are some examples for data and plotting.

Remove these before submit your assignment.

In [3]:
df_example = pd.read_csv(cda_files[0])
df_example.head(10)

Unnamed: 0,time,icao,type,callsign,latitude,longitude,altitude,speed,track_angle,vertical_rate
0,0.0,40631F,A319,EZY58YF,52.79041,2.55135,24975,445,101,-192
1,20.0,40631F,A319,EZY58YF,52.78042,2.61632,24625,441,106,-1408
2,40.0,40631F,A319,EZY58YF,52.76827,2.68098,23925,436,107,-2304
3,60.0,40631F,A319,EZY58YF,52.75611,2.74282,23225,432,107,-1984
4,80.0,40631F,A319,EZY58YF,52.74248,2.81368,22550,425,107,-1728
5,100.0,40631F,A319,EZY58YF,52.73261,2.86673,22000,420,106,-2112
6,140.0,40631F,A319,EZY58YF,52.70906,2.99515,20550,411,106,-2112
7,160.0,40631F,A319,EZY58YF,52.69504,3.07138,19900,400,106,-1344
8,180.0,40631F,A319,EZY58YF,52.68875,3.10529,19550,397,106,-2304
9,200.0,40631F,A319,EZY58YF,52.6774,3.16626,18775,396,107,-1664


In [4]:
# visualization example

flight_sample = pd.read_csv(noncda_files[0])

fig, ax = plt.subplots(1, 3, figsize=(12, 3))
ax[0].scatter(flight_sample.time, flight_sample.altitude, s=5)
ax[0].set_xlabel("time")
ax[0].set_ylabel("altitude")
ax[1].scatter(flight_sample.time, flight_sample.speed, s=5)
ax[1].set_xlabel("time")
ax[1].set_ylabel("speed")
ax[2].scatter(flight_sample.time, flight_sample.vertical_rate, s=5)
ax[2].set_xlabel("time")
ax[2].set_ylabel("vertical_rate")
plt.tight_layout()
plt.show()

<IPython.core.display.Javascript object>

# Task 1: Simple linear regression 

In this task you will learn how to apply simple linear regression model using a couple of flight trajectories. To complete the task, follow the steps below:

1. Use you group id as random seed, select one flight from CDA trajectories, and another one from Non-CDA trajectories

1. Inspect the relationships of (time, altitude), (time, speed), and (time, vertical_rate) for these two trajectories.

1. Apply linear regression to all three parameters for both trajectories, using time as input and altitude as output.

1. Evaluate the performance of the estimators using different error metrics.

In [5]:
# set group_id to your own group number

group_id = 19

In [6]:
np.random.seed(group_id)

cda_filename = np.random.choice(cda_files)
print(cda_filename)
noncda_filename = np.random.choice(noncda_files)

df_cda = pd.read_csv(cda_filename)
df_noncda = pd.read_csv(noncda_filename)

type_cda = df_cda["type"].iloc[0]
type_noncda = df_noncda["type"].iloc[0]


data/cda\93ce2914.csv


In [7]:
# write your code here

# create more cells if needed
print('cda data representation')
fig, ax = plt.subplots(2, 3, figsize=(12, 6))
ax[0,0].scatter(df_cda.time, df_cda.altitude, s=5)
ax[0,0].set_xlabel("time")
ax[0,0].set_ylabel("altitude")
ax[0,1].scatter(df_cda.time, df_cda.speed, s=5)
ax[0,1].set_xlabel("time")
ax[0,1].set_ylabel("speed")
ax[0,2].scatter(df_cda.time, df_cda.vertical_rate, s=5)
ax[0,2].set_xlabel("time")
ax[0,2].set_ylabel("vertical_rate")
ax[1,0].scatter(df_cda.altitude, df_cda.vertical_rate, s=5)
ax[1,0].set_xlabel("altitude")
ax[1,0].set_ylabel("vertical_rate")
ax[1,1].scatter(df_cda.altitude, df_cda.speed, s=5)
ax[1,1].set_xlabel("altitude")
ax[1,1].set_ylabel("speed")
ax[1,2].scatter(df_cda.speed, df_cda.vertical_rate, s=5)
ax[1,2].set_xlabel("speed")
ax[1,2].set_ylabel("vertical_rate")
plt.tight_layout()
plt.title('cda')
plt.show()

print('non-cda data representation')
fig, ax = plt.subplots(2, 3, figsize=(12, 6))
ax[0,0].scatter(df_noncda.time, df_noncda.altitude, s=5)
ax[0,0].set_xlabel("time")
ax[0,0].set_ylabel("altitude")
ax[0,1].scatter(df_noncda.time, df_noncda.speed, s=5)
ax[0,1].set_xlabel("time")
ax[0,1].set_ylabel("speed")
ax[0,2].scatter(df_noncda.time, df_noncda.vertical_rate, s=5)
ax[0,2].set_xlabel("time")
ax[0,2].set_ylabel("vertical_rate")
ax[1,0].scatter(df_noncda.altitude, df_noncda.vertical_rate, s=5)
ax[1,0].set_xlabel("altitude")
ax[1,0].set_ylabel("vertical_rate")
ax[1,1].scatter(df_noncda.altitude, df_noncda.speed, s=5)
ax[1,1].set_xlabel("altitude")
ax[1,1].set_ylabel("speed")
ax[1,2].scatter(df_noncda.speed, df_noncda.vertical_rate, s=5)
ax[1,2].set_xlabel("speed")
ax[1,2].set_ylabel("vertical_rate")
plt.tight_layout()
plt.title('non - cda')
plt.show()


cda data representation


<IPython.core.display.Javascript object>

non-cda data representation


<IPython.core.display.Javascript object>

In [8]:
## Linear regression
from sklearn.linear_model import LinearRegression
# for cda
x_train = df_cda.time.to_numpy
x_train = x_train()
y_train = df_cda.altitude.to_numpy
y_train = y_train()
lin_regress = LinearRegression().fit(x_train.reshape(-1,1), y_train)
print('intercept : ', lin_regress.intercept_)
print('slope : ', lin_regress.coef_[0])

from sklearn.metrics import mean_squared_error

MSE = mean_squared_error(y_train, lin_regress.predict(x_train.reshape(-1,1)))
print('MSE : ', MSE)
print('RMSE : ', np.sqrt(MSE))

intercept :  23351.76640518252
slope :  -21.009054842311887
MSE :  59504.539276738404
RMSE :  243.93552278571156


In [9]:
# for non - cda
x_train = df_noncda.time.to_numpy
x_train = x_train()
y_train = df_noncda.altitude.to_numpy
y_train = y_train()
lin_regress = LinearRegression().fit(x_train.reshape(-1,1), y_train)
print('intercept : ', lin_regress.intercept_)
print('slope : ', lin_regress.coef_[0])
MSE = mean_squared_error(y_train, lin_regress.predict(x_train.reshape(-1,1)))
print('MSE : ', MSE)
print('RMSE : ', np.sqrt(MSE))

intercept :  23948.393485141656
slope :  -14.789148407092219
MSE :  1622899.4684553044
RMSE :  1273.93071572017


(this is a markdown cell)

write your analysis here

- For cda
- - Altitude vs time plot shows a line with an almost constant slope (signifying constant descent). 
- - The variation in the speed is gradual. 
- - The vertical rate is negative and is almost constant (except at the end). 


- For non - cda
- - Altitude vs time plot shows a line with varying slope. 
- - Lot of variation in the speed. 
- - The vertical rate is negative and varying. 



# Task 2: Multiple linear regression

In this task you will learn how to apply Polynomial regression model. To complete the task, follow the steps below:

1. Using the same trajectories from the previous task, but choose both speed and altitude as predictors for the vertical rate. 

1. Construct a 3D multiple linear regression model

1. Visualize your result and briefly analyze your results.

In [23]:
# write your code here

# create more cells if needed
from sklearn.model_selection import train_test_split

# for cda

x1 = df_cda.altitude.to_numpy
x1 = x1()
x2 = df_cda.speed.to_numpy
x2 = x2()
y = df_cda.vertical_rate.to_numpy
y = y()

x = np.concatenate((x1.reshape(-1,1),x2.reshape(-1,1)), axis = 1)


x_train, x_test, y_train, y_test = train_test_split(x, y)

model = LinearRegression().fit(x_train, y_train)
b0 = model.intercept_
b1, b2 = model.coef_

print('Coefficients {} {} {}'.format(b0,b1,b2))


fig = plt.figure(figsize=(6, 6))
ax = fig.add_subplot(111, projection = '3d')
ax.scatter(x_train[:,0], x_train[:,1], y_train, color = 'k', s = 15, label = 'training data')
ax.scatter(x_test[:,0], x_test[:,1], y_test, color = 'r', s = 15, label = 'test data')
ax.set_xlabel('altitude')
ax.set_ylabel('speed')
ax.set_zlabel('vertical rate')
ax.legend()


x1, x2 = np.meshgrid(range(5000, 25000), range(225, 400))
z = b0 + b1*x1 + b2*x2

ax.plot_surface(x1, x2, z, alpha = 0.2)
plt.tight_layout()

Coefficients -3645.955578229411 -0.07127946954407555 9.765894118769793


<IPython.core.display.Javascript object>

ValueError: Unknown projection '3d'

In [61]:
# for non - cda

x1 = df_noncda.altitude.to_numpy
x1 = x1()
x2 = df_noncda.speed.to_numpy
x2 = x2()
y = df_noncda.vertical_rate.to_numpy
y = y()

x = np.concatenate((x1.reshape(-1,1),x2.reshape(-1,1)), axis = 1)


x_train, x_test, y_train, y_test = train_test_split(x, y)

model = LinearRegression().fit(x_train, y_train)
b0 = model.intercept_
b1, b2 = model.coef_

print('Coefficients {} {} {}'.format(b0,b1,b2))


fig = plt.figure(figsize=(6, 6))
ax = fig.add_subplot(111, projection = '3d')
ax.scatter(x_train[:,0], x_train[:,1], y_train, color = 'k', s = 15, label = 'training data')
ax.scatter(x_test[:,0], x_test[:,1], y_test, color = 'r', s = 15, label = 'test data')
ax.set_xlabel('altitude')
ax.set_ylabel('speed')
ax.set_zlabel('vertical rate')
ax.legend()


x1, x2 = np.meshgrid(range(5000, 25000), range(225, 400))
z = b0 + b1*x1 + b2*x2

ax.plot_surface(x1, x2, z, alpha = 0.2)
plt.tight_layout()

Coefficients 2331.99053587175 0.01880347553836803 -11.61879835086252


<IPython.core.display.Javascript object>

(this is a markdown cell)

write your analysis here






# Task 3: Polynomial regression

In this task you will learn how to apply Polynomial regression model. To complete the task, follow the steps below:

1. Based on previous trajectories, apply polynomial regression, using altitude as input and speed as output. 

1. Try out different orders of polynomials.

1. Analyze your choice briefly. Taking into consideration of bias-variance trade-off.

1. Applying regularization to a high-order polynomial model you have tried earlier. Write a brief analysis of your result.


In [97]:
# write your code here

# create more cells if needed

from sklearn.model_selection import train_test_split
from scipy.linalg import inv
from sklearn.metrics import mean_squared_error, r2_score


#definitions
def print_model(coef):
    print("-" * 70)
    print(np.poly1d(coef[::-1]))
    print("-" * 70)

    
def polynomial_regression(x, y, k):

    X = np.vander(x, k+1, increasing=True)
    Y = y.reshape(-1, 1)
    X = X.astype(np.longdouble)
    Y = Y.astype(np.longdouble)

    res = np.dot(X.T, X)
    res = np.dot(inv(res), X.T)
    res = np.dot(res, Y)

    coef = res.squeeze()
    return coef
def plot_poly(x, coef):
    x_ = np.linspace(min(x), max(x), 100)
    y_ = np.zeros(len(x_))
    for i, c in enumerate(coef):
        y_ += c * x_**i
    plt.plot(x_, y_)
    
def plot_data(
    ax,
    df_train,
    df_test,
    xcol,
    ycol,
    show_train=True,
    show_test=False,
    label=True,
    show_type=False,
):
    x_train = df_train[xcol].values
    y_train = df_train[ycol].values

    x_test = df_test[xcol].values
    y_test = df_test[ycol].values

    type_train = df_train.index.values
    type_test = df_test.index.values

    if show_train:
        ax.scatter(
            x_train,
            y_train,
            color="k",
            s=50,
            lw=2,
            label="training set",
            facecolor="w",
            zorder=10,
        )

    if show_test:
        ax.scatter(
            x_test,
            y_test,
            color="r",
            s=50,
            lw=2,
            label="testing set",
            facecolor="w",
            zorder=10,
        )

    if label:
        ax.set_xlabel(xcol)
        ax.set_ylabel(ycol)

    if show_type:
        for x, y, t in zip(x_train, y_train, type_train):
            # move the text label around
            x += max(x_train) / 50
            ax.text(x, y, t, ha="left", va="center", fontsize=8)

    if show_type and show_test:
        for x, y, t in zip(x_test, y_test, type_test):
            x -= max(x_train) / 25
            ax.text(x, y, t, ha="left", va="center", fontsize=8, color="r")

    ax.legend()
    
    
#actual code
x = df_noncda.altitude.to_numpy
x = x()
y = df_noncda.speed.to_numpy
y = y()

x = x.astype(np.longdouble)
y = y.astype(np.longdouble)
x_train, x_test, y_train, y_test = train_test_split(x, y)
coef = []
plt.close('all')
for k in range(1,15):
    coef.append(polynomial_regression(x_train, y_train, k))
    print('Coefficients for k=', k)
    print_model(coef[k-1])
    y_pred = np.poly1d(coef[k-1][::-1])(x_test)
    y_pred = y_pred.astype(np.longdouble)
    print('R^2 score is:', r2_score(y_test, y_pred))
    fig, ax = plt.subplots(1)
    ax.scatter(df_noncda.altitude, df_noncda.speed, s = 5)
    ax.scatter(x_test,y_test, s = 5, color = 'r')
    ax.set_xlabel("altitude")
    ax.set_ylabel("speed")
    plot_poly(x_train,coef[k-1])

Coefficients for k= 1
----------------------------------------------------------------------
 
0.005213 x + 230.3
----------------------------------------------------------------------
R^2 score is: 0.5883757471419282


<IPython.core.display.Javascript object>

Coefficients for k= 2
----------------------------------------------------------------------
           2
-3.98e-07 x + 0.01641 x + 166.4
----------------------------------------------------------------------
R^2 score is: 0.5942116782131592


<IPython.core.display.Javascript object>

Coefficients for k= 3
----------------------------------------------------------------------
           3             2
1.315e-10 x - 6.114e-06 x + 0.09215 x - 129.7
----------------------------------------------------------------------
R^2 score is: 0.8019547152987716


<IPython.core.display.Javascript object>

Coefficients for k= 4
----------------------------------------------------------------------
           4             3             2
1.751e-14 x - 8.928e-10 x + 1.488e-05 x - 0.08388 x + 377.4
----------------------------------------------------------------------
R^2 score is: 0.9108712534454887


<IPython.core.display.Javascript object>

Coefficients for k= 5
----------------------------------------------------------------------
            5             4             3            2
-1.846e-18 x + 1.538e-13 x - 4.711e-09 x + 6.52e-05 x - 0.3933 x + 1085
----------------------------------------------------------------------
R^2 score is: 0.9185137019032585


<IPython.core.display.Javascript object>

Coefficients for k= 6
----------------------------------------------------------------------
           6             5             4             3             2
-4.92e-22 x + 4.158e-17 x - 1.373e-12 x + 2.247e-08 x - 0.0001918 x + 0.8251 x - 1174
----------------------------------------------------------------------
R^2 score is: 0.9357565447063539


<IPython.core.display.Javascript object>

Coefficients for k= 7
----------------------------------------------------------------------
            7             6             5             4             3
-1.254e-25 x + 1.233e-20 x - 4.995e-16 x + 1.081e-11 x - 1.348e-07 x
              2
 + 0.0009678 x - 3.685 x + 5949
----------------------------------------------------------------------
R^2 score is: 0.897566671303789


<IPython.core.display.Javascript object>

Coefficients for k= 8
----------------------------------------------------------------------
            8             7             6             5             4
-1.613e-29 x + 1.752e-24 x - 8.036e-20 x + 2.029e-15 x - 3.078e-11 x
              3            2
 + 2.863e-07 x - 0.001588 x + 4.802 x - 5835
----------------------------------------------------------------------
R^2 score is: 0.732767837778199


<IPython.core.display.Javascript object>

Coefficients for k= 9
----------------------------------------------------------------------
            9             8             7             6            5
-4.647e-35 x - 1.014e-29 x + 1.417e-24 x - 6.977e-20 x + 1.82e-15 x
              4             3            2
 - 2.813e-11 x + 2.645e-07 x - 0.001477 x + 4.486 x - 5448
----------------------------------------------------------------------
R^2 score is: 0.6974401280028555


<IPython.core.display.Javascript object>

Coefficients for k= 10
----------------------------------------------------------------------
           10             9             8             7             6
4.975e-38 x  - 6.954e-33 x + 4.102e-28 x - 1.332e-23 x + 2.593e-19 x
              5             4             3             2
 - 3.061e-15 x + 2.053e-11 x - 5.675e-08 x - 0.0001339 x + 1.274 x - 2114
----------------------------------------------------------------------
R^2 score is: 0.6554476979321356


<IPython.core.display.Javascript object>

Coefficients for k= 11
----------------------------------------------------------------------
           11             10             9            8             7
1.282e-41 x  - 1.417e-36 x  + 5.552e-32 x - 4.39e-28 x - 3.813e-23 x
              6             5             4             3           2
 + 1.605e-18 x - 3.173e-14 x + 3.771e-10 x - 2.822e-06 x + 0.01302 x - 33.86 x + 3.813e+04
----------------------------------------------------------------------
R^2 score is: -1.265051660228698


<IPython.core.display.Javascript object>

Coefficients for k= 12
----------------------------------------------------------------------
           12             11             10             9             8
1.815e-46 x  - 2.108e-41 x  + 9.629e-37 x  - 1.947e-32 x + 1.732e-29 x
             7             6             5             4             3
 + 8.47e-24 x - 2.257e-19 x + 3.349e-15 x - 3.364e-11 x + 2.367e-07 x
             2
 - 0.001109 x + 3.05 x - 3431
----------------------------------------------------------------------
R^2 score is: -0.44293313559837233


<IPython.core.display.Javascript object>

Coefficients for k= 13
----------------------------------------------------------------------
          13             12             11             10             9
2.21e-49 x  - 2.831e-44 x  + 1.444e-39 x  - 3.339e-35 x  + 1.064e-31 x
              8             7             6             5             4
 + 1.316e-26 x - 3.294e-22 x + 3.077e-18 x + 5.896e-15 x - 4.457e-10 x
              3           2
 + 4.878e-06 x - 0.02673 x + 76.22 x - 8.976e+04
----------------------------------------------------------------------
R^2 score is: 0.7839343232744127


<IPython.core.display.Javascript object>

Coefficients for k= 14
----------------------------------------------------------------------
            14             13             12             11
-8.988e-54 x  + 8.636e-49 x  - 2.398e-44 x  - 2.316e-40 x 
              10             9             8             7
 + 2.281e-35 x  - 2.176e-31 x - 8.439e-27 x + 2.089e-22 x
              6             5             4             3           2
 + 2.558e-20 x - 6.442e-14 x + 1.153e-09 x - 1.023e-05 x + 0.05133 x - 139.1 x + 1.587e+05
----------------------------------------------------------------------
R^2 score is: -2.982580197814964


<IPython.core.display.Javascript object>

In [110]:
# write your code here

# create more cells if needed

from sklearn.model_selection import train_test_split
from scipy.linalg import inv
from sklearn.metrics import mean_squared_error, r2_score


#definitions
def print_model(coef):
    print("-" * 70)
    print(np.poly1d(coef[::-1]))
    print("-" * 70)

    
def polynomial_regression(x, y, k):

    X = np.vander(x, k+1, increasing=True)
    Y = y.reshape(-1, 1)
    X = X.astype(np.longdouble)
    Y = Y.astype(np.longdouble)

    res = np.dot(X.T, X)
    res = np.dot(inv(res), X.T)
    res = np.dot(res, Y)

    coef = res.squeeze()
    return coef
def plot_poly(x, coef):
    x_ = np.linspace(min(x), max(x), 100)
    y_ = np.zeros(len(x_))
    for i, c in enumerate(coef):
        y_ += c * x_**i
    plt.plot(x_, y_)
    
def plot_data(
    ax,
    df_train,
    df_test,
    xcol,
    ycol,
    show_train=True,
    show_test=False,
    label=True,
    show_type=False,
):
    x_train = df_train[xcol].values
    y_train = df_train[ycol].values

    x_test = df_test[xcol].values
    y_test = df_test[ycol].values

    type_train = df_train.index.values
    type_test = df_test.index.values

    if show_train:
        ax.scatter(
            x_train,
            y_train,
            color="k",
            s=50,
            lw=2,
            label="training set",
            facecolor="w",
            zorder=10,
        )

    if show_test:
        ax.scatter(
            x_test,
            y_test,
            color="r",
            s=50,
            lw=2,
            label="testing set",
            facecolor="w",
            zorder=10,
        )

    if label:
        ax.set_xlabel(xcol)
        ax.set_ylabel(ycol)

    if show_type:
        for x, y, t in zip(x_train, y_train, type_train):
            # move the text label around
            x += max(x_train) / 50
            ax.text(x, y, t, ha="left", va="center", fontsize=8)

    if show_type and show_test:
        for x, y, t in zip(x_test, y_test, type_test):
            x -= max(x_train) / 25
            ax.text(x, y, t, ha="left", va="center", fontsize=8, color="r")

    ax.legend()
    
    
#actual code
x = df_cda.altitude.to_numpy
x = x()
y = df_cda.speed.to_numpy
y = y()

x = x.astype(np.longdouble)
y = y.astype(np.longdouble)
x_train, x_test, y_train, y_test = train_test_split(x, y)
coef = []
plt.close('all')
for k in range(1,15):
    coef.append(polynomial_regression(x_train, y_train, k))
    print('Coefficients for k=', k)
    print_model(coef[k-1])
    y_pred = np.poly1d(coef[k-1][::-1])(x_test)
    y_pred = y_pred.astype(np.longdouble)
    print('R^2 score is:', r2_score(y_test, y_pred))
    fig, ax = plt.subplots(1)
    ax.scatter(df_cda.altitude, df_cda.speed, s = 5)
    ax.scatter(x_test,y_test, s = 5, color = 'r')
    ax.set_xlabel("altitude")
    ax.set_ylabel("speed")
    plot_poly(x_train,coef[k-1])

Coefficients for k= 1
----------------------------------------------------------------------
 
0.007348 x + 241.5
----------------------------------------------------------------------
R^2 score is: 0.9634596494081505


<IPython.core.display.Javascript object>

Coefficients for k= 2
----------------------------------------------------------------------
           2
9.068e-10 x + 0.007322 x + 241.7
----------------------------------------------------------------------
R^2 score is: 0.9634606805075348


<IPython.core.display.Javascript object>

Coefficients for k= 3
----------------------------------------------------------------------
           3             2
2.492e-11 x - 1.081e-06 x + 0.02184 x + 182.7
----------------------------------------------------------------------
R^2 score is: 0.9852801558754326


<IPython.core.display.Javascript object>

Coefficients for k= 4
----------------------------------------------------------------------
           4             3             2
6.834e-15 x - 3.704e-10 x + 7.022e-06 x - 0.04726 x + 387.4
----------------------------------------------------------------------
R^2 score is: 0.9410001582307498


<IPython.core.display.Javascript object>

Coefficients for k= 5
----------------------------------------------------------------------
           5             4            3             2
5.309e-19 x - 3.193e-14 x + 7.13e-10 x - 7.371e-06 x + 0.04301 x + 174.9
----------------------------------------------------------------------
R^2 score is: 0.9160711389963558


<IPython.core.display.Javascript object>

Coefficients for k= 6
----------------------------------------------------------------------
            6            5             4             3             2
-2.864e-22 x + 2.56e-17 x - 9.142e-13 x + 1.664e-08 x - 0.0001623 x + 0.8097 x - 1327
----------------------------------------------------------------------
R^2 score is: 0.9886586000611003


<IPython.core.display.Javascript object>

Coefficients for k= 7
----------------------------------------------------------------------
           7             6             5             4             3
4.218e-26 x - 4.572e-21 x + 2.067e-16 x - 5.027e-12 x + 7.073e-08 x
              2
 - 0.0005731 x + 2.473 x - 4095
----------------------------------------------------------------------
R^2 score is: 0.9050598718252871


<IPython.core.display.Javascript object>

Coefficients for k= 8
----------------------------------------------------------------------
            8             7             6             5             4
-4.259e-30 x + 5.345e-25 x - 2.882e-20 x + 8.703e-16 x - 1.604e-11 x
              3            2
 + 1.839e-07 x - 0.001276 x + 4.879 x - 7565
----------------------------------------------------------------------
R^2 score is: 0.9831798462318835


<IPython.core.display.Javascript object>

Coefficients for k= 9
----------------------------------------------------------------------
           9             8             7            6             5
-2.29e-33 x + 2.923e-28 x - 1.616e-23 x + 5.06e-19 x - 9.865e-15 x
              4             3            2
 + 1.238e-10 x - 9.952e-07 x + 0.004927 x - 13.56 x + 1.603e+04
----------------------------------------------------------------------
R^2 score is: -0.6186575371999821


<IPython.core.display.Javascript object>

Coefficients for k= 10
----------------------------------------------------------------------
            10             9             8             7             6
-1.319e-37 x  + 1.718e-32 x - 9.762e-28 x + 3.183e-23 x - 6.594e-19 x
              5             4             3            2
 + 9.095e-15 x - 8.523e-11 x + 5.437e-07 x - 0.002308 x + 6.021 x - 7129
----------------------------------------------------------------------
R^2 score is: -0.04691339592839383


<IPython.core.display.Javascript object>

Coefficients for k= 11
----------------------------------------------------------------------
            11             10             9             8             7
-3.302e-41 x  + 4.247e-36 x  - 2.289e-31 x + 6.471e-27 x - 9.067e-23 x
              6             5             4             3           2
 + 5.671e-20 x + 2.113e-14 x - 3.967e-10 x + 3.777e-06 x - 0.02075 x + 62.53 x - 7.987e+04
----------------------------------------------------------------------
R^2 score is: -26.989663188520964


<IPython.core.display.Javascript object>

Coefficients for k= 12
----------------------------------------------------------------------
          12             11             10             9             8
5.36e-46 x  - 5.897e-41 x  + 2.601e-36 x  - 6.329e-32 x + 1.456e-27 x
              7             6             5             4             3
 - 5.235e-23 x + 1.673e-18 x - 3.465e-14 x + 4.552e-10 x - 3.793e-06 x
            2
 + 0.01943 x - 55.71 x + 6.853e+04
----------------------------------------------------------------------
R^2 score is: -18.29031474411296


<IPython.core.display.Javascript object>

Coefficients for k= 13
----------------------------------------------------------------------
           13             12             11             10
1.121e-50 x  - 8.436e-46 x  + 1.159e-41 x  + 5.404e-37 x 
              9             8             7             6             5
 - 1.238e-32 x - 2.275e-28 x + 3.911e-24 x + 3.249e-19 x - 1.299e-14 x
             4            3           2
 + 2.22e-10 x - 2.13e-06 x + 0.01189 x - 36.08 x + 4.618e+04
----------------------------------------------------------------------
R^2 score is: -0.47975012974805953


<IPython.core.display.Javascript object>

Coefficients for k= 14
----------------------------------------------------------------------
           14            13             12             11
4.852e-55 x  - 3.62e-50 x  + 5.816e-46 x  + 1.602e-41 x 
              10             9             8             7
 - 4.487e-37 x  - 4.256e-33 x + 2.492e-28 x - 7.158e-24 x
              6             5             4            3            2
 + 3.086e-19 x - 8.989e-15 x + 1.494e-10 x - 1.46e-06 x + 0.008353 x - 25.88 x + 3.376e+04
----------------------------------------------------------------------
R^2 score is: -9.122703716699958


<IPython.core.display.Javascript object>

In [108]:
#Regularization
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge

alphas = [0, 1e10, 1e20, 1e40, 1e60]
X_train = x_train.reshape(-1, 1)
X_test = x_test.reshape(-1, 1)
X_train = X_train.astype(np.longdouble)
X_test = X_test.astype(np.longdouble)
y_train = y_train.astype(np.longdouble)
k=14

fig, axes = plt.subplots(1, len(alphas), figsize=(18, 5))
for i, alpha in enumerate(alphas):
    ax = axes.flatten()[i]

    model = make_pipeline(PolynomialFeatures(k), Ridge(alpha=alpha, fit_intercept=False))
    model.fit(X_train, y_train)

    coef = model['ridge'].coef_
    
    RSME_train = np.sqrt(mean_squared_error(y_train, model.predict(X_train))).round(2)
    RSME_test = np.sqrt(mean_squared_error(y_test, model.predict(X_test))).round(2)

    ax.scatter(df_noncda.altitude, df_noncda.speed, s = 5)
    ax.scatter(x_test,y_test, s = 5, color = 'r')
    ax.set_xlabel("altitude")
    ax.set_ylabel("speed")    
    x_ =[]
    y_ = []
    x_ = np.linspace(min(x), max(x), 100)
    y_ = np.zeros(len(x_))
    for n, c in enumerate(coef):
        y_ += c * x_**n
    axes[i].plot(x_, y_)
    ax.set_title('$\\alpha$={:.1e} \n RSME_train:{} \n RSME_test:{}'.format(alpha, RSME_train, RSME_test))
    ax.legend(loc="upper left")
plt.suptitle('Ridge regression')
plt.tight_layout()
plt.show()


<IPython.core.display.Javascript object>

The Goldilocks zone without regularization for this model seems to be around k = 6 since it has an R^2 score of around 0.94 which is really good.

With regularization at k = 14 we see that an alpha around 1e20 gives the lowest RSME, and the test error is close to the train error, which is a good indication we reduced overfitting.

We get a different result for alpha = 0 in the case where we do regularization which is a bit weird...
This can be explained by the fact that we used SKlearn in the second case and our own calculations in the first case!

# Task 4: Logistic regression

In this task you will learn how to apply Logistic regression model. You need to generate a new dataset based on given data. To complete the task, follow the steps below:

1. For all trajectories in CDA and NON-CDA group, apply linear regression, using time as input and altitude as output.

1. Calculate MAE for all regression models. Construct a dataset with MAE as input, and CDA status as output (CDA as 0, and NON-CDA as 1).

1. Determine the Logistic regression model describe the relationship between MAE and CDA status

In [None]:
# write your code here

# create more cells if needed


def calcMAE(df):
    # complete this function for calculating MAE

    # [TODO] fit linear model to time and altitude

    # [TODO] calculate MAE

    return MAE


new_data = []

for f in cda_files:
    df = pd.read_csv(f)
    MAE = calcMAE(df)
    new_data.append((MAE, 0))

for f in noncda_files:
    df = pd.read_csv(f)

    MAE = calcMAE(df)
    new_data.append((MAE, 1))

In [None]:
# write your logistic regression code here

(this is a markdown cell)

write your analysis here






# Task 5: Bayesian regression

In this task you will learn how to apply Bayesian regression model. I recommend to use of `pymc3` library. To complete the task, follow the steps below:

1. Apply Bayesian linear regression to vertical speed of CDA and Non-CDA trajectories (time as input). Provide an analysis of your result.

1. **(Bonus)** Design a quadratic model to altitude using the Bayesian regression approach. Visualize and analyze your findings.



In [10]:
# write your code here
!pip install pymc3
# create more cells if needed

# Tip: try different prior probability density functions of parameters. If the regression fails:
#   1. change the initial guess.
#   2. change the variance for the priors of the random variables


Collecting pymc3
  Downloading https://files.pythonhosted.org/packages/b9/e4/2aff00744c2d7b78836f32e407e5751883ab36cae5509db60bc3dd86ee43/pymc3-3.11.4-py3-none-any.whl (869kB)
Collecting fastprogress>=0.2.0 (from pymc3)
  Downloading https://files.pythonhosted.org/packages/eb/1f/c61b92d806fbd06ad75d08440efe7f2bd1006ba0b15d086debed49d93cdc/fastprogress-1.0.0-py3-none-any.whl
Collecting typing-extensions>=3.7.4 (from pymc3)
  Downloading https://files.pythonhosted.org/packages/74/60/18783336cc7fcdd95dae91d73477830aa53f5d3181ae4fe20491d7fc3199/typing_extensions-3.10.0.2-py3-none-any.whl
Collecting arviz>=0.11.0 (from pymc3)
  Downloading https://files.pythonhosted.org/packages/e2/a8/e2ad120b06822e29e0d185bed1ae300576f3f61f97fceb6933ba6f6accf7/arviz-0.11.2-py3-none-any.whl (1.6MB)
Collecting theano-pymc==1.1.2 (from pymc3)
  Downloading https://files.pythonhosted.org/packages/01/26/ee0f0a4c2d18d6a7058c71e3cfed21b31a209979e7d8191dbc990c542a61/Theano-PyMC-1.1.2.tar.gz (1.8MB)
Collecting pats

xarray 0.19.0 has requirement pandas>=1.0, but you'll have pandas 0.24.2 which is incompatible.


(this is a markdown cell)

write your analysis here






In [11]:
import pymc3 as pm

