# Supervised learning methods on Kickstarter data

Basic supervised learning methods for regression and classification on a preprocessed version of the Kickstarter dataset found on: https://www.kaggle.com/kemical/kickstarter-projects

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf

In [6]:
%matplotlib inline

# basic settings
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:,.2f}'.format
plt.rcParams['figure.figsize'] = (16, 12)

## 1. Regression

_Explain which regression problem you have chosen to solve. The explanation of the problem should mention what variable is predicted based on which other variables and a brief discussion of appropriate feature transformation choices such as one-of-K coding._  

Answer here.


### 1.1 Linear regression with forward selection
Explain how a new data observation is predicted according to the linear model estimated by forward selection. I.e. what are the effects of the selected attributes in terms of predicting the data.  
(Note that interpreting the magnitude of the estimated coefficients in general requires that each attribute be normalized prior to the analysis.)

In [10]:
# import cleaned data
# first column contains old indices and is not imported
df = pd.read_csv('kickstarter-projects-cleaned_data.csv', usecols=range(1,229))

In [13]:
df

Unnamed: 0,state,backers,usd_pledged_real,usd_goal_real,category_3D Printing,category_Academic,category_Accessories,category_Action,category_Animals,category_Animation,...,country_JP,country_LU,country_MX,"country_N,0""",country_NL,country_NO,country_NZ,country_SE,country_SG,country_US
0,0,0,0.00,1533.95,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,15,2421.00,30000.00,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,0,3,220.00,45000.00,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,0,1,1.00,5000.00,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,1,224,52375.00,50000.00,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
331205,0,4,154.00,6500.00,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
331206,0,5,155.00,1500.00,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
331207,0,1,20.00,15000.00,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
331208,0,6,200.00,15000.00,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


### 1.2 Artificial neural network

_Fit an artificial neural network (ANN) model to the data. Select a __relevant complexity-controlling parameter (such as number of hidden neurons)__ and apply __two-level cross-validation to both optimize the parameter and estimate the generalization error.__ Recall in two-level cross-validation, each inner loop selects an "optimal" value of the parameter and estimate the generalization error for that selected value (this is then repeated in each fold of the outer loop).  
We want you to produce a __table (or graph) that shows, for each iteration of the outer loop, the selected value of the parameter $s^*$ and the corresponding value of the generalization error ($E^{test}_i$).__  
_Finally compute the __two-level cross-validation estimate of the generalization error__ and also include this in the report._  

_Note for this, and the subsequent cross-validation questions, it is vital you compute the error as the average errors on your test set (and not the sum). This is important because you want to compute their absolute magnitudes._

### 1.3 Performance comparison with credibility intervals

_Statistically evaluate if there is a significant performance difference between the fitted ANN and linear regression models using the credibility-interval method discussed in the lecture notes (Example 2 in section 9.4.3) as well
as a baseline._

_Recall that to accomplish this, the linear regression model and the ANN need
to be evaluated on the same cross-validation splits. Therefore, re-use the outermost
cross-validation splits used in the previous section and then compute
the test-errors of the linear regression model on the same splits.
In addition to this, compare if the performance of your models are better than
a simple baseline obtained by predicting the output to be the average of the
training data. In other words complete three test: Linear regression vs. ANN,
linear regression vs. average output, ANN vs. average output and discuss your
findings._

## 2. Classification