# REGRESSION - PREDICT CONTINUOUS NUMBER
Dependent variable y (outcome we're trying to predict) is continuous  

Examples:   
Try to learn to predict target (Y) given input (X)
-  Trying to predict mm of rainfall tomorrow  
-  Trying to predict the value of Google’s stock price  

## **Properties**  
- Output type: Continuous Number        
- What are you trying to find: 'Best Fit Line'    
- Evaluation: Sum of Squared Error or R-squared

## Linear v Non-Linear Problem
Linear: 
- Simple or 
- Multiple Linear Regression  

Non-linear:   
- Polynomial Regression, SVR, Decision Tree or Random Forest  
- k-Fold Cross Validation and then pick model with the best results  

## Outliers  
### **What causes outliers?  **
- Sensor Malfunction (Ignore)
- Data Entry Errors (Ignore)
- Freak Events (Pay Attention)  

### **Outlier Rejection Loop  **    
Fit the regression on all training points, discard the 10% of points that have the largest residual errors, and refit on the remaining points. 

**1. Train **

**2. Remove points with the highest residual error after training**
- residual error = error a data point has after you fit the best possible line  
- approx. 10% of data points, but may vary from application to application  

**3. Re-Train remaining points**
 
Approach works for any machine learning alogrithm.  

## Coding  
### **Split > Fit > Predict > Score It**  
Fit aka 'train'  
Algorithms that can learn from observational data and make predictions based on it  

### **Using 2 Main Functions **   
train(x,y)  
predict(x)

### **Data Types and Shapes**    
Sci-kit learn requires everything to be numerical  

x is a **MATRIX** of n x D    
X = dataset[['feature1','feature2','feature3']]
- n = # of samples (observations)
- d = number of features (dimensions)

y is a **VECTOR** of shape n x 1  
y = dataset['target']
- for a regression y will be float values  

### Generalization  
- Generalization: predict accurately not only for data we trained on but new data we haven’t seen before  
- Usually split data into train/test sets to get an idea of how well a mode will generalize  

## **EVALUATION**    

### ** Sum of the Squared Error**   
Can compare SSE to figure out which line is being fit better

Drawbacks:  
- As you add more data, the SSE will almost certaintly go up but it doesn't necessary mean that your fit is doing a worse job.  
- So if you're comparing two different sets of data that have a different number of points in them, this can be a big problem  
as the SSE can be jerked around by the # of data points that you're using (even though the fit might be perfectly fine).  

### **R-squared (Coefficient of Determination)**    
Answers the question 'how much of my change in the output (y) is explained by the change in my input (x)'?
- % of the total variation in dependent variable y is captured by the model  
- Ranges from 0 to 1  
- independent of the number of training points in the data set (so more reliable than SSE)  

def compute_r_squared(data, predictions):    
SST = ((data - np.mean(data))^2).sum()    
SSReg = ((predictions - data))^2).sum()    
r_squared = 1 - SSReg / SST    

---
 
# **1. 	SIMPLE LINEAR REGRESSION - - LINEAR  **    
Fit straight line to dataset of observations and use this line to predict unobserved values  
One dimensional regression (only one input variable)  

## **Sum of Squared Error (SSE)**  
For supervised learning, we know there is a cost function to minimize.  
The best regression is the one that minimizes the sum of the squared errors. 
SSE = Sum[(actual value - predicted value of the point on the regression line from regression)^2]   

There can be multiple lines that minimize the sum of the absolute errors, but only one line will minimize sum of the squared errors!  So using SSE also makes implementation much easier.

So in the equation y = mx + b,  want to find the m and b that will minimize SSE.    


## Algorithms to Minimize SSE
There are several algorithms that minimize the sum of the squared errors in a regression.   
Estimate the coefficients of our linear model, usually using:
1. Ordinary Least Squares
2. Gradient Descent

### **Ordinary Least Squares  **   
Used in sklearn LinearRegression  
Sometimes called ‘Maximum likelihood estimation’  
The least squares method can effectively fit linear models since it only requires matrix algebra and provides deterministic estimates of the coefficients.     
 
Minimizes the squared-error between each point and the line    
- Error is just the distance between each point and the line  
- Sum up all those squared errors  
- Measuring the variance of the data points from that line  
- By minimizing the variance, find the line that fits best

**Pros: **  
Works on any size of data set, gives information about relevance of features  
Is always gauranteed to find the optimal solution when performing linear   regression

**Cons: **  
Linear regression assumptions  
Least squares is a method which directly minimizes the sum of square error in a model algebraically. Often times we have too much data to fit into memory and we can't use least squares. 

### **Gradient Descent**  
Alternative method to least squares  
Gradient descent is a general method that can be used to estimate coefficents of nearly any model, including linear models.   
At it's core, gradient descent minimizes the residuals in the estimated model by updating each coefficent based on it's gradient. 
Looks for the global minimum


**Pros:**
- Works best with 3D data  
- Iterates to find line that best follows the contours defined by the data  
- Easy to try in Python and compare to Least Squares 

**Cons**  
- Not always gauranteed to find the optimal value.  Usually Least Squares is a perfectly good choice  


- Cost function may have numerous local minima. When using gradient descent, our algorithm can become trapped in a local minimum that is not actually the global minimum of the cost function.  May need to perform gradient descent numerous times, randomizing our initial values of theta with idfferent random values every time (remember to seed random values for repeatability).   
 
### Coding

#Split  
from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = .20, random_state= 0)  

#Fit  
from sklearn.linear_model import LinearRegression  
reg = LinearRegression()  
reg.fit(ages, net_worths)  

#Predict  
print 'katie's net worth prediction:', reg.predict([27])  

#Score It    
print 'r-squared score:', reg.score(ages, net_worths)  
print 'slope:', reg.coef_    
print 'intercept:', reg.intercept_   

#Visualization  
plt.scatter(ages, net_worths)  
plt.plot(ages, reg.predict(ages), color = 'blue', linewidth = 3)  
plt.xlabel('ages')  
plt.ylabel('net worths')
plt.show()
  

---


## **2.	MULTIPLE LINEAR REGRESSION - -LINEAR **  
Also called Multi-Variate Regression  
More than one input variable
Create coefficients in front of numerical values that represent each feature (still using least squares)     


### Pros:   
- Works on any size of data set, gives information about relevance of features    
- Can still measure fit with r-squared   

### Cons: 
- Linear regression assumptions
- Normalize features so you can compare coefficients in a meaningful way  
- Can’t really use categorical data (ordinal data you can) 

---

## **3.	Polynomial Regression - - NON LINEAR**  
Linear formula y = mx + b is a first degree polynomial   

### Pros:  
- Works on any size of dataset, works very well on non-linear problems    
- Higher orders produce more complex curves    
- Can still measure fit with r-squared    

### Cons:   
- Need to choose the right polynomial degree for a good bias/variance tradeoff  

---

## **4.	Support Vector Regression- - NON LINEAR**  
a.	Pros: Easily adaptable, works very well on non-linear problems, not biased by outliers  
b.	Cons: Compulsory to apply feature scaling, not well known, more difficult to understand  


---

## **5.	Decision Tree Regression- - NON LINEAR**  
a.	Pros: Interpretability, no need for feature scaling, works on linear / nonlinear problems  
b.	Cons: Poor results on too small datasets, overfitting can easily occur  

---

## **6.	Random Forest Regression- - NON LINEAR**  
a.	Pros: Powerful and accurate, good performance on many problems, incl. non linear  
b.	Cons: No interpretability, overfitting can easily occur, need to choose number of trees  