<a href="https://colab.research.google.com/github/JedRoundy/Machine_Learning_For_Economists/blob/main/PSET_5/pset5coding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# import the modules and function you will use here
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression, Lasso, Ridge

This problem deals with regularized regression. The boston dataset is described right after it is loaded in just by running the code that is aleardy there.

In [None]:
from sklearn.datasets import fetch_california_housing

california = fetch_california_housing()
print(california['DESCR'])
X = pd.DataFrame(california['data'], columns=california['feature_names'])
y = pd.Series(california['target'])

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived

$(a)$ Split the data into a train and a test set

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0, test_size = .33)


$(b)$ Use this data to fit an OLS, LASSO, ridge, and ElasticNet model on the data. For now, use the default for the penalty coefficient. Display the coefficients and test error for each.

In [None]:
#Fit OLS object
ols = LinearRegression()

ols.fit(X_train, y_train)

#fit Lasso object
lasso = Lasso()

lasso.fit(X_train, y_train)


#fit ridge object
ridge = Ridge()

ridge.fit(X_train, y_train)


#Print out scores
print(f'OLS Score {ols.score(X_test, y_test)}')
print(f'Lasso Score: {lasso.score(X_test, y_test)}')
print(f'Ridge Score: {ridge.score(X_test, y_test)}')


OLS Score 0.5957643114594791
Lasso Score: 0.2859075052315626
Ridge Score: 0.5957631570661484


$(c)$ Describe the differences that you see in the coefficients and error. What is the cause of this difference in coefficients?

In [None]:
r_coef = ridge.coef_
ols_coef = ols.coef_
l_coef = lasso.coef_

coef_dict = {"Variable": X.columns, "Ridge Coefficients": r_coef, "OLS Coefficients": ols_coef, "Lasso Coefficients": l_coef}
coef_df = pd.DataFrame(data = coef_dict)

coef_df

#OLS and Ridge estimates have very similar coefficients, but Lasso has very different coefficients. Lasso says that the variable that is most important is the median income.

Unnamed: 0,Variable,Ridge Coefficients,OLS Coefficients,Lasso Coefficients
0,MedInc,0.444474,0.444621,0.148245
1,HouseAge,0.00939,0.009387,0.005945
2,AveRooms,-0.115184,-0.115465,0.0
3,AveBedrms,0.630306,0.631725,-0.0
4,Population,-8e-06,-8e-06,-6e-06
5,AveOccup,-0.003939,-0.003938,-0.0
6,Latitude,-0.410569,-0.410582,-0.0
7,Longitude,-0.424418,-0.424451,-0.0


$(d)$ Use K-fold cross validation to find an optimal penalty parameter for Ridge and Lasso.

Index(['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup',
       'Latitude', 'Longitude'],
      dtype='object')

$(e)$ Now use cross validation, to find the optimal penalty parameter. Use LOOCV and Kfold cross validation with K=5 to find optimal parameters for the ElasticNet model. How do the test errors and optimal parameters differ?

$(f)$ Now that we have tuned the models to perform about as well as they can, which one performs best on the training data? Which one performs best on the test data? Which of these models allow us to do effective causal inference with the coefficients? Why?

For the next problem we will be using the `Carseats` data set that is available on learningsuite. Load the data and convert the text variables into dummies so that we can use them in the data. Pandas has a function called `get_dummies` that you might want to use.

Now that the data has only numeric columns, we can proceed to the analysis.  
Use `Sales` as the outcome variable  
(a) Split the data set into a training set and a test set.  
(b) Fit a regression tree to the training set with the default depth. What train and test MSE do you obtain?  
(c) Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE? Plot a tree with a depth of 3, and interpret the results.  
(d) Use a bagging approach in order to analyze this data. What test MSE do you obtain? Look at the feature importances attribute of your model object to determine which variables are most important.  
(e) Use random forests to analyze this data. What test MSE do you obtain? Look at the feature importances attribute of your model object function to determine which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.

We will now use boosting to predict Log Salary in the `Hitters` data set.  
(a) Format the data appropriately for this analysis. Use 200 observations in your training set.  
(b) Perform boosting on the training set with 1,000 trees for a range of values of the shrinkage parameter λ. Produce a plot with different shrinkage values on the x-axis and the corresponding training set MSE on the y-axis. Add a curve with different shrinkage values on the x-axis and the corresponding test set MSE on the y-axis. The shrinkage parameter is often referred to as the learning rate   
(c) Compare the test MSE of boosting to the test MSE of two of the penalized regression approaches that we discussed  
(d) Which variables appear to be the most important predictors in the boosted model?  
(e) The default for base estimator is a Decision Tree with a maximum depth of 3. Is that the optimal depth? Justify your response.  
(f) Now that the boosting model is tuned, let's compare the results to bagging and random forests. Report test errors for your models and discuss how they compare.

In this problem, you will use support vector approaches in order to predict whether a given car gets high or low gas mileage based on the Auto data set.  

#### NOTE: SVM algortihms will often take longer than other models to train, particularly when doing cross validation

(a) Create a binary variable that takes on a 1 for cars with gas mileage above the median, and a 0 for cars with gas mileage below the median.  
(b) Fit a support vector classifier to the data with various values of cost, in order to predict whether a car gets high or low gas mileage. Report the cross-validation errors associated with different values of this parameter. Comment on your results.  
(c) Make an ROC curve for your model. The module scikitplot has a nice function you might want to use but you should eb able to make it on your own or another module if you desire.

Below there are some generated datasets of varying structure that you will classifying is SVMs, plotting the data to see what it looks like will likey be helpful. Find the basis kernel that does best job classifying each of them. Because the data is two dimensional, it might be nice to use a library like mlxtend which has a function that will display decision regions form an svm using a one of their functions.

In [None]:
from sklearn.datasets import make_moons
x, y = make_moons(n_samples=100, shuffle=True, noise=1/10, random_state=123)

In [None]:
from sklearn.datasets import make_circles
x, y = make_circles(n_samples=100, shuffle=False, noise=1/50, random_state=123, factor=0.6)

In [None]:
from sklearn.datasets import make_blobs
x, y = make_blobs(n_samples=100, n_features=2, centers=None, cluster_std=2.0,
           center_box=(-10.0, 10.0), shuffle=True, random_state=10)