# Regression

Now let's get some real data. We will use housing information from zillow, check out the file `zillow_data_download_april2020.xlsx` for a quick look of what the data looks like. Our goal will be to build a linear regression model between the number of houses listed vs the number of houses sold in a few states. Fitting these models can serve as a key real estate indicator.

---- 
### 🟠 Logistic regression
So far, we have shown several ways to solve the linear regression problem in Julia. Here, we will first start with a motivating example of when you would want to use logistic regression. Let's assume that our predictor vector is binary (`0` or `1`), let's fit a linear regression model.

What this plot quickly shows is that linear regression may end up predicting values outside the `[0,1]` interval. For an example like this, we will use logistic regression. Interestingly, a generalized linear model (https://en.wikipedia.org/wiki/Generalized_linear_model) unifies concepts like linear regression and logistic regression, and the `GLM` package allows you to apply either of these regressions easily by specifying the `distribution family` and the `link` function. 

To apply logistic regression via the `GLM` package, you can readily use the `Binomial()` family and the `LogitLink()` link function. 

Let's load some data and take a look at one example.

We will map the sex of each cat to a binary 0/1 value.

Females (color 1) seem to be more present in the lower left corner and Males (color 2) seem to be present in the top right corner. Let's run a logistic regression model on this data.

As you can see, contrary to the linear regression case, the predicted values do not go beyond 1.

-----
### 🟠 Non linear regression
Finally, sometimes you may have a set of points and the goal is to fit a non-linear function (maybe a quadratic function, a cubic function, an exponential function...). The way we would solve such a problem is by minimizing the least square error between the fitted function and the observations we have. We will use the package `LsqFit` for this task. Note that this problem is usually modeled as a numerical optimizaiton problem.

We will first set up our data.

Then, we set up the model with `model(x,p)`. The vector `p` is what to be estimated given a set of values `x`.

⚠️ A note about `curve_fit`: this function can take multiple other inputs, for instance the Jacobian of what you are trying to fit. We don't dive into these details here, but be sure to check out the `LsqFit` package to see what other things you can can pass to create a better fit.

Also note that julia has multiple packages that allow you to create Jacobians so you don't have to write them yourself. Two such packages are `FiniteDifferences` or `ForwardDiff`.

↪️ Back to our example. We are ready now to plot the curve we have generated

Just for fun... Let's try to fit a linear function

# Finally...
After finishing this notebook, you should be able to:
- [ ] run a linear regression model
- [ ] use the GLM package to pass functions and probability distributions to solve your special regression problem
- [ ] use GLM to solve a logistic regression problem
- [ ] fit a nonlinear regression to your data using the LsqFit package
- [ ] use the LsqFit package to fit a linear function too

# 🥳 One cool finding

One metric in real estate is to find the number of houses being sold out of the number of houses on the market. We collect multiple data points from multiple states and fit a linear model to these states. It turns out that North Carolina has the highest sold/listed ratios. Florida is one of the least, and California is somewhere in between.

<img src="data/0701.png" width="400">
