In [None]:
!pip install plotly==4.5.2

# More Examples

Let's do a couple more examples to get the hang of linear regression.

In [None]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import statsmodels.api as sm

## Beach Sand

It turns out that there is a correlation between the average size of a grain of sand on a beach, and how steep the beach is. There is a small dataset at 
`'https://raw.githubusercontent.com/SkyIslandsMath/semester-2/master/data/beach_sand.csv'` that contains the average grain size and steepness of 9 beaches around the world. 

1. Load the data and make a scatter plot with `'GranuleSize(mm)'` on the x-axis and `'BeachGrad(deg)'` on the y-axis.

2. Does there appear to be a correlation? What is the correlation coefficient?

3. Use statsmodels to make a linear regression model of the data and print out the summary of the model. Don't forget to add a constant to your x-values. 
     What do our coefficients mean?  

4. Plot the original scatter plot with our model line added.

5. According to your model, what would the slope of a beach be if it had an average granule size of 1 mm?

## Crickets
The rate at which crickets chirp varies according to the temperature, so we can use cricket chirps as a rudimentary thermometer. We have cricket chirping data here:
```'https://raw.githubusercontent.com/SkyIslandsMath/semester-2/master/data/crickets.csv'```

Follow the same steps as above to examine, plot and model the data.  

1. According to the model you just made, What would the temperature be if the crickets were chirping at a rate of 1 chirp per second?
2. What would the temperature be if the rate was 24 chirps/second?

# Multiple linear regression
There are a number of ways to generalize our basic linear models. One method is to have multiple independent variables that help predict our response variable. Let's look at an example.

## Antelopes
There is a dataset at `'https://raw.githubusercontent.com/SkyIslandsMath/semester-2/master/data/antelope_mlr.csv'` that tracked a particular population of antelopes in Wyoming over several years. They were looking to see how the size of the herd, the rainfall and the winter severity affected the number of fawns that were born each spring. 

Let's load and look at the data.

In [None]:
data=pd.read_csv('https://raw.githubusercontent.com/SkyIslandsMath/semester-2/master/data/antelope_mlr.csv')
data

The Winter Severity index goes from 1 to 5, with 1 being a mild winter, and 5 being a brutal winter.

Let's make three plots, with all three having the number of fawns born on the y-axis, and then each of the other columns on the x-axis.

In [None]:
px.scatter(data,x='Adults (100s)',y='New Fawns/100')

In [None]:
px.scatter(data,x='Annual Prec. (in.)',y='New Fawns/100')

In [None]:
px.scatter(data,x='Winter Severity',y='New Fawns/100')

We can see that there appears to be a nice correlation for all of the variables with the number of fawns born. There is a strong positive correlation for precipitation and adult herd size, and a negative correlation for winter severity and fawns.

Rather than modeling these separately, we can combine them into a single multivariate model. The process doesn't really change much, we just have multiple columns in our independent variable.

1. Our independent variable will now include three columns, not just one. So we create it slightly differently:

In [None]:
#we create a li|st with the columns we want to use as independent variables
ind_vars=['Adults (100s)','Annual Prec. (in.)','Winter Severity']

#we set X to the subframe with these columns
X=data[ind_vars]

#we still need to add a constant
X=sm.add_constant(X)

2. Nothing changes about how we create our response variable, or how we make and fit our model:

In [None]:
Y=data['New Fawns/100']
fawn_model=sm.OLS(Y,X)
fawn_model=fawn_model.fit()
fawn_model.summary()

We haven't learned how to decide how good our model is, but trust me when I tell you this model is better than the one we get when using just one of the independent variables. 

The downside of this model is that it is hard to visualize this model since we need 4 -dimensions (the 3 independent variables plus the dependent variable). 

Even though it is difficult to visualize the model it is still easy to use.  We just need to look at the coefficients in the summary above to get our linear model:

$$
# of fawns = -5.9220 + 0.3382*(# of Adults) + 0.4015*(annual precipitation) + 0.2629*(winter severity)
$$

So if we wanted to predict how many fawns would be born in the spring if there were 820 adults in the herd, there were 13.6 inches of precipitation and the winter had a severity of two we would get:


In [None]:
-5.9220 + 0.3382*(8.2) + 0.4015*(13.6) + 0.2629*(2)

So we would expect 2.83744 fawns to be born in those conditions.

## Assignment

Use the dataset at 
```
'https://raw.githubusercontent.com/SkyIslandsMath/semester-2/master/data/franchises.csv'
```
It shows the performance of several individual stores of a franchise along with factors that might affect the sales of that store. Your response variable is the 'Annual Sales' column. All of the other columns are your independent variables. Analyze the dataset like we did for the fawns and create a multiple linear regression model for the variables. 

Why might such a model be useful?