# Making a scatter with a trend line

## Getting ready


In addition to `plotly`, `numpy` and `pandas`, make sure the following Python libraries avaiable in your Python environment

-  `statsmodels` 
-  `scipy`

You can install it using the command:

```
pip install statsmodels, scipy 
```

Import the Python modules `numpy`, `pandas` , 

Import the [`norm`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html) object from `scipy.stats`. This object will allow us to generate random samples from a normal distribution. This will help us to create data sets to be used in this recipe.

In [1]:
import numpy as np
import pandas as pd
from scipy.stats import norm

Create two data sets to be used in this recipe:

- `data1` : which contains two variables, `x` and `y`, with a linear relationship
- `data2` : which contains two variables, `x` and `y`, with a non-linear relationship

In [2]:
n = 200
x = np.linspace(0, 15, n)
epsilon = norm().rvs(n)
sigma = 2
y = 2*x + sigma*epsilon
data1 = pd.DataFrame({'x':x, 'y':y})

In [3]:
n = 200
x = np.linspace(0, 15, n)
epsilon = norm(loc=20, scale=100).rvs(n)
y = 0.5*x**3 + epsilon -10
data2 = pd.DataFrame({'x':x, 'y':y})

## How to do it

1. Import the `plotly.express` module as `px`

In [4]:
import plotly.express as px

2. Make a scatter plot to illustrate the points in the `data1` data set

In [5]:
df = data1
fig = px.scatter(df, x='x', y ='y', 
                 title='Just a simple scatter')
fig.show()

There is a linear relationship between the variables. 

### Linear Trend

3. Add a line that captures the linear relationship in the data. To do this,  simply add the argument `trendline` and pass the string `ols`.  This will draw the line determined by the Ordinary Least Squares regression (OLS) method.

In [6]:
fig = px.scatter(df, x='x', y ='y', trendline="ols",
                 title='Scatter with OLS trend line')
fig.show()

4. Change the color of the trend line by using `trendline_color_overrride`

In [7]:
fig = px.scatter(df, x='x', y ='y', trendline="ols", 
                 trendline_color_override="red",
                 title='Scatter with OLS trend line')
fig.show()

5. Retrieve the results of the OLS algorithm by using the `plotly` function `get_trendline_result` and passing your figure object.

In [8]:
results_table = px.get_trendline_results(fig)
results_table

Unnamed: 0,px_fit_results
0,<statsmodels.regression.linear_model.Regressio...


Let's check wha type of object this returns

In [9]:
type(results_table)

pandas.core.frame.DataFrame

It is a pandas `DataFrame`

6. Extract the object containing the results from the `DataFrame`. This is a `statsmodels.regression.linear_model.RegressionResultsWrapper` object

In [10]:
results = results_table['px_fit_results'][0]
results

<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x11d2cb8d0>

In [11]:
type(results)

statsmodels.regression.linear_model.RegressionResultsWrapper

7. Get the full details on the regression by using the method `summary` from the `results` object. This method returns a `DataFrame`

In [14]:
results.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.953
Model:,OLS,Adj. R-squared:,0.953
Method:,Least Squares,F-statistic:,4029.0
Date:,"Sun, 18 Aug 2024",Prob (F-statistic):,1.44e-133
Time:,14:43:56,Log-Likelihood:,-413.48
No. Observations:,200,AIC:,831.0
Df Residuals:,198,BIC:,837.6
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.2286,0.271,0.844,0.400,-0.305,0.763
x1,1.9824,0.031,63.471,0.000,1.921,2.044

0,1,2,3
Omnibus:,3.318,Durbin-Watson:,1.739
Prob(Omnibus):,0.19,Jarque-Bera (JB):,3.325
Skew:,-0.311,Prob(JB):,0.19
Kurtosis:,2.887,Cond. No.,17.4


Note that there is a similar method namee `summary2`. This also returns a `DataFrame` with a summary. However, this is a **experimental** version and as such it must be used with caution. 

### Non-Linear Trend

1. Make a scatter plot to illustrate the points in the `data2` data set. Include the OLS regression line to contrast it against the data. It is clear that the data does not show a linear relationship

In [15]:
df = data2
fig = px.scatter(df, x='x', y ='y', trendline="ols", 
                 trendline_color_override="red",
                 title='Scatter with OLS trend line')
fig.show()

2. Import the `statsmodels.formula.api` as `smf`. This will help us to set a non-linear model based on the data in `data2`

In [17]:
import statsmodels.formula.api as smf

3. Fit a OLS non-linear model to the data by using the `smf.ols` and passing

- `formula` This is a sring which specifies the non-linear curve that we want to fit. In this case we are going to fit a cubic polynomial
- `data` The `DataFrame` with the data set to be fitted

In [26]:
model = smf.ols(formula='y ~ I(x**3)', data = df).fit()

4. Plot the scatter together with the curve given by the fitted polynomial evaluated in the `x` variable

In [28]:
fig = px.scatter(df, x='x', y ='y',
                 title='Scatter + Fitted Polynomial')
fig.add_scatter(x=df.x, y =predicted, name="Fitted Polynomial")
fig.show()

5. Get the full details of the model by using the method `summary`

In [29]:
model.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.96
Model:,OLS,Adj. R-squared:,0.96
Method:,Least Squares,F-statistic:,4792.0
Date:,"Sun, 18 Aug 2024",Prob (F-statistic):,1.0399999999999999e-140
Time:,14:56:41,Log-Likelihood:,-1199.9
No. Observations:,200,AIC:,2404.0
Df Residuals:,198,BIC:,2410.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,9.0492,9.237,0.980,0.328,-9.166,27.264
I(x ** 3),0.4981,0.007,69.226,0.000,0.484,0.512

0,1,2,3
Omnibus:,0.387,Durbin-Watson:,2.029
Prob(Omnibus):,0.824,Jarque-Bera (JB):,0.527
Skew:,0.077,Prob(JB):,0.768
Kurtosis:,2.802,Cond. No.,1710.0
