<a target="_blank" href="https://colab.research.google.com/github/RodrigoAVargasHdz/CHEM-4PB3/blob/w2024/Course_Notes/Week%203/Week_3_Intro_Linear_Models.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# **Week 3 - Introduction to Linear Models**

## **Abstract**

1. **Brief Overview of Parameter Generation**: Creating an array of random integers and adding Gausian Noise to data using **NumPy's** `Uniform` function.  

2. **Animations on Google Colab**: Introducing the `IPython.display` to create gifs using individual frames. These can be used to visualize Kinetic Models in Python.

3. **Linear and Polynomial Models**: The use of Linear and Polynomial models to predict data, inspecting the advantages and distadvantages of each model. Introducing the mathmatical structure of these models.

4. **Overfitting and Underfitting**: Overfitting  occurs when a model learns the training data too well, capturing noise and irrelevant details, which leads to poor generalization to new data. Underfitting, occurs when a model is too simple to capture the underlying patterns in the data, resulting in a lack of accuracy both on the training and test data.

5. **Understanding Lambda and Regularization**: To avoid overfitting or underfitting data, lambda are introduced as critical elements in machine learning. Introducing the magnitudes of the coefficient to optimize regularization.


>## **References: Essential Resources for Further Learning**
>
>1. **NumPy Random Tutorial**: [Tutorial](https://numpy.org/doc/stable/reference/random/index.html)
2. **Creating Animations with Matplotlib**: [Tutorial](https://towardsdatascience.com/animations-with-matplotlib-d96375c5442c)
3. **Linear Regression in Python**: [Tutorial](https://realpython.com/linear-regression-in-python/)
4. **Overfitting and Underfitting in Machine Learning**: [Article](https://towardsdatascience.com/overfitting-vs-underfitting-a-complete-example-d05dd7e19765)
5. **Regularization in Machine Learning**: [Tutorial](https://www.datacamp.com/community/tutorials/tutorial-ridge-lasso-elastic-net)


Feel free to explore these resources to deepen your understanding of data visualization, data management, and computational tools in Chemistry.





In [None]:
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from matplotlib import animation
from IPython.display import HTML, Image # For GIF
import os

## **Understanding Linear Models in Python - Animation**

To visualize Linear Kinetic Models in Python, we can generate random data using **NumPy's** `Uniform` function.


In [None]:
# generate random data over f(x) = sin(x) + x - 1
def get_data(N):
    x = np.linspace(-1.,1.,N) #This creates an array x of N linearly spaced values between -1 and 1.
    y = np.sin(.5*x) + x -1.
    y = y + np.random.uniform(low = 0.,high=0.5,size=x.shape) #Adds random noise to each y value.
    return x,y

Linear models 101:
$$
\begin{align}
f(x) &= m x + b
=  \sum_{i=0}^{d} w_i x_i = \begin{bmatrix}
w_0 & w_1 & \cdots & w_p \\
\end{bmatrix}\begin{bmatrix}
 1 \\
 x_1 \\
 \vdots \\
 x_d
\end{bmatrix}
\end{align} 
$$

In [None]:
def model(x,params):
    m,b = params #tuple
    y = m*x
    y = y + b
    return y

### **Generating Random Parameters**

*   **Purpose:** The purpose of this function is to generate a 2D matrix of random numbers. The dimensions of this array are $m x 2$, meaning there are m rows and 2 columns.

- $$[m,b] \sim U([-2,2])$$






In [None]:
#random parameters
def get_random_params(m):
    theta_random = np.random.uniform(low=-2.,high=2.,size=(m,2))
    return theta_random

## **Funcitons in Python**

1. Build-in functions: `min(x)`
2. User-Defined Functions: 
   ```python
      def f_min(x): 
         return np.min(x)
   ```
3. lambda functions: 
   ```python
   lambda x: np.min(x)
   ```

In [None]:
f_min = lambda x: np.min(x) 
x = np.array([10.,1.,2.,3.])
print(type(f_min))
print(f_min(x))


**Exercise**<br>
Write a lambda function for a linear model.


In [None]:
#code here!
model_l = lambda x,m,b: m*x+b
m = 1.
b = 2.
x = 10.

print(model(x,(m,b)))
print(model_l(x,m,b))

model_l2 = lambda x: model(x,(m,b))
print(model_l2(x))

## **Figure Per Frame**


> The model will plot a series of linear models based on the parameters.

> Loops over each pair of parameters **`(m, b)`** in `theta_rnd`.
For each pair, it calls `plot_figure_frame` to plot the data and the linear model based on those parameters.

> This graph shows the datapoints generated. Using Python and Linear Regression, we can fit a model (a line of best fit) to the datapoints.

In [None]:
x, y = get_data(25)
plt.scatter(x, y, label='data')
plt.xlabel(r'$x$', fontsize=18)
plt.ylabel(r'$f(x)$', fontsize=18)
plt.ylim(-3., 2.)
plt.legend()
# plt.savefig('Figures/data.png')

In [None]:
def plot_figure_frame(data, params, i):
    m, b = params
    x, y = data
    def f(x, m, b): return m*x + b

    x_grid = np.linspace(-1., 1., 100)
    y_pred = f(x_grid, m, b)

    fig, ax = plt.subplots()
    ax.clear()
    ax.scatter(x, y, label='data')
    ax.plot(x_grid, y_pred, color='k', label='model')
    ax.text(0.2, -2.5, 'm=%.2f, b=%.2f' % (m, b), fontsize=15)
    ax.legend(loc=1)
    ax.set_xlabel(r'$x$', fontsize=18)
    ax.set_ylabel(r'$f(x)$', fontsize=18)
    ax.set_ylim(-3., 2.)
    # plt.savefig('Figures/linear_model_%s.png'%(i))
    plt.draw()
    plt.pause(0.1)

In [None]:
theta_rnd = get_random_params(2)
x, y = get_data(25)
for i, p in enumerate(theta_rnd):
    plot_figure_frame((x,y),p,i)

## **Animation**

> To animate the Linear model in Google Colab, the [`IPython.display`](https://ipython.readthedocs.io/en/stable/api/generated/IPython.display.html) library can convert frames to gifs.

In [None]:
from matplotlib import animation, rc
from IPython.display import HTML

x,y = get_data(25)
theta_rnd = get_random_params(20)

fig, ax = plt.subplots(figsize=(6,5))
# plt.figure(facecolor='white')

ax.set_xlabel(r'$x$', fontsize=18)
ax.set_ylabel(r'$f(x)$', fontsize=18)
ax.set_ylim(-3., 2.)
ax.set_xlim(-1.1, 1.1)
ax.scatter(x,y,label='data')

# line1, = ax.plot([], [], ms=20, label='data')
line2, = ax.plot([], [], ms=20, color='k', label='model')
txt2 = ax.text(0.2,-2.5, '',fontsize=15)
ax.legend(loc=1)

def drawframe(n):
    params = theta_rnd[n]
    m, b = params
    def f(x, m, b): return m*x + b

    x_grid = np.linspace(-1., 1., 100)
    y_pred = f(x_grid, m, b)

    # line1.set_data(x,y)
    line2.set_data(x_grid,y_pred)
    txt2.set_text('m=%.2f, b=%.2f'%(m,b))

    return (line2)

> The original plotted data

In [None]:
%matplotlib inline
# blit=True re-draws only the parts that have changed.
anim = animation.FuncAnimation(
    fig, drawframe, frames=19, interval=200, blit=False,)

# Save as GIF
anim.save('animation.gif')  # writer='pillow'

# play animation
HTML(anim.to_html5_video())


## **Polynomials Models are linear Models**

Polynomial models are a class of regression models that use polynomial functions to fit a relationship between a dependent variable and **one or more independent variables**. Unlike linear models, which are constrained to a straight line, polynomial models can **fit data with curves and complex** relationships, making them more flexible for a wide range of datasets.


- **General Form**: The general form of a polynomial model is given by:
  $$
    f(x) = w_0 + w_1x + w_2x^2 + \ldots + w_px^p = \begin{bmatrix}
    w_0 & w_1 & \cdots & w_p \\
    \end{bmatrix} \begin{bmatrix}
    1 \\
    x^1 \\
    \vdots \\
    x^p
      \end{bmatrix}
  $$
    where:
    - $x$ is the independent variable.
    - $w_0, w_1, \ldots, w_n$ are the coefficients of the model.
    - $n$ is the degree of the polynomial.

- **Degree of the Polynomial**:
  - The degree $n$ of the polynomial determines the curve's complexity.
  - A degree of 1 corresponds to a linear model (straight line).
  - Higher degrees (2 for quadratic, 3 for cubic, etc.) allow for more complex curves.


- **Flexibility**:
  - Polynomial models are more flexible than linear models, able to fit data with curves and **non-linear relationships**.


- **Overfitting Concern**:
  - Caution is needed to avoid overfitting, especially with high-degree polynomials.
  - Overfitting occurs when the model becomes too complex, fitting the noise in the data rather than the underlying trend.



In [None]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import animation, rc
from IPython.display import HTML

# New function to generate data
def get_data(N):
    x = np.linspace(-2, 2, N)
    y = np.sin(x**3) * np.cos(x) - x**3
    y += np.random.normal(scale=1, size=x.shape)  # Adding Gaussian noise
    return x, y

# Generate data
x, y = get_data(20)

# Set up the figure, the axis, and the plot element
fig, ax = plt.subplots()
# ax.set_xlim((min(x), max(x)))
# ax.set_ylim((min(y)-1, max(y)+1))

# Plot the data points
ax.scatter(x, y, color='red')



# # Frame by frame
for pi in range(1,20,2): 
    z = np.polyfit(x, y, pi)  # what is this?
    poly_model = np.poly1d(z)  # what is this?
    ax.plot(x, poly_model(x),label='p = %s' % pi)
    # break
plt.legend()
plt.show()
    

# # Animation function
# # Initialization function: plot the background of each frame
# line, = ax.plot([], [], lw=2)
# def init():
#     line.set_data([], [])
#     return (line,)
# def animate(i):
#     if i == 0:
#         return (line,)
#     z = np.polyfit(x, y, i) #what is this? 
#     poly_model = np.poly1d(z) #what is this?
#     line.set_data(x, poly_model(x))
#     ax.set_title(f"Polynomial Degree {i}")
#     return (line,)

# # Call the animator
# anim = animation.FuncAnimation(fig, animate, init_func=init,
#                                frames=15, interval=200, blit=True)

# # Display the animation
# HTML(anim.to_html5_video())

In [None]:
# polynomial model 1D
def poly_pred(data_tr, deg, x_grid):
    x, y = data_tr

    #training 
    w = np.polyfit(x, y, deg)
    poly_model = np.poly1d(w) 

    #prediction
    y_pred = poly_model(x_grid)
    return y_pred,w

In [None]:
x, y = get_data(10)
data_tr = (x,y)
x_grid = np.linspace(-2., 2., 200)

fig, ax = plt.subplots(figsize=(6, 5))
w_ = [] 
p_ = np.arange(1, 13, 1,dtype=np.int32)
for p in p_: # loop over different degrees
    y_pred, w = poly_pred(data_tr, p, x_grid)
    print(p,w)
    plt.plot(x_grid,y_pred,label='p=%s'%p)
    w_.append(np.pad(w, (0, 13-w.shape[0]),
              mode='constant', constant_values=0))
plt.scatter(x,y,s=75,label='data')
plt.legend(fontsize=5)
plt.xlabel(r'$x$',fontsize=18)
plt.ylabel(r'$f(x)$',fontsize=18)
# plt.savefig('Figures/polyfit_2.png',dpi=1800)

fig, ax0 = plt.subplots(1, 1)
c = ax0.pcolor(np.abs(np.array(w_)), edgecolors='k', linewidths=4)
fig.colorbar(c, ax=ax0,label=r'$|w_i|$')
ax0.set_xlabel(r'$w_i$')
ax0.set_yticks(np.arange(p_.shape[0])+0.5, p_)
ax0.set_ylabel('Poly degree')
fig.tight_layout()
plt.show()