# Using Linear Regression to Extract Physical Constants: Newtonian Gravitation

Author: Julie Butler, butle222@msu.edu

Date Created: October 15, 2022

Last Modified: November 7, 2022

Notebook 2/5 of the DSECOP Module: An Introduction to the Machine Learning Workflow with Linear Regression.  See the entire module [here]().

## Imports

First let's import the necessary packages needed to run this notebook.

In [None]:
##############################
##          IMPORTS         ##
##############################
# Needed for arrays and mathematical functions
import numpy as np
# For plotting
import matplotlib.pyplot as plt
# For importing and formatting data sets
import pandas as pd
# For the machine learning section, their functions will be explained later in
# the notebook
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

Now we need to set up the notebook to sync with Google Drive.  There will be some pop-up windows you will need to follow the prompts on.

In [None]:
# GOOGLE DRIVE SET UP
from google.colab import drive
# Mount my Google Drive
drive.mount('/content/gdrive')

### IMPORTANT!

**Problem 1:** Change the below line to match the file directory where the notebooks and data are stored on your Google Drive.

In [None]:
# Directory to retrieve data files from (###CHANGE THIS###)
data_dir = '/content/gdrive/My Drive/Teaching/DSECOP/Module 2: Workflow/'

## Building A Model

Given a star which has a mass of $m_1$ and is at location $\vec{r}_1$ and a planet which has a mass of $m_2$ and is at location $\vec{r}_2$, then the graviational force on the planet due to the star is:

$$\vec{F}_{G,21} = -G\frac{m_1m_2}{r_{21}^2}\hat{r}_{21},$$

where $\vec{r}_{21} = \vec{r}_1 - \vec{r}_2$.

However we also know that the net force on an object is given by Newton's second law:

$$\vec{F}_{net} = m\vec{a},$$

where $\vec{a}$ is the acceleration of the object.

If we assume that the only force acting on our planet is the gravitaional force due to being near the star, then its net force and the Newtonian gravitaional force equation are equivalent.  To simplfy calculations we will also only consider the magnitudes of both equations.

$$\vec{F}_{G,21} = \vec{F}_{net} \longrightarrow F_{G,21} = F_{net}$$

We can expand both equations using the above definitions, where the $m$ in the net force equation is $m_1$, the mass of the planet (i.e. the object who's motion we are studying).

$$-G\frac{m_1m_2}{r_{21}^2} = m_2a$$

We can simplify these equations a bit further by doing some algebra:

$$a = \frac{Gm_1}{r_{21}^2} = -Gm_1\frac{1}{r_{21}^2}$$

Now, let's separate the constants from the physical values and we know that both acceleration and position are time dependent, so we can add in a notation that reflects this.

$$a(t) =  Gm_1\frac{1}{r_{12}(t)^2}$$

Now, let's look at the equation for the output of the linear regression algorithm and compare it to the equation for acceleration which we have just derived.

$$\hat{y} = X\theta$$

By comparing the equation for a(t) to the equation for $\hat{y}$, we can begin to see how we can format your acceleration equation as a linear regression problem.  First, we can see that $\hat{y}$ will be the acceleration of our planet, which we can expand as a vector for each time step we have data on, for a total of N data points.

$$\hat{y} = a(t)\ = \ \begin{bmatrix}
    a(t_0) \
    a(t_1) \\
    a(t_2) \\
    . \\
    . \\
    . \\
    a(t_N)
\end{bmatrix}$$

We can also see that our X data will be the inverse squared position, which we can represent in vector form based on time steps as:

$$X\ =\ \frac{1}{r_{21}^2(t)}\ =\ \begin{bmatrix}
    1/r_{21}^2(t_0) \\
    1/r_{21}^2(t_1) \\
    1/r_{21}^2(t_2) \\
    . \\
    . \\
    . \\
    1/r_{21}^2(t_N) \\
\end{bmatrix}$$

Here, we can take a moment to simplify the notation since we only have two objects in total, the planet and the star.  For the remainer of this problem we will simplify the notation to r$_{12}$ $\rightarrow$ r$_2$.

$$X\ =\ \frac{1}{r_{2}^2(t)}\ =\ \begin{bmatrix}
    1/r_{2}^2(t_0) \\
    1/r_{2}^2(t_1) \\
    1/r_{2}^2(t_2) \\
    . \\
    . \\
    . \\
    1/r_{2}^2(t_N) \\
\end{bmatrix}$$

Finally, the weights, $\theta$, will be the gravitational constant, G, multiplied by the mass of the sun, m$_1$.

$$\theta = Gm_1$$

The goal of this linear regression analysis will be to determine the value of m$_1$, the mass of the star the planet is orbiting.

### Import and Format the Data Set

The data needed for this problem is stored in the file `data_notebook_2_one_planet.csv`.  Before we attempt to apply linear regression to this problem we will need to import and format the data set.

**Problem 2:** Import the data set as a Pandas dataframe and print the first few lines.

**Problem 3:** Remembering that a subscript of "2" represents data collected about the planet, what data is provided for this problem.  Write your response in the textbox below.  In the code cell below that, convert each column of the data file to a NumPy array, saving the NumPy arrays with useful and descriptive names (i.e. t, a, v, and r).

Now we need to create the design matrix, X = 1/r$^2$.  However, since we were given just the position, we will need to perform some manipulations to create our design matrix:

In [None]:
r_squared = r**2 
r_squared_inverse = 1/r_squared
X = np.array([r_squared_inverse]).T

Next, we need to split our data into a training set and a test set.

**Problem 4:** Using the Scikit-Learn function `train_test_split` split the X and y data sets into training and test sets.  Use 20% of the data as the test set and the remainder as the training set.

Since our X data set only has one feature (i.e. its a vector instead of a matrix) we will need to reshape the X components of our training and test data sets.

In [None]:
X_train = X_train.reshape(-1,1)
X_test = X_test.reshape(-1,1)

**Problem 5:** Using Scikit-Learn's linear regression model:
* Define a linear regression model, remembering to set `fit_intercept` to False
* Train the model using the training data
* Predict the values of the test set and store the predictions for later use

Now let's define a function that will calculate the mean-squared error between two data sets.

In [None]:
def mse (A,B):
    return np.average((A-B)**2)

**Problem 6:** Using the MSE function defined above, calculate the MSE between the test data set and the predictions. Based in the MSE, is the machine learning model a good match for the data?

**Problem 7:**. Finally, graph the test data set and the predicted data set as a function of time (remember that the y component of each data set is acceleration).  Make sure you have a legend on your graph and label the x and y axes.

Finally, let's extract the linear regression weights to see if we can extract the mass of the star.

In [None]:
weight = LiR.coef_

**Problem 8:** Finally, perform a mathematical operation on the linear regression weight to recover the mass of the star.  Remember that G = 6.67x10$^{-11}$ $\frac{N m^2}{kg^2}$.  The mass of the star used in the simulation to create the data set is 8x10$^{28}$.  How close is your result?

In [None]:
weight/6.67e-11

## Practice What You Have Learned

Let's assume that we have a system that contains two objects, object 1 (mass = $m_1$, location = $\vec{r}_1$) and object 2 (mass = $m_2$, location = $\vec{r}_2$).  These two objects have a strange interaction but after much study we have been able to determine that the force on object 1 caused by object 2 can be modelled by the following equation:

$$\vec{F}_{12} = \frac{m_1^2m_2}{(\vec{r}_2-\vec{r}_1)^3} + \frac{m_1}{m_2}(\vec{r}_2-\vec{r}_1)$$

Unfortunately, we have not been able to determine their masses. However, we have been able to record their relative positions and the acceleration of the first object. Using this information, you are asked to determine the mass of the two objects. 

a. Before we begin coding, we need to develop a theoretical model to try to match.  Using the above force equation and Newton's second law, write an equation that relates the acceleration of object 1 to the relative distance between the two objects.

b. Using your equation from part a and the equation for linear regression ($\hat{y} = X\theta$), figure out which values correspond to $\hat{y}$, X, and $\theta$.

c. The data for this problem is stored in `data_notebook_2_objects`.  Import the data as a Pandas dataframe and print the first few lines.  The columns, in order, are: the time (in seconds), the acceleration of object 2 (m/s$^2$), the velocity of object 2 (m/s), and the relative position between the two objects (i.e. r$_2$ - r$_1$, measured in m).  Extract these columns each as a separate NumPy array and save them with useful names.

d. Create a design matrix X using the imported data that corresponds to your answers to parts a and b. 

e. Using the function `train_test_split` from Scikit-Learn, split your data into a training set and a test set. Use 20% of the data as the test data set.

f. Using Scikit-Learn's linear regression model:
* Define a linear regression model, remembering to set `fit_intercept` to False
* Train the model using the training data
* Predict the values of the test set and store the predictions for later use

g. Calculate the MSE error between the test data set and the predicted data set.  Based on this error, is the machine learning model a good fit for the data?

h. Plot the predicted and test data sets on the same graph as a function of time.  Make sure you label your axes and add a legend.

i. Extract the linear regression weights and use them to extract m$_1$ and m$_2$.  The values of m$_1$ and m$_2$ used to generate the data were 5kg and 7kg respectively.  How close were your answers?