In [42]:
# #Please DO NOT RUN! This is for printing versions of my packages.
# import sys  # Import sys to get Python version
# import matplotlib
# import scipy
# import pandas

# # Print the version of each package
# print("Python version:", sys.version)
# print("NumPy version:", np.__version__)
# print("Matplotlib version:",matplotlib.__version__)
# print("SciPy version:", scipy.__version__)
# print("Pandas version:", pandas.__version__)

Python version: 3.9.13 (tags/v3.9.13:6de2ca5, May 17 2022, 16:36:42) [MSC v.1929 64 bit (AMD64)]
NumPy version: 1.23.5
Matplotlib version: 3.4.2
SciPy version: 1.9.1
Pandas version: 1.5.3


### Package Versions

- **Python version**: `3.9.13`
- **NumPy version**: `1.23.5`
- **Matplotlib version**: `3.4.2`
- **SciPy version**: `1.9.1`
- **Pandas version**: `1.5.3`



In [52]:
# Import the necessary libraries
import numpy as np # For calculations
import matplotlib.pyplot as plt # For plotting 
from scipy.optimize import curve_fit # For curve fitting
from matplotlib.ticker import (MultipleLocator, FormatStrFormatter,AutoMinorLocator) #make plot nicer
import pandas as pd # For reading the data
import os #For flexible path way to data

First, I start with reading the data file using panda package. Then I calculate the error of dependent data point $y$ using the $e_i = \sqrt{y_i}$.

In [53]:
# Define the path in a more flexible way
root_folder = os.getcwd()
data_file = os.path.join(root_folder, 'photon_exp.dat')
#Read data file using pandas package
data = pd.read_csv(data_file, delim_whitespace=True)

# Extract the (x) and (y)
x = data['X'].values
y = data['Y'].values

#Calculate error of y as sqrt(y)
e_y = np.sqrt(y)
print( "x =", x)
print("y=", y)
print("errors of y = ", e_y)

x = [0. 1. 2. 3. 4.]
y= [25 36 64 49 81]
errors of y =  [5. 6. 8. 7. 9.]


**Section a)**

to find the best-fit parameters $a$ and $b$ for the line $y=bx+a$, I use `curve-fit` function in `scipy` package. I use the linear model  $y=bx+a$ to fit a line ito my data. The output of the `curve-fit` function give me the best parameters  $a$ and $b$, and the covariance matrix of calculating each parameter
$$
\operatorname{cov} =
\begin{pmatrix}
\sigma^2_a & \operatorname{cov}[a,b] \\
\operatorname{cov}[b,a] & \sigma^2_b
\end{pmatrix}.
$$

Here the diagonal elemts of the matrix are the variance of each parameters that can be shown as
$$
\operatorname{diag} = 
[\sigma^2_a, \sigma^2_b]
$$
that can be calculated using `np.diag()`.

The standard deviation of each parameter is 
$
\sigma_a = \sqrt{\sigma^2_a}, \sigma_b = \sqrt{\sigma^2_b}.
$

The correlation function is defined as
$$
r :=\frac{ \operatorname{cov}[a,b]}{\left(\sigma_a\times\sigma_b\right)}.
$$

In [47]:
# Define the linear model y = a + bx
def linear_model(x, a, b):
    return a + b * x

# Perform the curve fitting
parameters, cov = curve_fit(linear_model, x, y, sigma=e_y, absolute_sigma=True) #The values of sigma (the errors) are treated as absolute uncertainties in the curve fitting. 
a, b =parameters  # best-fit parameters
sigma_a, sigma_b = np.sqrt(np.diag(cov))  # standard errors for a and b
# Calculate the correlation coefficient
cov_ab = cov[0, 1]  # covariance of a and b
correlation_coefficient = cov_ab / (sigma_a * sigma_b)
# Print the best-fit parameters with their errors
print(f"Best-fit parameters:")
print(f"a = {a:.2f} ± {sigma_a:.2f}")
print(f"b = {b:.2f} ± {sigma_b:.2f}")
#print correlation coeficient
print(f"r = {correlation_coefficient:.2f}")

Best-fit parameters:
a = 25.44 ± 4.26
b = 12.06 ± 2.11
r = -0.72


Correlation coeficiant $(r = -0.72)$ means that the two variables have a strong inverse (negative) relationship, as one variable increases the other decreases.

**Section b)**

$\chi^2$ is a measure of how well the model fits the data, specifically in the presence of uncertainties data, and the smaller the value of $\chi^2$, the better the fit.
$\chi^2$ is calculated by
$$
\chi^2 := \sum_{i=1}^{n} \left( \frac{y_i - f(x_i)}{\sigma_i} \right)^2 =\sum_{i=1}^{n} \left( \frac{y_i - b x_i -a}{\sigma_i} \right)^2,
$$
here we call the numerator as the $\text{residual}= \left(y_i - b x_i -a\right)$.


$\text{R}^2$ tells us how much of the variance in the dependent variable is explained by the independent variables. the formula of $\text{R}^2$ is
$$
R^2 := 1 - \frac{\sum_{i=1}^{n} (y_i -  b x_i -a)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2},
$$
where $\bar{y} := \sum_{i=1}^{n} y_i$ is the mean value of variable $y$.

In [46]:
# Calculate the residuals as the list 
residuals = y - linear_model(x, *parameters)

# Calculate the chi-squared
chi_squared = np.sum((residuals / e_y) ** 2)

# Calculate the coefficient of determination
numerator = np.sum(residuals ** 2)
denominator = np.sum((y - np.mean(y)) ** 2)
r_squared = 1 - (numerator  / denominator)
#print the results
print(f"χ² = {chi_squared:.2f}")
print(f"R² = {r_squared:.2f}")

χ² = 7.24
R² = 0.79


For a good fit we expect $\chi^2 \approx n-p$ which $n = 5$ is the number of data points and $p = 2$ is the number of the parameter. So we expect $\chi^2 \approx 3$ for a good fit but here we have approximately $7$ and the reduced  $\chi^2$ is $\frac{7.24}{3} = 2.33$ means the fit is not great as this value is greater than 1.
It indicates that the model does not perfectly fit the data. It could indicate that the data is more scattered than the model predicts.


The $R^2$ value indicates that approximately $79\% $of the variation in the dependent variable (y) can be explained by the independent variable (x). The remaining $11%$ of the variation is due to other factors not accounted for by the model. The closer $R^2$ is to 1, the better the model explains the data.