<a href="https://colab.research.google.com/github/DoubleJ79/AGR-4020-F23/blob/main/Intro%20to%20Jupyter%20for%20descriptive%20stats.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This tutorial will show you how to use a Python library of functions that is referred to as "numpy", which is for performing mathematical operations on arrays and matrices along with basic statistical functions.

The formula for calculating the standard deviation of a set of values is: s = sqrt(sum((x - x̄)^2) / (n - 1))
where s is the standard deviation, x is the set of values, x̄ is the mean of the values, and n is the number of values in the set.

In this formula, we first calculate the difference between each value and the mean, square each difference, sum the squared differences, divide by the number of values minus one, and then take the square root of the result to obtain the standard deviation.

However, the formula above looks fine but isn't professional. How can we improve that? Use LaTeX syntax (pronounced "Lay-Tek"). This is part of the beauty of Jupyter notebooks, which are similar to R Markdown and Quarto documents you would use inside RStudio.

To insert an equation for standard deviation in a Jupyter notebook, you can use LaTeX syntax to format the equation. Here's an example of how to format the equation for standard deviation:

$$s = \sqrt{\frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n-1}}$$

In this equation, s represents the standard deviation, x_i represents the ith value in the set of values (i.e., the ith element of an array), n represents the number of values in the set, and \bar{x} represents the mean of the values.

The dollar signs are a delimiter used in LaTeX to indicate the beginning and end of a displayed math equation. When you use two dollar signs to enclose an equation (it has dollar signs as bookends), it tells LaTeX to display the equation on a separate line, centered on the page, and with extra vertical space above and below the equation.

Next, add a new text cell and modify the LaTeX above to omit the square root symbol and the denominator so you only see the sum of squared errors and change "s" to say "sum of squares". You can add spaces between words using a backslash.





$$sum\ of\ squares = {\sum_{i=1}^{n}(x_i - \bar{x})^2}{n-1}$$

In [18]:
import math as math
import numpy as np

In [20]:
array1 = [1,2,4,8,12] # Define a list (i.e., array) of values
xbar = sum(array1) / len(array1) # Calculate the mean as the sum divided by the number of elements in the array (i.e., "n")
sigma_sq = sum((x - xbar) ** 2 for x in array1) / (len(array1) - 1)
sigma = math.sqrt(sigma_sq)




Ok that is fine but we can't see the result so let's tell python to print the results:

In [21]:
print("Array1:",array1)
print("x_Bar:", xbar)
print("sigma squared:",sigma_sq)
print("sigma:", sigma)

Array1: [1, 2, 4, 8, 12]
x_Bar: 5.4
sigma squared: 20.8
sigma: 4.560701700396552


In [5]:
array1 = np.array([1,2,4,8,12])
xbar = np.mean(array1)
sigma = np.std(array1)
sigma_sq = np.var(array1)
print(xbar)
print(sigma)
print(sigma_sq)

5.4
4.079215610874228
16.64


In [7]:
print("Array:",array1)
print("X_Bar:", xbar)
print("Sigma:", sigma)
print("Sigma squared:",sigma_sq)

Array: [ 1  2  4  8 12]
X_Bar: 5.4
Sigma: 4.079215610874228
Sigma squared: 16.64


In [8]:
array2 = np.array([2,4,3,7,16])


Now we can calculate the element-wise subtraction and squaring of the arrays. The code below shows how to find the squared difference between the 2 arrays

In [12]:
sq_diff = (array1 - array2)**2
print("Squared differences:", sq_diff)

Squared differences: [ 1  4  1  1 16]


In [13]:
mean_sq_error = np.mean(sq_diff)
print("Mean Squared Error:", mean_sq_error)
rmse = np.sqrt(mean_sq_error)
print("Root Mean Squared Error:", rmse)

Mean Error: 4.6
Root Mean Squared Error: 2.1447610589527217


On average, the values in array1 and array2 vary by 2.14 units. Now lets bundle everything together into 1 line of code for the calculation:

In [14]:
# Calculate the RMSE
rmse = np.sqrt(np.mean((array1 - array2) ** 2))

# Print the result
print("RMSE:", rmse)

RMSE: 2.1447610589527217
