# Tasks
This is a jupyter file containing my solutions to the tasks assessments in my 4th year module "Emerging Technologies". Author Conor Rabbitte (conorrrabbitte.it3@gmail.com)

***

## Task 1: Calculate the square root of 2.
The first task is to create a python function called **sqrt2** that calculates and prints to the screen the square root of 2 to 100 decimal places. I will achieve this by using a well known and popular root-finding algorithm known as Newton's method (also known as Newton-Raphson method) [1].

$$ x_{n+1} = x_{n} - \frac{f(x_{0})}{f'(x_{0})} $$

### Understanding Newton's Method
Newton's method "produces successively better approximations to the roots (or zeroes) of a real-valued function". This means that with ever iteratation we get closer to the "real-answer" of what the root of a given number is. Starting with the function $f$ defined for $x$ we can get the derivative [3] $f'$ and with an initial guess of $x_{0}$ for the root of $f$. Then we can use the algorithm above to solve for an answer $x_{1}$ and repeat the algorithm.

### Example
Lets find the square root of a number $a$, that is to say the positive number $x$ such that $x^2=a$. By finding the zero of the function $f(x)=x^2-a$ we can find the derivative $f'(x)=2x$.
For example, if we try to find the sqaure root of $a=2$, with an initial guess $x_{0}=1.4$ (initial estimation comes from the graph below) the sequence given by Newton's method is:

$$ x_{1} = x_{0}-\frac{f(x_{0})}{f'(x_{0})} = 1.4-\frac{1.4^2-2}{2*1.4} = 1.4142$$

If we then repeat this method using the previous answer as $x_{1}=1.4142$ we will get better approximation of the root.

$$ x_{2} = x_{1}-\frac{f(x_{1})}{f'(x_{1})} = 1.4142-\frac{1.4142^2-2}{2*1.4142} = 1.414213562$$


<img src="img/Square_root_graph.png">

### Python code of the function

In [1]:
def sqrt2():
    """
    A function used to test and calculate the square root of the number 2 using Newton's method.
    """
    # Square root number a.
    a = 2
    # Initial guess for the square root x.
    x = 1.4
    # Loop until we're happy with the accuracy (no more then 1e -14).
    while abs(a - (x * x)) > 0.00000000000001:
        # Calculate a better guess for the square root.
        x -= (x * x - a) / (a * x)
    # Store result of x in result.
    result = x
    # Print result to screen to 100 decimal places.
    print("%.100f" % result)    

### Testing Function Sqrt2

In [2]:
# Test the function sqrt2().
sqrt2()

1.4142135623730951454746218587388284504413604736328125000000000000000000000000000000000000000000000000


## Conclusion
As demonstrated above we can use Newton's method to find the square root of a given number.  With each iteration producing successively better approximations. However some numbers, such as $2$, cannot be expressed as a ratio of natural numbers. Known as 'irrational numbers' these numbers have non-repeating decimal expansions that cannot be expressed as a fraction.

In conclusion there is nothing to fear as modern mathematicians from irrational numbers, unlike the ancient Pythagoreans, they make perfect sense to us so there is no need to go hurling people overboard [4].

## References
[1] Newton's Method; Wikipedia; https://en.wikipedia.org/wiki/Newton%27s_method

[2] A Tour of Go; Exercise: Loops and Functions; https://tour.golang.org/flowcontrol/8

[3] Derivative; Wikipedia; https://en.wikipedia.org/wiki/Derivative

[4] The Sqaure Root of 2; CosmosMagazine; Paul Davies; https://cosmosmagazine.com/mathematics/the-square-root-of-2/;

***

## Task 2: Proving the value of a Chi-squared test
The second task is to create a python program that will Chi-squared test the categorical data found in the table below. Given the Chi-squared value based on the data below is approximately 24.6, the task is to verify this is the correct answer.

|              | A   | B   | C   | D   | Total |
| :----------- | :-: | :-: | :-: | :-: | --:   |
| White collar | 90  | 60  | 104 | 95  | 349   |
| Blue collar  | 30  | 50  | 51  | 20  | 151   |
| No collar    | 30  | 51  | 45  | 35  | 150   |
<br>
| Total        | 150 | 150 | 200 | 150 | 650   |



### Understanding the Chi-squared statistic
The chi-squared statistic is a measure of the differenece between observed and expected frequencies of outcomes from a set of events or variables. The data used in calculating a chi-square statistic must be randomly drawn from independent variables. When the observed data is categorized into a table, like the one above, we can calculate the expected results using a simple calculation. For example if we take the column *'A'* and row *'White Collar'* and plug them into our formula we will get the expected result of 80.54.

$$ C * \frac{R}{T} = 150 * \frac{349}{650} = 80.54 $$

Where:
- *__C__* is the Total in a given Column
- *__R__* is the Total in a given Row
- *__T__* is the Tables Total

Using this formula across the entire table gives us a list of expected results corresponding to our observed results. Following this we can use another formula across each cell in the table whose sum total will give us our test statistic.

$$ \frac{(observed - expected)^2}{expected} = \frac{(90 - 80.54)^2}{80.54} = 1.11 $$

The sum total of this forumla across all the cells of the given table yields __24.6.__
Finally we can calculate the *Degrees of Freedom* using another simple formula

$$ (Number Of Rows - 1)(Number Of Columns - 2) = (3-1)(4-1) = 6 $$

<br>

\*\*This walkthrough of the Chi-squared statistic was taken from *Wikipedia* [2] and *investopedia* [5]. It is not my own work, but my interpretation that serves as explanation of the inner workings of the chi-squared statistic which I will program in python below.

### Python code - Chi-square statistic test
The attempt below was constructed using the guide found at pythonhealthcare.org [1]

In [3]:
import numpy as np
import pandas as pd
import scipy.stats as stats

In [4]:
data = pd.DataFrame(columns=['A', 'B', 'C', 'D'])

In [5]:
data.loc['White Collar'] = [90, 60, 104, 95]
data.loc['Blue Collar'] = [30, 50, 51, 20]
data.loc['No Collar'] = [30, 40, 45, 35]

In [6]:
data

Unnamed: 0,A,B,C,D
White Collar,90,60,104,95
Blue Collar,30,50,51,20
No Collar,30,40,45,35


## Chisqaure Contingency Method
In the scipy.stats we can call on the method chi2_contingency. This function computes the chi-sqaure statistic and p-value. It returns four values and takes two parameters.

In [7]:
chi2, p, dof, expected = stats.chi2_contingency(data, correction=False)

In [8]:
chi2

24.5712028585826

In [9]:
p

0.0004098425861096696

In [10]:
dof

6

In [11]:
expected

array([[ 80.53846154,  80.53846154, 107.38461538,  80.53846154],
       [ 34.84615385,  34.84615385,  46.46153846,  34.84615385],
       [ 34.61538462,  34.61538462,  46.15384615,  34.61538462]])

## References
[1] Statistics: Chi-sqaured test; pythonhealthcare.org; https://pythonhealthcare.org/2018/04/13/58-statistics-chi-squared-test/ 

[2] Chi-squared test; Wikipedia.org https://en.wikipedia.org/wiki/Chi-squared_test

[3] Contingency; SciPy.org Documentation; https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html

[4] Chisquare; SciPy.org Documentation; https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html

[5] Chi-Square (χ2) Statistic Definition; investopedia.com; https://www.investopedia.com/terms/c/chi-square-statistic.asp

***

## Task 3: Standard Deviation
The third task is to perform a simulation of a standard deviation calculation used in Microsoft's Excel, known as STDEV.S. Furthur, demonstrate that STDEV.S calculation is a better estimation for the standard deviation of a population when performed on a sample.

### Defining the difference
*__Standard Deviation (SD):__* measures the amount of *variation* or the *spread* of a data distribution. It measures the distance between each data point and the mean. A low SD indicates the data is close to the mean while a high SD indicates the data is spread out over a wider range.

*__Population (P):__* is a set of data that contains *all* memebers of a specified group. This is the entire list of values in a given data set.

*__Sample (S):__* is a set of data that contains only *part*, or a *subset*, of a population. A sample inherently is always less then the size of the population from which is it taken.

The distinctions between *__P__* and *__S__* are very important when we are working with statistical data. It is important because in a *__P__* data set we have all the data we need to make an accurate statistical analysis *__BUT__* in a *__S__* data set we have only a portion of the data and therefore need to compensate for this fact, to help give us a more accurate result. When we are working with a given data set and asked to find the *__SD__* the answer will vary on wether the data set is of type *__P__* or type *__S__*. 


### Population standard deviation
The mathematical formula for the population standard deviation is as follows, where $\sigma$ represents the _population standard deviation:_

$$\sigma = \sqrt{ \frac{\Sigma(x_{i}-\mu)^2}{n} }$$

While this may look complicated I will break the formula down into simple steps.

1. $\mu$: Calculate the mean of the population data.
2. $x_{i}$: Subtract the mean from each data point _(i)_ in the given data set, this is called a deviation.
3. Square each deviation to make it a positive number.
4. $\Sigma$: Sum all squared deviations together.
5. $n$: Divide the sum by the number of data points in the population, this is call the variance.
6. Take the square root of the variance to get the standard deviation.

### Sample standard deviation
The mathematical formula for the sample standard deviation is as follows, where $s_{x}$ represents the _sample standard deviation:_

$$s_{x} = \sqrt{ \frac{\Sigma(x_{i}-\bar{x})^2}{n-1} }$$

Much like the formula for population, this sample formula takes all the same steps with $\bar{x}$ replacing $\mu$ and one small, but crucial, difference in step 5. 

5. $n-1$: Divide the sum by one less than the number of data points in the sample, this is called the variance.

### Example
#### Population Standard Deviation
A teacher wants to compare a classroom of 12 students and calculate the standard deviation of their grades for a recent python assessment. They arrange a list of _all_ the students grades(ranging from 1 - 10) below:

_**Grades**_ = [3, 5, 6, 9, 10, 2, 4, 5, 7, 8, 4, 2]

$$\mu = \frac{3 + 5 + 6 + 9 + 10 + 2 + 4 + 5 + 7 + 8 + 4 + 2}{12} = \frac{63}{12} = 5.42$$

The mean is 5.42 points.


| Grade: $x_{i}$ | Deviation: $(x_{i} - \mu)$ | Squared deviation: $(x_{i} - \mu)^2$ |
| :----------: | :-: | :-: |
| 3 | 3 - 5.42 = -2.42 | (-2.42)$^2$ = 5.84 |
| 5 | 5 - 5.42 = -0.42 | (-0.42)$^2$ = 0.17 |
| 6 | 6 - 5.42 = 0.58 | (0.58)$^2$ = 0.34 |
| 9 | 9 - 5.42 = 3.58 | (3.58)$^2$ = 12.84 |
| 10 | 10 - 5.42 = 4.58 | (4.58)$^2$ = 21.00 |
| 2 | 2 - 5.42 = -3.42 | (-3.42)$^2$ = 11.67 |
| 4 | 4 - 5.42 = -1.42 | (-1.42)$^2$ = 2.00 |
| 5 | 5 - 5.42 = -0.42 | (-0.42)$^2$ = 0.17 |
| 7 | 7 - 5.42 = 1.58 | (1.58)$^2$ = 2.50 |
| 8 | 8 - 5.42 = 2.58 | (2.58)$^2$ = 6.67 |
| 4 | 4 - 5.42 = -1.42 | (-1.42)$^2$ = 2.00 |
| 2 | 2 - 5.42 = -5.42 | (-5.42)$^2$ = 11.67 |

Add the squared deviations together.

$$\Sigma = (5.84 + 0.17 + 0.34 + 12.84 + 21.00 + 11.67 + 2.00 + 0.17 + 2.50 + 6.67 + 2.00 + 11.67) = (76.92)$$

Divide the $\Sigma$ by the number of grades.

$$ \frac{\Sigma(x_{i}-\mu)^2}{n-1} = \frac{76.92}{n-1} = \frac{76.92}{12} = 6.40$$

Take the square root of the result above.

$$\sqrt{6.40} \approx{2.53}$$

The population standard deviation is approximately 2.53.

### Python Code

In [12]:
import numpy as np

In [69]:
grades = [3, 5, 6, 9, 10, 2, 4, 5, 7, 8, 4, 2]

In [70]:
mean = (sum(grades) / len(grades))

In [71]:
mean

5.416666666666667

In [72]:
deviation = [i-mean for i in grades]

In [73]:
deviation

[-2.416666666666667,
 -0.41666666666666696,
 0.583333333333333,
 3.583333333333333,
 4.583333333333333,
 -3.416666666666667,
 -1.416666666666667,
 -0.41666666666666696,
 1.583333333333333,
 2.583333333333333,
 -1.416666666666667,
 -3.416666666666667]

In [74]:
sqrdDeviation = [i**2 for i in deviation]

In [75]:
print(sqrdDeviation)

[5.8402777777777795, 0.17361111111111135, 0.34027777777777746, 12.840277777777775, 21.006944444444443, 11.673611111111112, 2.006944444444445, 0.17361111111111135, 2.5069444444444433, 6.67361111111111, 2.006944444444445, 11.673611111111112]


In [76]:
sumDeviation = sum(sqrdDeviation)

In [78]:
sumDeviation

76.91666666666667

In [79]:
standardDeviation = (sumDeviation / len(grades))

In [81]:
standardDeviation

6.409722222222222

In [82]:
finalAns = np.sqrt(standardDeviation)

In [83]:
finalAns

2.5317429218272185

## References
[1] Population and sample standard deviation review; khanacademy; https://www.khanacademy.org/math/statistics-probability/summarizing-quantitative-data/variance-standard-deviation-sample/a/population-and-sample-standard-deviation-review

[2] Standard Deviation; Wikipedia; https://en.wikipedia.org/wiki/Standard_deviation

[3] Population vs Sample Data; mathbitsnotebook; http://mathbitsnotebook.com/Algebra1/StatisticsData/STPopSample.html

[4] numpy.std; numpy; https://numpy.org/doc/stable/reference/generated/numpy.std.html#:~:text=The%20standard%20deviation%20is%20the,N%20%3D%20len(x)%20

[5] LaTeX mathematical symbols; oeis.org; https://oeis.org/wiki/List_of_LaTeX_mathematical_symbols

***