# GMIT HDip Data Analysis 2020. Machine Learning and Statistics  

### Task 1 : Obtain approximation of $\sqrt{2}$ to 100 decimal places in Python (without using library functions)


$\sqrt{2}$ is an irrational number - it cannot be rendered exactly as a fraction, or in decimal notation, no matter how many decimal places are specified [1].   
One way of approximating the value is provided by Newton's Method. An iteration using the formula $$x_{n+1} = \frac{1}{2}(x_n + \frac{a}{x_n}) $$ approximates the $\sqrt{a}$, with each subsequent value of $x_{n+1}$ being closer to the actual value than $x_n$. A simple explanation of how it works is that if $x_n$ is too large, then $a/x_n$ will be smaller than the square root, and the mean value of $x_n$ and $a/x_n$ will be closer to the root than $x_n$ is. Similarly if $x_n$ is too small, $a/x_n$ is greater than the root and the mean of the sum is again closer than $x_n$ [2].   

[1] Proof that $\sqrt{2}$ is irrational ; https://www.homeschoolmath.net/teaching/proof_square_root_2_irrational.php  
[2] Square roots via Newton's method ; https://math.mit.edu/~stevenj/18.335/newton-sqrt.pdf  

Python code to demonstrate this, with a starting value of 1.5 (we know the square root of 2 is somewhere between 1 and 2), is given by :

In [3]:
# Approximate square root of 2 with a starting value of 1.5, using Newton's method
# We know the square root will be between 1 and 2, so we'll start at 1.5
ix = 1.5
new = 0
# Keep looping until the value produced by the last iteration equals that of the previous iteration
while (new != ix):
  new = ix
  ix = 0.5*(ix+2/ix)
# Display the result    
print(ix)
print(ix*ix)


1.414213562373095
1.9999999999999996


We see from running the above code that the square root is only calculated to 15 decimal places, and if we square the resulting value it doesn't quite equal 2. This is because of limitations in floating point arithmetic (arithmetic on non-whole numbers).
For example the decimal value 0.1 cannot be expressed exactly using the binary system employed by computers - it can only be approximated. This is true for all languages, not just Python.  
Floating point arithmetic in Python only provides accuracy to about 15 decimal places [3]. Integer arithmetic in Python however does not have this same level of constraint - the size of stored numbers is only restricted by hardware limitations [4].  
In order to achieve our required accuracy therefore we need to ensure all operations are carried out on integers, and not floating point numbers. We will replace the multiplication in the method above by 0.5 (a floating point number) with division by 2, and use the floor division operator '//' which returns the integer part of the result [5] instead of the simple division operator '/'. 

To obtain 100 decimal places using Newton's method we will need to start with an integer squared value containing the required
number (2 in this case) followed by 200 zeros (as $x^n * x^n = x^{2n}$). When we have the square root of this (to the accuracy of an integer followed by 100 zeros) we will format the result to insert a decimal point after the whole number part of the answer.  

(The created function is called 'sqrt2' as that's what we're told to do, but it will allow a positive integer argument so that 
 it can be tested against a number we know the exact root of - 4)

[3] Floating point arithmetic: issues and limitations ; https://python.readthedocs.io/en/latest/tutorial/floatingpoint.html  
[4] Numbers in Python; https://realpython.com/python-numbers/  
[5] Floor division; https://python-reference.readthedocs.io/en/latest/docs/operators/floor_division.html


In [4]:
def sqrt2(inum):
# Calculate the square root of an input positive integer to 100 decimal places.

# Create an integer of value 'inum' followed by 200 zeros (by creating a string and converting it to an integer).
# We will find the square root of this number, and then format it to appear as a decimal value with 100 decimal places.

  temp = str(inum) + (200*"0")
  squared = int(temp)
  
# Create an integer of value 1 followed by 100 zeros. Dividing the square root by this value to give the integer part of the 
# square root provides the length of the integer part in digits (not strictly needed here as we know the integer part of the 
# square root of 2 will be one digit long, but it makes it easy to apply the code to other positive integers if desired). 
  temp = "1" + (100*"0")
  divisor = int(temp)

# Use Newton's method to get an approximation of the square root
# This uses the iteration x(n+1) = 0.5(x(n) + (a/x(n))), where 'a' is the number whose square root we are finding, x(n) is the 
# n'th estimate, and x(n+1) is the subsequent approximation. A simple explanation of how it works is that if 'x(n)' is too 
# large, then a/x(n) will be smaller than the square root, and the mean value of x(n) and a/x(n) will be closer to the root 
# than x(n) is. Similarly if x(n) is too small, a/x(n) is greater than the root and the mean of the sum is again 
# closer than x(n).

# Keep looping until there is no change in the value of the root from one iteration to the next
# Avoid generating floating point numbers by using the floor division operator (divide by 2 rather than multiply by 0.5) - in 
# this way we can handle very large integers without losing accuracy.
  root = 1
  saved = 0
  while (saved != root):
    saved = root
    root = (saved + squared//saved)//2

# We now have the approximate root of '2' followed by 200 zeros.
# We can't just divide to get a floating point number as we then lose the required accuracy, so we convert to a string and 
# stick the decimal point in the required place (after the first digit in this case).
  ans = str(root)

# Get the whole number part of the answer by dividing by the number '1' followed by 100 zeros, and find its length (so we can
# put the decimal place in the right position in the final answer).
  whole = root//divisor
  i=len(str(whole))
  wlen=i
  
# Loop through all the digits in the answer, adding them to the string 'sqr', and putting a decimal point in after the integer 
# part
  sqr=""
  for d in ans:
    i-=1
    sqr+=d
# Put the decimal point in after the whole number part    
    if i == 0:
      sqr+="."

# Display the answer (+ve and -ve values), and its length (just to illustrate that it is actually to 100 decimal places).
# Subtract 1 for the decimal point, and the length of the integer part of the value, from the overall length, to get the 
# number of decimal places.
  print("\nSquare roots of ",inum," - ")
  print("(Number of decimal places -",(len(sqr)-1-wlen),")") 
  print("+",sqr)
  print("-",sqr) 

# Call the function to calculate the square root of 4, which we know is exactly 2, to check it first
sqrt2(4)

# Now get the required square root (of 2)
sqrt2(2)


Square roots of  4  - 
(Number of decimal places - 100 )
+ 2.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
- 2.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

Square roots of  2  - 
(Number of decimal places - 100 )
+ 1.4142135623730950488016887242096980785696718753769480731766797379907324784621070388503875343276415727
- 1.4142135623730950488016887242096980785696718753769480731766797379907324784621070388503875343276415727


Perform a further check on the answer using the standard library module 'decimal', which can be used to perform arithmetic to 
high precision [6]
[6] Arbitrary precision of square roots; https://stackoverflow.com/questions/10725522/arbitrary-precision-of-square-roots

In [2]:
# Get the 'decimal' module
from decimal import *
# Set the precision
getcontext().prec = 101
print("\nSquare root using Python Decimal module - ")
# Get the square root of 2
print(" ",Decimal(2).sqrt())


Square root using Python Decimal module - 
  1.4142135623730950488016887242096980785696718753769480731766797379907324784621070388503875343276415727


This gives the same value as the function 'sqrt2'- extra confirmation that the answer is correct.

### Task2 : Perform a Chi-squared statistical test against a small table obtained from the Wikipedia page about Chi-squared, and provide some analysis of the test. Confirm the statistical value is approximately 24.6, and get the associated p-value.  

The Chi-squared test may be used to determine if there is a statistically significant difference between expected and observed frequencies of categorical data (eg male / female) (but not numerical data) [1]   

The case considered here is based on hypothetical data taken from Wikipedia, for the residential locations of people from three occupational classes (white collar, blue collar, no collar) living in 4 city districts. We are testing to see if neighbourhood of residence is independent of occupation.  

The formula used to calculate the Chi-squared value is : $\sum{\frac{(O-E)^2}{E}}$ , where 'O' is the observed value and 'E' is the expected value [2]  
If the data is tabulated, with one category represented by columns and the other by rows, then the Expected values are calculated as the product of the values in the rows and the values in the columns, divided by the overall total [2], eg  

If we had two entries in Category 1, A and B, and 2 in Category 2, X and Y we may write some observed values as eg:  
[3]

Category | A | B | Total
--- | --- | --- | --- 
__X__ |37 | 59 | 96
__Y__ | 25 | 43 | 68
__Total__ | 62 | 102 | 164  

A table of Expected values would then be :

Category | A Expect | B Expect | Total
--- | --- | --- | --- 
__X Expect__ |$\frac{96 * 62}{164}$ | $\frac{96 * 102}{164}$ | 96
__Y Expect__ | $\frac{68 * 62}{164}$ | $\frac{68 * 102}{164}$ | 68
__Total__ | 62 | 102 | 164  

which equates to :  

Category | A Expect | B Expect | Total
--- | --- | --- | --- 
__X Expect__ |36.29 | 59.71 | 96
__Y Expect__ | 25.71 | 42.29 | 68
__Total__ | 62 | 102 | 164  

Applying the equation above yields :  

$\frac{(37-36.29)^2}{36.29} + \frac{(59 - 59.71)^2}{59.71} + \frac{(25 - 25.71)^2}{25.71} + \frac{(43 - 42.29)^2}{42.29}$  

= 0.014 + 0.008 + 0.020 + 0.012 = 0.054  which is the Chi-squared value.

The statistic provides a measure of whether the categories are independent or not, eg if the categories were male and female students taking arts and science courses, if there is no link between gender and course taken you would expect roughly equal proportions of males and females, allowing for the total numbers of each gender type present, taking each category of course. [4] The null hypothesis for the test is that there is no connection between the variables (they are independent). If the observed values are significantly different from the expected values then the hypothesis would be rejected and you would conclude that the variables are linked [5]. The p-value returned by the test is used to conclude whether or not the variables are likely to be independent. If the p-value is <= the significance level then conclude the evidence is that the observed and expected values are different, and therefore the variables are related [5].  
The test is sensitive to sample size. It is not appropriate to use if any cross-tabulation cells (individual elements in the table) have fewer than 5 cases (a single element in the sample may then have an inappropriately large influence on the result), or if sample sizes are large (roughly > 500), when small differences between observed and expected values may be deemed statistically significant [6]. Choosing appropriate categories may help to address this issue.

[1] Chi-squared test ; https://en.wikipedia.org/wiki/Chi-squared_test  
[2] Chi-square test  ; https://www.mathsisfun.com/data/chi-square-test.html  
[3] Markdown table syntax ; https://www.makeuseof.com/tag/create-markdown-table/  
[4] What does a Chi-square statistic tell you ; https://www.investopedia.com/terms/c/chi-square-statistic.asp#:~:text=Chi%2Dsquare%20tests%20are%20often,of%20variables%20in%20the%20relationship.  
[5] Overview of the Chi square test of independence ; https://statisticsbyjim.com/hypothesis-testing/chi-square-test-independence-example/#:~:text=For%20a%20Chi%2Dsquare%20test,exists%20between%20the%20categorical%20variables. 
[6] What are special concerns with regard to the Chi square statistic ; https://www.statisticssolutions.com/using-chi-square-statistic-in-research/

Our table from Wikipedia with rows as occupational categories and columns as neighbourhoods is :  

Category | A | B | C | D | Total
--- | --- | --- | --- | --- | ---
__White collar__ | 90 | 60 | 104 | 95 | 349
__Blue collar__ | 30 | 50 | 51 | 20 | 151
__No collar__ | 30 | 40 | 45 | 35 | 150
__Total__ | 150 | 150 | 200 | 150 | 650

Using the 'scipy' library, first hard code the table into the code and try the test :

In [41]:
# Import the scipy Chi-squared functions
# ref https://machinelearningmastery.com/chi-squared-test-for-machine-learning/
from scipy.stats import chi2_contingency

# Put the data taken from Wikipedia into a table 
table = [	[90, 60, 104, 95],
			[30,  50,  51, 20],
            [30,  40,  45, 35]]

# Perform the test. This returns 3 individual values, the Chi-square statistic, p-value and degrees of freedom, 
# and a table of expected values
stat, p, df, expVals = chi2_contingency(table)

# Print out the Chi-squared stat, p value and degrees of freedom
print('\nstat=%.2f ; p=%2f ; degrees of freedom=%2d \n' % (stat,p,df))

# Get the overall total value for the table (the sample size)
total = 0
for i in range(len(table)):
    for j in range(len(table[0])):
        total += table[i][j]

# Calculate and display the individual contributions to the test statistic from each element in the table, using
# (observed - expected)**2 / expected
sum = 0
for i in range(len(table)):
    for j in range(len(table[0])):
        chiSq = ((table[i][j]-expVals[i][j])**2)/expVals[i][j]
        sum += chiSq
        print(chiSq," (",table[i][j]," ",expVals[i][j],")")
        
# Check we have the right test statistic using this calculation        
print("\nChi squared value - ",sum)        


stat=24.57 ; p=0.000410 ; degrees of freedom= 6 

1.1115274410403364  ( 90   80.53846153846153 )
5.237601939607668  ( 60   80.53846153846153 )
0.10667842186466842  ( 104   107.38461538461539 )
2.596723238557052  ( 95   80.53846153846153 )
0.6739684156902702  ( 30   34.84615384615385 )
6.590083205977245  ( 50   34.84615384615385 )
0.44332654100866054  ( 51   46.46153846153846 )
6.32518254372559  ( 20   34.84615384615385 )
0.6153846153846149  ( 30   34.61538461538461 )
0.8376068376068384  ( 40   34.61538461538461 )
0.02884615384615382  ( 45   46.15384615384615 )
0.004273504273504322  ( 35   34.61538461538461 )

Chi squared value -  24.5712028585826


The p-value obtained from the test (0.00041) is < 0.05, and consequently we would conclude that the variables are not independent of each other. The p-value comes from the probability distribution of the Chi-squared test, and is the probability of getting results at least as extreme as the observed values if the null hypothesis is true [7] The probability distribution is dependent on the number of degrees of freedom (number of columns - 1) * (number of rows -1) (here 3 * 2 = 6).  

The values in the table that give the highest contribution to the Chi-squared result are the ones that differ most from the expected values (with the expectation being that the variables are independent). If our conclusion is that the variables are not independent, these are the values that most strongly contribute to that conclusion. In this case they are :

60 white collared workers in neighbourhood B (Chi square value 5.24)
50 blue collared workers in neighbourhood B (Chi square value 6.59)
20 blue collared workers in neighbourhood D (Chi square value 6.32)

If the variables were independent more white collared workers would have been expected to be in area B (about 80), fewer blue collared workers in area B (about 35) and more blue collared workers in area D (about 35).  

This reflects what might be expected perhaps in a real life analysis of the occupations of residents in particular areas of cities. Frequently more expensive and cheaper housing are segregated to some extent, and those with lower incomes would not be able to afford to live in more expensive accommodations.  

[7] p-value ; https://en.wikipedia.org/wiki/P-value