# Machine learning and statistics Assessment 2020


In [1]:
import scipy.stats
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import chi2_contingency

### 1. Python function to calculate square root of 2 to 100 decimal places

Our assignment here is to write a python function, called sqrt2, that calculates and prints the square root of 2 to 100 decimal places. This has to be done without using any modules.
The square root of 2 is an irrational number, which means it cannot be written as a fraction and there will never be any pattern in the number.

First I tried the easiest method I could think of, which was:

In [2]:
def sqrt2practice():
    x = 2 ** 0.5
    print(x)
    
sqrt2practice()

1.4142135623730951


Unfortunately, Python only calculated 16 decimal places. This is because Python only stores an approximation of irrational numbers, so as to make it too heavy to calculate with. Let's try to get python to print out 100 decimals using the format() function.

In [3]:
def sqrt2practice2():
    x = 2 ** 0.5
    print("{:.100f}".format(x))
    
sqrt2practice2()

1.4142135623730951454746218587388284504413604736328125000000000000000000000000000000000000000000000000


Well, we got a bit further, 51 decimals this time. But then python is just showing us a bunch of zeros. 
We know that the square root of 2 is an irrational number, which never can be represented exactly by a finite number of digits,  so chances of this being correct are extremely small. Also, I checked what the correct value should be and I found this: 

1.41421356237309504880168872420969807856967187537694 80731766797379907324784621070388503875343276415727 
https://nerdparadise.com/math/reference/2sqrt10000 and https://apod.nasa.gov/htmltest/gifcity/sqrt2.1mil

So, it seems we haven't gotten much closer to what we are looking for. This simple formula is not going to do, time to really delve into the subject and find a solution.

#### Newton's Method

When trying to find a manual way of calculating the square root, I came across Newton's method.
Basically, what this method does, is take an original guess at the square root, and then with a simple formula, get closer and closer, until the difference between the last and the second last guess is adequately small.


In [4]:
# Return square root of 2, using Newton's approach  
def squareRoot() : 

    x = 2 # Initial guess 
    
    l = 0.000001 # Required difference between last and second last guess
    
    
    while True : # Keep going until condition becomes false
         
  
        root = 0.5 * (x + (2 / x)) # Get closer and closer to the correct number, by perfecting the initial guess, with this simple formula   

  
        if (abs(root - x) < l) : # If difference between root and x(last and second last guess) is smaller than l, break the loop. Use abs to convert negative int to positive
            break 
  
        x = root # last guess becomes root  
        
    s = format(root, ".100f") # Show 100 decimals
        
    return s 
  

    print(squareRoot())

    # Adapted from: https://www.geeksforgeeks.org/find-root-of-a-number-using-newtons-method/

In [5]:
squareRoot()

'1.4142135623730949234300169337075203657150268554687500000000000000000000000000000000000000000000000000'

Newton's method works, but still, Python won't calculate the square root of 2 to 100 digits.

After spending hours trying to figure out a way, I found something promising on stackoverflow. What if we try to find the square root of 2 * 10 * 200 (to leave 101 digits), and then format it in a way that gets rid of the superfluous zero's at the end and displays it in the right way? Got some inspiration here: https://stackoverflow.com/questions/64278117/is-there-a-way-to-create-more-decimal-points-on-python-without-importing-a-libra and here:
https://stackoverflow.com/questions/5187664/generating-digits-of-square-root-of-2

In [6]:
# code adapted from https://stackoverflow.com/questions/5187664/generating-digits-of-square-root-of-2

def sqrt2(): # Function that calculates the squareroot of 2, to a total of 100 digits. Keeping numbers as integers.
    
    a = 2 * (10**(2*100)) # to calculate 
    
    x_prev = 0 # Set initial guess at 0
    
    x_next = 1 * (10**100)
    
    while x_prev != x_next: # Keep going until x_prev == x_next
        
        x_prev = x_next # previous x is replaced by next x
        
        x_next = (x_prev + (a // x_prev)) >> 1 # Use a method like Newtons to get closer and closer to the correct number
        
    sqrt2=str(x_next) # Change x_next into a string so it can be used into format properly
    
    print('{}.{}'.format(sqrt2[:1], sqrt2[1:101])) # Format so the first digit comes before the '.' and the next 100 digits come after.




In [7]:
sqrt2()

1.4142135623730950488016887242096980785696718753769480731766797379907324784621070388503875343276415727


Now, lets compare this value to the number found on Nerdparadise, as well as on the official NASA website:

1.4142135623730950488016887242096980785696718753769480731766797379907324784621070388503875343276415727

It looks like we finally got to where we wanted to be!

### 2. Chi-squared test

A Chi-squared test tells us if there is a statistically significant difference between the expected frequencies and the observed frequencies in one or more categories of a contingency table. https://en.wikipedia.org/wiki/Chi-squared_test 

First, I need to create a Pandas DataFrame from the table, so I can perform the Chi-squared test on it. I got some help on how to create a DataFrame and how to rename the rows here:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html

In [8]:
df = pd.DataFrame(np.array([[90, 60, 104, 95], [30, 50, 51, 20], [30, 40, 45, 35]]), columns=["A", "B", "C", "D"])
df.rename(index={0: "White collar", 1: "Blue collar", 2: "No collar"})


Unnamed: 0,A,B,C,D
White collar,90,60,104,95
Blue collar,30,50,51,20
No collar,30,40,45,35


In [9]:
chi2, p, dof, expected = chi2_contingency(df) # Learned how to do this here: https://stackoverflow.com/questions/43963606/python-pandas-chi-squared-test-of-independence

In [10]:
chi2 # Display Chi-squared value

24.5712028585826

In [11]:
p # Display the associated p-value

0.0004098425861096696

In [12]:
dof # display the degrees of freedom

6

In [13]:
expected # Display the expected values if neighbourhood of residence is independent of occupation.

array([[ 80.53846154,  80.53846154, 107.38461538,  80.53846154],
       [ 34.84615385,  34.84615385,  46.46153846,  34.84615385],
       [ 34.61538462,  34.61538462,  46.15384615,  34.61538462]])

With a p-value of 0.0004 it is extremely unlikely that random chance can explain the observed values.

## 3. Standard deviation (STDEV.P vs STDEV.S)

Microsoft Excel has two different ways of calculating the standard deviation of an array of numbers. What is the difference?

In [25]:
x = np.array([2,4,7,9,11,100])

First, I will use Numpy to calculate the standard deviation of my array, without subtracting 1 from len. So this method is similar to STDEV.P function in Excel. I am not specifying any ddof (Delta Degrees of Freedom), so this is set to it's default 0. The formula numpy is using to calculate the standard deviation of x is now: *np.sqrt(np.sum((x - np.mean(x))**2)/len(x))*

In [31]:
np.sqrt(np.sum((x - np.mean(x))**2)/len(x))

34.93525758059073

The built in Numpy function np.std gives us the same result.

In [29]:
np.std(x)

34.93525758059073

Now, I will add 'ddof = 1'. By doing so, 1 will be subtracted from the number of elements and in doing so provides an unbiased estimator of the variance of the infinite population. https://numpy.org/doc/stable/reference/generated/numpy.std.html


In [30]:
np.std(x, ddof = 1)

38.2696572582858