# Tasks


These are my solutions to the Tasks assessment for Machine Learning and Statistics in 2020. The author is Katarzyna Chmielowiec-Connell (G00376370@gmit.ie)

***

### Task 1: Calculate a square root

***

In  Python we can calculate the square root of a number in many different ways but in this task the aim is to calculate the square root without using a built in Python library. One of the method that can be used for that purpose is the Netwon' method.

Every real nuber has 2 square roots. The most common analytical methods in finding the square root are iterative and require two steps: finding a suitable starting value, followed by iterative refinement until termination criteria is met. The most suitable method for programmatic calculation is Newton's method which is based on a property of the derivative in the calculus [1, 2, 3, 4].

To find the square root $b$ of a number $x$ we can use the following equation. 

$$ b_{next} = b - \frac{b^2 - x}{2b}  $$

With only a few iterations one can obtain a solution accurate to many decimal places. 



#### Code

In [7]:
# code adopted from https://web.microsoftstream.com/video/0519941d-9f8b-4ae1-8935-6711117cf8fe 

# definition of the function sqrt_2
def sqrt_2(x):
    """
    # A function to calculate a square root of a number  
    """
    # divide x by 2 to get the initial guess (the starting value)
    b = x / 2
    #Loop until happy with the accuracy.
    while abs(x - (b * b)) > 0.001:
        # starting with guess b, calculate a better b guess based on how close squareroot b is to x
        b -= (b * b - x) / (2 * b)
    #Return the (approximate) square root of x.
    return b

##### Tests of the function

Here we test the function with some known values.

In [8]:
#Test the function on number 2
sqrt_2(2)

1.4142156862745099

In [9]:
# import math library to compare the results 
import math
math.sqrt(2)

1.4142135623730951

The function works correctly but it only displays 16 decimal places. To display hundres places, the following code was adopted:

In [40]:
# Code adapt from https://rosettacode.org/wiki/Integer_roots#Python
# defined function sqrt2
def sqrt2(x):

# set y to be a large integer
    y = 2*100**x
# set starting value as 1    
    x_1 = 1 
    
# calculate the second step of approximation    
    x_2 = (x_1 + y // (x_1)) // 2     

    # using while loop get better guess of approximation
    while x_1 != x_2:   
        x_1 = x_2    
        x_2 = (x_2 + y // x_2) // 2
        
#format the result to display decimal places   
    result = f'{x_2  // 10**100}.{x_2  % 10**100:0100d}'
    return result

In [41]:
root = sqrt2(100)
print(root)

1.4142135623730950488016887242096980785696718753769480731766797379907324784621070388503875343276415727


##### Conclusion
In the Newton’s-method we converge towards the desired result but never reach it in a finite number
of steps. How fast we can converge is a key question.



##### References 

[1] A Tour of Go; Exercise: Loops and Functions; https://tour.golang.org/flowcontrol/8   

[2] Newton's method; Wikipedia; https://en.wikipedia.org/wiki/Newton%27s_method

[3] Methods of computing square roots; Wikipedia; https://en.wikipedia.org/wiki/Methods_of_computing_square_roots

[4] Getting started with task assessment; Ian McLoughlin;
https://web.microsoftstream.com/video/0519941d-9f8b-4ae1-8935-6711117cf8fe 

[5] math.mit.edu, "newton-sqrt", [online], https://math.mit.edu/~stevenj/18.335/newton-sqrt.pdf

[6] apod.nasa.gov, "sqrt2", [online], https://apod.nasa.gov/htmltest/gifcity/sqrt2.1mil   

[7] Integer Roots [online] https://rosettacode.org/wiki/Integer_roots#Python

***

### Task 2: Chi-squared test  of Independence
***

####  Chi-square test
***
The Chi-Square Test of Independence allow to check if there is an association between categorical variables (i.e., whether the variables are independent or related). It is a nonparametric test. This test is also known as Chi-Square Test of Association.
This test utilizes a contingency table to analyze the data. A contingency table (also known as a cross-tabulation, crosstab, or two-way table) is an arrangement in which data is classified according to two categorical variables. The categories for one variable appear in the rows, and the categories for the other variable appear in columns. Each variable must have two or more categories. Each cell reflects the total count of cases for a specific pair of categories.[8]

After calculation of expected value of the two nominal variables and the test of independence will will be able to conduct hypothesis testing where a test statistic is computed and compared to a critical value. The critical value for the chi-square statistic is determined by the level of significance (typically .05) and the degrees of freedom. If the observed chi-square test statistic is greater than the critical value, the null hypothesis can be rejected.

Null hypothesis: Assumes that there is no association between the two variables.

Alternative hypothesis: Assumes that there is an association between the two variables. [11]

***

#### Wikipedia example

The example of Wikipedia of the Chi-square test describes a city of 1,000,000 residents with four neighborhoods: A, B, C, and D. A random sample of 650 residents of the city is taken and their occupation is recorded as "white collar", "blue collar", or "no collar". The null hypothesis is that each person's neighborhood of residence is independent of the person's occupational classification. The data are tabulated as follow:


               |  A |  B  |   C |  D  |  Total
-----------------------------------------------
White collar   | 90 | 60  | 104 | 95  |  349
Blue collar    | 30 | 50  |  51 | 20  |  151
No collar      | 30 | 40  |  45 | 35  |  150

-----------------------------------------------
Total          |150 |150  | 200 | 150 |  650


From the website we also learn that the Chi-squared value based on it is approximately 24.6 [9]. 

In the task I will use scipy.stats to verify this value and calculate the associated p value. 


### Code

In [9]:
#import packages
#import pandas for dataframe
import pandas as pd
#import numpy for the arrays of numbers
import numpy as np
#import scipy for the machine learning 
import scipy.stats as ss
#chi-square test with similar proportions
from scipy.stats import chi2_contingency
from scipy.stats import chi2

# plotting style for the plots:
plt.style.use("fivethirtyeight")

# create the table with the arrays of numbers
df = pd.DataFrame([[90,60,104,95,349],[30,50,51,20,151],[30,40,45,35,150],[150,150,200,150,650]], index=["White Collar","Blue Collar","No Collar","Total"], columns=["A","B","C","D","Total"])
print("Data Table")
print(df)

# code adapt from https://machinelearningmastery.com/chi-squared-test-for-machine-learning/
# run the chi test to calculate expeted value of 2 nominal variables
stat,p,dof,expected = chi2_contingency(df) 
print("\nStat:",stat) 

# print the p value:
print("p value:",p) # a measure of the probability that an observed difference could have occurred just by random chance#

# print the degree of freedom result 
print('Degrees of freedom: %d \n' % dof)

print('dof=%d' % dof)
print(expected)

# interpret test-statistic
prob = 0.95
critical = chi2.ppf(prob, dof)
print('probability=%.3f, critical=%.3f, stat=%.3f' % (prob, critical, stat))
# assess the result - if statistic >= crticial value, the result is significant, dependency exists, reject null hypothesis (H0)
if abs(stat) >= critical:
	print('Dependent (reject H0)')

#if statistic < crticial value, the result is not significant, dependency does not exist, failed to reject null hypothesis (H0)
else:
	print('Independent (fail to reject H0)')
# interpret p-value
alpha = 1.0 - prob
print('significance=%.3f, p=%.3f' % (alpha, p))
if p <= alpha:
	print('Dependent (reject H0)')
else:
	print('Independent (fail to reject H0)')

Data Table
                A    B    C    D  Total
White Collar   90   60  104   95    349
Blue Collar    30   50   51   20    151
No Collar      30   40   45   35    150
Total         150  150  200  150    650

Stat: 24.571202858582602
p value: 0.016990737760739776
Degrees of freedom: 12 

dof=12
[[ 80.53846154  80.53846154 107.38461538  80.53846154 349.        ]
 [ 34.84615385  34.84615385  46.46153846  34.84615385 151.        ]
 [ 34.61538462  34.61538462  46.15384615  34.61538462 150.        ]
 [150.         150.         200.         150.         650.        ]]
probability=0.950, critical=21.026, stat=24.571
Dependent (reject H0)
significance=0.050, p=0.017
Dependent (reject H0)


#### Conclusion

The conducted above chi-squared test of independence proved that the statistical value on the Wikipedia page is correct. After the critical value calculation it has been observed that the observed ch-square test statistic is less than the critical value the dependency exists and therfore the null hypothesis can be rejected.

#### References

[8] SPSS Turorials: Chi-Square Test of Independence [online] https://libguides.library.kent.edu/spss/chisquare

[9] Chi-squared tests; Wikipedia; https://en.wikipedia.org/wiki/Chi-squared_test

[10] Chi-squared test for machine learning; https://machinelearningmastery.com/chi-squared-test-for-machine-learning/

[11] Chi-Square Test of Independence; https://www.statisticssolutions.com/non-parametric-analysis-chi-square/