# Tasks

Machine Learning

Winter 2023/24

by James Connolly (G00232918)

***

## Task 1

> Square roots are difficult to calculate. In Python, you typically use the power operator (a double asterisk) or a package such
as math. In this task,1 you should write a function sqrt(x) to 1 approximate the square root of a floating point number x without
using the power operator or a package.

> Rather, you should use the initial guess for the square root called $z_0$. You then repeatedly improve it using the following formula, until the difference between some previous guess $z_i$ and the next $z{i+1}$ is less than some threshold, say 0.01.

$$ z_{i+1} = z_i - \frac{z_i x z_i - x}{2z_i} $$


In [4]:
def sqrt(x):
    # Initial guess for the square root
    z = 4 /4.0

    # Loop until we are accurate enough
    # while (z could be improved):
    for i in range(100):
        # Newton's method for a better approximation
        z = z - (((z*z)-x)/(2*z))

    # z should now be a good approximation for the square root
    return z

In [5]:
# test the function on 3.
sqrt(3)

1.7320508075688774

In [6]:
# Check Python's value for square root of 3.
3**0.5

1.7320508075688772

In [21]:
### Alertnative answer

def sqrt1(x):

    # Starting point is 'x / 2.0', which is a reasonable estimate based on reference.
    z = x / 2.0
    
    
    # Set a threshold for stop criteria
    # when the difference between consecutive approximations is less than or 
    # equal to this threshold.
    threshold = 0.01

    while True:
        # Use Newton's method to compute a better approximation of the square root.
        z_next = z - ((z * z - x) / (2 * z))
        
        # Check if the absolute difference between the current and next approximation is within the threshold.
        # If it is, we consider 'z_next' to be a good approximation and return it.
        if abs(z_next - z) <= threshold:
            return z_next  
        
        # Update 'z' with the new approximation for the next iteration.
        z = z_next
        

# Test the sqrt1 function on the number 3.
result = sqrt1(3)
print(result)  

# Check with Python Square root operator
python_sqrt = 3**0.5
print(python_sqrt)


1.7320508100147276
1.7320508075688772


Reference - https://www.rookieslab.com/posts/finding-square-root-using-guess-and-check-algorithm-in-python

### Notes

***

1. The calculation $z^2 - x$ is exactly when $z$ is the square root of $x$. It is greater than zero when $z$ is too big. It is less than zero when $z$ is too small. Thus $(z^2 - x)^2$ is a good candidate for a cost function. 
2. The derivate of the numerator $z^2 - x$ with respect to $z$ is $2z$. This is the denominator of the fraction in the formula from the question. 

***
## Task 1 End

## Task 2

> Consider the below contingency table based on a survey asking respondents whether they prefer coffee or tea and whether they
prefer plain or chocolate biscuits. Use scipy.stats to perform a chi-squared test to see whether there is any evidence of an association between drink preference and biscuit preference in this instance.




In [12]:
# Importing the necessary libraries
import pandas as pd
import numpy as np
import scipy.stats as ss

# Defining the variables as Categorical
drink = pd.Categorical(['Coffee', 'Tea'], categories=['Coffee', 'Tea']) 
biscuit = pd.Categorical(['Chocolate', 'Plain'], categories=['Chocolate', 'Plain'])  

# converting the data to a NumPy array
data = np.array([[43, 57], [56, 45]])
# Creating the cross tabulation
cross_tab = pd.DataFrame(data, index=drink, columns=biscuit)  

# Performing the chi-squared test
chi2, p, dof, expected = ss.chi2_contingency(cross_tab)

# Output the results
# Chi-squared statistic
print("Chi-squared statistic:", chi2)
# p-value
print("P-value:", p)
# degrees of freedom
print("Degrees of freedom:", dof)
# expected frequency table
print("Expected frequencies table:")
print(expected)


Chi-squared statistic: 2.6359100836554257
P-value: 0.10447218120907394
Degrees of freedom: 1
Expected frequencies table:
[[49.25373134 50.74626866]
 [49.74626866 51.25373134]]


## Conclusion

Based on the provided results, it seems that there is no significant association categorical variables being analysed, as the p-value is greater than the typical significance level of 0.05.

### References
* How to set up the dataframe for cross tab - https://pandas.pydata.org/docs/reference/api/pandas.crosstab.html
* Understanding pd.categorical - https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.Categorical.html
* chi square example - https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html
* Analsying the results of p-value - https://study.com/skill/learn/how-to-interpret-the-p-value-for-the-chi-square-test-for-goodness-of-fit-explanation.html

***
## Task 2 End

# Task 3

> Perform a t-test on the famous penguins dataset to investigate whether there is evidence of a significant difference in the body mass of male and female gentoo penguins.

In [2]:
import pandas as pd

# read in the dataset to be used.
df = pd.read_csv('data/penguins.csv')

# Show.
df

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE
...,...,...,...,...,...,...,...
339,Gentoo,Biscoe,,,,,
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,FEMALE
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,MALE
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,FEMALE


In [3]:
# male body mass
# using the to numpy method to put all the related values into
# an array (to numpy)
sample_a = df[df['sex'] == 'MALE']['body_mass_g'].to_numpy()

sample_a


array([3750., 3650., 4675., 3800., 4400., 4500., 4200., 3600., 3950.,
       3800., 3550., 3950., 3900., 3900., 4150., 3950., 4650., 3900.,
       4400., 4600., 3425., 4150., 4300., 4050., 3700., 3800., 3750.,
       4400., 4050., 3950., 4100., 4450., 3900., 4150., 4250., 3900.,
       4000., 4700., 4200., 3550., 3800., 3950., 4300., 4450., 4300.,
       4350., 4100., 4725., 4250., 3550., 3900., 4775., 4600., 4275.,
       4075., 3775., 3325., 3500., 3875., 4000., 4300., 4000., 3500.,
       4475., 3900., 3975., 4250., 3475., 3725., 3650., 4250., 3750.,
       4000., 3900., 3650., 3725., 3750., 3700., 3775., 4050., 4050.,
       3300., 4400., 3400., 3800., 4150., 3800., 4550., 4300., 4100.,
       3600., 4800., 4500., 3950., 3550., 4450., 4300., 3250., 3950.,
       4050., 3450., 4050., 3800., 3950., 4000., 3775., 4100., 5700.,
       5700., 5400., 5200., 5150., 5550., 5850., 5850., 6300., 5350.,
       5700., 5050., 5100., 5650., 5550., 5250., 6050., 5400., 5250.,
       5350., 5700.,

In [4]:
# female body mass
sample_b = df[df['sex'] == 'FEMALE']['body_mass_g'].to_numpy()

sample_b

array([3800., 3250., 3450., 3625., 3200., 3700., 3450., 3325., 3400.,
       3800., 3800., 3200., 3150., 3250., 3300., 3325., 3550., 3300.,
       3150., 3100., 3000., 3450., 3500., 3450., 2900., 3550., 2850.,
       3150., 3600., 2850., 3350., 3050., 3600., 3550., 3700., 3700.,
       3550., 3200., 3800., 3350., 3500., 3600., 3550., 3400., 3300.,
       3700., 2900., 3725., 3075., 2925., 3750., 3175., 3825., 3200.,
       3900., 2900., 3350., 3150., 3450., 3050., 3275., 3050., 3325.,
       3500., 3425., 3175., 3400., 3400., 3050., 3000., 3475., 3450.,
       3700., 3500., 3525., 3950., 3250., 4150., 3800., 3700., 3575.,
       3700., 3450., 3600., 2900., 3300., 3400., 3700., 3200., 3350.,
       3900., 3850., 2700., 3650., 3500., 3675., 3400., 3675., 3325.,
       3600., 3350., 3250., 3525., 3650., 3650., 3400., 3775., 4500.,
       4450., 4550., 4800., 4400., 4650., 4650., 4200., 4150., 4800.,
       5000., 4400., 5000., 4600., 4700., 5050., 5150., 4950., 4350.,
       3950., 4300.,

In [6]:
import scipy.stats as ss

ss.ttest_ind(sample_a, sample_b)

TtestResult(statistic=8.541720337994516, pvalue=4.897246751596224e-16, df=331.0)

## Conclusion

The t-test tool is similar to a tool thats to figure out if 2 groups are different from each other in a meaningful way. It helps to determine whether any of obsevered differents are probably not just due to random chance.

By using the t-statistic, degrees of freedom and the values from the t-distribution, you can assess the validity of the hypotheses regarding the relationship between the two groups.

A summary of the results are the following - 

T-statistic - 8.54 suggests a signifiant difference in the means of the samples defined. 

P-Value -  The p-value is very small (4.897e-16), indicating strong evidence against the null hypothesis. A small p-value suggests that the observed difference between the two sample means is not likely to be due to random chance.


### References -

* To numpy method - https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_numpy.html
* T-test definition - https://www.investopedia.com/terms/t/t-test.asp


# Task 3 end