Machine Learning and Statistics

Winter 2023

by Ioan Domsa

***

## Task 1

***

> Square roots are difficult to calculate. In Python, you typically use the power operator (a double asterisk) or a package such
as `math`. In this task you should write a function `sqrt(x)` to approximate the square root of a floating point number x without
using the power operator or a package.

>Rather, you should use the Newton’s method. Start with an initial guess for the square root called $z_0$. You then repeatedly
improve it using the following formula, until the difference between some previous guess $z_{i}$ and the next $z_{i+1}$
is less than some threshold, say 0.01.

$$ z_{i+1} = z_i − \frac {z_i × z_i − x}{2z_i} $$


In [1]:
def sqrt(x):
    # Initial guess for the square root.
    z1 = x / 2.0
    # set a threshold of approximation
    t = 0.0000001
    # counter
    # c = 0

# Loop until we are accurate enough
    while True:
        # Newtons method
        z2 = z1 - ((z1*z1)-x)/(2*z1)
        # c = c + 1 
        # check the threshold
        if abs(z2-z1) <= t:
            break
        z1 = z2
    # return z2, c
    return z2

x = 3
# result, count = sqrt(x)
result = sqrt(x)
print(result)
# print(count)

1.7320508075688772


In [2]:
# Test function
sqrt(3)

1.7320508075688772

In [3]:
# Check Python's value for square root of 3
3**0.5

1.7320508075688772

### Notes

***

1. The calculation $ z^2 - x $ is exactly zero when $z$ is the sqare root of $x$. It is greater than zero when $z$ is too big. It is less than zero when $z$ is too small. Thus $(z^2 -x)^2$ is a good candidate for a cost function.

2. The derivative of the numerator $z^2 - x$ with respect to $z$ is $2z$. That is the denominator of the fraction in the formula from the question

***

## Task 2

***

> Consider the below contingency table based on a survey asking respondents whether they prefer coffee or tea and whether they prefer plain or chocolate biscuits.

>Use scipy.stats to perform a chi-squared test to see whether there is any evidence of an association between drink preference and biscuit preference in this instance.

|           	|            	|  **Biscuit** 	|              	|  
|:---------:	|:----------:	|:------------:	|:---------:	|  
|           	|            	| Chocolate     | Plain      	|  
| **Drink** 	| Coffee 	    |      43      	|     57    	|  
|           	| Tea  	        |      56      	|     45    	|  

In [4]:
# Data frames.
import pandas as pd

# Shuffles.
import random

# Statistics.
import scipy.stats as ss

In [5]:
# 43 coffee drinkers who preferred chocolate biscuit
coffee_choc = [["coffee", "chocolate"]] * 43

# Show
coffee_choc

[['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 

In [6]:
# 56 tea drinkers who preferred chocolate biscuit
tea_choc = [["tea", "chocolate"]] * 56

# Show
tea_choc

[['tea', 'chocolate'],
 ['tea', 'chocolate'],
 ['tea', 'chocolate'],
 ['tea', 'chocolate'],
 ['tea', 'chocolate'],
 ['tea', 'chocolate'],
 ['tea', 'chocolate'],
 ['tea', 'chocolate'],
 ['tea', 'chocolate'],
 ['tea', 'chocolate'],
 ['tea', 'chocolate'],
 ['tea', 'chocolate'],
 ['tea', 'chocolate'],
 ['tea', 'chocolate'],
 ['tea', 'chocolate'],
 ['tea', 'chocolate'],
 ['tea', 'chocolate'],
 ['tea', 'chocolate'],
 ['tea', 'chocolate'],
 ['tea', 'chocolate'],
 ['tea', 'chocolate'],
 ['tea', 'chocolate'],
 ['tea', 'chocolate'],
 ['tea', 'chocolate'],
 ['tea', 'chocolate'],
 ['tea', 'chocolate'],
 ['tea', 'chocolate'],
 ['tea', 'chocolate'],
 ['tea', 'chocolate'],
 ['tea', 'chocolate'],
 ['tea', 'chocolate'],
 ['tea', 'chocolate'],
 ['tea', 'chocolate'],
 ['tea', 'chocolate'],
 ['tea', 'chocolate'],
 ['tea', 'chocolate'],
 ['tea', 'chocolate'],
 ['tea', 'chocolate'],
 ['tea', 'chocolate'],
 ['tea', 'chocolate'],
 ['tea', 'chocolate'],
 ['tea', 'chocolate'],
 ['tea', 'chocolate'],
 ['tea', 'c

In [7]:
# 57 coffee drinkers who preferred plain biscuit
coffee_plain = [["coffee", "plain"]] * 57

# Show
coffee_plain

[['coffee', 'plain'],
 ['coffee', 'plain'],
 ['coffee', 'plain'],
 ['coffee', 'plain'],
 ['coffee', 'plain'],
 ['coffee', 'plain'],
 ['coffee', 'plain'],
 ['coffee', 'plain'],
 ['coffee', 'plain'],
 ['coffee', 'plain'],
 ['coffee', 'plain'],
 ['coffee', 'plain'],
 ['coffee', 'plain'],
 ['coffee', 'plain'],
 ['coffee', 'plain'],
 ['coffee', 'plain'],
 ['coffee', 'plain'],
 ['coffee', 'plain'],
 ['coffee', 'plain'],
 ['coffee', 'plain'],
 ['coffee', 'plain'],
 ['coffee', 'plain'],
 ['coffee', 'plain'],
 ['coffee', 'plain'],
 ['coffee', 'plain'],
 ['coffee', 'plain'],
 ['coffee', 'plain'],
 ['coffee', 'plain'],
 ['coffee', 'plain'],
 ['coffee', 'plain'],
 ['coffee', 'plain'],
 ['coffee', 'plain'],
 ['coffee', 'plain'],
 ['coffee', 'plain'],
 ['coffee', 'plain'],
 ['coffee', 'plain'],
 ['coffee', 'plain'],
 ['coffee', 'plain'],
 ['coffee', 'plain'],
 ['coffee', 'plain'],
 ['coffee', 'plain'],
 ['coffee', 'plain'],
 ['coffee', 'plain'],
 ['coffee', 'plain'],
 ['coffee', 'plain'],
 ['coffee'

In [8]:
# 45 tea drinkers who preferred plain biscuit
tea_plain = [["tea", "plain"]] * 45

# Show
tea_plain

[['tea', 'plain'],
 ['tea', 'plain'],
 ['tea', 'plain'],
 ['tea', 'plain'],
 ['tea', 'plain'],
 ['tea', 'plain'],
 ['tea', 'plain'],
 ['tea', 'plain'],
 ['tea', 'plain'],
 ['tea', 'plain'],
 ['tea', 'plain'],
 ['tea', 'plain'],
 ['tea', 'plain'],
 ['tea', 'plain'],
 ['tea', 'plain'],
 ['tea', 'plain'],
 ['tea', 'plain'],
 ['tea', 'plain'],
 ['tea', 'plain'],
 ['tea', 'plain'],
 ['tea', 'plain'],
 ['tea', 'plain'],
 ['tea', 'plain'],
 ['tea', 'plain'],
 ['tea', 'plain'],
 ['tea', 'plain'],
 ['tea', 'plain'],
 ['tea', 'plain'],
 ['tea', 'plain'],
 ['tea', 'plain'],
 ['tea', 'plain'],
 ['tea', 'plain'],
 ['tea', 'plain'],
 ['tea', 'plain'],
 ['tea', 'plain'],
 ['tea', 'plain'],
 ['tea', 'plain'],
 ['tea', 'plain'],
 ['tea', 'plain'],
 ['tea', 'plain'],
 ['tea', 'plain'],
 ['tea', 'plain'],
 ['tea', 'plain'],
 ['tea', 'plain'],
 ['tea', 'plain']]

In [9]:
# Raw data, merge the four lists

raw_data = coffee_choc + coffee_plain + tea_choc + tea_plain

# show raw data
raw_data

[['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 

In [10]:
# Shuffle the data

random.shuffle(raw_data)

# show raw_data
raw_data

[['tea', 'chocolate'],
 ['coffee', 'plain'],
 ['tea', 'plain'],
 ['coffee', 'chocolate'],
 ['coffee', 'plain'],
 ['coffee', 'plain'],
 ['coffee', 'plain'],
 ['tea', 'chocolate'],
 ['tea', 'plain'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'plain'],
 ['tea', 'chocolate'],
 ['coffee', 'chocolate'],
 ['tea', 'plain'],
 ['coffee', 'plain'],
 ['coffee', 'plain'],
 ['coffee', 'plain'],
 ['coffee', 'plain'],
 ['tea', 'plain'],
 ['coffee', 'plain'],
 ['tea', 'chocolate'],
 ['coffee', 'plain'],
 ['coffee', 'plain'],
 ['tea', 'chocolate'],
 ['tea', 'plain'],
 ['tea', 'chocolate'],
 ['tea', 'chocolate'],
 ['coffee', 'plain'],
 ['coffee', 'chocolate'],
 ['coffee', 'plain'],
 ['tea', 'chocolate'],
 ['coffee', 'plain'],
 ['tea', 'chocolate'],
 ['coffee', 'chocolate'],
 ['coffee', 'chocolate'],
 ['tea', 'chocolate'],
 ['coffee', 'plain'],
 ['tea', 'chocolate'],
 ['tea', 'chocolate'],
 ['tea', 'plain'],
 ['tea', 'plain'],
 ['tea', 'plain'],
 ['tea', 'chocolate'],
 ['coffee', 'ch

In [11]:
# Zip the list - make the rows columns and the columns rows
# Interchange the outer and inner lists
drink, biscuit = list(zip(*raw_data))

# Show drink, biscuit
drink, biscuit

(('tea',
  'coffee',
  'tea',
  'coffee',
  'coffee',
  'coffee',
  'coffee',
  'tea',
  'tea',
  'coffee',
  'coffee',
  'coffee',
  'tea',
  'coffee',
  'tea',
  'coffee',
  'coffee',
  'coffee',
  'coffee',
  'tea',
  'coffee',
  'tea',
  'coffee',
  'coffee',
  'tea',
  'tea',
  'tea',
  'tea',
  'coffee',
  'coffee',
  'coffee',
  'tea',
  'coffee',
  'tea',
  'coffee',
  'coffee',
  'tea',
  'coffee',
  'tea',
  'tea',
  'tea',
  'tea',
  'tea',
  'tea',
  'coffee',
  'tea',
  'coffee',
  'tea',
  'tea',
  'coffee',
  'tea',
  'tea',
  'coffee',
  'tea',
  'coffee',
  'tea',
  'tea',
  'tea',
  'tea',
  'tea',
  'coffee',
  'tea',
  'coffee',
  'tea',
  'coffee',
  'coffee',
  'tea',
  'coffee',
  'coffee',
  'tea',
  'tea',
  'tea',
  'coffee',
  'coffee',
  'tea',
  'tea',
  'tea',
  'coffee',
  'coffee',
  'tea',
  'coffee',
  'coffee',
  'tea',
  'tea',
  'coffee',
  'coffee',
  'tea',
  'coffee',
  'coffee',
  'coffee',
  'tea',
  'tea',
  'coffee',
  'coffee',
  'tea',
  't

In [12]:
# create a data frame
df = pd.DataFrame({"drink": drink, "biscuit": biscuit})

# show
df

Unnamed: 0,drink,biscuit
0,tea,chocolate
1,coffee,plain
2,tea,plain
3,coffee,chocolate
4,coffee,plain
...,...,...
196,coffee,plain
197,tea,plain
198,tea,plain
199,coffee,chocolate


In [13]:
# perform cross tab contingency
cross = ss.contingency.crosstab(df["drink"], df["biscuit"])

# show
cross

CrosstabResult(elements=(array(['coffee', 'tea'], dtype=object), array(['chocolate', 'plain'], dtype=object)), count=array([[43, 57],
       [56, 45]]))

In [14]:
# The counts.
cross.count

array([[43, 57],
       [56, 45]])

In [15]:
# The first variable and the second

first, second = cross.elements

#show
first, second

(array(['coffee', 'tea'], dtype=object),
 array(['chocolate', 'plain'], dtype=object))

In [16]:
# Do the statistics.
result = ss.chi2_contingency(cross.count, correction = False)

# Show.
result

Chi2ContingencyResult(statistic=3.113937364324669, pvalue=0.07762509678333357, dof=1, expected_freq=array([[49.25373134, 50.74626866],
       [49.74626866, 51.25373134]]))

In [17]:
# The expected frequesncies if independent
result.expected_freq

array([[49.25373134, 50.74626866],
       [49.74626866, 51.25373134]])

In [18]:
# Preferd chocolate biscuits irespective of drink.
99 / 201

0.4925373134328358

In [19]:
# If no relationship between drink and biscuit, 
# then we should have same proportion of coffee drinkers
# liking chocolate biscuits as we have overall
100 * (99 / 201)

49.25373134328358

In [20]:
# If no relationship between drink and biscuit, 
# then we should have same proportion of peopple 
# liking plain biscuit who are tea drinkers as we have overall
102 * (101 / 201)

51.25373134328359

## Task 3

***

> Perform a t-test on the famous penguins data set to investigate 

> whether there is evidence of a significant difference in the body mass of male and female gentoo penguins.

In [21]:
# numerical arays
import numpy as np

In [53]:
# Load Penguins
df = pd.read_csv('topic3/notes/data/penguins.csv')

# Show
df

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE
...,...,...,...,...,...,...,...
339,Gentoo,Biscoe,,,,,
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,FEMALE
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,MALE
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,FEMALE


In [38]:
# Gentoo male body mass

df_gentoo = df[df["species"] == "Gentoo"]

df_gentoo

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
220,Gentoo,Biscoe,46.1,13.2,211.0,4500.0,FEMALE
221,Gentoo,Biscoe,50.0,16.3,230.0,5700.0,MALE
222,Gentoo,Biscoe,48.7,14.1,210.0,4450.0,FEMALE
223,Gentoo,Biscoe,50.0,15.2,218.0,5700.0,MALE
224,Gentoo,Biscoe,47.6,14.5,215.0,5400.0,MALE
...,...,...,...,...,...,...,...
339,Gentoo,Biscoe,,,,,
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,FEMALE
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,MALE
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,FEMALE


In [64]:
# Gentoo Male mass
sample_male = df_gentoo.loc[df_gentoo['sex'] == 'MALE']['body_mass_g'].to_numpy()

count_sample_male = len(sample_male)
count_sample_male, sample_male

(61,
 array([5700., 5700., 5400., 5200., 5150., 5550., 5850., 5850., 6300.,
        5350., 5700., 5050., 5100., 5650., 5550., 5250., 6050., 5400.,
        5250., 5350., 5700., 4750., 5550., 5400., 5300., 5300., 5000.,
        5050., 5000., 5550., 5300., 5650., 5700., 5800., 5550., 5000.,
        5100., 5800., 6000., 5950., 5450., 5350., 5600., 5300., 5550.,
        5400., 5650., 5200., 4925., 5250., 5600., 5500., 5500., 5500.,
        5500., 5950., 5500., 5850., 6000., 5750., 5400.]))

In [65]:
# Gentoo Female mass
sample_female = df_gentoo.loc[df_gentoo['sex'] == 'FEMALE']['body_mass_g'].to_numpy()

count_sample_female = len(sample_female)
count_sample_female, sample_female

(58,
 array([4500., 4450., 4550., 4800., 4400., 4650., 4650., 4200., 4150.,
        4800., 5000., 4400., 5000., 4600., 4700., 5050., 5150., 4950.,
        4350., 3950., 4300., 4900., 4200., 5100., 4850., 4400., 4900.,
        4300., 4450., 4200., 4400., 4700., 4700., 4750., 5200., 4700.,
        4600., 4750., 4625., 4725., 4750., 4600., 4875., 4950., 4750.,
        4850., 4875., 4625., 4850., 4975., 4700., 4575., 5000., 4650.,
        4375., 4925., 4850., 5200.]))

In [66]:
# Perform t-test

ss.ttest_ind(sample_male, sample_female)

TtestResult(statistic=14.721676481405709, pvalue=2.133687602018886e-28, df=117.0)

> The t-test above sugests significant diference between the body mass of the two gender samples analysed.
> the statistic value of 14 shows a large difference between the means of the two samples.
> the small pvalue, close to 0 shows the probanility is very low

***
## End
