<p align="center">
  <img src="https://www.edvancer.in/wp-content/uploads/2016/01/ML-vs.-stats1.png" 
</p>

## <div align="center">Machine Learning and Statistics: Tasks</div>
### <div align="center">Author: Sean Elliott</div>

----

In [119]:
# Data frames.
import pandas as pd

# Statistics.
import scipy.stats as ss

#for shuffling the data.
import random

#numerical arrays.
import numpy as np

## Task 1 
Square roots are difficult to calculate. In Python, you typically use the power operator (a double asterisk) or a package such
as 'math'. In this task,1 you should write a function 'sqrt(x)' to approximate the square root of a floating point number 'x' without using the power operator or a package.
Rather, you should use 'Newton’s method'. Start with an initial guess for the square root called $z_0$. You then repeatedly improve it using the following formula, until the difference between some previous guess $z_i$ and the next $z_{i+1}$ is less than some threshold, say 0.01.

$$z_{i+1} = z_i - \frac{z_i * z_i - x}{2z_i} $$

'*' denotes multiplication


In [120]:
# First attempt at writing code for square root 
def sqrt(x):
  # First guess for square root.
  z = x / 4.0
  # create a loop that will run for a designated set number of times.
  for i in range (1000):
    z = z - (((z * z) - x) / (2 * z))
# return z which should be a good approximation fo the square root.
  return z

In [121]:
# test function created above.
sqrt(15)

3.8729833462074166

In [122]:
# test built in python function.
15**0.5

3.872983346207417

## References: 

https://medium.com/@shouke.wei/how-to-embed-an-image-size-and-align-it-in-the-jupyter-notebook-542a2e4e2c98 Date Accessed: 26/09/2023 19:42
https://saturncloud.io/blog/how-to-position-embedded-images-in-jupyter-notebooks-using-markdown/ Date Accessed: 26/09/2023 19:47


***

## Task 2 

Consider the below contingency table based on a survey asking respondents whether they prefer coffee or tea and whether they prefer plain or chocolate biscuits. 
Use scipy.stats to perform a chi-squared test to see whether there is any evidence of an association between drink preference and biscuit preference in this instance.


We will start this project by first defining what the Chi-Squared Test is; and what it's correct uses are.
The LaTeX notation for the Chi-Squared Test is as follows:


$$\chi^2 = \sum \frac {(O - E)^2}{E}$$

The break down of the above values:

$$\chi^2$$ is the chi-square test statistic 
$$\sum $$  is the summation operator (meaning find the sum of)
$$ O $$ is the observed frequency value 
$$ E $$ is the expected frequency value


The idea behind the Chi-squared test is a simple one: the test is used to compare the 'actual' data values with what would be 'expected' if the null hypothesis is true. The test involves finding the squared difference between the actual results and the expected results and then dividing that difference by the expected data results.

In [123]:
# Create the data represented in the tabnle so that it can be fed into the program.

coffee_choc = [['Coffee','Chocolate']] * 43
coffee_plain = [['Coffee','Plain']] * 57
tea_choc = [['Tea','Chocolate']] * 56
tea_plain = [['Tea','Plain']] * 45

#store the 4 value sets above in 1 variable 'data'.
data = coffee_choc + coffee_plain + tea_choc + tea_plain

In [124]:
#shuffle the way the data appears in the dataset, but doesnt alter the results - ensures that the data doesnt look contrived.
random.shuffle(data)

In [125]:
drink, biscuit = list(zip(*data))

In [126]:
# create dataframe 
df = pd.DataFrame({'drink': drink, 'biscuit': biscuit})

#print out datafarme to ensure running as expected.
df


Unnamed: 0,drink,biscuit
0,Coffee,Plain
1,Coffee,Plain
2,Tea,Plain
3,Tea,Chocolate
4,Tea,Chocolate
...,...,...
196,Tea,Plain
197,Coffee,Chocolate
198,Tea,Chocolate
199,Coffee,Plain


In [127]:
cross = ss.contingency.crosstab(df['drink'], df['biscuit'])

# Show.
cross

CrosstabResult(elements=(array(['Coffee', 'Tea'], dtype=object), array(['Chocolate', 'Plain'], dtype=object)), count=array([[43, 57],
       [56, 45]]))

In [128]:
# organise data within dataset for easy manipulation
first, second = cross.elements

# Show arrays 
first, second

(array(['Coffee', 'Tea'], dtype=object),
 array(['Chocolate', 'Plain'], dtype=object))

In [129]:
cross.count 

array([[43, 57],
       [56, 45]])

In [130]:
result = ss.chi2_contingency(cross.count, correction=False)

# Show.
result

Chi2ContingencyResult(statistic=3.113937364324669, pvalue=0.07762509678333357, dof=1, expected_freq=array([[49.25373134, 50.74626866],
       [49.74626866, 51.25373134]]))

In [131]:
# The expected fequencies if independent.
result.expected_freq

array([[49.25373134, 50.74626866],
       [49.74626866, 51.25373134]])

In [132]:
cross.count - result.expected_freq

array([[-6.25373134,  6.25373134],
       [ 6.25373134, -6.25373134]])

In [133]:
(cross.count - result.expected_freq)**2

array([[39.10915571, 39.10915571],
       [39.10915571, 39.10915571]])

In [134]:
(cross.count - result.expected_freq)**2 / result.expected_freq

array([[0.79403437, 0.77068042],
       [0.78617265, 0.76304992]])

In [135]:
((cross.count - result.expected_freq)**2 / result.expected_freq).sum()

3.113937364324669

*** 

## Task 3



In [136]:
#Load penguins csv file.
url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv'
df = pd.read_csv(url)
df

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE
...,...,...,...,...,...,...,...
339,Gentoo,Biscoe,,,,,
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,FEMALE
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,MALE
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,FEMALE


In [137]:
# Gentoo Male Samples   
male = df[df['species'] == 'Gentoo'], ['sex'] == ['MALE']

AttributeError: 'list' object has no attribute 'to_numpy'

In [None]:
# Gentoo Female Samples
female = df[df['species'] == 'Gentoo'], ['sex'] == ['FEMALE']
female 

References:

https://www.simplilearn.com/tutorials/statistics-tutorial/chi-square-test Date Accessed: 17/10/2023

https://ezspss.com/interpreting-chi-square-results-in-spss/ Dat Accessed: 17/10/2023

https://www.scribbr.com/statistics/chi-square-tests/#when - Date Accessed: 16/10/2023
