# Machine Learning and Statistics Tasks

**Tatjana Staunton**

***

### Task 1







>Square roots are difficult to calculate. In Python, you typically
use the power operator (a double asterisk) or a package such
as `math`. In this task, you should write a function `sqrt(x)` to 
approximate the square root of a floating point number x without
using the power operator or a package.

>Rather, you should use the Newton’s method. Start with an 
initial guess for the square root called $z_0$. You then repeatedly
improve it using the following formula, until the difference between some previous guess $z_i$ and the next $z_{i+1}$
is less than some
threshold, say 0.01.

$$ z_{i+1} = z_i - \frac{z_i * z_i - x}{2z_i}$$



In [1]:
def sqrt(x):
# Initial guess for the square root.
    z = x / 4.0
# Loop until it accurate enough.
    for i in range(100):
# Newton's method for a better 
        z = z - (((z * z) -x) / (2 * z))
# Now z should be a good aproximation for the square root.
    return z

In [2]:
# Test the function on 5.
sqrt(5)

2.23606797749979

In [3]:
# Check Python's square root of 3.
5 ** 0.5

2.23606797749979

##### Notes


>1. The calculation $z^2 - x$ is exactly zero when $z$ is the square root of $x$. it is greater than zero when $z$ is too big. It is less than zero when $z$ is too small. Thus $(z^2 - x)^2$ ia a good candidate for a cost function.
>2. The derivative of the numerator $z^2 - x$ with respect to $z$ is $2z$.That is denominator.



#### References

https://atlantictu-my.sharepoint.com/personal/ian_mcloughlin_atu_ie/_layouts/15/stream.aspx?id=%2Fpersonal%2Fian%5Fmcloughlin%5Fatu%5Fie%2FDocuments%2Fstudent%5Fshares%2Fmachine%5Flearnning%5Fand%5Fstatistics%2F1%5Fgeneral%2Ft01v11%5Ftask%5Fone%5Fand%5Frepo%2Emkv&referrer=StreamWebApp%2EWeb&referrerScenario=AddressBarCopied%2Eview

***

### Task 2

>Consider the below contingency table based on a survey asking
respondents whether they prefer coffee or tea and whether they
prefer plain or chocolate biscuits. Use `scipy.stats` to perform
a `chi-squared` test to see whether there is any evidence of an association between drink preference and biscuit preference in this
instance.



\begin{array}{|c|c|c|}
\hline
\text{} & \text{Chocolat Biscuit} & \text{PlaneBiscuit}\\
\hline
\text{Coffee}  & \text{43}& \text{57}\\
\hline
\text{Tea} & \text{56} & \text{45}\\
\hline
\end{array}


In [4]:
# Importing libraries.
# Numerical arrays.
import numpy as np

# Statistics.
import scipy.stats as ss
ss.chi2_contingency

# Creating the contingency table.
contingency_table = np.array([[43, 57], [56, 45]])
result = ss.chi2_contingency(contingency_table)

result

Chi2ContingencyResult(statistic=2.6359100836554257, pvalue=0.10447218120907394, dof=1, expected_freq=array([[49.25373134, 50.74626866],
       [49.74626866, 51.25373134]]))

#### Notes
Based on the provided data and the results of the chi-squared test, there is not sufficient statistical evidence to conclude that there is an association between respondents' preferences for coffee or tea and their preferences for plain or chocolate biscuits. At the given significance level, we cannot confidently say that there is a significant relationship between drink preference and biscuit preference based on the survey data you have.

#### References

https://www.overleaf.com/learn/latex/Tables#Creating_a_simple_table_in_LaTeX

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html

https://atlantictu-my.sharepoint.com/personal/ian_mcloughlin_atu_ie/_layouts/15/onedrive.aspx?id=%2Fpersonal%2Fian%5Fmcloughlin%5Fatu%5Fie%2FDocuments%2Fstudent%5Fshares%2Fmachine%5Flearnning%5Fand%5Fstatistics%2F2%5Fchi%5Fsquare&ga=1

***

### Task 3

>Perform a `t-test` on the famous penguins data set to investigate 
whether there is evidence of a significant difference in the body
mass of male and female gentoo penguins.

In [5]:
# Importing libraries.
# Plots.
import matplotlib.pyplot as plt

# Numerical arrays.
import numpy as np

# Data frames.
import pandas as pd

# Statistics.
import scipy.stats as ss

# Loading Palmer Penguins dataset.
df=pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv")


# To display data.
df

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE
...,...,...,...,...,...,...,...
339,Gentoo,Biscoe,,,,,
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,FEMALE
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,MALE
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,FEMALE


In [6]:
# Filtering for Gentoo penguins only.
gentoo_df = df[df['species'] == 'Gentoo']

gentoo_df 

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
220,Gentoo,Biscoe,46.1,13.2,211.0,4500.0,FEMALE
221,Gentoo,Biscoe,50.0,16.3,230.0,5700.0,MALE
222,Gentoo,Biscoe,48.7,14.1,210.0,4450.0,FEMALE
223,Gentoo,Biscoe,50.0,15.2,218.0,5700.0,MALE
224,Gentoo,Biscoe,47.6,14.5,215.0,5400.0,MALE
...,...,...,...,...,...,...,...
339,Gentoo,Biscoe,,,,,
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,FEMALE
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,MALE
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,FEMALE


In [7]:
# The body mass of male Gentoo penguins..
sample_a = gentoo_df[gentoo_df['sex'] == 'MALE']['body_mass_g'].to_numpy()

sample_a

array([5700., 5700., 5400., 5200., 5150., 5550., 5850., 5850., 6300.,
       5350., 5700., 5050., 5100., 5650., 5550., 5250., 6050., 5400.,
       5250., 5350., 5700., 4750., 5550., 5400., 5300., 5300., 5000.,
       5050., 5000., 5550., 5300., 5650., 5700., 5800., 5550., 5000.,
       5100., 5800., 6000., 5950., 5450., 5350., 5600., 5300., 5550.,
       5400., 5650., 5200., 4925., 5250., 5600., 5500., 5500., 5500.,
       5500., 5950., 5500., 5850., 6000., 5750., 5400.])

In [8]:
# The body mass of female Gentoo penguins.
sample_b = gentoo_df[gentoo_df['sex'] == 'FEMALE']['body_mass_g'].to_numpy()

sample_b

array([4500., 4450., 4550., 4800., 4400., 4650., 4650., 4200., 4150.,
       4800., 5000., 4400., 5000., 4600., 4700., 5050., 5150., 4950.,
       4350., 3950., 4300., 4900., 4200., 5100., 4850., 4400., 4900.,
       4300., 4450., 4200., 4400., 4700., 4700., 4750., 5200., 4700.,
       4600., 4750., 4625., 4725., 4750., 4600., 4875., 4950., 4750.,
       4850., 4875., 4625., 4850., 4975., 4700., 4575., 5000., 4650.,
       4375., 4925., 4850., 5200.])

In [9]:
# Performing t-test.
ss.ttest_ind(sample_a, sample_b)

Ttest_indResult(statistic=14.721676481405709, pvalue=2.133687602018886e-28)

#### Notes

In summary, this result provides strong evidence that there is a significant difference in body mass between male and female Gentoo penguins based on the given data.



#### References

https://atlantictu-my.sharepoint.com/personal/ian_mcloughlin_atu_ie/_layouts/15/onedrive.aspx?id=%2Fpersonal%2Fian%5Fmcloughlin%5Fatu%5Fie%2FDocuments%2Fstudent%5Fshares%2Fmachine%5Flearnning%5Fand%5Fstatistics%2F3%5Ft%5Ftests&ga=1

https://github.com/mwaskom/seaborn-data/blob/master/penguins.csv

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html




***

### Task 4

>Using the famous iris data set, suggest whether the setosa class 
is easily separable from the other two classes. Provide evidence
for your answer.

#### Notes

#### References

https://atlantictu-my.sharepoint.com/personal/ian_mcloughlin_atu_ie/_layouts/15/onedrive.aspx?id=%2Fpersonal%2Fian%5Fmcloughlin%5Fatu%5Fie%2FDocuments%2Fstudent%5Fshares%2Fmachine%5Flearnning%5Fand%5Fstatistics%2F4%5Fknn&ga=1

https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv


***

### Task 5


>Perform Principal Component Analysis on the iris data set,
reducing the number of dimensions to two. Explain the purpose
of the analysis and your results.