# Tasks

#### This notebook contains 5 tasks supporting the Machine Learning and Statistics module of the Higher Diploma in Science in Data Analytics course at ATU 2023. 

### 5 task:

   [[Task 1]](#ref601) Square roots
   
   [[Task 2]](#ref602) Chi-square test
   
   [[Task 3]](#ref603) t-test
   
   [[Task 4]](#ref604) knn
   
   [[Task 5]](#ref605) Principal Component Analysis
   

<a id='ref601'></a>
# Task 1

Part of assignments for the Machine Learning and Statistics modulel of the Higher Diploma in Science in Data Analytics course at ATU 2023

Winter 23/24

Author: Jarlath Scarry

[Back to top of notebook](#Tasks)

## Square roots 

> Square roots are difficult to calculate. In Python, you typically use the power operator (a double asterisk) or a package such as `math`. In this task, you should write a function `sqrt(x)` to approximate the square root of a floating point number x without using the power operator or a package.

> Rather, you should use the Newton’s method. Start with an initial guess for the square root called $z0$. You then repeatedly improve it using the following formula, until the difference between some previous guess $zi$ and the next $z{i+1}$ is less than some threshold, say 0.01

$$z_{i+1} = z_i - \frac{z_i  ×  z_i - x}{2z_i}$$

[[101]](#ref101) (MACHINE LEARNING AND STATISTICS course material)

###  Ways to find square root in python. 

The easiest way is to use a special operator. One such method is to call the `math` library and use `math.sqrt(x)`. Another is to raise the number to the power of a half. An easy way to do this in python is `x**0.5`. See an example of both of these two methods below.
In this exercise we will to do this without special operators but instead using Newtons method. 


### Code examples

Eaxmples of finding square root using special operators

In [1]:
import math
x = 16
math.sqrt(x)

In [2]:
4**0.5

2.0

### Coding Newtons method
Lets put this formula into code. If we take x as the number and our output should be the Square root of x

$$z_{i+1} = z_i - \frac{z_i  ×  z_i - x}{2z_i}$$



### Start with a guess
We start by making a guess at the result. Lets take `z = 2` as our initial guess for the result when trying to calculate $\sqrt{16}$. If we run the code below, it uses our guess and calculates out an extimation of the $\sqrt{16}$.

When we run the code below we get an estimation of the answer. Each time we re-run it we get a better estimation of the result.
**If we re-running this piece of code 6 times we have a good result.**

In [3]:
# The number that we want to calculate the square root of 
x= 16
# Our initial guess for the square root. Lets set it as a floating point number, since our result may not be a whole number. 
z = 2.0

In [None]:
# Newtons method for a better approximation 
#Each time we run this line of code it calculated a better result. 
#If we re-running this piece of code 6 times we have a good estimation of the answer.
#results:
         #1st = 5
         #2nd = 4.1
         #3rd = 4.001219512195122
         #4th = 4.0000001858445895
         #5th = 4.000000000000004
         #6th = 4
z = z -(((z*z)-x)/ (2*z))

z

### Iterative method 

This is an iterative method. Rather than manually re-running the code 6 times until we get a good result we can set it to itterate a number of times. **Lets code this to loop 10 times which should give a reasonable result in most cases.** The example below shows Ten iterations of the calculation of $\sqrt{15}$.

In [None]:
#Put the code into a loop to run 10 times
def sqrt(x):
    #initial guess for the square root
    z = x/4.0
    #Loop newtons method 10 times until we get a good approximation.
    for i in range(10):
        # Newtons method for a better approximation
        z = z -(((z*z)-x)/ (2*z))
        print(z)
        print ((z*z)-x)
    # z should now be a good approximation
    return z   



In [None]:
#call the function and test it on 15. 
#I also print the results and the calue of ((z*z)-x) which we will look at later.
result1 = sqrt(15)
result1

In [None]:
#check pythons value for square root of 15 
result2 = 15**0.5
result2

### Result

After running the code we get a result from the `sqrt(15)` function. $\sqrt{15}$ is returned as  `3.8729833462074166`. This compares closely to the result when raising to the power of 0.5 method.

`sqrt(15)` function result = 3.8729833462074166

`15**0.5`           result = 3.872983346207417 

### Improving the code to be more efficent
  
#### Negative numbers

The square root of negative numbers is undefined. This code will continue to loop if a negative number is input. Any number squared will produce a positive number, so there is no true square root of a negative number. We could improve the code by limiting the input to positive numbers. Then with an `if` return the message "undefined" if a negative number is input.

#### Iterations 

This `sqrt()` code is somewhat inefficent. It will repeat the loop 10 times even if a good answer is achieved after the first loop. 

The calculation  𝑧2−𝑥 is exactly  𝑧𝑒𝑟𝑜 when  𝑧 is the square root of  𝑥. It is greater than zero when  𝑧 is too big. It is less than  𝑧𝑒𝑟𝑜 when  𝑧 is too small. Therfore by using (𝑥2−𝑥) as a cost function, and using it to stop the loop when it approaches zero, we can make the code more efficent. This improvment will stop code when a "good" result is achieved, rather than running the 10 loops



In [None]:
def sqrt(x):
    #initial guess for the square root
    z = x/4.0
    #Loop while ((z*z)-x) is not very colse to 0
    while ((z*z)-x)>1e-10 or ((z*z)-x)< -1e-10:
        print(z)
        # Newtons method for a better approximation
        z = z -(((z*z)-x)/ (2*z))
    # z should now be a good approximation
    return z  


In [None]:
#square root of 15. Result is very close to python 15**0.5 method.
sqrt(4)

### Conclusion

In this case calculating the square root of 4 stopped after 6 iterations because we were had achieved a "good" result. This reduces the "cost" from 10 to 6.


### Acknowledgments

Ian McLoughlin ATU.ie MLAS lecture notes. For much of the code and inspiration for the rest.

### References

[101]<a id='ref101'></a> MACHINE LEARNING AND STATISTICS course material. Task 1 - Square roots 

## End task1

[Back to top of task](#Task-1)

[Back to top of notebook](#Tasks)
<hr style="border: 2px solid black" />

<a id='ref602'></a>
# Task 2

Part of assignments for the Machine Learning and Statistics modulel of the Higher Diploma in Science in Data Analytics course at ATU 2023

Winter 23/24

Author: Jarlath Scarry

[Back to top of notebook](#Tasks)

### Chi-square test

> Consider the below contingency table based on a survey asking respondents whether they prefer coffee or tea and whether they prefer plain or chocolate biscuits. Use scipy.stats to perform a chi-squared test to see whether there is any evidence of an association between drink preference and biscuit preference in this instance.

![images/task2_image1.PNG](images/task2_image1.PNG)



### Assumptions for Chi-square test

When you choose to analyse your data using a chi-square test for independence, you need to make sure that the data you want to analyse "passes" two assumptions. You need to do this because it is only appropriate to use a chi-square test for independence if your data passes these two assumptions. If it does not, you cannot use a chi-square test for independence. These two assumptions are:

Assumption #1:
Your two variables should be measured at an ordinal or nominal level (i.e., categorical data). You can learn more about ordinal and nominal variables in our article: Types of Variable.

Assumption #2:
Your two variable should consist of two or more categorical, independent groups. Example independent variables that meet this criterion include gender (2 groups: Males and Females), ethnicity (e.g., 3 groups: Caucasian, African American and Hispanic), physical activity level (e.g., 4 groups: sedentary, low, moderate and high), profession (e.g., 5 groups: surgeon, doctor, nurse, dentist, therapist), and so forth. These independent groups should not have any overlap. For example Tea or coffee drinker, Well sometimes I drink Tea, and sometimes Coffee. This is not accepted. The choice should be either or.

In the section, Procedure, we illustrate the SPSS Statistics procedure to perform a chi-square test for independence. First, we introduce the example that is used in this guide.

[[201]](#ref201) (statistics.laerd.com Chi-Square Test for Association using SPSS Statistics Oct 2023)


We have Two variables which are cross tabulated on the table. One variable is Drinks with a choice of Tea or Coffee on the Verticle axis on the Left, and the other Variable is Bisciuts with ca choice of Chocolate or Plain on the horizontal axis at the top.

### Chi-Square Tests

    
### Laerd Statistics Chi-Square Test for Independence


### Data for the test
Data is generated from the results table above.  It is done in a few steps below using pandas, and generated results are shuffled to make them look more as expected.

In [None]:
import pandas as pd
import random

import scipy.stats as ss
from scipy.stats.contingency import crosstab


coffee_chocolate = [['Coffee','Chocolate']]*43
coffee_plain = [['Coffee','Plain']]*57
tea_chocolate = [['Tea','Chocolate']]*56
tea_plain = [['Tea','Chocolate']]*56

raw_data = coffee_chocolate + coffee_plain + tea_chocolate + tea_plain
#shuffle the data
random.shuffle(raw_data)
# Zip the list - make the rows columns and the columns rows
# Interchange the outer and inner lists
drink, biscuit = list(zip(*raw_data))  #2 lists, one with deing and one with biscuits.

# create a data frame
df = pd.DataFrame({'drink': drink, 'biscuit': biscuit})

df  # df generated from the table in the question.
df.to_csv(r'data\survey.csv') 

## Contingency Table

Contingency tables are used in statistics to summarize the relationship between several categorical variables. In our example, The Contingency table between the two variables "Drinks " and "Biscuits" is a Frequency table of these variables presented simultaneously. A chi-squared test conducted on a contingency table can test whether or not a relationship exists between variables. These effects are defined as relationships between rows and columns.

[[202]](#ref202) (stackoverflow.com, How to understand the chi square contingency table, Oct 2023)

The Contingency Table function produces a table of the joint distribution of two categorical variables. This technique is often used to analyze survey data such as in our small survey.

[[203]](#ref203) (scipy.org, scipy.stats.contingency.crosstab, Oct 2023)

scipy.stats.contingency.crosstab

> This function computes the chi-square statistic and p-value for the hypothesis test of independence of the observed frequencies in the contingency table observed.

[[204]](#ref204) (Wikipedia.org, Contingency table, Oct 2023) 

> The expected frequencies are computed based on the marginal sums under the assumption of independence; see scipy.stats.contingency.expected_freq. The number of degrees of freedom is (expressed using numpy functions and attributes).

Crosstab returns a table of counts for each possible unique combination in the argument.

In [None]:
#cross = pd.crosstab(index=df['drink'], columns=df['biscuit'])
#cross

In [None]:
# Perform Crosstabs Contingency.

cross = ss.contingency.crosstab(df['drink'], df['biscuit'])
# Show.
cross



In [None]:
first, second = cross.elements
first, second

In [None]:
cross.count

In [None]:
import scipy
print('SciPy version:', scipy.__version__)

In [None]:
df[df['drink'] == first[0]]

####  Statistical Test

> Chi-square test of independence of variables in a contingency table.

[[205]](#ref205) (scipy.org, scipy.stats.chi2_contingency, Oct 2023) 

> This function computes the chi-square statistic and p-value for the hypothesis test of independence of the observed frequencies in the contingency table observed. The expected frequencies are computed based on the marginal sums under the assumption of independence; see scipy.stats.contingency.expected_freq. The number of degrees of freedom is (expressed using numpy functions and attributes):


#### In our case the paramaters are:

    The array
    
    Correction value.  When set to false this prevents the function from applying the Yates correction
  
> In statistics, Yates's correction for continuity (or Yates's chi-squared test) is used in certain situations when testing for independence in a contingency table. It aims at correcting the error introduced by assuming that the discrete probabilities of frequencies in the table can be approximated by a continuous distribution (chi-squared). In some cases, Yates's correction may adjust too far, and so its current use is limited. 

[[206]](#ref206) (wikipedia.org, Yates's correction for continuity, Oct 2023) 

> Like scipy.stats.chisquare, this function computes a chi-square statistic; the convenience this function provides is to figure out the expected frequencies and degrees of freedom from the given contingency table.

    dof degrees of freedom.   
   

In [None]:
#the contingency table
cross.count

# Do the statistics. Just do them.
Chi2ContingencyResult = ss.chi2_contingency(cross.count, correction=False) ##chi2_contingency is the function that does the chi2 test for independence. 

# Show.
Chi2ContingencyResult

### Results

#### Expected results?
What results would we expect to see? Well we can see the "expected frequency" data already on the results table. This can also be quickly calculate this using the `expected_freq` function.  This returns the expected frequencies based on the marginal sums of the table. It shows a results table the same shape as the original table observed. 

The expected results show what result we should expect to see in a table of independent groups. In this case the expected results can be seen in the table below. For example we would expect to see around 73 people to respond saying they would like Coffee with a Chocolate biscuit.
       
[[205]](#ref205)(scipy.org, scipy.stats.chi2_contingency, Oct 2023)  

In [None]:
import numpy
expected = Chi2ContingencyResult.expected_freq
numpy.savetxt(r'data\expected.csv', expected, delimiter=",")
title = 'Expected result'
expected_df = pd.DataFrame(expected,columns=['Chocolate','Plain'],index=['Coffee', 'Tea'])
expected_df = expected_df.style.set_caption(title)
expected_df

#### Actual survey results
<img src="images/task2_image1.PNG" align="left"/>

### Actual results

The function returns a results  `object` containing the following attributes:

  statistic float:  The test statistic. = `statistic=87.31664516129032`

  pvalue float: The p-value of the test. = `pvalue=9.24664373812942e-21`

  dof int: The degrees of freedom.
       This is generally calculated as `dof` = (number of rows - 1) * (number of columns - 1) =`dof=1`

  expected_freqndarray, same shape as observed
        The expected frequencies, based on the marginal sums of the table.

[[205]](#ref205) (scipy.org, scipy.stats.chi2_contingency, Oct 2023) 

### Interpreting the results

In the Chi-squared test for independence we are testing if there is a difference in the proportions across different cattegories.

Is the choice of a plain or chocolate biscuit dependant on the drink chosen? Does knowing what Drink a person chooses tell us anything about what biscuit the might like? Looking at the data it appears that somone chhosing Tea is more likley to choose a Chocolate biscuit, but we can check this with the Chi-squared test for independence. 

If we assume at the outset that there is no difference between whether a person likes Tea or Coffee with either Plain or Chocolate biscuits, and take this to be k0, or the NULL hypothesis. If this is true what are the chances that the sample data we have would fit that hypothesis? We need to set the treshold for this hypothesis. If we pass the treshold we reject the NULL hypothesis. 

So we can take it that a p-value of less than 5% (p-value<0.05) means the result is statistically significant. In otherwords we reject the NULL hypothesis and accept that the sample data is representitive of the overall population. That if a person chooses Tea, they are likley to have a chocolate biscuit. Our p-value is far lower than 5% so we should accept the alternative hypothesis. In other words our assunption that the groups are independent is false based on out survey.

### Conclusion

 For a Chi-square test, a p-value that is less than or equal to your significance level indicates there is sufficient evidence to conclude that the observed distribution is not the same as the expected distribution. You can conclude that a relationship exists between the categorical variables. 
 
 [[207]](#ref207) (statisticsbyjim.com, Chi-Square Test of Independence and an Example, Oct 2023) 
 
Our p-value is far lower than 5% so we should accept the alternative hypothesis. In other words our assunption that the groups are independent is false based on out survey.

So we should conclude that if somone chooses Tea to drink they are more likley to have a Chocolate biscuit.
 
The p-value is the evidence against a null hypothesis. The smaller the p-value, the strong the evidence that you should reject the null hypothesis. 

[[208]](#ref208) (stackoverflow.com, how to understand the chi square contingency table, Oct 2023) 



### Acknowledgments

Ian McLoughlin ATU.ie MLAS lecture notes. For much of the code and inspiration for the rest.

### References

[201]<a id='ref201'></a> (statistics.laerd.com Chi-Square Test for Association using SPSS Statistics Oct 2023) https://statistics.laerd.com/spss-tutorials/chi-square-test-for-association-using-spss-statistics.php

[202]<a id='ref202'></a> (stackoverflow.com How to understand the chi square contingency table, Oct 2023)
https://stackoverflow.com/questions/52692315/how-to-understand-the-chi-square-contingency-table

[203]<a id='ref203'></a> (scipy.org, scipy.stats.contingency.crosstab, Oct 2023)
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.contingency.crosstab.html

[204]<a id='ref204'></a> (Wikipedia.org, Contingency table, Oct 2023) https://en.wikipedia.org/wiki/Contingency_table 

[205]<a id='ref205'></a> (scipy.org, scipy.stats.chi2_contingency, Oct 2023) https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html

[206]<a id='ref206'></a> (wikipedia.org, Yates's correction for continuity, Oct 2023) https://en.wikipedia.org/wiki/Yates%27s_correction_for_continuity

[207]<a id='ref207'></a> (statisticsbyjim.com, Chi-Square Test of Independence and an Example, Oct 2023) https://statisticsbyjim.com/hypothesis-testing/chi-square-test-independence-example/#:~:text=For%20a%20Chi%2Dsquare%20test,exists%20between%20the%20categorical%20variables. 

[208]<a id='ref208'></a> (stackoverflow.com, how to understand the chi square contingency table, Oct 2023) https://stackoverflow.com/questions/52692315/how-to-understand-the-chi-square-contingency-table


## END task2

[Back to top of task](#Task-2)

[Back to top of notebook](#Tasks)
<hr style="border: 2px solid black" />

<a id='ref603'></a>
# Task 3

Part of assignments for the Machine Learning and Statistics modulel of the Higher Diploma in Science in Data Analytics course at ATU 2023

Winter 23/24

Author: Jarlath Scarry

[Back to top of notebook](#Tasks)

### Perform a t-test

>Perform a t-test on the famous penguins data set to investigate
whether there is evidence of a significant difference in the body
mass of male and female gentoo penguins.

#### t-Tests

>A t-test is a type of statistical analysis used to compare the averages of two groups and determine whether the differences between them are more likely to arise from random chance. 
**It is any statistical hypothesis test in which the test statistic follows a Student's t-distribution under the null hypothesis.** It is most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known (typically, the scaling term is unknown and is therefore a nuisance parameter). When the scaling term is estimated based on the data, the test statistic—under certain conditions—follows a Student's t distribution. The t-test's most common application is to test whether the means of two populations are different.

>History
The t-distribution, also known as Student's t-distribution, gets its name from William Sealy Gosset, who first published it in English in 1908 in the scientific journal Biometrika using the pseudonym "Student" because his employer preferred staff to use pen names when publishing scientific papers.

>Although it was William Gosset after whom the term "Student" is penned, it was actually through the work of Ronald Fisher that the distribution became well known as "Student's distribution"

[[301]](#ref301) (wikipedia.org Student's t-test, Oct 2023)

#### What Is a T-Distribution?

>The t-distribution, also known as the Student’s t-distribution, is a type of probability distribution that is similar to the normal distribution with its bell. Unlike normal distributions, t- distribution has heavier tails, which result in a greater chance for extreme values.
The t-distribution is used in statistics to estimate the significance of population parameters for small sample sizes or unknown variations.
The t-distribution is the basis for computing t-tests in statistics.

[[303]](#ref303) (Investopedia.com, What Is T-Distribution in Probability? How Do You Use It? Oct 2023)

### Normal distribution

Normal distribution is a probability distribution that is symmetric about the mean, showing that data near the mean are more frequent in occurrence than data far from the mean. 
In graphical form, the normal distribution appears as a "bell curve". 
In summary then is; The term “Normal Distribution Curve” or “Bell Curve” is used to describe the mathematical 
concept called normal distribution, sometimes referred to as Gaussian distribution. 
It refers to the shape that is created when a line is plotted using the data points 
for an item that meets the criteria of ‘Normal Distribution’. [[2]](#ref2) (Normal distribution, Wikipedia, Nov 2022)

### Properties of a Normal distribution bell curve and the Area under the curve

The area under the graph represents 100% of the data
Empirical rule. Data falls within a certain number of standard deviations from the mean  
68% =1σ (sigma)
95% = 2σ
99.7% = 3σ

[[315]](#ref315) (Normal Distribution Definition and Properties, Prof. Essa, Nov 2015)

There are a few basic properties of a normal distribution.

The graph will be Symmetrical.

The Mean, Mode and Median are all equal.

The standard normal distribution has a mean of 0 and a variance of 1 standard deviation

### Probability density function (PDF)

In probability theory, a probability density function (PDF), or density of a continuous random variable, is a function whose value at any given sample (or point) in the sample space (the set of possible values taken by the random variable) can be interpreted as providing a relative likelihood that the value of the random variable would be equal to that sample. 

So the PDF is used to find the probability of the random variable falling within a range of values, rather than being a specific value. 

][316]](#ref316) (Probability Density Function, wikipedia, Jan 2023)

Lets consider the probability of a womans height being exactly 1.7m?

Well if we want to find somone exactly 1.700000m it would be very unlikley, infact we could say that probability is close to Zero. 

However, what is the probability of finding a woman around 1.7m, say we were willing to accept a height between 1.68 to 1.72m? Then we can calculate the probability based on the area under that part of the bell curve.


![image info](https://plus.maths.org/content/sites/plus.maths.org/files/articles/2021/Prob_dist/normal_height.png)

[[314]](#ref314) (mathspluss.org, Probability distributions, Oct 2023)
***

#### Laerd statistice t-test

The independent-samples t-test (or independent t-test, for short) compares the means between two unrelated groups on the same continuous, dependent variable. 

For example, you could use an independent t-test to understand whether first year graduate salaries differed based on gender 
The dependent variable would be "first year graduate salaries". 
Your independent variable would be "gender", which has two groups: "male" and "female.

[[305]](#ref305) (statistics.laerd.com, Independent t-test using SPSS Statistics, Oct 2023)

**Probability Density Function**

$ f(x) = \frac{1}{\sigma \sqrt{2 \pi} } e^{- \frac{1}{2} \big(\frac{x - \mu}{\sigma}\big)^2 } $

In [None]:
# Plots.
import matplotlib.pyplot as plt

# Numerical arrays.
import numpy as np

# Data frames.
import pandas as pd

# Statistics.
import scipy.stats as ss

In [None]:
##function as written in lectures
def normal_pdf(x, mu=0.0, sigma=1.0):
  # Answer: A*B.
  A = 1.0 / (sigma * np.sqrt(2.0 * np.pi))
  B = np.exp(-0.5 * ((x - mu) / sigma)**2)
  return A * B

#### Plot the normal distribution

Lets plot the normal distribution using the formula coded in the `normal_pdf()` function above. Code calculates the probability density based on the formula above.

[[306]](#ref306) (MLAS lecture notes, Ian McLoughlin, Oct 2023)

In [None]:
# Create a blank plot.
fig, ax = plt.subplots(figsize=(12,6))

# Range of x values.
x = np.linspace(-4.0, 4.0, 1001)

# Plot the pdf for the standard normal distribution.
mu, sigma2 = 0.0, 1.0
y = normal_pdf(x, mu=mu, sigma=np.sqrt(sigma2))
ax.plot(x, y, label=f"$\mu = {mu}, \sigma^2 = {sigma2}$")

Create a function that calculates the PDF for the normal distribution

In [None]:
normal_pdf(0)   ##red line. Middle of the standard normal pdf

In [None]:
normal_pdf(0,0,np.sqrt(0.2))   ##blue line

In [None]:
normal_pdf(0,-2,np.sqrt(0.5))   ##green line... So it looks like we have done a good job of calculating the normal distribution

#### Different normally distributed curves

The below plot examples from wikipedia show normally distributed curves for differing values of $\mu$ and 	$\sigma$

![Normal Distribution](https://upload.wikimedia.org/wikipedia/commons/thumb/7/74/Normal_Distribution_PDF.svg/1280px-Normal_Distribution_PDF.svg.png)
[302](#ref302) (Normal distribution, Wikipedia, Oct 2023)

#### Generating a normal distribution

There are many built in functions in Python. Numpy and scipy stats have such functions. Numpy will let us generate normally distributed data using np.random.normal() or normally distributed standard data using np.random.standard_normal. The standard normal distribution is a normal distribution where the mean is 0 and the standard deviation is 1. We can generate a satndard normal distribution with numpy.standard_normal

#### Lets plot some Normal distribution examples, first using functions in the Python Numpy library

The code below uses Numpy to generate random numbers in a normal distribution.

[[307]](#ref307) (W3Schools.com, Normal (Gaussian) Distribution, Oct 2023) Retreived from: 

[[308]](#ref308) (sharpsightlabs.com, A Quick Introduction to Numpy Random Normal, Joshua Ebner, Oct 2023)

3 parameters in the function. These allow us to control the mean, the standard deviation, and the size of the normal distribution

I ran a random sample of 100 numbers, mean = 0 and standard deviation of 0.1 and plotted the random results. The data appears to be normally distributed but the bell curve is not well defined due to the small sample size. When the sample size is increased to 10000 we can see a well defined bell curve matching the familiar normal distribution.

In [None]:
random_data = np.random.standard_normal(100000) # Generate random normal data using numpy.random

fig, ax = plt.subplots(figsize=(12,6)) # Create an empty plot.

ax.hist(random_data, bins=40, density=True) # Plot a histogram of the data.


mu, sigma2 = 0.0, 1.0  # Plot the pdf for the standard normal distribution.
y = normal_pdf(x, mu=mu, sigma=np.sqrt(sigma2))
ax.plot(x, y, label=f'$\mu = {mu}, \sigma^2 = {sigma2}$');

### Sampling Distribution

Lets look a ittle closer at the random (pseudo-random) numbers generated by numpy. Below code generates 2 arrays each 3 numbers long. The numbers are independantly generated and should have no relationship to eachother, other than that they all fit within a normal sttandard distribution. 

[[309]](#ref309) (Numpy.org, Random sampling, Oct 2023)

The numpy.random module implements pseudo-random number generators (PRNGs or RNGs, for short) with the ability to draw samples from a variety of probability distributions.

In [None]:
# Generate some random normal data.
random_data = np.random.standard_normal((10000, 25))

# Show.
random_data

In [None]:
# Mean across the rows.
random_data.mean(axis=1)

In [None]:
# Create an empty figure.
fig, ax = plt.subplots(figsize=(12,6))

# Histogram of means.
ax.hist(random_data.mean(axis=1), bins=30);

In [None]:
# Create an empty figure.
fig, ax = plt.subplots(figsize=(12,6))

# Histogram of means.
ax.hist(random_data.mean(axis=1), bins=30, density=True)

# Plot standard normal distribution.
x = np.linspace(-4.0, 4.0, 1001)
y = normal_pdf(x)
ax.plot(x, y);

## The Penguin data set.

That brings us on to the data for this test. Lets look at the penguins dataset and attempt to perform a t-test to see if there is a relationship between male and female gentoo weights. Lets first load the dataset from seaborn library and have a look at it.

[[310]](#ref310) (seaborn.pydata.org, seaborn.load_dataset, Oct 2023)

In [None]:
import numpy as np
import pandas as pd
import scipy.stats as sps
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns

#penguins = sns.load_dataset('penguins') #load the dataset from seaborn library
penguins = pd.read_csv(r"data\penguins.csv")
penguins.head(10) #show the first 10 rows

Lets get the penguins by species: Gentoo

In [None]:
gentoos = penguins[penguins['species'] == 'Gentoo']
gentoos.head(10)

The gentoo data info below shows an overview of the data we have for gentoos

In [None]:
gentoos.info()

### Paired groups for ttest


We can't measure the weight all the Gentoos in the world, but we do have a sample of them. We can see in the dataset there are a number of catogories including Male/Female, so lets see how many of each and seperate them out. The dataset contains a sample of 61 male gentoos and 58 female gentoos weights. So if we do a calculation on this sample, this statistic can help us to estimate the population parameter. In our case the mean of the sample can be used as an estimate of the mean of the population. So if we seperate them out, this will give us out 2 paired groups (or samples) required for the ttest.

In [None]:
gentoos.sex.unique() #get count of males and females
gentoos.value_counts(subset = gentoos.sex)

In [None]:
sns.histplot(data=gentoos, x="body_mass_g", hue="sex")

In [None]:
gentoos_male = gentoos[gentoos['sex'] == 'MALE']
gentoos_male.head(10)

In [None]:
gentoos_female = gentoos[gentoos['sex'] == 'FEMALE']
gentoos_female.head(10)

Now we have the two paired groups, we need to focus on getting the body mass of both groups. 

In [None]:
male_mass = gentoos_male['body_mass_g']
male_mass.head(10)

In [None]:
female_mass = gentoos_female['body_mass_g']
female_mass.head(10)

In [None]:
# Create an empty plot.
fig, ax = plt.subplots(figsize=(12,6))
# Plot a histogram of the data.

plt.hist(female_mass, bins=100,range=(2000,8000))
plt.title("Female mass", size=20, color="red")

In [None]:
# Create an empty plot.
fig, ax = plt.subplots(figsize=(12,6))
# Plot a histogram of the data.

plt.hist(male_mass, bins=100,range=(2000,8000))
plt.title("Male mass", size=20, color="red")

### Any evidence of a difference in mass of Male and Female Gentoos?

Plot the mass of both male and female gentoos on one histogram with seaborn. We can see there is some overlap, but both sets have seperate distributions. Also I have added the KDE curve (Kernel Density Estimators)

From this plot we can say YES "there is evidence of a significant difference in the body
mass of male and female gentoo penguins"

#### Getting more values from the dataset

Getting the mean body mass of both groups. Lets look at the mean of both groups. The mean mass of Female Gentoo Penguins is around 4680g while the mean mass of Male Gentoo Penguins is around 5485g. Here we also find the mean and standard deviation for the group of Gentoos overall.

In [None]:
gentoos_mass = gentoos['body_mass_g']
gentoos_mean_mass = gentoos_mass.mean()
gentoos_sd_mass = gentoos_mass.std()
female_mass_mean = female_mass.mean()
male_mass_mean = male_mass.mean()
female_mass_mean, male_mass_mean

In [None]:
fig, ax = plt.subplots(figsize=(12,6))
sns.histplot(data=gentoos, x="body_mass_g", hue="sex", bins=100, kde=True)
#sns.histplot(data=gentoos, x="body_mass_g", kde=True)

plt.xlim(2000, 8000)
plt.title("Gentoo Penguin mass g", size=20, color="red")

### Running a ttest on the groups

Now we have the data extracted form our data set, lets run a ttest. The two arrays of data we have lend themselves to a ttest.

We can use `scipy.stats.ttest_ind()` which will:
Calculate the T-test for the means of two independent samples of scores.
This is a test for the null hypothesis that 2 independent samples have identical average (expected) values. This test assumes that the populations have identical variances by default.

Perform a t-test on these 2 arrays. The t-test should 

[[311]](#ref311) (docs.scipy.org, scipy.stats.ttest_ind, Oct 2023)

In [None]:
# t-test.
ss.ttest_ind(male_mass, female_mass)

Lets calculate the quantiles of the t-distribution corresponding to our confidence level and degrees of freedom. The df is our sample size -1. `(62+59)-1` = df = 120. We can calculate the quantiles using the scipy Percent point function `ss.t.ppf()`

[[312]](#ref312) (Youtube.com, T-test, ANOVA and Chi Squared test made easy, Oct 2023) 

[[313]](#ref313) (youtube.com, Python for Data Analysis: Hypothesis Testing and T-Tests, DataDaft, Oct 2023) 

In [None]:
ss.t.ppf(q=.025, df=120), ss.t.ppf(q=.975, df=120) #Percent point function. 95% confidence interval for our sample size.

This creates a confidence interval of 95% around our female mass mean. So we see at 95% confidence interval our female_mass  falls within our population mean. This is to be expected as seen in the earlier plots with the overlap.

In [None]:
import math
sigma = female_mass_mean/math.sqrt(121) #sample stdev/sample size

ss.t.interval(0.95,   #confidence interval
              df=120,  #degrees of freedom
              loc=gentoos_mean_mass,  #sample mean
              scale=sigma) #standard deviation estimate

### ttest Results

The scipy ttest returns an object `TtestResult` with the following attributes:
    
`statisticfloat` or ndarray
    The t-statistic.
    
`pvaluefloat` or ndarray
    The p-value associated with the given alternative.
    
`dffloat` or ndarray
    The number of degrees of freedom used in calculation of the t-statistic. This is always NaN for a permutation t-test.   
    

### Understanding the results

If we assume at the outset that there is no difference between the male and female mass, and take this to be k0, or the NULL hypothesis. If this is true what are the chances that the sample data we have would fit that hypothesis? From what we have seen it looks unlikley, but how to read that from the ttest results?

Lets say this is very improbable, we need to set a treshold for this. If we say that the chances of seeing the different mass_means for the pair in our sample and the poplation overall having same mass_mean is very unlikley, could we put a value on that. Lets say there is a 5% chance of this being the case. 

So we can take it that a p-value of less than 5% (p-value<0.05) means the result is statistically significant. In otherwords we reject the NULL hypothesis and accept that the sample data is representitive of the overall population. That a Gentoo Penguins weight will likley vary depending on their sex. On the plots earlier we can see Males are generally heavier than females.

If a p-value reported from a t test is less than 0.05, then that result is said to be statistically significant. If a p-value is greater than 0.05, then the result is insignificant.

In this case the pvalue is signifficantly lower than 0.05 which indicates the result is statistically significant. The chance of seeing the sample we have in a population where there is no difference in Male and Female weights in very unlikley.

This result (`pvalue=2.133687602018886e-28`) gives us strong enough evidence to reject the NULL hypothesis and say, yes there is likly to be a difference in the weight of a Gentoo penguin based on their sex. 

The ttest statistic value (`statistic=14.721676481405709`) tells us how much the sample mean deviates from the null hypothesis. I fthe statistic lies outside the quantiles of the t-distribution corresponding to our confidence level and degrees of freedom, we reject the null hypothesis.

So our t-statistic of 14.721676481405709 is well outside our range of -1.979930405052777 to 1.9799304050527766, so this is further evidence to accept the alternative hypothesis and reject the NULL hypothesis. 


### Conclusion

In this case I would recommend rejecting the NULL hypothesis. I would expect that in the general population a Gentoo Penguins weight will likley vary depending on their sex. On the plots earlier we can see Males are generally heavier than females. The results of the ttest support this.

### Acknowledgments

Ian McLoughlin ATU.ie MLAS lecture notes. For much of the code and inspiration for the rest.

Palmer Archipelago (Antarctica) penguin data. For making the penguins dataset available for use. https://www.kaggle.com/datasets/parulpandey/palmer-archipelago-antarctica-penguin-data
Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.

### References
[301]<a id='ref301'></a> (title, website.ie, Month Year) Retrieved from: https://en.wikipedia.org/wiki/Student%27s_t-test

[302]<a id='ref302'></a> (Normal distribution, Wikipedia, Oct 2023) Retreived from: https://en.wikipedia.org/wiki/Normal_distribution

[303]<a id='ref303'></a> (Investopedia.com, What Is T-Distribution in Probability? How Do You Use It? Oct 2023) Retreived from: https://www.investopedia.com/terms/t/tdistribution.asp#toc-what-is-a-t-distribution

[304]<a id='ref304'></a> (Normal Distributions (Bell Curve): Definition, Word Problems, Stephanie Glen, NOv 2022) Retrieved from: https://www.statisticshowto.com/probability-and-statistics/normal-distributions/#whatisND
  
[305]<a id='ref305'></a> (statistics.laerd.com, Independent t-test using SPSS Statistics, Oct 2023) https://statistics.laerd.com/spss-tutorials/independent-t-test-using-spss-statistics.php
  
[306]<a id='ref306'></a> (MLAS lecture notes, Ian McLoughlin, Oct 2023) Retreived from: ATU.ie lectures

[307]<a id='ref307'></a> (W3Schools.com, Normal (Gaussian) Distribution, Oct 2023) Retreived from: https://www.w3schools.com/python/numpy/numpy_random_normal.asp

[308]<a id='ref308'></a> (sharpsightlabs.com, A Quick Introduction to Numpy Random Normal, Joshua Ebner, Oct 2023) Retreived from: https://www.sharpsightlabs.com/blog/np-random-randn-explained/

[309]<a id='ref309'></a> (Numpy.org, Random sampling, Oct 2023) Retreived from: https://numpy.org/devdocs/reference/random/index.html#quick-start

[310]<a id='ref310'></a> (seaborn.pydata.org, seaborn.load_dataset, Oct 2023) Retreived from: https://seaborn.pydata.org/generated/seaborn.load_dataset.html

[311]<a id='ref311'></a> (docs.scipy.org, scipy.stats.ttest_ind, Oct 2023) Retreived from:   https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html

[312]<a id='ref312'></a> (Youtube.com, T-test, ANOVA and Chi Squared test made easy, Global Health with Greg Martin, Oct 2023) Retreived from: https://www.youtube.com/watch?v=ijeEYFnS2v4

[313]<a id='ref312'></a> (youtube.com, Python for Data Analysis: Hypothesis Testing and T-Tests, DataDaft, Oct 2023) Retreived from: https://www.youtube.com/watch?v=CIbJSX-biu0

[314]<a id='ref314'></a> (mathspluss.org, Probability distributions, Oct 2023) Retreived from: https://plus.maths.org/content/maths-minute-probability-distributions

[315]<a id='ref315'></a> (Normal Distribution Definition and Properties, Prof. Essa, Nov 2015) Retrieved from: https://youtu.be/iMak-EW4HtM

[316]<a id='ref316'></a> (Probability Density Function, wikipedia, Jan 2023) Retreived from: https://en.wikipedia.org/wiki/Probability_density_function


## END task3

[Back to top of task](#Task-3)

[Back to top of notebook](#Tasks)
<hr style="border: 2px solid black" />

Topic 4: k Nearest Neighbours

<a id='ref604'></a>
# Task 4

Part of assignments for the Machine Learning and Statistics modulel of the Higher Diploma in Science in Data Analytics course at ATU 2023

Winter 23/24

Author: Jarlath Scarry

[Back to top of notebook](#Tasks)

### k Nearest Neighbours

>Using the famous iris data set suggest whether the setosa class is easily separable from the other two classes. Provide evidence for your answer.

### Content

In this exercise we aim to train a machine learning model that can take in feature data from the iris dataset and test the accuracy of the model when it tries to predict which of the three flower varieties a set of features are from.

The inputs are recorded measurments of the flowers featurse under the column headings: sepal_length, sepal_width, petal_length and petal_width. The varieties are setosa, versicolor and verginica under the heading class.

###  scikit-learn

We will use the sikit-learn. Scikit-learn is a machine learning library in Python. It is built on NumPy and SciPy and it contains a wide range of tools for machine learning, including: classification, regression, clustering, and dimension reduction tasks. It is a popular choice for machine learning beginners as it has good documentation and a broad library of functions. It is an excellent platform for learning. The package in python is actually called `sklearn`. This shorter name is helpful as it avoids confusion with extra charachters. It is generally included with annaconda so no need to install it seperatley.

>Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection, model evaluation, and many other utilities.



[[401]](#ref401) (scikit-learn.org, Getting Started, Nov 2023)

### sikit-learn fit method

The scikit-learn `fit()` method is used to train or teach the learning model on a specific dataset. The model begins with default settings that may not be good at predicting the type of flower without any knowledge of flowers. Instead of using its previous knowlodge it learns the patterns in the data and uses those patterns to make predictions on new data. A training step with a fit method is required for most machine learning models before they can be used to make predictions.

## Start coding

In [None]:
# Machine Learning.
import sklearn as sk
# Data frames.
import pandas as pd

In [None]:
#imports required to run the notebook code
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import math

Lets have a look at the data and check for null values. 

In [None]:
iris = pd.read_csv(r"data\iris.csv")
iris.head()

In [None]:
iris.info()

In [None]:
iris.isnull().values.any()

The data is quite short with just 150 entries but no null (or `NaN`) values. This is good, because we don't need to worry about our results being skewed by empty cells. We can see column headings from the data `head()` output. Varieties are under the heading `class`. The other feature headings can also be seen. We will need these too.

To carryout a split train and learn on the data we need separate out the target variable. In this case we are looking to predict class, so we seperate it out as y. The feature data we are using to predict the class is seperated to the variable X. After that we will split the dataset out into training (70%) and testing (30%) sets using the `train_test_split()` function from sikit-learn

In [None]:
# Separate out the target variable. In this case we are looking to predict class, so we seperate it out as y. 
# The feature data we are using to predict the class is seperated to the variable X. 
X = iris[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]
y = iris['class']

In [None]:
#https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
#sklearn.model_selection.train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None)

#Slpit the dataset into training (70%) and testing (30%) sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=True)

Below using the data `head()` function again we can see the `X_train` and `X_test` sets contain the same data for different selected rows. Also notice the class that we wish to predice is now missing. This is seperated into the `y` datasets. 

In [None]:
X_train.head()

In [None]:
X_test.head()

K-Nearest Neighbors (KNN) is an algorithm that can classify new data points based on the majority class of its k nearest neighbors in the training set. 

The k-value is an important parameter in KNN. It determines the number of nearest neighbors considered. The best k-value will differ based on the dataset can be optimised by experimentation

[[402]](#ref402) (ibm.com, What is the k-nearest neighbors algorithm? Nov 2023)

>What is a good value for K in KNN?
How to find the optimal value of K in KNN? | by Amey Band ...
The optimal K value usually found is the square root of N, where N is the total number of samples. Use an error plot or accuracy plot to find the most favorable K value. KNN performs well with multi-label classes, but you must be aware of the outliers.23 May 2020

[[403]](#ref403) (towardsdatasience.com, How to find the optimal value of K in KNN? Nov 2023)



In [None]:
knn_value = round(math.sqrt((len(iris.index))))

In [None]:
# Create and fit a KNeighborsClassifier model
knn = KNeighborsClassifier(n_neighbors=knn_value)
knn.fit(X_train, y_train)
print(knn.fit(X_train, y_train))

In [None]:
accuracy = (knn.score(X_test, y_test))*100
accuracy

### Results

When the code evaluates on the testing set, it is does so with a mixture of all three classes. The code then uses its knowledge of each class to predict a class. As a result, the accuracy of the code is a measure of its ability to correctly classify all three varieties of flowers, not just Iris setosa.

[[404]](#ref404) (towardsdatasience.com, Importance of Distance Metrics in Machine Learning 

In this case the model can predict a flowers variety or `class` with 97.77% accuracy.


### Results for setosa

Lets look at the results more closely to see how accuratley setosa can be predicted. For this we can use the sikitlearn `classification_report()` function. It will output a report of our models performance by giving scores for `precision`, `recall`, `F1-score`, and `support` for each class.

    Precision is the proportion of positive predictions that are correct

    Recall is the proportion of actual positives that are correctly identified

    F1-score is a measure of the balance between precision and recall

    Support is the number of actual occurrences of each class in the dataset

[[405]](#ref405) (kaggle.com/, kNN Classifier Tutorial, Nov 2023)

In [None]:
y_pred = knn.predict(X_test)

In [None]:
#https://medium.com/@mehtashubh1029/iris-flower-classification-using-knn-1eef6e7f3f84
#The classification_report function builds a text report showing the main classification metrics. 
#https://scikit-learn.org/stable/modules/model_evaluation.html#classification-report
print(classification_report(y_pred, y_test))

### Conclusion

Using sikit-learn on the iris dataset we predicted a flowers variety or `class` with 97.77% accuracy. More importantly for this exercise we predicted `setosa` with 100% accuracy.

### References
[401]<a id='ref401'></a>  (scikit-learn.org, Getting Started, Nov 2023) https://scikit-learn.org/stable/getting_started.html

[402]<a id='ref402'></a>  (ibm.com, What is the k-nearest neighbors algorithm? Nov 2023) retreived from: https://www.ibm.com/topics/knn

[403]<a id='ref403'></a>  (towardsdatasience.com, How to find the optimal value of K in KNN? Nov 2023) retreived from: https://towardsdatascience.com/how-to-find-the-optimal-value-of-k-in-knn-35d936e554eb

[404]<a id='ref404'></a>  (towardsdatasience.com, Importance of Distance Metrics in Machine Learning Modelling, Nov 2023) retreived from:  https://towardsdatascience.com/importance-of-distance-metrics-in-machine-learning-modelling-e51395ffe60d

[405]<a id='ref405'></a>  (kaggle.com/, kNN Classifier Tutorial, Nov 2023) retreived from: https://www.kaggle.com/code/prashant111/knn-classifier-tutorial

## END task4

[Back to top of task](#Task-4)

[Back to top of notebook](#Tasks)
<hr style="border: 2px solid black" />

<a id='ref605'></a>
# Task 5

Part of assignments for the Machine Learning and Statistics modulel of the Higher Diploma in Science in Data Analytics course at ATU 2023

Winter 23/24

Author: Jarlath Scarry

[Back to top of notebook](#Tasks)

### Preprocessing

>Perform Principal Component Analysis on the iris data set reducing the number of dimensions to two. Explain the purpose
of the analysis and your results.

### Content

The notebook task involves assess the normality of the dataset as a whole and specifically the 'setosa' class using Shapiro-Wilk test. Following this standard scaling is applied to normalize the data features. Finally Principal Component Analysis (PCA) is carried out to reduce the dimensions to two. This allows better visualisation of the variance explained by these variables.

In [None]:
#imports
import pandas as pd                   # Data frames.
import sklearn as sk                  # Machine Learning.
import sklearn.neighbors as ne        # Nearest neighbors.
import sklearn.preprocessing as pre   # Preprocessing.
import sklearn.decomposition as dec   # Decomposition.
import scipy.stats as ss              # Statistical test.
import matplotlib.pyplot as plt       # Plots.
import seaborn as sns                 # Statistical plots.
import warnings
warnings.filterwarnings('error', category=DeprecationWarning)

In [None]:
# Load the Iris dataset
df = pd.read_csv(r"data\iris.csv")
df.head()


In [None]:
# Drop any rows with NA/Nan.
df = df.dropna()

# Show.
df

Separate out the target variable. In this case we are looking to predict class, so we seperate it out as y. 

The feature data we are using to predict the class is seperated to the variable X. 

In [None]:
# Separate out the target variable. In this case we are looking to predict class, so we seperate it out as y. 
# The feature data we are using to predict the class is seperated to the variable X. 
X = df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]
y = df['class']

## Tests for Normality

The approaches can be divided into two main themes: relying on statistical tests or visual inspection. Statistical tests have the advantage of making an objective judgement of normality, but are disadvantaged by sometimes not being sensitive enough at low sample sizes or overly sensitive to large sample sizes. As such, some statisticians prefer to use their experience to make a subjective judgement about the data from plots/graphs. 

#### Shapiro test

The Shapiro (Shapiro-Wilk) test is a statistical test used to test if a sample data is normally distributed or not. 
The null-hypothesis of this test is that the population is normally distributed. Thus, if the p value is less than the chosen alpha level, then the null hypothesis is rejected and there is evidence that the data tested are not normally distributed. 

 [[502]](#ref502) (wikipedia, Shapiro–Wilk test, Dec 2023)

If the Sig. value of the Shapiro-Wilk Test is greater than 0.05, the data is normal. If it is below 0.05, the data significantly deviate from a normal distribution.

 [[504]](#ref504) (statistics.laerd.com, Testing for Normality using SPSS Statistics, Dec 2023)


In [None]:
#Lets look at the distribution of the petal_length. Setosa is seperate

# Histogram.
# Create a histogram for petal length separated by class
plt.figure(figsize=(8, 6))
sns.histplot(data=df, x='petal_length', hue='class', kde=True, palette='Set1')
plt.title('Distribution of Petal Length')
plt.xlabel('Petal Length')
plt.ylabel('Count')
#plt.legend(title='Class')
plt.legend(classes, title='Class')
plt.show()

#### Shapiro test result

This result indicates that petal length is far from normally distributed across all classes

In [None]:
ss.shapiro(df['petal_length'])

In [None]:
# Separate out gentoos. Lets look at distribution of setosa in isolation
df_seto = df[df['class'] == 'setosa']

# Histogram.
df_seto['petal_length'].hist()


#### Shapiro test result for just setosa

The p-value indicates the strength of evidence that the data is normally distributed. With a pvalue >0.05, we can say the 'petal_length' data for 'setosa' may not significantly deviate from a normal distribution. So it looks like within the setosa class, petal_length is likley to be normally distributed. The data for all classes together combined however is not likley to be normally distributed.

In [None]:
ss.shapiro(df_seto['petal_length'])

## Scaling Data

Scaling data for machine learning involves transforming the variable features to a similar scale so they are comparable. It reduces unwanted or disproportionate impact on analyses or machine learning.

Standardize features by removing the mean and scaling to unit variance. We will use Standardscaler to scale the data.

 [[503]](#ref503) (scikit-learn.org, StandardScaler, Dec 2023)

In [None]:
#import the df again
# Load the Iris dataset
df2 = pd.read_csv(r"data\iris.csv")
df2.head()

In [None]:
df2 = df2.dropna()

In [None]:
# Separate out the target variable. In this case we are looking to predict class, so we seperate it out as y. 
# The feature data we are using to predict the class is seperated to the variable X. 
X = df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]
y = df['class']

In [None]:
# Create a standard scaler.
scaler = pre.StandardScaler()

# Show.
scaler

In [None]:
# Fit the data to the scaler.
scaler.fit(X)

In [None]:
# Show the means and variances.
scaler.mean_, scaler.var_

In [None]:
# Verify the above. Does that look right? Yes it matches.
X.describe()

In [None]:
# Transformed X array.
X_transformed = scaler.transform(X)

#the earlier fitted scaler is used to transform the 
#original dataset X into a scaled form (X_transformed) for further analysis. 
#It keeps the same scaling learned from the original data.

# Show.
X_transformed

In [None]:
# Means.
X_transformed.mean(axis=0)

In [None]:
# Means.
X_transformed.std(axis=0)

In [None]:
# Differences squared between first and last row.
(X_transformed[0] - X_transformed[-1])**2

In [None]:
# Original column names.
X.columns

In [None]:
# Re-create data frame.
df_X_trans = pd.DataFrame(X_transformed, columns=X.columns)

# Show.
df_X_trans

## Dimensions

***

In [None]:
# Look at the data again.
df

In [None]:
# Scatter plots and histograms.
sns.pairplot(df, hue='class');

Looking at the pairplots, we can see that setosa are lierarly seperable in a number of pairplots, but there is no way to seperate all classes with a pair of variables.

#### Principal component analysis (PCA) 

There is no good visualisation for all the variables together. As we have seen, the pair plot is probably the best quick visualisation of all the variables in the iris dataset together. We can see at a glance the spread of points and then easily focus on one pair for a closer look, but is there a better way to combine the data from all the variables? 

Principal component analysis attempts to do this. It is an attempt to take the key points (or essence) of all the variables and combint them into two bariables that can be easily plotted.

PCA is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and enabling the visualization of multidimensional data. 

 [[501]](#ref501) (wikipedia.org, Principal component analysis Dec 2023)

In [None]:
# Load the Iris dataset
df = pd.read_csv(r"data\iris.csv")
df.head()

In [None]:
#crete new PCA instance.

pca = dec.PCA(n_components=2)

In [None]:
X

In [None]:
#fit the data to pca
pca.fit(X) 


In [None]:
pca.explained_variance_ratio_

In [None]:
X_pca = pca.transform(X)

#### Explained variance ratios obtained from performing Principal Component Analysis (PCA)

In this case, the `array([0.92461621, 0.05301557])` represents the proportion of variance explained by each principal component. Running the PCA above resulted in two principal components. The variance ratios for these components are approximately 92.46% for the first and 5.30% for the second. This tells us that most of the variance in the orivinal variables is retained by the first principal component. The second retains very little in comparison.

In [None]:
X_pca

In [None]:
# Create an empty plot.
fig, ax = plt.subplots()

# Plot scatter plot.
ax.plot(X_pca[:, 0], X_pca[:, 1], 'k.');

In [None]:
# Original classifications.
df_pca = pd.DataFrame(df[['class']])

# Show.
df_pca

In [None]:
# Incorporate our PCA variables.
df_pca['pca0'] = X_pca[:, 0]
df_pca['pca1'] = X_pca[:, 1]

# Show.
df_pca

In [None]:
# Pair plot.
sns.pairplot(df_pca, hue='class')

If we look at this plot we can say that KNN will still have have a very difficut time seperating versicolor and verginica. So, it looks like reducing the variables by PCA has not yielded much.

In [None]:
# The scaled data.
df_X_trans

In [None]:
# Create a new PCA instance.
pca = dec.PCA(n_components=2)

# Fit the scaled data.
pca.fit(df_X_trans)

# Transform.
X_trans_pca = pca.transform(df_X_trans)

# Original classifications.
df_trans_pca = pd.DataFrame(df[['class']])

# Incorporate our PCA variables.
df_trans_pca['pca0'] = X_trans_pca[:, 0]
df_trans_pca['pca1'] = X_trans_pca[:, 1]

# Show.
df_trans_pca

In [None]:
# Look at the variance.
pca.explained_variance_ratio_

Explained variance ratios obtained from performing Principal Component Analysis (PCA) on the `df_X_trans` data is more balanced. It is about 75% to 25%, compared to 95% to 5% earlier.

In [None]:
# Pair plot.
sns.pairplot(df_trans_pca, hue='class')

### Conclusion

After pre processing the data we get a good picture of how difficult it might be to classify the data. This last pair plot would suggest that after initial analysis it will be very difficult to get a clean seperation of versicolor and virginica by machine learning. We have succeeded in reducing the variables

### References

[501]<a id='ref501'></a> (wikipedia.org, Principal component analysis Dec 2023) Retreived from: https://en.wikipedia.org/wiki/Principal_component_analysis

[502]<a id='ref502'></a> (wikipedia, Shapiro–Wilk test, Dec 2023) retreived from: https://en.wikipedia.org/wiki/Shapiro%E2%80%93Wilk_test

[503]<a id='ref503'></a> (scikit-learn.org, StandardScaler, Dec 2023) Retreived from: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

[504]<a id='ref504'></a> (statistics.laerd.com, Testing for Normality using SPSS Statistics, Dec 2023) retreived from: https://statistics.laerd.com/spss-tutorials/testing-for-normality-using-spss-statistics.php



### END tasks

[Back to top of task](#Task-5)

[Back to top of notebook](#Tasks)
<hr style="border: 2px solid black" />