# **<span style="color:blue"> Weekly Tasks </span>**


#### This notebook contains the required weekly tasks for the completion of module Machine Learning and Statistics 23/24. 

***

## <ins><span style="color:blue"> Task 1 </span></ins>
> *Square roots are difficult to calculate. In Python, you typically use the power operator (a double asterisk) or a package such as `math`. In this task, you should write a function `sqrt(x)` to approximate the square root of a floating point number x without using the power operator or a package.*

>*Rather, you should use the Newton’s method. Start with an initial guess for the square root called $z_{0}$. You then repeatedly improve it using the following formula, until the difference between some previous guess $z_{i}$ and the next $z_{i+1}$ is less than some threshold, say 0.01.*

$$ z_{i+1} = z{_i} - \frac {z_i × z_i - x} {2z_i} $$

In [1]:
# The number that we want to get the square root of

x = 16

In [2]:
# Our initial guess for the square root
# Setting as floating point number as square root will be floating point unless x has a perfect square root 
z = 5 

Now, we use Newton's method. Below, using ``z=5`` as the inital guess, the next approximation of the square root of 16 to be 4.1. This is a much more accureate approximation, as 4.1 is closer much to the true square root of 16 than the initial guess of 5. 

In [3]:
# Calculate the first approximation for the square root of 16

zfirst = z - (((z*z)-x) /(2*z))
zfirst

4.1

The next six cells of code show how using the previous approximate value of the square root of 16 ($z_0$), brings the next approximate value ($z_0$) closer and closer to the correct value each time the code is run (using the most recent approximate value for $z_0$)

In [4]:
zsecond = zfirst - (((zfirst*zfirst)-x)/(2*zfirst))
zsecond

4.001219512195122

In [5]:
zthird = zsecond - (((zsecond*zsecond) -x)/(2*zsecond))
zthird

4.0000001858445895

In [6]:
zfourth = zthird - (((zthird*zthird) - x)/(2*zthird))
zfourth

4.000000000000004

In [7]:
zfifth = zfourth - (((zfourth*zfourth) - x)/(2*zfourth))
zfifth

4.0

In [8]:
zsixth = zfifth - (((zfifth*zfifth) - x)/(2*zfifth))
zsixth

4.0

In [9]:
zseventh = zsixth - (((zsixth*zsixth) - x)/(2*zsixth))
zseventh

4.0

To carry the above out using less lines of code, the following code could be used so that the value of $z_0$ is overwritten every time the code  is run:

$$z = z - (((z*z)-x)/(2*z)) $$

#### <span style="color:blue"> Creating a function to define the square root of ``n`` Using a ``For Loop`` </span>

In [10]:
def sqrt(n): # Defining the Function
    z = n/4.0 # Setting a value for the approx value and storing it as a variable z   

# Loop until we are accurate enough
    for i in range(100):
        # Newton's method for a better approximation.
        z = z - (((z*z)-x) /(2*z))

    return z

In [11]:
sqrt(16)

4.0

### <span style="color:blue"> Creating a function to define the square roof of ``n`` using a ``While Loop`` </span>

The following code creates a function to define the square root of of a number. For this the number we will be looking to get the square root of will be given the value of <span style="color:red"> "n" </span>. Firstly, we  use the ``def`` keyword followed by the name of the function we want to create ``sqrt``. In the brackets we pass the parameter <span style="color:red">  "n" </span> into the function.

I have sourced code from [How to Think Like a Computer Scientist: Learning with Python 3](https://openbookproject.net/thinkcs/python/english3e/iteration.html) and lecture videos from  [McLoughlin. I, 2023](https://atlantictu-my.sharepoint.com/personal/ian_mcloughlin_atu_ie/_layouts/15/stream.aspx?id=%2Fpersonal%2Fian%5Fmcloughlin%5Fatu%5Fie%2FDocuments%2Fstudent%5Fshares%2Fmachine%5Flearnning%5Fand%5Fstatistics%2F1%5Fgeneral%2Ft01v11%5Ftask%5Fone%5Fand%5Frepo%2Emkv&referrer=OneDriveForBusiness&referrerScenario=OpenFile) for the purpose of this task.

The first guess <span style="color:red"> n/2.0</span> is stored as variable <span style="color:red"> "approx" </span>. 

Using a ``while True loop`` the function will continue to loop until the square root is found, using the code inside of it. 

In the loop the code works as follows: Firstly, Newton's method is used. The formula is saying that if the current approx figure $z{_i}$ MINUS the approx figure squared $(z_i)$ x $(z_i)$ - $n$, all divided by the approx figure multiplied by two (${2z_i} $) is all equal to the new approx figure $z_{i+1}$, then the square root is found.

$$ z_{i+1} = z{_i} - \frac {z_i × z_i - n} {2z_i} $$

 Take for example, approx <span style="color:red"> "z"</span> to be equal to 2 and <span style="color:red"> "n"</span> to be 4 to demonstrate Newton's Method:

$$ 2_{i+1} = 2{_i} - \frac {2_i × 2_i - 4} {2(2)_i}$$

$$ 2= 2- \frac {2 × 2 - 4} {4} $$ 

$$2 = 2- \frac {4 - 4} {4}$$

$$2 = 2- \frac {0} {4}$$

$$2 = 2-0$$

$$2 = 2$$


The next line of code is stating that if the value of the first approx figure is equal to the new approx figure, then return the new approx figure.

Finally to test the function, the ``print`` command is used and a figure (for example 27) is passed in to get the square root of it. 

The whole function is demonstrated below:

In [12]:
def sqrt(n): # Defining the Function
    approx = n/2.0 # Setting a value for the approx value and storing it as a variable approx    
    while True: # While the following is true:
        newApprox = approx - (((approx*approx)-n)/(2*approx)) # Using Newton's method
        if abs(approx - newApprox) ==0: # If the difference between approx and newApprox figure is equal to 0
            return newApprox # Return the new newApprox 
        approx = newApprox # Approx figure = newApprox figure 

# Testing the function        
print(sqrt(27.0))

5.196152422706632


Lastly, just to check the output from the function above is correct, use python to check the square root of 27. This gives the same figure as generated by the function above:

In [13]:
# Check pythons value for the square root of 27:
27**0.5

5.196152422706632

# <span style="color:blue">End</span>
***

## <ins><span style="color:blue"> Task 2 </span></ins>

> *Consider the below contingency table based on a survey asking respondents whether they prefer coffee or tea and whether they prefer plain or chocolate biscuits. Use scipy.stats to perform a chi-squared test to see whether there is any evidence of an association between drink preference and biscuit preference in this instance.*

</br>
<div align="center">

|        |               |*Biscuit*                                                                               |
|--------|---------------|--------------------------------------------------------------------------------------  |
|        |               |**Chocolate**  &nbsp; &nbsp; &nbsp;  **Plain**                                          |
|***Drink*** |**Coffee**     |    43     &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;  57|
|        |**Tea**        |    56     &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;      45|

</div>

In [14]:
# Importing Libraries

# For Working With Dataframes
import pandas as pd

# For doing Statistics 
import scipy.stats as ss

# Shuffle the simulated data
import random as rd

# Plotting data
import seaborn as sns
import matplotlib.pyplot as plt

#### <span style="color:blue"> Synthasise a datset </span>

Using the data from the contingency table above, first we synthasise a dataset using Python. First, four lists are created. Each list contains two strings, which are the names of the two categories.

In [15]:
# Create a list of coffee drinkers who prefer chocolate biscuits. There are 43 of them so muliply the list by 43
# Put this list into another list so it generates 43 lists of coffee drinkers who prefer chocolate biscuits 
# Store this in a variable called coffee_choc

coffee_choc = [['Coffee','Chocolate']]*43

# Show
coffee_choc

[['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 

In [16]:
# Create a list of tea drinkers who prefer chocolate biscuits There are 56 of them so muliply the list by 56
# Put this list into another list so it generates 56 lists of tea drinkers who prefer chocolate biscuits
# Store this in a variable called tea_choc

tea_choc = [['Tea','Chocolate']]*56

# Show
tea_choc

[['Tea', 'Chocolate'],
 ['Tea', 'Chocolate'],
 ['Tea', 'Chocolate'],
 ['Tea', 'Chocolate'],
 ['Tea', 'Chocolate'],
 ['Tea', 'Chocolate'],
 ['Tea', 'Chocolate'],
 ['Tea', 'Chocolate'],
 ['Tea', 'Chocolate'],
 ['Tea', 'Chocolate'],
 ['Tea', 'Chocolate'],
 ['Tea', 'Chocolate'],
 ['Tea', 'Chocolate'],
 ['Tea', 'Chocolate'],
 ['Tea', 'Chocolate'],
 ['Tea', 'Chocolate'],
 ['Tea', 'Chocolate'],
 ['Tea', 'Chocolate'],
 ['Tea', 'Chocolate'],
 ['Tea', 'Chocolate'],
 ['Tea', 'Chocolate'],
 ['Tea', 'Chocolate'],
 ['Tea', 'Chocolate'],
 ['Tea', 'Chocolate'],
 ['Tea', 'Chocolate'],
 ['Tea', 'Chocolate'],
 ['Tea', 'Chocolate'],
 ['Tea', 'Chocolate'],
 ['Tea', 'Chocolate'],
 ['Tea', 'Chocolate'],
 ['Tea', 'Chocolate'],
 ['Tea', 'Chocolate'],
 ['Tea', 'Chocolate'],
 ['Tea', 'Chocolate'],
 ['Tea', 'Chocolate'],
 ['Tea', 'Chocolate'],
 ['Tea', 'Chocolate'],
 ['Tea', 'Chocolate'],
 ['Tea', 'Chocolate'],
 ['Tea', 'Chocolate'],
 ['Tea', 'Chocolate'],
 ['Tea', 'Chocolate'],
 ['Tea', 'Chocolate'],
 ['Tea', 'C

In [17]:
# Create a list of coffee drinkers who prefer plain biscuits There are 57 of them so muliply the list by 57
# Put this list into another list so it generates 57 lists of coffee drinkers who prefer plain biscuits
# Store this in a variable called coffee_plain

coffee_plain = [['Coffee','Plain']]*57

# Show
coffee_plain

[['Coffee', 'Plain'],
 ['Coffee', 'Plain'],
 ['Coffee', 'Plain'],
 ['Coffee', 'Plain'],
 ['Coffee', 'Plain'],
 ['Coffee', 'Plain'],
 ['Coffee', 'Plain'],
 ['Coffee', 'Plain'],
 ['Coffee', 'Plain'],
 ['Coffee', 'Plain'],
 ['Coffee', 'Plain'],
 ['Coffee', 'Plain'],
 ['Coffee', 'Plain'],
 ['Coffee', 'Plain'],
 ['Coffee', 'Plain'],
 ['Coffee', 'Plain'],
 ['Coffee', 'Plain'],
 ['Coffee', 'Plain'],
 ['Coffee', 'Plain'],
 ['Coffee', 'Plain'],
 ['Coffee', 'Plain'],
 ['Coffee', 'Plain'],
 ['Coffee', 'Plain'],
 ['Coffee', 'Plain'],
 ['Coffee', 'Plain'],
 ['Coffee', 'Plain'],
 ['Coffee', 'Plain'],
 ['Coffee', 'Plain'],
 ['Coffee', 'Plain'],
 ['Coffee', 'Plain'],
 ['Coffee', 'Plain'],
 ['Coffee', 'Plain'],
 ['Coffee', 'Plain'],
 ['Coffee', 'Plain'],
 ['Coffee', 'Plain'],
 ['Coffee', 'Plain'],
 ['Coffee', 'Plain'],
 ['Coffee', 'Plain'],
 ['Coffee', 'Plain'],
 ['Coffee', 'Plain'],
 ['Coffee', 'Plain'],
 ['Coffee', 'Plain'],
 ['Coffee', 'Plain'],
 ['Coffee', 'Plain'],
 ['Coffee', 'Plain'],
 ['Coffee'

In [18]:
# Create a list of tea drinkers who prefer plain biscuits There are 45 of them so muliply the list by 45
# Put this list into another list so it generates 56 lists of tea drinkers who prefer plain biscuits
# Store this in a variable called tea_plain

tea_plain = [['Tea','Plain']]*45

# Show
tea_plain

[['Tea', 'Plain'],
 ['Tea', 'Plain'],
 ['Tea', 'Plain'],
 ['Tea', 'Plain'],
 ['Tea', 'Plain'],
 ['Tea', 'Plain'],
 ['Tea', 'Plain'],
 ['Tea', 'Plain'],
 ['Tea', 'Plain'],
 ['Tea', 'Plain'],
 ['Tea', 'Plain'],
 ['Tea', 'Plain'],
 ['Tea', 'Plain'],
 ['Tea', 'Plain'],
 ['Tea', 'Plain'],
 ['Tea', 'Plain'],
 ['Tea', 'Plain'],
 ['Tea', 'Plain'],
 ['Tea', 'Plain'],
 ['Tea', 'Plain'],
 ['Tea', 'Plain'],
 ['Tea', 'Plain'],
 ['Tea', 'Plain'],
 ['Tea', 'Plain'],
 ['Tea', 'Plain'],
 ['Tea', 'Plain'],
 ['Tea', 'Plain'],
 ['Tea', 'Plain'],
 ['Tea', 'Plain'],
 ['Tea', 'Plain'],
 ['Tea', 'Plain'],
 ['Tea', 'Plain'],
 ['Tea', 'Plain'],
 ['Tea', 'Plain'],
 ['Tea', 'Plain'],
 ['Tea', 'Plain'],
 ['Tea', 'Plain'],
 ['Tea', 'Plain'],
 ['Tea', 'Plain'],
 ['Tea', 'Plain'],
 ['Tea', 'Plain'],
 ['Tea', 'Plain'],
 ['Tea', 'Plain'],
 ['Tea', 'Plain'],
 ['Tea', 'Plain']]

#### <span style="color:blue">Concatinating the four lists</span>

Next using the ``+`` operator, the four lists are [concatenated](https://www.freecodecamp.org/news/joining-lists-in-python-how-to-concat-lists/) (or merged) and the merged dataset is stored as a valriable called raw_data. This creates a 2D list of all the raw data - in the form of an outer list which contains inner lists which in turn contians two strings. 

In [19]:
# Merge the four lists - creates a 2D list of all the raw data
raw_data = coffee_choc + tea_choc + coffee_plain + tea_plain

# Show
raw_data

[['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 

#### <span style="color:blue">Shuffle the data</span>

At this point, it is very apparent that this data is simultated data due to the ordered structure it has in the ``raw data`` 2D list. To make this more realistic looking, the [random.shuffle](https://www.w3schools.com/python/ref_random_shuffle.asp) function from the [random](https://www.w3schools.com/python/module_random.asp) library can be used to shuffle the data. It is outer list rather than the inner list that is shuffled so the data itself within the inner list is not altered. This would not be done in the real world when collecting the raw data as it would be collected at random by nature.

In [20]:
# Shuffle the data
rd.shuffle(raw_data) 

#Show 
raw_data

[['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Plain'],
 ['Tea', 'Chocolate'],
 ['Tea', 'Plain'],
 ['Tea', 'Plain'],
 ['Coffee', 'Plain'],
 ['Tea', 'Chocolate'],
 ['Tea', 'Chocolate'],
 ['Tea', 'Plain'],
 ['Tea', 'Chocolate'],
 ['Tea', 'Plain'],
 ['Coffee', 'Chocolate'],
 ['Tea', 'Chocolate'],
 ['Tea', 'Plain'],
 ['Coffee', 'Plain'],
 ['Tea', 'Plain'],
 ['Tea', 'Chocolate'],
 ['Tea', 'Chocolate'],
 ['Coffee', 'Plain'],
 ['Coffee', 'Plain'],
 ['Tea', 'Chocolate'],
 ['Tea', 'Plain'],
 ['Tea', 'Plain'],
 ['Coffee', 'Plain'],
 ['Coffee', 'Plain'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Plain'],
 ['Tea', 'Plain'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Plain'],
 ['Tea', 'Chocolate'],
 ['Coffee', 'Plain'],
 ['Coffee', 'Plain'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Plain'],
 ['Tea', 'Plain'],
 ['Coffee', 'Plain'],
 ['Coffee', 'Chocolate'],
 ['Coffee', 'Plain'],
 ['Coffee', 'Chocolate'],
 ['Tea', 

#### <span style="color:blue">Zipping the data</span>

Next, using the [zip() function ](https://www.programiz.com/python-programming/methods/built-in/zip) using the ``*`` operator to zip the data. The ``*`` [operator ](https://initialcommit.com/blog/python-zip-two-lists#:~:text=With%20the%20use%20of%20the,is%20assigned%20to%20the%20variable.&text=This%20operator%20is%20often%20used%20with%20the%20zip()%20function%20in%20Python.) which retruns the agrument (raw_data) as a [tuple ](https://www.w3schools.com/python/python_tuples.asp).

This ``zip()`` function flips the data so that the outer and inner lists are flipped. Essentially, it makes the rows into columns and columns into rows so that the [Pandas Library](https://www.w3schools.com/python/pandas/pandas_dataframes.asp) can read the data and turn it into a dataframe. The data is now in the form of a list which contains two lists where each of the values from <span style="color:blue">drink</span> and each of the values from <span style="color:blue">biscuit</span> are now in their own respective two lists.






In [21]:
# Zip the list to flip the data so the outer and inner lists are flipped
# This creates a list that contains 2 more inner lists - drink and biscuit 

drink, biscuit =  list(zip(*raw_data))

# Show
drink, biscuit

(('Coffee',
  'Coffee',
  'Coffee',
  'Tea',
  'Tea',
  'Tea',
  'Coffee',
  'Tea',
  'Tea',
  'Tea',
  'Tea',
  'Tea',
  'Coffee',
  'Tea',
  'Tea',
  'Coffee',
  'Tea',
  'Tea',
  'Tea',
  'Coffee',
  'Coffee',
  'Tea',
  'Tea',
  'Tea',
  'Coffee',
  'Coffee',
  'Coffee',
  'Coffee',
  'Coffee',
  'Tea',
  'Coffee',
  'Coffee',
  'Coffee',
  'Tea',
  'Coffee',
  'Coffee',
  'Coffee',
  'Coffee',
  'Coffee',
  'Tea',
  'Coffee',
  'Coffee',
  'Coffee',
  'Coffee',
  'Tea',
  'Tea',
  'Tea',
  'Tea',
  'Coffee',
  'Coffee',
  'Tea',
  'Coffee',
  'Tea',
  'Tea',
  'Tea',
  'Tea',
  'Tea',
  'Coffee',
  'Coffee',
  'Coffee',
  'Coffee',
  'Tea',
  'Coffee',
  'Coffee',
  'Coffee',
  'Tea',
  'Tea',
  'Coffee',
  'Tea',
  'Coffee',
  'Coffee',
  'Coffee',
  'Tea',
  'Coffee',
  'Coffee',
  'Tea',
  'Tea',
  'Coffee',
  'Coffee',
  'Coffee',
  'Coffee',
  'Coffee',
  'Coffee',
  'Coffee',
  'Tea',
  'Coffee',
  'Tea',
  'Tea',
  'Coffee',
  'Tea',
  'Coffee',
  'Tea',
  'Coffee',
  'Tea'

#### <span style="color:blue"> Using Pandas to create a dataframe from the simulated data </span>

Next, using [Pandas Library](https://www.w3schools.com/python/pandas/pandas_intro.asp#:~:text=Pandas%20is%20a%20Python%20library,by%20Wes%20McKinney%20in%202008.), a [dataframe](https://www.w3schools.com/python/pandas/pandas_dataframes.asp) which is a table containing columns and rows can be created using the above simulated data. The data is produced in the dataframe in two categories, with variable names 'Drink' and 'Biscuit'. The data is in a [dictionary](https://www.geeksforgeeks.org/python-dictionary/). The data is held in key:value pairs. Where the keys are the column names and the values are the data in the rows below them. The dataframe below is generated containing the 201 rows and 2 columns of data, where the rows are indexed starting from index 0 through to index 200 and the columns are indexed by the keys 'Drink' and 'Biscuit'.

In [22]:
# Create a Dataframe

df = pd.DataFrame({"Drink":drink, "Biscuit": biscuit})

# Show
df

Unnamed: 0,Drink,Biscuit
0,Coffee,Chocolate
1,Coffee,Chocolate
2,Coffee,Plain
3,Tea,Chocolate
4,Tea,Plain
...,...,...
196,Tea,Plain
197,Coffee,Plain
198,Tea,Chocolate
199,Coffee,Plain


#### <span style="color:blue"> Contingency Table</span>


From the dataframe, using the [crosstab contingency function](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.contingency.crosstab.html), we can establish the number of coffee drinkers who prefer chocolate biscuits, coffee drinkers who prefer plain biscuits, tea drinkers who prefer chocolate biscuits and lastly, tea drinkers who prefer plain biscuits. The function creates [contingency table](https://www.statology.org/contingency-table-python/) to illustrate the preferences. The table clearly shows the preferences, along with the total number of tea drinkers, total number of coffee drinkers, total number of chocolate biscuit eaters and total number of plain biscuit eaters under the **All** headings. The total number of observations is 201 which matches the dataframe.

In [23]:
# Perform Crosstabs Contingency
# store as variable contingencyTable
contingencyTable = pd.crosstab(index=df['Drink'], columns=df['Biscuit'], margins=True)

# Show 
contingencyTable


Biscuit,Chocolate,Plain,All
Drink,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Coffee,43,57,100
Tea,56,45,101
All,99,102,201


#### <span style="color:blue"> The statistical Test</span>

With the contingency table created above, the [scipy.stats.chi2_contingency function](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html), is used to calculate the chi square to test for independence between the two variables drink and biscuit.

In [24]:
chisquare = ss.chi2_contingency(contingencyTable, correction=False)

chisquare

Chi2ContingencyResult(statistic=3.113937364324669, pvalue=0.5389425856850661, dof=4, expected_freq=array([[ 49.25373134,  50.74626866, 100.        ],
       [ 49.74626866,  51.25373134, 101.        ],
       [ 99.        , 102.        , 201.        ]]))

In [25]:
# The expected frequencies if the variables are independent 
chisquare.expected_freq

array([[ 49.25373134,  50.74626866, 100.        ],
       [ 49.74626866,  51.25373134, 101.        ],
       [ 99.        , 102.        , 201.        ]])

#### <span style="color:blue"> Interpreting the results</span>

For the chi square test the Null Hypothesis $H_0$ is that there is **no** relationship between the variables being tested. The alternative hypothesis $H_1$ is that there is a significant relationship between the two variables. $H_0$ is assumed until there is evidence to reject it [geeksforgeeks, 2023.](https://www.geeksforgeeks.org/python-pearsons-chi-square-test/). The **P-value** generally chosen to work off is 0.05. If the P value result of the chi-square test is lower than 0.05 then there is strong evidence to reject the Null Hypothesis $H_0$. If the P value result is greater than 0.05, then we accept the Null Hypothesis[McLeod.S 2023.] (https://www.simplypsychology.org/p-value.html).


##### <span style="color:blue"> Where are the expected frequency values coming from?</span>

The expetect frequency values above are generated based on the Null Hypothesis $H_0$. In other words, The proportion of people who have a preferred drink and preferred biscuit type is the same proportion as biscuit preference irrespective of drink preference.

For example the proportion of coffee drinkers who prefer chocolate biscuits, should be the same as the overall proportion of people (irrespective of drink preference) who prefer chocolate biscuits. 
Similarly, take tea drinkers who prefer plain biscuits, should be the same as the overall proportion of people (irrespective of drink preference) who prefer plain biscuits. 

Take the following two examples: 

##### <span style="color:blue"> Example 1</span>

In [26]:
# Calculate the number of people who preferred Chocolate biscuits irrespective of drink type 
# There are 99 people who prefer chocolate biscuits and 201 people in total 
preferchoc= 99/201
preferchoc 

0.4925373134328358

In [27]:
# Multiply the above proportion by the total number of people who drink coffee 
totalpreferchoc =100*(preferchoc)

totalpreferchoc

49.25373134328358

The result 49.25373134328358 is the same as the exepected frequency generated above for coffee drinkers who prefer chocolate biscuits. So the number of coffee drinkers who prefer chocolate biscuits is the same proportion as the number of people who prefer chocolate biscuits over all.

##### <span style="color:blue"> Example 2</span>

In [28]:
# Calculate the number of people who Preferred Plain biscuits irrespective of drink type 
# There are 102 people who prefer plain biscuits and 201 people in total 
preferplain = 102/201
preferplain

0.5074626865671642

In [29]:
# Multiply the above proportion by the total number of people who drink tea
totalpreferplain = 101*(preferplain)
totalpreferplain

51.25373134328358

The result 51.25373134328358 is the same as the exepected frequency generated above for tea drinkers who prefer plain biscuits. So the number of tea drinkers who prefer plain biscuits is the same proportion as the number of people who prefer plain biscuits over all.

#### <span style="color:blue"> Comparing the expected frequencies with the contingency table</span>

When comparing the expected frequencies with the results that the contingency table have provided, we can see that the figures are not quite the same. Again, take for example, the coffee drinkers who prefer chocolate biscuits. The contingency table shows that 43 coffee drinkers prefer chocolate biscuits. The expected frequency table shows 49.25373134. 

Similarly with the tea drinkers and plain biscuits. The contingency table shows that 45 tea drinkers perfer plain biscuits and the expected frequency is 51.25373134. 

The chances of seeing a value more extreme is the P value of pvalue=0.5389425856850661 which is not significant enough to reject the null Hypothesis $H_0$ that there is no relationship between the two variables. If the P value was less than 0.05 then there would be evidence to reject $H_0$. However in this case, we fail to reject $H_0$ that there is no correlation between drink preference and biscuit preference.

# <span style="color:blue"> End</span>
***

## <ins><span style="color:blue"> Task 3 </span></ins>

> *Perform a t-test on the famous [penguins dataset](https://github.com/mwaskom/seaborn-data/blob/master/penguins.csv) to investigate whether there is evidence of a significant difference in the body mass of male and female gentoo penguins.*

### <span style ="color:blue">Introduction</span>

The *t-distribution* was first derived in 1876 by [Fredreich Robert Helmert](https://en.wikipedia.org/wiki/Friedrich_Robert_Helmert) (1843-1917) and [Jacob Lüroth](https://en.wikipedia.org/wiki/Jacob_L%C3%BCroth) (1844-1910). However, the t-distribution was developed by and given another name, the [Student's t-distribution](https://en.wikipedia.org/wiki/Student%27s_t-distribution) by [William Sealy Gosset](https://en.wikipedia.org/wiki/William_Sealy_Gosset) (1876 -1937), an English statistician [Wikipedia (2023)](https://en.wikipedia.org/wiki/Student%27s_t-test).

<figure>
    <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/4/42/William_Sealy_Gosset.jpg/186px-William_Sealy_Gosset.jpg"
         alt="William Sealy Gosset">
    <figcaption>William Sealy Gosset</figcaption>
    <figcaption>https://en.wikipedia.org/wiki/William_Sealy_Gosset</figcaption>
</figure>


### <span style ="color:blue">Types of *t-tests*</span>

According to [jmp Statistical Discovery](https://www.jmp.com/en_ch/statistics-knowledge-portal/t-test.html), there are **three** types of *t-test*: 

- the one-sample, *t-test*
- the two-sample *t-test*
- the paired *t-test*

For all *t-tests* it is assumed that:
1. The data are continuous.
2. The sample data have been randomly sampled from a population.
3. There is homogeneity of variance (i.e., the variability of the data in each group is similar).
4. The distribution is approximately normal.

To delve further into  point 2 and point 3 above, it is assumed that both the samples of data are **IIDs**. This means that all of the values in each of the samples are **1)** Independent - that there is no connection between the values and that one value has not influenced the next value in any way and **2)** the values are identially iistributed - meaning that all values in the sample have the same proability distribution. This means all of the values come from a distribution that has the same mean ($mu$) and the same standard deviation ($sigma$) and  [Wikipedia, 2023](https://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables).


For the purpose of this task, a two-sample test will be used, where the Null Hypothesis $H_0$ is that the means of two populations are the same. The *t-test* will be used to establish if there is evidence to suggest that the means of two populations are equal, given two individual samples of data.

# **<span style="color:blue"> References </span>**
***

#### Task 1


Document 360. How to change the color of the text in Markdown? 30 March 2023. <https://docs.document360.com/docs/how-to-change-the-color-of-the-text-in-markdown>.


Markdown Basics. 04 October 2023. <https://docs.document360.com/docs/how-to-change-the-color-of-the-text-in-markdown>


Overleaf. How do I use Subscripts? 2023. <https://www.overleaf.com/learn/latex/Questions/How_do_I_use_subscripts%3F>


Soualem, Nadir. Math-Linux.com. 02 April 2023. <https://www.math-linux.com/latex-26/faq/latex-faq/article/latex-derivatives-limits-sums-products-and-integrals>


Peter Wentworth, Jeffrey Elkner, Allen B. Downey, and Chris Meyers. How to Think Like a Computer Scientist: Learning with Python 3. October 2012. https://openbookproject.net/thinkcs/python/english3e/iteration.html

Dionysia Lemonaki While Loops in Python - While True Loop Statement Example. 19 July 2022. https://www.freecodecamp.org/news/while-loops-in-python-while-true-loop-statement-example/#while-true

Ian McLoughlin. t01v11_task_one_and_repo. 13 September 2023. 
https://atlantictu-my.sharepoint.com/personal/ian_mcloughlin_atu_ie/_layouts/15/stream.aspx?id=%2Fpersonal%2Fian%5Fmcloughlin%5Fatu%5Fie%2FDocuments%2Fstudent%5Fshares%2Fmachine%5Flearnning%5Fand%5Fstatistics%2F1%5Fgeneral%2Ft01v11%5Ftask%5Fone%5Fand%5Frepo%2Emkv&referrer=OneDriveForBusiness&referrerScenario=OpenFile


Matt Cone. Markdown Guide. Extended Sytax. 2023. https://www.markdownguide.org/extended-syntax/


How To Center Things In Markdown. Markdown Land 02 November 2021. https://markdown.land/markdown-center


Matt Ball. Stackoverflow How do I ensure that whitespace is preserved in Markdown?. 30 March 2013. https://stackoverflow.com/questions/15721373/how-do-i-ensure-that-whitespace-is-preserved-in-markdown#:~:text=To%20preserve%20spaces%20in%20a,line%20break%20at%20its%20position%22.

#### Task 2
W3Schools Python Join Two Lists. 2023. https://www.w3schools.com/python/gloss_python_join_lists.asp

Programiz Python zip(). 2023. https://www.programiz.com/python-programming/methods/built-in/zip

Initial commit. Python Zip Two Lists. 14 November 2021. https://initialcommit.com/blog/python-zip-two-lists#:~:text=With%20the%20use%20of%20the,is%20assigned%20to%20the%20variable.&text=This%20operator%20is%20often%20used%20with%20the%20zip()%20function%20in%20Python.


W3Schools. Python Tuples. 2023. https://www.w3schools.com/python/python_tuples.asp


codexcademy. Docs/Markdown/Links. 22 August 2023. https://www.codecademy.com/resources/docs/markdown/links

W3Schools. Pandas Dataframes. 2023. https://www.w3schools.com/python/pandas/pandas_dataframes.asp


W3Schools. Pandas Introduction. 2023. https://www.w3schools.com/python/pandas/pandas_intro.asp#:~:text=Pandas%20is%20a%20Python%20library,by%20Wes%20McKinney%20in%202008.


Shittu Olumide. freeCodeCamp. Joining Lists in Python - How to Concat Lists. 14 March 2023. https://www.freecodecamp.org/news/joining-lists-in-python-how-to-concat-lists/


w3schools. Python Random shuffle() method. 2023. https://www.w3schools.com/python/ref_random_shuffle.asp


w3schools. Python Random Module. 2023. https://www.w3schools.com/python/module_random.asp


geeksforgeeks. Python Dictionary. 2023. https://www.geeksforgeeks.org/python-dictionary/


scipy.org. scipy.stats.contingency.crosstab. 2023. https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.contingency.crosstab.html

Zach. statology. How to create a Contingency Table in Python. 12 March 2023. https://www.statology.org/contingency-table-python/


scypy.org. scipy.stats.chi2_contingency. 2023. https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html


geeksforgeeks. 2023. https://www.geeksforgeeks.org/python-pearsons-chi-square-test/


McLeod. S. SimplyPsychology. P-Value And Statistical Significance: What It Is & Why It Matters. 13/10/2023. https://www.simplypsychology.org/p-value.html



#### Task 3
Cone.M. Markdown Guide. Hacks. 2023. https://www.markdownguide.org/hacks/#image-captions


JMP Statistical Discovery. Statistics Knowledge Portal. 2023. The t-test. https://www.jmp.com/en_ch/statistics-knowledge-portal/t-test.html


Wikipedia. Independent and identically distributed random variables. 2023. https://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables

# **<span style="color:blue"> End </span>**
***