# Machine Learning and Statistics 2023

Student: Lais Coletta Pereira

Lecturer: Ian Mcloughlin

In this jupyter notebook I am going to submit the tasks for the Machine Learning and Statistics module for the Higher Diploma of Data Analysis ATU. 

## Task 1

> Square roots are difficult to calculate. In Python, you typically
use the power operator (a double asterisk) or a package such
as `math`. In this task,1 you should write a function `sqrt(x)` to
approximate the square root of a floating point number x without
using the power operator or a package.

> Rather, you should use the Newton’s method. Start with an initial guess for the square root called $z_0$. You then repeatedly improve it using the following formula, until the difference between some previous guess $z_i$ and the next $z_{i+1}$ is less than some threshold, say 0.01.

$$ z_{i+1} = z_i − \frac{z_i × z_i − x}{2z_i} $$ 

In [1]:
import numpy as np

In [2]:
def sqrt (x):
    #initial guess for the square root.
    z = x / 4.0
    #Loop until we are accurate enough. 
    # Newton's method for a better approximation.
    for i in range (100):
        z = z - ((z*z)-x)/(2*z)
# z should now be a good approximation for the sqare root.
    return z

In [3]:
# test the function on 3
sqrt (3)

1.7320508075688774

In [4]:
# Check Python's value for square root of 3

3**0.5

1.7320508075688772

## Task 2

Consider the below contingency table based on a survey asking respondents whether they prefer coffee or tea and whether they
prefer plain or chocolate biscuits. Use scipy.stats to perform a chi-squared test to see whether there is any evidence of an association between drink preference and biscuit preference in this instance.

|      |     |      Biscuit    |          |
| ---- | ---------- | --------- | -------- |
|      |            | Chocolate | Plain    |
| Drink | Coffee    | 43        | 57       |
|      | Tea       | 56        | 45       |


In [5]:
import scipy.stats as ss
from scipy.stats import chi2_contingency

# Create a contingency table from the task example
contingency_table = np.array([[43, 57],
                                 [56, 45]])

# Do the chi-squared test
result = ss.chi2_contingency(contingency_table, correction=False)

# Show.
result


(3.113937364324669,
 0.07762509678333357,
 1,
 array([[49.25373134, 50.74626866],
        [49.74626866, 51.25373134]]))

The expected frequence of in the array shows what the expected counts in each cell would be if there were no relationship between the variables. Therefore, if there is no relationship between drink and biscuit type, then we should have same proportion of people who like coffee as we have overall. In this case, there is some small association that people who like drinking coffee tend to eat plain biscuit while people who likes tea, tend to have chocolate biscuit. 

## Task 3

Perform a t-test on the famous penguins data set to investigate whether there is evidence of a significant difference in the body mass of male and female gentoo penguins.

In [6]:
import pandas as pd

# Read the dataset and create a variable
penguins = pd.read_csv('penguins.csv')

print (penguins)



    species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
0    Adelie  Torgersen            39.1           18.7              181.0   
1    Adelie  Torgersen            39.5           17.4              186.0   
2    Adelie  Torgersen            40.3           18.0              195.0   
3    Adelie  Torgersen             NaN            NaN                NaN   
4    Adelie  Torgersen            36.7           19.3              193.0   
..      ...        ...             ...            ...                ...   
339  Gentoo     Biscoe             NaN            NaN                NaN   
340  Gentoo     Biscoe            46.8           14.3              215.0   
341  Gentoo     Biscoe            50.4           15.7              222.0   
342  Gentoo     Biscoe            45.2           14.8              212.0   
343  Gentoo     Biscoe            49.9           16.1              213.0   

     body_mass_g     sex  
0         3750.0    MALE  
1         3800.0  FEMALE  
2     

Get an overview of the dataset, including data types and missing values:


In [7]:
print(penguins.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB
None


Perform the t-test:

In [8]:
#Drop missing values from the original DataFrame
penguins.dropna(subset=['body_mass_g'], inplace=True)

# Separate gentoo penguins column
gentoo_penguins = penguins[penguins["species"] == "Gentoo"]

# Separate between male and female within gentoo 
male_penguins = gentoo_penguins[gentoo_penguins["sex"] == "Male"]
female_penguins = gentoo_penguins[gentoo_penguins["sex"] == "Female"]

# Perform a t-test
result = ss.ttest_ind(male_penguins["body_mass_g"], female_penguins["body_mass_g"], equal_var=False, nan_policy='raise')


# Display the results
print (result)

Ttest_indResult(statistic=nan, pvalue=nan)


The result I am receiving is 'NaN'

In [9]:
# Drop rows with null values
penguins_cleaned = penguins.dropna()

# Display the cleaned dataset
print("Cleaned Dataset:")
print(penguins_cleaned)

Cleaned Dataset:
    species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
0    Adelie  Torgersen            39.1           18.7              181.0   
1    Adelie  Torgersen            39.5           17.4              186.0   
2    Adelie  Torgersen            40.3           18.0              195.0   
4    Adelie  Torgersen            36.7           19.3              193.0   
5    Adelie  Torgersen            39.3           20.6              190.0   
..      ...        ...             ...            ...                ...   
338  Gentoo     Biscoe            47.2           13.7              214.0   
340  Gentoo     Biscoe            46.8           14.3              215.0   
341  Gentoo     Biscoe            50.4           15.7              222.0   
342  Gentoo     Biscoe            45.2           14.8              212.0   
343  Gentoo     Biscoe            49.9           16.1              213.0   

     body_mass_g     sex  
0         3750.0    MALE  
1         3800.0

In [10]:
# Check for null values in the dataset
null_values = penguins.isnull()

# Sum the null values for each column to count the missing values
missing_values_count = null_values.sum()

# Display the count of missing values for each column
print("Missing Values Count for Each Column:")
print(missing_values_count)

Missing Values Count for Each Column:
species              0
island               0
bill_length_mm       0
bill_depth_mm        0
flipper_length_mm    0
body_mass_g          0
sex                  9
dtype: int64
