# Machine Learning and Statistics Tasks

***

Roberto Vergano

## Task 1. The Newton´s method

***

Square roots are difficult to calculate. In Python, you typically use the power operator (a double asterisk) or a package such
as math. In this task,(1) you should write a function sqrt(x) to approximate the square root of a floating point number x without
using the power operator or a package.

Rather, you should use the Newton’s method.(2) Start with an 2 Square Roots via Newton’s Method. initial guess for the square root called z0. You then repeatedly improve it using the following formula, until the difference between
some previous guess zi and the next zi+1 is less than some threshold, say 0.01.

In [3]:
# For LaTex code:
from IPython.display import display, Math
latex_code = r"z_{i+1} = z_i - \frac{{z_i*z_i - x}}{{2z_i}}"
display(Math(latex_code))

<IPython.core.display.Math object>

Where: 

zi = current estimate or iteration in the sequence.  
zi+1 = next estimate or iteration in the sequence.  
x = constant or parameter.  

**How does the formula work?**

1. Initial guess,z0
2. Iteration. For each iteration (i), the formula computes zi+1 from the current estimate zi.
3. Convergence. This iteration repeats until zi+1 is close enough to zi. This means that the process is repeated until the root of the equation is approached.

In [2]:
# Function "get_positive_float" avoids the use of negative numbers as an input for the formula.
# Using a while loop, the function asks the user for a positive floating number.
def get_positive_float():
    while True:
        try:
            x = float(input("Enter a positive floating number: "))
            if x > 0:
                return x
            else:
                print("That is not a positive floating number. Please try again.")
        except ValueError:
            print("Invalid input. Please enter a valid floating-point number.")

# Funtion "sqrt" takes the input from the user to calculate the square root using the Newton formula.
def sqrt(x):
    # Initial guess
    z = x / 4.0

    # Iterative approximation using Newton method. 
    for i in range(100):
        z = z - (((z*z) - x) / (2*z))

    return z

# Get a positive floating-point number from the user
x = get_positive_float()

# Calculate and print the square root
result = sqrt(x)
print(f"The square root of {x} is approximately {result}")


The square root of 25.0 is approximately 5.0


References

1. A Tour of Go. Aug. 18, 2023. url: https://go.dev/tour/flowcontrol/8 (visited on 08/18/2023).
2. Square Roots via Newton’s Method. Feb. 4, 2015. url: https://math.mit.edu/~stevenj/18.335/newtonsqrt. pdf (visited on 08/18/2023).


## Task 2. Chi squared test

***

Consider the below contingency table based on a survey asking respondents whether they prefer coffee or tea and whether they prefer plain or chocolate biscuits. Use scipy.stats to perform a chi-squared test to see whether there is any evidence of an association between drink preference and biscuit preference in this instance.

To create the dataframe:

In [2]:
import pandas as pd

# Create a DataFrame
data = {
    'Chocolate': [43, 56],
    'Plain': [57, 45]
}

index = ['Coffee', 'Tea']

df = pd.DataFrame(data, index=index)

# Define a style function to add borders for the table:
def borders(s):
    return f'border: 1px solid black;'

# Apply the style function to the DataFrame
df_border = df.style.applymap(borders)

# Display the styled DataFrame
df_border

Unnamed: 0,Chocolate,Plain
Coffee,43,57
Tea,56,45


For this table, we have 2 nominal variables:  
- biscuits (chocolate/plain)  
- drink (biscuits/tea)  

To calcule the chi-squared test we are going to transform the dataframe in a numpy array and then run the chi-squared test for contingency tables available in spicy.stats website.(1) Finally, we will compare the chi square results with and without the continuity correction.

In [12]:
import numpy as np
from scipy.stats import chi2_contingency
from tabulate import tabulate

# Create numpy array with the contingency table
observed_data = np.array([[43, 57], [56, 45]])

# chi-squared test with "correction=True"
chi2_stat, p_val, dof, expected = chi2_contingency(observed_data, correction=True)

# chi-squared test with "correction=False"
chi2_stat_no_correction, p_val_no_correction, dof_no_correction, expected_no_correction = chi2_contingency(observed_data, correction=False)

# To display the results in a table
table_data = [
    ["Chi-squared Statistic", chi2_stat, chi2_stat_no_correction],
    ["P-value", p_val, p_val_no_correction],
    ["Degrees of Freedom", dof, dof_no_correction],
    ["Expected Frequencies", ""],
    ["", expected[0, 0], expected_no_correction[0, 0]],
    ["", expected[0, 1], expected_no_correction[0, 1]],
    ["", expected[1, 0], expected_no_correction[1, 0]],
    ["", expected[1, 1], expected_no_correction[1, 1]],
]

table = tabulate(table_data, headers=["Statistic", "Value (correction=True)", "Value (correction=False)"], tablefmt="fancy_grid", colalign=("center", "center", "center"))

print(table)

╒═══════════════════════╤═══════════════════════════╤════════════════════════════╕
│       Statistic       │  Value (correction=True)  │  Value (correction=False)  │
╞═══════════════════════╪═══════════════════════════╪════════════════════════════╡
│ Chi-squared Statistic │    2.6359100836554257     │          3.11394           │
├───────────────────────┼───────────────────────────┼────────────────────────────┤
│        P-value        │    0.10447218120907394    │         0.0776251          │
├───────────────────────┼───────────────────────────┼────────────────────────────┤
│  Degrees of Freedom   │             1             │             1              │
├───────────────────────┼───────────────────────────┼────────────────────────────┤
│ Expected Frequencies  │                           │                            │
├───────────────────────┼───────────────────────────┼────────────────────────────┤
│                       │     49.25373134328358     │          49.2537           │
├───

Results:  
1. Chi-squared Statistic: the results for both (correction = True/False) are significantly greater than zero. This may suggest no relationship between drink and biscuit preference. 
2. P-Value: assuming a P-Value =< 0.05 as stadistically significant, the results for both (correction = True/False) suggest no relationship between drink and biscuit preference.  
3. Correction (True/False): the continuity correction does not seem to have much effect on the overall result. 

REFERENCES  
1. https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html  

## TASK 3. t-test

***

Perform a t-test on the famous penguins data set to investigate whether there is evidence of a significant difference in the body mass of male and female gentoo penguins.

The pinguins dataset was downloaded from https://github.com/mwaskom/seaborn-data/blob/master/penguins.csv

First, we need to access the dataset using pandas. Then, we need to find the target columns and prepare the sample where we are going to perform the t-test.

In [39]:
# "read_csv" from pandas to open and read the dataset.
import pandas as pd 
dt = pd.read_csv("datasets/penguins.csv")

In [40]:
# info () to see a summary of the dataset.
dt.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB


Our target columns are 1 ("species), 5 ("body_mass_g") and 6 ("sex"). Therefore, we need to extract our target population from the dataset ("Gentoo penguins"). Then separate by sex and select the parameter we want to test ("body_mass") 

In [53]:
# Describe the "species" column. 
dt["species"].describe()

count        344
unique         3
top       Adelie
freq         152
Name: species, dtype: object

The unique value indicate that we have 3 categorical values in the dataset corresponding to 3 different species. 

In [61]:
# First, we extract in "df" variable the Gentoo penguins. 
df = dt.loc[dt["species"]=="Gentoo"]

# Separate by sex:

male = df.loc[df["sex"] == "MALE"]
female = df.loc[df["sex"] == "FEMALE"]

# Count the number of samples in each sex:
m_sample = male.shape[0]
f_sample = female.shape[0]
print("Number of male samples:", m_sample)
print("Number of female sample",f_sample)

Number of male samples: 61
Number of female sample 58


We have a difference of 3 samples between both sex due to NULL values. In order to keep both groups with the same number of samples, we are going to randomnly select 58 samples from the male group.

In [69]:
# Sample from pandas to select 58 random rows.
male_random = male.sample(58)
male_random.shape[0]

58

Finally, we need to get the data only for the body_mass column for each group. 

In [83]:
male_mass = male_random["body_mass_g"]
female_mass = female["body_mass_g"]
print("Gentoo male body mass descriptive stadistics:")
print(male_mass.describe())
print()
print("Gentoo female body mass descriptive stadistics:")
print(female_mass.describe())

Gentoo male body mass descriptive stadistics:
count      58.000000
mean     5488.793103
std       309.788143
min      4750.000000
25%      5300.000000
50%      5500.000000
75%      5700.000000
max      6300.000000
Name: body_mass_g, dtype: float64

Gentoo female body mass descriptive stadistics:
count      58.000000
mean     4679.741379
std       281.578294
min      3950.000000
25%      4462.500000
50%      4700.000000
75%      4875.000000
max      5200.000000
Name: body_mass_g, dtype: float64


On average, the body mass from the male gentoo penguin is bigger than the female gentoo penguin. Let´s perfom the t-test to find out whether this difference is statistically significant or not.

**T-test**  


In [86]:
from scipy.stats import ttest_ind

# t-test:
t_statistic, p_value = ttest_ind(male_mass, female_mass)

# Display the results
print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")

# Check for statistical significance (assuming significance level of 0.05)
if p_value < 0.05:
    print("The difference is statistically significant.")
else:
    print("The difference is not statistically significant.")

T-statistic: 14.718211530302684
P-value: 4.015926522817719e-28
The difference is statistically significant.


Results:  
1. The T-statistic value measures indicates a big difference between both sample group means.  
2. The low p-value indicates that this difference is statistically significant. 
3. Therefore, there is a statistically significant difference in the body mass between male and female gentoo penguins.

**References**
1. Spicy.stats website (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html)