# Statistical Data Management Session 9: Inferences Based on a Single Sample Tests of Hypothesis (chapter 8 in McClave & Sincich)


**We expect you to be able to solve these exercises both with and without Python, with the exception of course of part 1 of exercise 3, for which you have to rely on Python...**

## 1. The European ℮ Standard

Prepackaged items in the EU may bear the ℮-mark to show that they are conforming with EU weight standards (see  https://europa.eu/youreurope/business/product-requirements/labels-markings/emark/index_en.htm). 

1. To test the claim of your favourite crisp brand that their packages contain 120g, you weigh the contents of 20 packages and find $\bar{x} = 119.5$ and $s=0.8$. Is this brand complying to EU regulations correctly? You may assume the weights follow a normal distribution.

    The Council Directive of 20 January 1976 "on the approximation of the laws of the Member States relating to the making-up by weight or by volume of certain prepackaged products" OJ L 046 21.2.1976, p. 1 stipulates a one-sided t-test at confidence level $\alpha = 0.005$.

    Use both Python and the t-table!

In [None]:
import sqlite3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats as sts
import time
%matplotlib inline



2. Now assume you have information that the machines that fill the packages do this with a standard deviation $\sigma = 0.8$. Perform the test again!

## 2. Comparing Exam UML Tasks

Say that for an *Object Oriented Software Development* exam, it is known that an expected proportion of 75% of participants passes the UML modelling task. In the interest of fairness, the teachers of this course want to guard whether different exams are comparable. In the file ``uml.csv`` in the ``shared`` folder, you find scores, out of 64, of an exam UML taks. Run the cell below to define the proportion of passed students for this exam.

In [None]:
df_uml = pd.read_csv("../../shared/uml.csv")
n = len(df_uml)
p_hat = len(df_uml[df_uml["MarksUML"] >= 32]) / n

Perform a test at significance level $\alpha = 0.05$ to check whether this pass-rate is significantly different from $75\%$.

## 3. List Performance

How long does it take Python to do an operation on a huge list? Repeated simulations lead me to hazard the opinion that the function ``fill_with_ones()`` defined below takes 0.032 seconds to run in my Notebook interpreter. We will test whether your hub performs worse (i.e. longer execution time).

1. To check this, we need a data set. One execution of a function is not representative as the execution time depends on other processes as well. To overcome this problem, the code below executes the function call to ``fill_with_ones()`` 100 times. Run the code to generate your data set.

In [None]:
def fill_with_ones(array): #      silly function that simply overwrites all entries in an array with ones
    for i in range(len(array)):
        array[i] = 1

dummy_array = [0]*1000000 #       define an array with a million zeroes
times = np.empty(100) #           array to catch the time it takes for 100 simulations

for i in range(100): #            do this a 100 times
    start = time.time() #         log the time now, before the function call
    fill_with_ones(dummy_array) # call the function
    end = time.time() #           log the time again, after the function call
    print(end - start) #          print the time difference
    times[i] = end - start #      save the time difference
print("Mean:", times.mean())

2. Formulate $H_0$ and $H_a$.
3. Perform the test at significance level $\alpha = 0.01$.

## 4. Birth Weight

In last week's exercise 4, we obtained a $90\%$ confidence interval, based on $n=42$ and $\bar{x}=3.31$, for babies' birthweight: $[3.16, 3.47]$. Assume that these data were obtained from a sample in one hospital. We want to test, at $\alpha=0.05$, whether the weight of babies born in this hospital is significantly less than the national average, which is 3.4 kg.

1. Comment on the following reasoning: "3.4 lies more to the right ($3.4>3.31$) in this interval, so the birth weight in this hospital is indeed significantly less than in the national population."   
2. Perform the test.

## 5. SQL Recap

The file ``uml.sql`` provided on Toledo contains the information used in exercise 2: student q-numbers and scores. Note that certain students occur twice, e.g. q-number 114 with scores 14 and 15. In that case, their answer was spread over multiple pages and their score is the sum of these individual numbers. Import the file using MySQL Workbench and write the appropriate queries to retrieve the relevant information. Re-run your analysis (without running the cell which defined the dataframe!) to check whether you have the correct information.

In [None]:
conn = sqlite3.connect("../../shared/uml.db")

query_total = """
SELECT ... AS total 
FROM ...
"""

query_passed = """
SELECT ... AS passed 
FROM ...
"""

df_total = pd.read_sql_query(query_total, conn)
df_passed = pd.read_sql_query(query_passed, conn)
print(df_total)
print(df_passed)