### NAAM: Elias De Hondt ---- KLAS: ISB204B
Je levert een **uitgevoerde notebook** in, dus de **resultaten van de berekeningen staan er ook in**. Resultaten moeten **niet** afgerond worden.

In [2]:
import math                                                   # Mathematical functions
import pandas as pd                                           # Data manipulation
from scipy.stats import binom as binomial                     # Binomial distribution
from scipy.stats import norm as normal                        # Normal distribution
from scipy.stats import poisson as poisson                    # Poisson distribution
from scipy.stats import t as student                          # Student distribution
from scipy.stats import ttest_1samp                           # One-sample t-test
from scipy.stats import chisquare                             # Chi-squared test
from mlxtend.frequent_patterns import apriori                 # Apriori algorithm
from mlxtend.frequent_patterns import association_rules       # Association rules

def rule_filter(row, min_len, max_len):
    length = len(row['antecedents']) + len(row['consequents'])
    return min_len <= length <= max_len

def get_item_list (string):
    items = string [1:-1]
    return items.split(';')

def no_outliers(data): # Return the data without outliers
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    I = Q3 - Q1
    low = Q1 - 1.5 * I
    high = Q3 + 1.5 * I
    outliers = data[(data < low) | (data > high)]
    
    print("Low: ",low)
    print("High:",high)
    print("Len: ",len(data))
    print("Outliers:", outliers.values, "\n")
    return data[(data >= low) & (data <= high)]

def plot_confidence_interval(population_size, sample_mean, sample_standard_deviation, degrees_freedom, plot_factor):
    from matplotlib import pyplot as plt
    import numpy as np
    from scipy.stats import t as student

    margin_of_error = plot_factor * sample_standard_deviation / np.sqrt(population_size)
    lower_bound = sample_mean - margin_of_error
    upper_bound = sample_mean + margin_of_error

    # Plotting the confidence interval
    plt.figure(figsize=(10, 6))
    x_axis = np.linspace(sample_mean - 3 * sample_standard_deviation, sample_mean + 3 * sample_standard_deviation, 1000)
    y_axis = student.pdf(x_axis, degrees_freedom, loc=sample_mean, scale=sample_standard_deviation / np.sqrt(population_size))

    plt.plot(x_axis, y_axis, label='t-distribution')
    plt.axvline(lower_bound, color='red', linestyle='--', label='Lower Bound')
    plt.axvline(upper_bound, color='blue', linestyle='--', label='Upper Bound')
    plt.axvline(sample_mean, color='green', linestyle='-', label='Sample Mean')

    # Mark the confidence interval
    plt.fill_betweenx(y_axis, lower_bound, upper_bound, where=(x_axis >= lower_bound) & (x_axis <= upper_bound), color='orange', label='Confidence Interval')

    plt.title('Confidence Interval Plot')
    plt.xlabel('Sample Mean')
    plt.ylabel('Probability Density Function')
    plt.legend()
    plt.grid(True)
    plt.show()

In [3]:
bevolkingData = pd.read_csv('../Data/Bevolking.csv', delimiter=';', decimal='.', index_col='id')
display(bevolkingData)

Unnamed: 0_level_0,age,sex,region,income,married,children,car,fiber,iphone,linux
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
ID12101,48,FEMALE,INNER_CITY,17546.00,False,1,False,True,False,True
ID12102,40,MALE,TOWN,30085.10,True,3,True,True,True,False
ID12103,51,FEMALE,INNER_CITY,16575.40,True,0,True,True,False,False
ID12104,23,FEMALE,TOWN,20375.40,True,3,False,True,False,False
ID12105,57,FEMALE,RURAL,50576.30,True,0,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...
ID12696,61,FEMALE,INNER_CITY,47025.00,False,2,True,True,True,False
ID12697,30,FEMALE,INNER_CITY,9672.25,True,0,True,True,False,False
ID12698,31,FEMALE,TOWN,15976.30,True,0,True,False,False,True
ID12699,29,MALE,INNER_CITY,14711.80,True,0,False,False,True,False


**Vraag 1**

In [4]:
# Probability Mass Function

probability=len(bevolkingData[ (bevolkingData["region"] == "TOWN") | (bevolkingData["car"] == True)]) / len(bevolkingData)
print(f"The probability is as follows: {round(probability,4)} or {round(probability*100,4)}%.")

The probability is as follows: 0.63 or 63.0%.


**Vraag 2** 

In [5]:
# Technique used: Laplace
# P()

probability=(1/4*0.001)+(1/4*0.002)+(2/4*0.003)
print(f"The probability is as follows: {round(probability,4)} or {round(probability*100,4)}%.")

The probability is as follows: 0.0023 or 0.225%.


**Vraag 3**

In [6]:
# Technique used: The Binomial Distribution

true=len(bevolkingData[(bevolkingData["married"] == True)])
total=len(bevolkingData)

k=7                 # Probability of exactly k successes
n=10                # Number of trials
p=true/total        # Probability of success in each trial
probability=1-binomial.cdf(k, n, p)
print(f"The probability is as follows: {round(probability,4)} or {round(probability*100,4)}%.")

The probability is as follows: 0.2838 or 28.377%.


**Vraag 4**

In [7]:
# Technique used: The Poisson Distribution

x=3      # Number of events
y=10     # Average number of events
probability=1-poisson.cdf(x, y)
print(f"The probability is as follows: {round(probability,4)} or {round(probability*100,4)}%.")

The probability is as follows: 0.9897 or 98.9664%.


**Vraag 5**

In [8]:
# Technique used: Normal Distribution

loc=bevolkingData['income'].mean()
scale=bevolkingData['income'].std()
x=65000
probability=1-normal.cdf(x, loc, scale)
print(f"The probability is as follows: {round(probability,4)} or {round(probability*100,4)}%")

The probability is as follows: 0.0018 or 0.1835%


**Vraag 6**

In [9]:
a=0.05
x_bar=bevolkingData['age'].mean()
s=bevolkingData['age'].std()
n=len(bevolkingData['age'])
df=n-1
p=1-a

interval=student.interval(confidence=p, df=df, loc=x_bar, scale=s/math.sqrt(n))
print("Confidence Interval:",interval)

Confidence Interval: (41.23844813332852, 43.55155186667149)


**Vraag 7**

In [10]:
income=no_outliers(bevolkingData['income'])
mu=6500
a=0.05
data=ttest_1samp(income, mu)
p_value= data.pvalue
print("P Value:",p_value)
print("5%", income.mean()-(income.mean()/100*5))

if p_value < a:
    print('Nee')
else:
    print('Ja')

print(f"Het gemiddelde is niet {mu} wat {a} is groter dan de {p_value}.")

Low:  -11097.762499999993
High: 64534.937499999985
Len:  600
Outliers: [] 

P Value: 6.179025530039627e-171
5% 26147.82965583333
Nee
Het gemiddelde is niet 6500 wat 0.05 is groter dan de 6.179025530039627e-171.


**Vraag 8**

In [25]:
children=bevolkingData['children']
children1 = 0
children2 = 0
children3 = 0

for i in range(len(children)):
    if children.iloc[i] == 1:
        children1 += 1
    if children.iloc[i] == 2:
        children2 += 1
    if children.iloc[i] == 3:
        children3 += 1

total = children1 + children2 + children3

print("Amount of children1:",children1)
print("Amount of children2:",children2)
print("Amount of children3:",children3)
print("Total:",total, len(children))

measured_values = [children1,children2,children3]
expected_values = [total*0.45,total*0.20,total*0.10]
data = chisquare(measured_values, expected_values)

print("P Value:", data.pvalue)

a=0.05
if p_value < a:
    print('Nee')
else:
    print('Ja')

print("Er is hier its mis in mijn for i maar ik zie het probleem niet in 1,2,3")
print("En het gaat 'Ja' zijn wat dat p_value gaat < zijn dan a dus het bureau van de statistiek zijn bewering klopt.")

Amount of children1: 135
Amount of children2: 134
Amount of children3: 68
Total: 337 600


ValueError: For each axis slice, the sum of the observed frequencies must agree with the sum of the expected frequencies to a relative tolerance of 1e-08, but the percent differences are:
0.3333333333333333

**Vraag 9**

**Vraag 10**

In [None]:
# Apriori algorithm
support=0.2
min_threshold=0.7
bevolkingData_dummies=bevolkingData.drop(columns=['age', 'sex', 'region', 'income', 'married', 'children'])

item_sets_apriori=apriori(bevolkingData_dummies, min_support=support, use_colnames=True)

rules_apriori=association_rules(item_sets_apriori, metric='confidence', min_threshold=min_threshold)
rules_apriori=rules_apriori.drop(columns=['leverage', 'conviction', 'zhangs_metric'])

display(rules_apriori.sort_values(by='confidence', ascending=False).head(1)['antecedents'])
print("Wat (linux) heeft de grootsten confidence bij fiber.")