# P3 Example Exam

In [4]:
import math                                                   # Mathematical functions
import pandas as pd                                           # Data manipulation
import numpy as np                                            # Scientific computing
import matplotlib.pyplot as plt                               # Data visualization
from scipy.stats import binom as binomial                     # Binomial distribution
from scipy.stats import norm as normal                        # Normal distribution
from scipy.stats import poisson as poisson                    # Poisson distribution
from scipy.stats import t as student                          # Student distribution
from scipy.stats import chi2                                  # Chi-squared distribution
from scipy.stats import ttest_1samp                           # One-sample t-test
from scipy.stats import chisquare                             # Chi-squared test
from scipy.special import comb                                # Combinations
from mlxtend.frequent_patterns import apriori                 # Apriori algorithm
from mlxtend.frequent_patterns import fpgrowth                # FP-growth algorithm
from mlxtend.frequent_patterns import association_rules       # Association rules
from mlxtend.preprocessing import TransactionEncoder          # Transaction encoder

In [5]:
def rule_filter(row, min_len, max_len):
    length = len(row['antecedents']) + len(row['consequents'])
    return min_len <= length <= max_len

def get_item_list (string):
    items = string [1:-1]
    return items.split(';')

def plot_confidence_interval(population_size, sample_mean, sample_standard_deviation, degrees_freedom, plot_factor):
    margin_of_error = plot_factor * sample_standard_deviation / np.sqrt(population_size)
    lower_bound = sample_mean - margin_of_error
    upper_bound = sample_mean + margin_of_error

    # Plotting the confidence interval
    plt.figure(figsize=(10, 6))
    x_axis = np.linspace(sample_mean - 3 * sample_standard_deviation, sample_mean + 3 * sample_standard_deviation, 1000)
    y_axis = student.pdf(x_axis, degrees_freedom, loc=sample_mean, scale=sample_standard_deviation / np.sqrt(population_size))

    plt.plot(x_axis, y_axis, label='t-distribution')
    plt.axvline(lower_bound, color='red', linestyle='--', label='Lower Bound')
    plt.axvline(upper_bound, color='blue', linestyle='--', label='Upper Bound')
    plt.axvline(sample_mean, color='green', linestyle='-', label='Sample Mean')

    # Mark the confidence interval
    plt.fill_betweenx(y_axis, lower_bound, upper_bound, where=(x_axis >= lower_bound) & (x_axis <= upper_bound), color='orange', label='Confidence Interval')

    plt.title('Confidence Interval Plot')
    plt.xlabel('Sample Mean')
    plt.ylabel('Probability Density Function')
    plt.legend()
    plt.grid(True)
    plt.show()

## Exercises Chance

### Question 1:
You roll `2 dice`.
What is the probability that you roll a `5` and a `2`? (3 significant figures)

### Question 2:
We have `2 dice`.
-   1st: `3 yellow` areas and `3 red` areas.
-   2nd: `5 yellow` areas and `1 red` area.

What is the probability of having both a `yellow` and a `red` square?

### Question 3:
A box contains `4 yellow`, `2 brown`, `1 orange`, `1 green` and `2 purple` marbles. We always draw without replacing. (2 significant figures each)
- If we draw `4 marbles`, what is the probability that all marbles are `yellow`?
- If we draw `2 marbles`, what is the probability of getting a `brown` and a `green` marble?
- When we draw `1 marble`, what is the chance of getting an `orange` or a `green` marble?

### Question 4:
For a group of `144 students`, `90` passed the "OO Concepts" exam and `60` for "JavaFX".
Of the `90` who passed on "OO concepts", `30` also passed on "JavaFX".
What is the probability that if we randomly select a student, they will have passed at least 1 of the 2 exams? (4 significant digits)

### Question 5:
(C) In a group of `1000 students`, `400` regularly play computer games.
(B) `200` regularly play board games.
(C and B) There are `100` who regularly play computer games and board games.

If you choose a random student, what are the chances that they regularly play games (computer or board)?

### Question 6:
`70%` of students pass a "Data Science 1" exam. Of these, `50%` are day students and `50%` blended students. Of the participants, `60%` are day students and `40%` blended students. `1` random student is selected from the group each time. (2 significant figures each)

- What is the chance of a successful student?
- What is the probability of a day student, given that he or she is a successful student?
- What are the chances of a successful day student?
- What is the chance of a successful student, given that he or she is a day student?
- What is the chance of a student failing, given that he or she is a day student?
- What are the chances of a day student failing?

### Question 7:
`3 men` go to a party and wear exactly the same jackets. They hang their coats on the coat rack and then get very drunk.
At the end, each man takes a coat without looking at which one. What are the chances that everyone will take their own coat?

### Question 8:
If it rains heavily, there is a `50%` chance that my basement will flood. It rains heavily on average `35 days per year` and my basement floods `20 days per year` (even sometimes when it is not raining hard).
If my basement is flooded, what are the chances that it is because it is raining heavily?

### Question 9:
`90%` of PCs use Windows. If a PC crashes, you are `99.9%` sure it is running Windows. The chance that a PC will crash (regardless of the operating system) is `0.01`.

What are the chances of a PC crashing if you know it is running Windows?

### Question 10:
In 1988 there was discussion about introducing mandatory AIDS testing. A very reliable test has been developed that achieves the following results:
- `P(positive | infected) = 0.999`
- `P(negative | not infected) = 0.99`

It is estimated that `0.60%` of people are carriers of the virus.
What are the chances that someone is actually infected if you know the test is positive? Do you see why the mandatory test was not introduced?

### Question 11:
A program was written that determines when a text is interesting for a reader.
The program has already been trained and the following values were noted:
`1642 texts` were analyzed by hand. `46` of the texts were identified as interesting. A map was made of words and the number of times they occurred as well as the number of times they occurred in interesting texts.
We now get the following text:
    
`"This is a text to see if it will be judged as interesting."`

The folder contains the following values for these words:

    | Word        | Frequency  | Freq in interesting texts    |
    |-------------|------------|------------------------------|
    | dit         | 1368       | 40                           |
    | is          | 1642       | 46                           |
    | een         | 1642       | 46                           |
    | tekst       | 159        | 1                            |
    | om          | 1480       | 24                           |
    | te          | 1574       | 38                           |
    | kijken      | 947        | 11                           |
    | of          | 1642       | 35                           |
    | deze        | 1608       | 46                           |
    | als         | 1576       | 39                           |
    | interessant | 156        | 2                            |
    | zal         | 1387       | 46                           |
    | worden      | 1589       | 14                           |
    | beoordeeld  | 176        | 1                            |
What is the probability that this text is an interesting text, according to the algorithm seen? Is this chance big or small?

## Exercises Probability Distributions

### Question 1:
Luna rolls a die `4` times (answer with `4` significant figures):
- What is the probability of exactly one `6`?
- What is the probability of getting four sixes?
- What is the probability of `2` rolls below `3` and `2` rolls `3` or higher?


### Question 2:
`6` Applied Computer Science students must take an `English` test. The probability of passing the exam `0.75`.
- What is the probability that exactly `4` of them will succeed?
- What is the probability that exactly `5` of them will succeed?
- What is the probability that exactly `6` of them will succeed?
- What is the probability that less than `4` of them will succeed?

### Question 3:
On the way to the school, you pass `6` traffic lights. Unfortunately, these are not synchronized at all. They work completely independently of each other. The traffic lights are red approximately `40%` of the time.
- What is the probability that you don't see a single red light on the road?
- What is the probability that all traffic lights are red?
- What is the probability of encountering more than `2` lights on red?
- If you know that you lose an average of `2` minutes per red light. How much time do you expect to lose on average on a route?


### Question 4:
If you cycle to school, you will have a flat tire on average `2` a year.
- What is the probability that you will not have a flat tire in a given year?
- What is the probability that you will have more than `3` flat tires in a year?
- What is the probability that you will have `2` flat tires in `1` month?

### Question 5:
A user receives an average of `20` emails per day. One day he gets `100` and wonders if this is still normal.
- What is the probability that he will receive more than `100` emails in one day?
- What is the probability that he will get more than `30` in one day?
- What is the probability that he will get exactly `20` in one day?
- What is the probability that he will get `10` or less in a day?
- What is the probability that he will get `650` or less in a month (`30` days)?

### Question 6:
An average of `3.5` bitcoin transactions are carried out per second. However, the maximum number of transactions the network can process is `7` per second.
- What is the probability that more than `7` transactions need to be executed during one second?
- What is the probability that there is no transaction for one second?
- What is the probability that there are `3` or fewer transactions during a second?
- What is the probability that there are `2` or more transactions during a second?
- How many transactions do you expect in a day?

### Question 7:
A test consists of `40` questions and the average difficulty of the questions is `0.85` (the probability of a correct answer is `0.85`). Students receive `1` point per question. Which of the following values are the correct-expected value and standard deviation of the score on this test?
- µ=34 and ơ=2.26
- µ=29 and ơ=2.26
- µ=34 and ơ=5.10
- µ=29 and ơ=5.10

### Question 8:
Voor de standaardNormal distribution, wat zijn de volgende kansen?
- P(Z ≥ +1.64)
- P(Z ≥ -1.32)
- P(Z ≤ -0.18)
- P(Z ≤ +1.28)
- P(0.45 ≤ Z ≤ 0.92)
- P(-0.72 ≤ Z ≤ -0.38)


### Question 9:
In a class, the average hair length is normally distributed `µ=20cm` and `ơ=4`. What are the next opportunities?
- The probability that someone's hair is longer than `28cm`
- The chance that someone's hair is shorter than `16cm`
- The chance that someone has hair between `18cm` and `22cm`

### Question 10:
We are building a website with a competition with different questions. The scores for the match are normally distributed with `µ=50` and `ơ=5`.
- What is the probability that a randomly selected score on this match is lower than `40`?
- What is the probability that a randomly selected score on this match falls between `42` and `52`?
- Anyone who achieves a score higher than `58.75` will receive a special prize with a mention in the newspaper. If `75` people participate, how many will win a special prize?

### Question 11:
In an IQ test, the expected value is 100 (normally distributed). In Antwerp the standard deviation is `15`, in Ghent it is `18`.
- What is the probability that someone in Antwerp has an IQ greater than `120`?
- With what IQ do you belong to the people in Antwerp who have the `16%` lowest scores?
- What IQ (or more) do the `5%` smartest people in Ghent have?
- What is the ratio of Antwerp residents to Ghent residents for scores above `130`?
    1. `1:1`
    2. `1:2`
    3. `1:3`
    4. `5:6`

### Question 12:
Research has shown that the lifespan of lamps is normally distributed. A certain type of lamps has a lifespan of `500` hours, with a standard deviation of `100` hours. A chain store purchases `50,000` lamps of this type.
- How many of these lamps burn for longer than `400` hours?
- How many of these lamps have a lifespan between `400` and `700` hours?
- How many of these lamps have a lifespan of less than `600` hours?

### Question 13:
In a simple competition on the website, `8` out of `10` people give the correct answer.
- What is the probability that at least `55` participants from a random sample of `60` people answered correctly (`2` significant figures)?

### Question 14:
A new machine makes processors completely autonomously. Only `2/5` processors meet the quality standard.
- We take `10` processors made by this machine. What is the probability that fewer than `2` are correct?
- We take `100` processors made by this machine. What is the probability that fewer than `20` are correct?
- Both questions above ask for less than `20%` of the total number of processors. What difference do you notice? What could be the cause?

### Question 15:
Many things can go wrong during a laptop exam. We know that a computer running Windows has a `2%` chance of crashing during the exam. For Mac OSX it is `0.2%` and for Linux it is `0.1%`. Suppose a class of `40` students comes to take an exam.

#### Situation 1: Everyone runs Windows:
- What is the expected number of crashes during the exam?
- What is the chance that `1` computer will crash during the exam?
- What is the chance that `2` computers will crash during the exam?

#### Situation 2: 10 students run Windows, 10 students run Mac OSX and 20 run Linux
- What is the expected number of crashes during the exam?
- What is the chance that `1` computer will crash during the exam?
- What is the chance that `2` computers will crash during the exam?

### Question 16:
In a textile factory, rolls of fabric are produced with a length of `50` meters per roll. The number of weaving errors per roll is Poisson distributed with an associated expectation value of `1` weaving error per roll. During the qualitative inspection of the rolls of fabric, they are separated into rolls of `A quality` (with `0` or `1` weaving defect per roll) and rolls of `B quality` (with two or more weaving defects per roll).
- Calculate the probability that a random role will receive the designation `B quality`.
- The production volume per day is equal to `2000` meters of fabric. What is the probability that at least `30` rolls of `A quality` are made on any given day?

### Question 17:
An IT service desk receives an average of `10` service requests per hour.
- Calculate the probability that more than `15` service requests will be received in an hour.
- Calculate the probability that at least `10` but no more than `15` service requests will be received.
- Calculate the probability that no service requests will be received for fifteen minutes.

## Exercises Tests Part 1

### Question 1:
- We want to know how many requests a server has to process on average per day. We do a measurement for this. We measure for `30 days` and count it every day
number of requests. We arrive at a mean of `975` and a standard deviation equal to `100`.
     - Between what limits does the average number of requests lie if we want to be `95%` sure?
     - Suppose we found the same mean and standard deviation, but with a sample size of `100 days`. Then we are `95%` certain that the average is between ... and ....
     - Suppose someone claims that the server must process `1000 requests` per day. Can you support or reject this statement in the two cases? You want a certainty of `95%` again.

### Question 2:
- A factory makes `12kg` bags of corn flakes. To test this, random samples are taken regularly. `100` bags are weighed. The first sample yields a mean of `11.9kg` and a standard deviation of `1kg`. We use an `alpha=0.01`.
     - Which factor (Z-value) will you use to determine the interval?
     - Do we need to adjust the machine?

- A second sample also yields a mean of `11.9kg`, but a standard deviation of `0.1kg`.
     - Which factor (Z-value) do we use this time?
     - Do we need to adjust the machine this time?

### Question 3:
- On behalf of a cheese factory, we investigate whether some suppliers tamper with their milk by adding water. We take `5` consecutive shipments of milk and see at what temperature they freeze. We know that the freezing point of milk is `-0.545°C` with a standard deviation of `0.008°C`. The freezing point of water is of course `0°C`. In our sample we find an average freezing point of `-0.539 °C`. Set `alpha=0.1`
     - Which test are we going to use?
     - Should we apply this test one-sided or two-sided?
     - Was the milk tampered with?
     - What is the probability that the previous answer is wrong?

### Question 4:
- We wish to hire a programmer. We subject the candidates to a test. We know that a good programmer gets an average score of `100` on this one
test. A few (`16`) students from KdG volunteered. We arrive at a mean of `107.3` and a standard deviation of `8.0`.
     - Take `alpha=0.05`. We suspect that this group deviates from the average population.
         - Which test do we use?
         - Which factor are we going to use?
     - We can set an interval within which we are `95%` certain that KdG students score on average.
         - What is the lower limit of the interval?
         - What is the upper limit of the interval?
         - Can we say on the basis of this interval that KdG students score better than average?

## Exercises Tests Part 2

### Question 1:
- Open the file entry `/gamerps.csv`. Scissors rock paper is a game where you make a choice between these three. However, is there a preference among our scout members regarding the selection test? The expected distribution is proportional. Check whether our count differs from the expected distribution.
    - Which test will you use for this?
    - What is `H0`? `The count in our sample is/is not evenly distributed.`
    - What is the value of `X^2`? (3 significant digits)
    - What is the probability that a sample would have a higher value than `X^2` (p-value)? (3 significant digits)
    - Can `H0` be rejected with `95%` confidence?
    - So can we say that the scout members have a preference?

### Question 2:
- Open the file `/ColorHairBrussels.csv`. This concerns data on the color of the hair of a sample of people in Brussels. The expected distribution of hair colors in Europe are: `30% blonde`, `12% red`, `30% brown`, `25% dark`, and `3% black`. Check whether our count differs from the expected distribution.
     - Which test will you use for this?
     - What is `H0`? `The hair color count in our sample does/does not deviate from the expected distribution.`
     - What `X^2` value do you find?
     - What `p-value` do you find?
     - Can `H0` be rejected with a reliability of `95%` or not?
     - Is there a deviation in Brussels compared to the expected distribution?

### Question 3:
- We wish to conduct a survey on students' favorite browser. We expect the following distribution:
     - Internet explorer: `8`
     - Opera: `10`
     - Mozilla Firefox: `10`
     - Google chrome: `12`
- When inquiring about `1` class group, the distribution appears as follows:
     - Internet explorer: `17`
     - Opera: `10`
     - Mozilla Firefox: `8`
     - Google chrome: `5`

- Question:
     - Which test are you going to apply?
     - Is this deviation within our expectations (`alpha=0.01`) Yes/No?
     - What is the critical value of this data?
     - What `X^2` value do you find?
     - What p-value do you find?

### Question 4:
- A programmer has written a Python class to generate random integers between `0` and `10`. This code looks like this:

## Exercises Association Rules

### Question 1: [Income questionnaire]
The UCI Machine Learning Repository [Archive Ics Uci Edu](https://archive.ics.uci.edu/ml/index.php) contains a number of interesting datasets, including the so-called AdultUCI dataset. This is a dataset with a questionnaire that a significant number of respondents completed about their income. In addition to an indication of the income level, it also contains some other attributes. Before we can use it, you have to make some adjustments to the data. The data management chapter is therefore useful here.

- 1.1. Use Pandas to read this data (`../Data/AdultUCI.csv`) as a data frame called adultUCI.

- 1.2. View the data set.

- 1.3. Remove the following columns from the data frame: `fnlwgt`, `education-num`, `capital-gain`, `capital-loss`.

- 1.4. You cannot work with numerical data. Therefore, we will convert the numeric columns to categorize:
    - Convert the **age** column to classes. The breaks of the classes are (`15`, `25`, `45`, `65`, `100`). Convert the classes to the following names (`Young`, `Middle-aged`, `Senior`, `Old`).
    - Convert the **hours-per-week** column to classes. The breaks of the classes are (`0`, `25`, `40`, `60`, `168`). Convert the classes to the following names (`Part-time`, `Full-time`, `Over-time`, `Workaholic`)

- 1.5. Convert the data frame to a transactions object with Pandas get_dummies function. Make using the parameter `prefix_sep='='`. Study the result.

- 1.6. Create a barchart of all items with a support of `0.1` or more.

- 1.7. Which two items have very high support (proportionally)? Can you conclude from this that the administered questionnaire a good example of a random sample?

- 1.8. Apply the apriori and association_rules algorithms with the following parameters:
    - support= `0.05`,
    - confidence=`0.6`,
    - min_len=`2`, max_len=`3`.

- 1.9. You can use the following filter function in combination with the .apply function of a DataFrame 'def rule_filter(row, min_len, max_len):' How many rules did the algorithm find?

- 1.10. View the rules with the highest confidence? What stands out?

- 1.11. Can you explain why there is such a high confidence in this case?

- 1.12. That rule and variations on that rule are pretty useless. Therefore, remove the `relationship` column on the original data.

- 1.13. Run the apriori algorithm again. Which rule has the greatest confidence?

- 1.14. If you look at the Lift of this line, would you still consider it a good association consider rule?

- 1.15. If a respondent indicates that he works overtime (`hours-per-week`) and has a limited income (`income = small`), in which age category can we expect him to be? How sure are you of that?

- 1.16. Describe what the lift says about the rule used in n.

- 1.17. Does the combination of the three items from the previous 2 questions occur often? What number do you have?
used for this?

- 1.18. Have you come across a rule somewhere that says `hours-per-week=Workaholic`? Can you explain why?
is this so?

### Question 2: [Fruit promotion]
A supermarket wants to attract people to the store with a very strong promotion for fruit. Because she If they don't make a profit on that promotion, they want to compensate for that with another type right next to it to produce fruit that will increase the price slightly so that the profit margin on it can partially offset the loss compensate. The store wants to know which fruit to promote and which fruit to use the most has a chance to be purchased with the fruit on promotion.

- 2.1. Use the fruit preferences from the questionnaire dataset (`../DataFruitPurchase.csv`) toto draw up the rules.

- 2.2 Create association rules using this list. Use the following parameters for the apriori algorithm or the fp-growth algorithm:
    - support=`0.1`
    - confidence=`0.3`
    - min_len=`2`, max_len=`2`)

- 2.3. Find the association rule with the highest confidence.

- 2.4. Which fruit will the store promote based on that rule?

- 2.5. Based on that rule, which fruit will the store place next to the promotional item?

- 2.6. What percentage of the students who completed the questionnaire have the combination of the two fruit types are in their top 3?

- 2.7. What can you say about the fruit in promotion based on the lift?