# Exercises association rules

In [19]:
import pandas as pd # Data manipulation

def rule_filter(row, min_len, max_len):
    length = len(row['antecedents']) + len(row['consequents'])
    return min_len <= length <= max_len

## Theoretical questions

### Question 1: [Income questionnaire]
The UCI Machine Learning Repository [Archive Ics Uci Edu](https://archive.ics.uci.edu/ml/index.php) contains a number of interesting datasets, including the so-called AdultUCI dataset. This is a dataset with a questionnaire that a significant number of respondents completed about their income. In addition to an indication of the income level, it also contains some other attributes. Before we can use it, you have to make some adjustments to the data. The data management chapter is therefore useful here.

- 1.1. Use Pandas to read this data (`../Data/AdultUCI.csv`) as a data frame called adultUCI.

In [20]:
adultUCI = pd.read_csv('../Data/AdultUCI.csv', delimiter=';', decimal=',')

- 1.2. View the data set.

In [21]:
display(adultUCI.head())

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,small
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,small
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,small
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,small
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,small


- 1.3. Remove the following columns from the data frame: `fnlwgt`, `education-num`, `capital-gain`, `capital-loss`.

In [22]:
adultUCI = adultUCI.drop(columns=['fnlwgt', 'education-num', 'capital-gain', 'capital-loss'])
display(adultUCI.head())

Unnamed: 0,age,workclass,education,marital-status,occupation,relationship,race,sex,hours-per-week,native-country,income
0,39,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,40,United-States,small
1,50,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,13,United-States,small
2,38,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,40,United-States,small
3,53,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,40,United-States,small
4,28,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,40,Cuba,small


- 1.4. You cannot work with numerical data. Therefore, we will convert the numeric columns to categorize:
    - Convert the age column to classes. The breaks of the classes are (`15`, `25`, `45`, `65`, `100`). Convert the classes to the following names (`Young`, `Middle-aged`, `Senior`, `Old`).
    - Convert the hours-per-week column to classes. The breaks of the classes are (`0`, `25`, `40`, `60`, `168`). Convert the classes to the following names ("Part-time", "Full-time", "Over-time", "Workaholic")

- 1.5. Convert the data frame to a transactions object with Pandas get_dummies function. Make using the parameter `prefix_sep='='`. Study the result.

- 1.6. Create a barchart of all items with a support of 0.1 or more.

- 1.7. Which two items have very high support? Can you conclude from this that the administered questionnaire a good example of a random sample?

- 1.8. Apply the apriori and association_rules algorithms with the following parameters:
    - support= 0.05`,
    - confidence=`0.6`,
    - minlen=`2`, maxlen=`3`.

- 1.9. You can use the following filter function in combination with the .apply function of a DataFrame:
```python
def rule_filter(row, min_len, max_len):
    length = len(row['antecedents']) + len(row['consequents'])
    return min_len <= length <= max_len
```
> How many rules did the algorithm find?

- 1.10. View the rules with the highest confidence? What stands out?

- 1.11. Can you explain why there is such a high confidence in this case?

- 1.12. That rule and variations on that rule are pretty useless. Therefore, remove the 'relationship' column.

- 1.13. Run the apriori algorithm again. Which rule has the greatest confidence?

- 1.14. If you look at the Lift of this line, would you still consider it a good association consider rule?

- 1.15. If a respondent indicates that he works overtime (`hours-per-week`) and has a limited income (`income = small`), in which age category can we expect him to be? How sure are you of that?

- 1.16. Describe what the elevator says about the rule used in n.

- 1.17. Does the combination of the three items from the previous 2 questions occur often? What number do you have?
used for this?

- 1.18. Have you come across a rule somewhere that says `hours-per-week=Workaholic`? Can you explain why?
is this so?

### Question 2: [Fruit promotion]
A supermarket wants to attract people to the store with a very strong promotion for fruit. Because she If they don't make a profit on that promotion, they want to compensate for that with another type right next to it to produce fruit that will increase the price slightly so that the profit margin on it can partially offset the loss compensate. The store wants to know which fruit to promote and which fruit to use the most has a chance to be purchased with the fruit on promotion.

- 2.1. Use the fruit preferences from the questionnaire dataset (`../DataFruitPurchase.csv`) toto draw up the rules.

In [23]:
fruit = pd.read_csv('../Data/FruitPurchase.csv', delimiter=';', decimal=',')
display(fruit.head())

Unnamed: 0,Aardbei,Ananas,Appel,Banaan,Kers,Kiwi,Meloen,Peer,Pruim,Sinaasappel
0,True,True,False,False,False,False,True,False,False,False
1,True,False,False,False,False,True,True,False,False,False
2,True,False,True,False,False,False,False,False,False,True
3,True,False,False,True,True,False,False,False,False,False
4,True,True,True,False,False,False,False,False,False,False


- 2.2 Create association rules using this list. Use the following parameters for the apriori algorithm or the fp-growth algorithm:
    - support=`0.1`
    - confidence=`0.3`
    - minlen=`2`, maxlen=`2`)

- 2.3. Find the association rule with the highest confidence.

- 2.4. Which fruit will the store promote based on that rule?

- 2.5. Based on that rule, which fruit will the store place next to the promotional item?

- 2.6. What percentage of the students who completed the questionnaire have the combination of the two fruit types are in their top 3?

- 2.7. What can you say about the fruit in promotion based on the lift?

## Practical exercises (Python)

### Exercise 1:

### Exercise 2:

### Exercise 3:

### Exercise 4: