## Task 1

### Implement the Apriori algorithm to analyse the Disease Symptom dataset, identifying common combinations of symptoms that frequently co-occur within the same disease profile.

### Approach:
- Data Exploration:
    - Load and explore the dataset structure and composition 
    - Understand the schema
        - Identify the distribution of diseases and symptoms
        - Visualize symptom frequency and disease prevalence
- Item set generation:
    - Tag the disease together with its symptoms as a singular item set
    - Perform a count of these disease-symptom itemsets


- Data Mirroring:
    - Apply controlled dataset augmentation for diseases below minimum threshold
    - Generate mirrored transactions by creating clinically plausible symptom subsets [{A,B,C,D} --> {A,B,C}, {A,B}]
    - Balance augmented dataset to avoid over-representation of any single pattern


In [None]:
import os
os.chdir('..')

from global_functions import *
import pandas as pd
import matplotlib.pyplot as plt

In [4]:
disease_df = load_data_as_df()
disease_df

Successfully loaded data from RawData/dataset.csv
DataFrame shape: (4920, 18)


Unnamed: 0,Disease,Symptom_1,Symptom_2,Symptom_3,Symptom_4,Symptom_5,Symptom_6,Symptom_7,Symptom_8,Symptom_9,Symptom_10,Symptom_11,Symptom_12,Symptom_13,Symptom_14,Symptom_15,Symptom_16,Symptom_17
0,Fungal infection,itching,skin_rash,nodal_skin_eruptions,dischromic _patches,,,,,,,,,,,,,
1,Fungal infection,skin_rash,nodal_skin_eruptions,dischromic _patches,,,,,,,,,,,,,,
2,Fungal infection,itching,nodal_skin_eruptions,dischromic _patches,,,,,,,,,,,,,,
3,Fungal infection,itching,skin_rash,dischromic _patches,,,,,,,,,,,,,,
4,Fungal infection,itching,skin_rash,nodal_skin_eruptions,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4915,(vertigo) Paroymsal Positional Vertigo,vomiting,headache,nausea,spinning_movements,loss_of_balance,unsteadiness,,,,,,,,,,,
4916,Acne,skin_rash,pus_filled_pimples,blackheads,scurring,,,,,,,,,,,,,
4917,Urinary tract infection,burning_micturition,bladder_discomfort,foul_smell_of urine,continuous_feel_of_urine,,,,,,,,,,,,,
4918,Psoriasis,skin_rash,joint_pain,skin_peeling,silver_like_dusting,small_dents_in_nails,inflammatory_nails,,,,,,,,,,,


In [5]:
disease_df.groupby('Disease').count()

Unnamed: 0_level_0,Symptom_1,Symptom_2,Symptom_3,Symptom_4,Symptom_5,Symptom_6,Symptom_7,Symptom_8,Symptom_9,Symptom_10,Symptom_11,Symptom_12,Symptom_13,Symptom_14,Symptom_15,Symptom_16,Symptom_17
Disease,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
(vertigo) Paroymsal Positional Vertigo,120,120,120,120,120,78,0,0,0,0,0,0,0,0,0,0,0
AIDS,120,120,120,78,0,0,0,0,0,0,0,0,0,0,0,0,0
Acne,120,120,120,78,0,0,0,0,0,0,0,0,0,0,0,0,0
Alcoholic hepatitis,120,120,120,120,120,120,78,0,0,0,0,0,0,0,0,0,0
Allergy,120,120,120,72,0,0,0,0,0,0,0,0,0,0,0,0,0
Arthritis,120,120,120,120,90,0,0,0,0,0,0,0,0,0,0,0,0
Bronchial Asthma,120,120,120,120,120,72,0,0,0,0,0,0,0,0,0,0,0
Cervical spondylosis,120,120,120,120,78,0,0,0,0,0,0,0,0,0,0,0,0
Chicken pox,120,120,120,120,120,120,120,120,120,120,66,0,0,0,0,0,0
Chronic cholestasis,120,120,120,120,120,120,78,0,0,0,0,0,0,0,0,0,0


#### As we can see, every disease has a healthy number of datapoints (120). So, no mirroring is required at this stage.

### Furthermore, We will perform Apiori on each disease profile individually as running each disease (120 lines) against the global (~5000 lines) dataset renders everything infrequent.

### Grouping diseases together with their symptoms

In [9]:
from pprint import pprint
disease_dataset = {}
for disease, group in disease_df.groupby('Disease'):
    # Build list of symptom sets per reported case
    transactions = group.iloc[:, 1:].apply(lambda row: [s for s in row if pd.notna(s)], axis=1).tolist()
    # For each disease, represent all transactions in a dataframe
    disease_dataset[disease] = pd.DataFrame.from_records(transactions)
pprint(disease_dataset['(vertigo) Paroymsal  Positional Vertigo'])

             0          1                    2                    3  \
0     vomiting   headache               nausea   spinning_movements   
1     vomiting   headache               nausea   spinning_movements   
2     headache     nausea   spinning_movements      loss_of_balance   
3     vomiting     nausea   spinning_movements      loss_of_balance   
4     vomiting   headache   spinning_movements      loss_of_balance   
..         ...        ...                  ...                  ...   
115   vomiting   headache               nausea   spinning_movements   
116   vomiting   headache               nausea   spinning_movements   
117   vomiting   headache               nausea   spinning_movements   
118   vomiting   headache               nausea   spinning_movements   
119   vomiting   headache               nausea   spinning_movements   

                    4              5  
0     loss_of_balance   unsteadiness  
1     loss_of_balance   unsteadiness  
2        unsteadiness         

### Applying Apiori Algorithm on dataset

In [7]:
os.chdir('Task_1/')

In [8]:
from apriori import Apriori

In [14]:
# Initialize Apriori
apriori = Apriori(min_support=0.6, min_confidence=0.4)
all_freq_results = []
all_rule_results = []

for disease, transactions in disease_dataset.items():
    print(f'==============={disease}==================')
    apriori.load_data(transactions)

    # Find frequent itemsets
    frequent_itemsets = apriori.find_frequent_itemsets()

    for size, itemsets in frequent_itemsets.items():
        for itemset in itemsets:
            all_freq_results.append({
                'disease': disease,
                'itemset_size': size,
                'itemset': ', '.join([apriori.idx_to_item[idx] for idx in itemset]),
                'support': apriori.calculate_support(itemset)
            })
            
    # Generate association rules
    association_rules = apriori.generate_association_rules()

    for rule in association_rules:
        all_rule_results.append({
            'disease': disease,
            'antecedent': ', '.join(list(rule['antecedent'])),
            'consequent': ', '.join(list(rule['consequent'])),
            'support': round(rule['support'], 3),
            'confidence': round(rule['confidence'], 3)
        })

freq_df = pd.DataFrame(all_freq_results)
rules_df = pd.DataFrame(all_rule_results)

freq_df.to_csv('disease_frequent_itemsets.csv', index=False)
rules_df.to_csv('disease_association_rules.csv', index=False)

print("Saved frequent itemsets and association rules to CSV.")

Loaded 120 transactions with 6 unique items.
Finding frequent itemsets...
Found 6 frequent 1-itemsets
Found 30 frequent 2-itemsets
Found 40 frequent 3-itemsets
Found 30 frequent 4-itemsets
Found 12 frequent 5-itemsets
Found 2 frequent 6-itemsets
Found 0 frequent 7-itemsets
Frequent itemset mining completed in 0.01 seconds
Generating association rules...
Association rules completed in 0.02 seconds
Generated 1204 rules
Loaded 120 transactions with 4 unique items.
Finding frequent itemsets...
Found 4 frequent 1-itemsets
Found 12 frequent 2-itemsets
Found 8 frequent 3-itemsets
Found 2 frequent 4-itemsets
Found 0 frequent 5-itemsets
Frequent itemset mining completed in 0.00 seconds
Generating association rules...
Association rules completed in 0.00 seconds
Generated 100 rules
Loaded 120 transactions with 4 unique items.
Finding frequent itemsets...
Found 4 frequent 1-itemsets
Found 12 frequent 2-itemsets
Found 8 frequent 3-itemsets
Found 2 frequent 4-itemsets
Found 0 frequent 5-itemsets
Fre

: 