# Classification rules for covid dataset


## 1. Classification Rules Implementation

### 1.1. PRISM algorithm
In this lab we are going to implement the PRISM algorithm to extract the classification rules with the highest accuracy and coverage from the hospital patients dataset described in the class demo notebook.

Our algorithm extracts the rules ranked by the accuracy (from highest to lowest), and the ties are broken by choosing the rule with higher coverage. If both accuracy and coverage are the same - the condition is selected arbitrarily.


In [None]:
import pandas as pd
import numpy as np

We first implement the algorithm for learning one rule. Besides the dataset, we also pass two optional parameter:

the accuracy threshold - the number from 0 to 1 which specifies which rules are considered valid. If after refining the rules and still within the coverage threshold you reach the best accuracy which is below the threshold, you do not add these rules to your solution.

the coverage threshold - the absolute number of records covered by the rule. If the more precise rule covers less records than this threshold, the algorithm should stop refining this rule.

In [None]:
def learn_one_rule(data, accur_threshold, coverage_threshold):
    # Find class labels and attributes from data
    columns = list(data.columns)
    class_labels = columns[-1]
    attributes = columns[:-1]
    
    # Initialize the rule with empty lhs
    rule_lhs = []
    rule = [rule_lhs,""]
    cur_attr = attributes
    cur_data = data
    cur_rule = rule[:]

    while cur_attr:
        # For each attribute a not mentioned in the rule, and each attr value v, consider adding the condition a = v to rule_lhs
        best_accur = 0
        best_pair = []
        best_class = ""
        best_coverage = 0
        best_accur_so_far = -1

        for clas in cur_data[class_labels].unique():
            cur_data = cur_data[cur_data[class_labels]==clas]
            for a in cur_attr:
                if cur_data[a].dtypes != np.int64 or np.float64:
                    values = cur_data[a].unique()
                else:
                    values = cur_data[a].mean()
                
                for v in values:
                    cur_rule[0] = [a,v]
                    cur_rule[1] = clas

                    # compute accuracy and coverage
                    if cur_data[a].dtype != "int64" or "float64":
                        correct = cur_data[(cur_data[a]==v) & (cur_data[class_labels]==clas)].shape[0]
                        coverage = cur_data[cur_data[a]==v].shape[0]
                        accur = correct / coverage
                    else:
                        correct = cur_data[(cur_data[a]>=v) & (cur_data[class_labels]==clas)].shape[0]
                        coverage = cur_data[cur_data[a]>=v].shape[0]
                        accur = correct / coverage
            
                    # select the rule of the attribute,value pair with best accuracy
                    if (accur > best_accur):
                        best_accur = accur
                        best_pair = [a,v]
                        best_class = clas
                        best_coverage = coverage
                    elif (accur == best_accur):
                        cur_correct = coverage
                        correct_so_far = cur_data[(cur_data[best_pair[0]]==best_pair[1]) & (cur_data[class_labels]==clas)].shape[0]
                        if cur_correct > correct_so_far:
                            best_pair = [a,v]
                            best_class = clas
                            best_coverage = coverage
#         print(best_accur_so_far,best_coverage,(accur > accur_threshold),best_pair)
        
        # check coverage threshold and if the accuracy does not improve break
        if (best_coverage < coverage_threshold) or (best_accur_so_far==best_accur):
            if len(best_pair)!=0:
                a,v = best_pair[0],best_pair[1]
                cur_attr.remove(a)
                cur_data = data[data[a]==v]
                data = data[data[a]!=v]
            break
            
        # check coverage threshold and add condition a = v to the LHS of rule R
        if (best_coverage >= coverage_threshold):
            if best_accur > accur_threshold:
                rule[0].append(best_pair)
                rule[1] = best_class
                if len(best_pair)!=0:
                    a,v = best_pair[0],best_pair[1]
                    cur_attr.remove(a)
                    cur_data = data[data[a]==v]
                    data = data[data[a]!=v]
                    
        if data.empty:
            break
    return rule,data
    

In [None]:
def prism(data,accur_threshold, coverage_threshold):
    rules = []
    while data.shape[0] > 0:
        rule,data = learn_one_rule(data,accur_threshold, coverage_threshold)
        if len(rule[0]) >0:
            rules.append(rule)
    return rules

In [None]:
def print_rules(data,rules):
    columns = list(data.columns)
    class_labels = columns[-1]
    attributes = columns[:-1]
    
    for r in rules:
        print("If ",end="")
        for ele in r[0]:
            data = data[(data[ele[0]]==ele[1])]
            print("{} = '{}', ".format(ele[0],ele[1]),end="")
            
        coverage = data.shape[0]
        correct = data[data[class_labels]==r[1]].shape[0]
        accur = correct / coverage
        print("then {} = '{}'".format(class_labels,r[1]))
        print("Accuracy: {}, Coverage: {}".format(accur, coverage))

We can test the above function with a toy example of weather data

In [None]:
data = [["s","h","y"],["s","h","y"],["r","c","n"],["s","h","n"],["r","c","y"]]
data = pd.DataFrame (data,columns=['Outlook','Temp','Play'])
# rule,data = learn_one_rule(data,0.9,1)
rules = prism(data,0.9,2)
print_rules(data,rules)

## 2. Coronavirus dataset application

Finally, we can apply the algorithm to the COVID-19 dataset to learn reliable rules which determine which symptoms/preexisting conditions and their combination lead to the deadly outcomes of the COVID-19 infection.

Tis Mexican dataset which contains the information from the Statistical Yearbooks of Morbidity 2015-2017 (as well as the information regarding cases associated with COVID-19) was found on [kaggle](https://www.kaggle.com/tanmoyx/covid19-patient-precondition-dataset).

This preprocessed dataset contains only patients that tested positive for COVID-19 and with symptom atributes converted to categorical.

In this dataset we have the following attributes:
1. sex: 1 -woman, 2-man
2. age: numeric
3. diabetes: yes/no
4. copd (chronic obstructive pulmonary disease): yes/no
5. asthma: yes/no
6. imm_supr (suppressed immune system): yes/no
7. hypertension: yes/no
8. cardiovascular: yes/no
9. renal_chronic: yes/no
10. tobacco: yes/no	
11. outcome: alive/dead

In [None]:
data_file = "../data_set/covid_categorical_good.csv"

In [None]:
data = pd.read_csv(data_file)
data = data.dropna(how="any")
data.columns

In [None]:
data_rows = data.to_numpy().tolist()
len(data_rows)

In [None]:
columns_list = data.columns.to_numpy().tolist()
print(columns_list)

### 2.1. Using classification algorithm

In [None]:
rules = prism(data,0.8,20)
print_rules(data,rules)

We discover that with classification algorithm the rules for covid dataset are:

If imm_supr = 'no', copd = 'no', asthma = 'no', renal_chronic = 'no', cardiovascular = 'no', then outcome = 'alive'

If the patient has no suppressed immune system, no chronic obstructive pulmonary disease, no asthma, no cardiovascular discease, then the rule indicated that he or she will very likely be alive after tested positive for covid. 

Accuracy: 0.8902581098724516, Coverage: 199140