# Classification rules for covid dataset


## 1. Classification Rules Implementation

### 1.1. PRISM algorithm
In this lab we are going to implement the PRISM algorithm to extract the classification rules with the highest accuracy and coverage from the hospital patients dataset described in the class demo notebook.

Our algorithm extracts the rules ranked by the accuracy (from highest to lowest), and the ties are broken by choosing the rule with higher coverage. If both accuracy and coverage are the same - the condition is selected arbitrarily.


In [1]:
import pandas as pd
import numpy as np

We first implement the algorithm for learning one rule. Besides the dataset, we also pass two optional parameter:

the accuracy threshold - the number from 0 to 1 which specifies when you need to stop refining the rule. I.e. if you reached this accuracy threshold, you do not need to add more conditions to the rule's antedescent.

the coverage threshold - the absolute number of records covered by the rule. If the more precise rule covers less records than this threshold, the algorithm should stop refining this rule.

In [2]:
def learn_one_rule(data, accur_threshold, coverage_threshold):
    # Find class labels and attributes from data
    columns = list(data.columns)
    class_labels = columns[-1]
    attributes = columns[:-1]
    
    # Initialize the rule with empty lhs
    rule_lhs = []
    rule = [rule_lhs,""]
    cur_attr = attributes
    cur_data = data
    cur_rule = rule[:]

    while cur_attr:
        # For each attribute a not mentioned in the rule, and each attr value v, consider adding the condition a = v to rule_lhs
        best_accur = 0
        best_pair = []
        best_class = ""

        for clas in cur_data[class_labels].unique():
            cur_data = cur_data[cur_data[class_labels]==clas]
            for a in cur_attr:
                if cur_data[a].dtypes != np.int64 or np.float64:
                    values = cur_data[a].unique()
                else:
                    values = cur_data[a].mean()
                
                for v in values:
                    cur_rule[0] = [a,v]
                    cur_rule[1] = clas
                    print(cur_rule)

                    # compute accuracy and coverage
                    if cur_data[a].dtype != "int64" or "float64":
                        coverage = cur_data[(cur_data[a]==v) & (cur_data[class_labels]==clas)].shape[0]
#                         print(cur_data[(cur_data[a]==v) & (cur_data[class_labels]==clas)])
#                         print("WHYYYY",cur_data[cur_data[a]==v])
                        accur = coverage /cur_data[cur_data[a]==v].shape[0]
                    else:
                        coverage = cur_data[(cur_data[a]>=v) & (cur_data[class_labels]==clas)].shape[0]
                        accur = coverage /cur_data[cur_data[a]>=v].shape[0]
            
                    # select the rule of the attribute,value pair with best accuracy
                    if (accur > best_accur):
                        best_accur = accur
                        best_pair = [a,v]
                        best_class = clas
                    elif (accur == best_accur):
                        cur_correct = coverage
                        correct_so_far = cur_data[(cur_data[best_pair[0]]==best_pair[1]) & (cur_data[class_labels]==clas)].shape[0]
                        if cur_correct > correct_so_far:
                            best_pair = [a,v]
                            best_class = clas
        print(accur,coverage,(accur > accur_threshold),best_pair)

        # check accuracy threshold, coverage threshold
        if (accur > accur_threshold) or (coverage < coverage_threshold):
            # add condition a = v to the LHS of rule R
            rule[0].append(best_pair)
            rule[1] = best_class
            print(rule)
            if len(best_pair)!=0:
                a,v = best_pair[0],best_pair[1]
                cur_attr.remove(a)
                cur_data = data[data[a]==v]
                data = data[data[a]!=v]
            break
        print(rule)
        if data.empty:
            break
    return rule,data
    

In [3]:
def prism(data,accur_threshold, coverage_threshold):
    rules = []
    while data.shape[0] > 0:
        rule,data = learn_one_rule(data,accur_threshold, coverage_threshold)
        rules.append(rule)
    return rules

In [4]:
def print_rules(data,rules):
    columns = list(data.columns)
    class_labels = columns[-1]
    attributes = columns[:-1]

    for r in rules:
        print("If ",end="")
        for ele in r[0]:
            coverage = data[(data[ele[0]]==ele[1]) & (data[class_labels]==r[1])].shape[0]
            accur = coverage / data[data[ele[0]]==ele[1]].shape[0]
            print("{} = '{}', ".format(ele[0],ele[1]),end="")
        print("then {} = '{}'".format(class_labels,r[1]))
        print("Accuracy: {}, Coverage: {}".format(accur, coverage))

We can test the above function with a toy example of weather data

In [5]:
data = [["s","h","y"],["s","h","y"],["r","c","n"],["s","c","n"],["r","c","y"]]
data = pd.DataFrame (data,columns=['Outlook','Temp','Play'])
# rule,data = learn_one_rule(data,0.9,1)
rules = prism(data,0.9,3)
print_rules(data,rules)

[['Outlook', 's'], 'y']
[['Outlook', 'r'], 'y']
[['Temp', 'h'], 'y']
[['Temp', 'c'], 'y']
1.0 1 True ['Outlook', 's']
[[['Outlook', 's']], 'y']
[['Outlook', 'r'], 'n']
[['Temp', 'c'], 'n']
1.0 1 True ['Outlook', 'r']
[[['Outlook', 'r']], 'n']
If Outlook = 's', then Play = 'y'
Accuracy: 0.6666666666666666, Coverage: 2
If Outlook = 'r', then Play = 'n'
Accuracy: 0.5, Coverage: 1


## 2. Coronavirus dataset application

Finally, we can apply the algorithm to the COVID-19 dataset to learn reliable rules which determine which symptoms/preexisting conditions and their combination lead to the deadly outcomes of the COVID-19 infection.

Tis Mexican dataset which contains the information from the Statistical Yearbooks of Morbidity 2015-2017 (as well as the information regarding cases associated with COVID-19) was found on [kaggle](https://www.kaggle.com/tanmoyx/covid19-patient-precondition-dataset).

This preprocessed dataset contains only patients that tested positive for COVID-19 and with symptom atributes converted to categorical.

In this dataset we have the following attributes:
1. sex: 1 -woman, 2-man
2. age: numeric
3. diabetes: yes/no
4. copd (chronic obstructive pulmonary disease): yes/no
5. asthma: yes/no
6. imm_supr (suppressed immune system): yes/no
7. hypertension: yes/no
8. cardiovascular: yes/no
9. renal_chronic: yes/no
10. tobacco: yes/no	
11. outcome: alive/dead

In [6]:
data_file = "../data_set/covid_categorical_good.csv"

In [7]:
data = pd.read_csv(data_file)
data = data.dropna(how="any")
data.columns

Index(['sex', 'age', 'diabetes', 'copd', 'asthma', 'imm_supr', 'hypertension',
       'cardiovascular', 'obesity', 'renal_chronic', 'tobacco', 'outcome'],
      dtype='object')

In [8]:
data_rows = data.to_numpy().tolist()
len(data_rows)

219179

In [249]:
columns_list = data.columns.to_numpy().tolist()
print(columns_list)

['sex', 'age', 'diabetes', 'copd', 'asthma', 'imm_supr', 'hypertension', 'cardiovascular', 'obesity', 'renal_chronic', 'tobacco', 'outcome']


### 2.1. Using classification algorithm

In [9]:
rules = prism(data,0.8,30)
print_rules(data,rules)
# print(data[:10])

[['sex', 'male'], 'alive']
[['sex', 'female'], 'alive']
[['age', 27], 'alive']
[['age', 24], 'alive']
[['age', 54], 'alive']
[['age', 30], 'alive']
[['age', 63], 'alive']
[['age', 56], 'alive']
[['age', 41], 'alive']
[['age', 39], 'alive']
[['age', 46], 'alive']
[['age', 45], 'alive']
[['age', 28], 'alive']
[['age', 34], 'alive']
[['age', 38], 'alive']
[['age', 49], 'alive']
[['age', 25], 'alive']
[['age', 40], 'alive']
[['age', 31], 'alive']
[['age', 33], 'alive']
[['age', 52], 'alive']
[['age', 83], 'alive']
[['age', 43], 'alive']
[['age', 37], 'alive']
[['age', 32], 'alive']
[['age', 47], 'alive']
[['age', 42], 'alive']
[['age', 53], 'alive']
[['age', 48], 'alive']
[['age', 59], 'alive']
[['age', 44], 'alive']
[['age', 65], 'alive']
[['age', 73], 'alive']
[['age', 51], 'alive']
[['age', 50], 'alive']
[['age', 35], 'alive']
[['age', 67], 'alive']
[['age', 61], 'alive']
[['age', 55], 'alive']
[['age', 60], 'alive']
[['age', 29], 'alive']
[['age', 79], 'alive']
[['age', 58], 'alive']
[

We discover that with classification algorithm the rules for covid dataset are:

If imm_supr = 'no', then outcome = 'alive' 
Accuracy: 0.879777224745816, Coverage: 190192

If imm_supr = 'yes', then outcome = 'alive' 
Accuracy: 0.7430764097430764, Coverage: 2227