<a href="https://colab.research.google.com/github/LeoTheOriginal/Data-Bases/blob/main/JetBrains_Researcher.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TASK

## Implement Rule Compressision

You are [provided](https://drive.google.com/file/d/1i897nQmji1wUK7p8p_4Y6LaWL0RPoBos/view?usp=sharing) with a portion of a real dataset describing donors categorized into two age groups: young and old. The dataset includes various biomarkers measured from donor blood. Each biomarker is represented as a logical predicate that evaluates to true when the biomarker's value is relatively high.

## Task:

A system has generated a set of rules to identify which biomarkers (or biomarker combinations) are indicative of an "old" donor. Your objective is to compress this rule set while preserving the key insights.

Each rule follows the structure LHS => donor_is_old. The left-hand side (LHS) contains one or more predicates, which may appear in their normal or negated form, combined using the AND keyword. Negations are applied only to individual predicates, using the keyword NOT, which must immediately precede the predicate name. The right-hand side (RHS) always consists solely of the predicate donor_is_old.

Examples:

TNF_α => donor_is_old

Monocytes AND NOT sCD86 => donor_is_old

## Format:

We expect your program to take two files as input: the original dataset used to generate the rules and the list of generated rules. We provide example files ([rules](https://drive.google.com/file/d/1QS7_65o63YzX4R8Ci0XUagIqxnfY0kSZ/view?usp=sharing), [dataset](https://drive.google.com/file/d/1i897nQmji1wUK7p8p_4Y6LaWL0RPoBos/view?usp=sharing)) that can be used to evaluate your program. Dataset might contain NAs. Your program should generate a file containing the compressed rule set.

Provide the solution as a GitHub repository with a README file.
The README file must include a text description of the heuristics used to compress the rule set.
If the GitHub repository is private, please grant read-only access to the mentor (the mentor's email is listed in the project description).

## Possible directions

*   Identify rules that can be merged without loss of information.
*   Remove redundant or overly specific rules.
*   Implement a scoring mechanism to rank rules by usefulness.

There is no correct solution to this problem, with this problem we want to see your ability to decide what's important for rule compression.

Use of AI-based code assistant tools is not prohibited, but an interview will contain questions regarding key decisions of your implementation.




# Solution

First, we need to install gdown to download files from Google Drive

In [1]:
!pip install --quiet gdown

Now, we can download the dataset and the rules files

In [2]:
!gdown --id 1i897nQmji1wUK7p8p_4Y6LaWL0RPoBos -O dataset.tsv
!gdown --id 1QS7_65o63YzX4R8Ci0XUagIqxnfY0kSZ -O rules.txt

Downloading...
From: https://drive.google.com/uc?id=1i897nQmji1wUK7p8p_4Y6LaWL0RPoBos
To: /content/dataset.tsv
100% 7.35k/7.35k [00:00<00:00, 22.4MB/s]
Downloading...
From: https://drive.google.com/uc?id=1QS7_65o63YzX4R8Ci0XUagIqxnfY0kSZ
To: /content/rules.txt
100% 1.93k/1.93k [00:00<00:00, 8.73MB/s]


Import key libraries

In [12]:
import pandas as pd
import numpy as np

Loading the dataset

In [21]:
df = pd.read_csv('dataset.tsv', sep='\t', na_values=["NA"])
df = df.replace({'TRUE': True, 'FALSE': False})
print("Dataset loaded with shape:", df.shape)
df

Dataset loaded with shape: (39, 32)


Unnamed: 0,donor_is_old,BMI,WBC,Neutrophils,Lymphocytes,Monocytes,Eosinophils,Basophils,Neutrophils_percent,Lymphocytes_percent,...,IL_2,IL_4,IL_6,IL_8,TNF_α,sCD86,GDF_15,SOST,OMD,Notch_1
0,False,False,False,False,False,False,False,False,False,False,...,False,False,True,False,False,True,False,False,False,True
1,False,True,False,False,False,False,False,True,False,True,...,False,False,True,False,False,False,False,False,False,False
2,False,True,True,True,True,False,False,False,False,False,...,True,True,True,False,True,False,False,False,,True
3,False,False,False,False,False,False,False,False,False,False,...,False,False,True,False,False,False,False,False,False,False
4,False,True,False,False,False,False,False,True,True,False,...,False,False,True,True,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False,False,False,...,False,True,True,False,False,True,False,False,False,True
6,False,False,True,False,True,False,False,False,False,True,...,False,False,True,False,False,False,False,True,True,False
7,False,False,False,,True,False,,False,,True,...,False,False,True,False,False,False,,,False,False
8,False,False,False,False,False,False,False,True,True,False,...,False,False,True,False,False,True,False,False,False,False
9,False,False,False,False,False,False,False,False,False,False,...,True,True,True,False,False,False,False,False,False,True


Now the rules file

In [22]:
with open("rules.txt", "r") as file:
    raw_rules = file.readlines()

Then we can create some helper function

In [23]:
def parse_rule(rule_line):
    lhs, _ = rule_line.split("=>")
    lhs = lhs.strip()
    conditions = [cond.strip() for cond in lhs.split("AND")]
    return conditions

def evaluate_condition(row, condition):
    if condition.startswith("NOT "):
        col = condition[4:]
        return (row[col] == False) if pd.notna(row[col]) else False
    else:
        col = condition
        return (row[col] == True) if pd.notna(row[col]) else False

def evaluate_rule(row, conditions):
    return all(evaluate_condition(row, cond) if cond in row.index else False for cond in conditions)


Let's calculate the metrics for each rule

In [24]:
rule_metrics = []
total_rows = len(df)
for rule_line in raw_rules:
    conditions = parse_rule(rule_line)
    rule_mask = df.apply(lambda row: evaluate_rule(row, conditions), axis=1)
    support_count = rule_mask.sum()
    support = support_count / total_rows if total_rows > 0 else 0

    if support_count > 0:
        confidence = df.loc[rule_mask, "donor_is_old"].sum() / support_count
    else:
        confidence = 0

    score = support * confidence
    rule_metrics.append({
        "rule": rule_line,
        "conditions": set(conditions),
        "support": support,
        "confidence": confidence,
        "score": score
    })

Let's see what we got...

In [25]:
print("Original rules and their metrics:")
for rm in rule_metrics:
    print(f"{rm['rule']} -> Support: {rm['support']:.2f}, Confidence: {rm['confidence']:.2f}, Score: {rm['score']:.2f}")


Original rules and their metrics:
Monocytes AND NOT sCD86 => donor_is_old
 -> Support: 0.00, Confidence: 0.00, Score: 0.00
TNF_α => donor_is_old
 -> Support: 0.26, Confidence: 0.70, Score: 0.18
NOT Hb AND NOT Monocytes AND NOT Notch_1 => donor_is_old
 -> Support: 0.00, Confidence: 0.00, Score: 0.00
OMD => donor_is_old
 -> Support: 0.26, Confidence: 0.80, Score: 0.21
NOT Hb AND NOT Notch_1 => donor_is_old
 -> Support: 0.00, Confidence: 0.00, Score: 0.00
IL_10 => donor_is_old
 -> Support: 0.26, Confidence: 0.60, Score: 0.15
NOT Notch_1 => donor_is_old
 -> Support: 0.00, Confidence: 0.00, Score: 0.00
SOST => donor_is_old
 -> Support: 0.26, Confidence: 0.90, Score: 0.23
WBC => donor_is_old
 -> Support: 0.26, Confidence: 0.70, Score: 0.18
Eosinophils => donor_is_old
 -> Support: 0.23, Confidence: 0.67, Score: 0.15
NOT IL_4 AND NOT Lymphocytes => donor_is_old
 -> Support: 0.00, Confidence: 0.00, Score: 0.00
Monocytes_percent => donor_is_old
 -> Support: 0.26, Confidence: 0.80, Score: 0.21
NO

We can notice that some rules have zero support, so we can filter them

In [26]:
filtered_rule_metrics = [rm for rm in rule_metrics if rm["support"] > 0]

Finally we can apply rule compression

In [32]:
n = len(filtered_rule_metrics)
keep_flags = [True] * n

for i in range(n):
    for j in range(n):
        if i != j and keep_flags[j]:
            if filtered_rule_metrics[i]["conditions"].issubset(filtered_rule_metrics[j]["conditions"]):
                if filtered_rule_metrics[i]["score"] >= filtered_rule_metrics[j]["score"]:
                    keep_flags[j] = False

compressed_rule_metrics = [filtered_rule_metrics[i] for i in range(n) if keep_flags[i]]

print("\nCompressed rules and their metrics:")
for rm in compressed_rule_metrics:
    print(f"{rm['rule']} -> Support: {rm['support']:.2f}, Confidence: {rm['confidence']:.2f}, Score: {rm['score']:.2f}")


Compressed rules and their metrics:
TNF_α => donor_is_old
 -> Support: 0.26, Confidence: 0.70, Score: 0.18
OMD => donor_is_old
 -> Support: 0.26, Confidence: 0.80, Score: 0.21
IL_10 => donor_is_old
 -> Support: 0.26, Confidence: 0.60, Score: 0.15
SOST => donor_is_old
 -> Support: 0.26, Confidence: 0.90, Score: 0.23
WBC => donor_is_old
 -> Support: 0.26, Confidence: 0.70, Score: 0.18
Eosinophils => donor_is_old
 -> Support: 0.23, Confidence: 0.67, Score: 0.15
Monocytes_percent => donor_is_old
 -> Support: 0.26, Confidence: 0.80, Score: 0.21
Eosinophils_percent => donor_is_old
 -> Support: 0.23, Confidence: 0.78, Score: 0.18
IL_8 => donor_is_old
 -> Support: 0.26, Confidence: 0.80, Score: 0.21
GDF_15 => donor_is_old
 -> Support: 0.26, Confidence: 1.00, Score: 0.26
Basophils => donor_is_old
 -> Support: 0.33, Confidence: 0.62, Score: 0.21
Neutrophils => donor_is_old
 -> Support: 0.23, Confidence: 0.89, Score: 0.21
IL_13 => donor_is_old
 -> Support: 0.26, Confidence: 0.70, Score: 0.18
Mon

And in the end let's save our work

In [33]:
with open("compressed_rules.txt", "w") as f:
    for rm in compressed_rule_metrics:
        f.write(rm["rule"] + "\n")

print("\nCompressed rules saved to compressed_rules.txt")


Compressed rules saved to compressed_rules.txt
