# Breast Cancer Data Conjectures with TxGraffiti

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/RandyRDavila/AI-discovery-in-mathematics-with-TxGraffiti/blob/main/notebooks/breast_cancer.ipynb)

## Introduction

This notebook applies the TxGraffiti algorithm to generate conjectures on the famous breast cancer dataset. The dataset is commonly used in machine learning for binary classification tasks and contains various features related to breast cancer tumors.

## Dataset

The dataset consists of breast cancer tumor data and includes various numerical properties such as:
- **Mean Radius**: The mean of distances from the center to points on the perimeter.
- **Mean Texture**: The standard deviation of gray-scale values.
- **Mean Perimeter**: The perimeter of the tumor.
- **Mean Area**: The area of the tumor.
- **Mean Smoothness**: The local variation in radius lengths.

## Objectives

- Generate conjectures relating different numerical properties of breast cancer tumors.
- Identify significant relationships and patterns in tumor properties.
- Apply the Theo and Static Dalmatian heuristics to filter and refine the conjectures.

## Usage

1. **Run the cells to load the dataset and apply TxGraffiti.**
2. **Examine the generated conjectures and their significance.**

Discover new insights into breast cancer tumor properties with TxGraffiti.

---

In [12]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from pulp import *
from fractions import Fraction
from itertools import combinations

# Define the hypothesis, conclusion, and conjecture classes
class Hypothesis:
    def __init__(self, statements):
        self.statements = statements

class LinearConclusion:
    def __init__(self, target, inequality, slope, other, intercept):
        self.target = target
        self.inequality = inequality
        self.slope = slope
        self.other = other
        self.intercept = intercept

class LinearConjecture:
    def __init__(self, hypothesis, conclusion, symbol, touch, type="tumor"):
        self.hypothesis = hypothesis
        self.conclusion = conclusion
        self.symbol = symbol
        self.touch = touch
        self.type = type

    def __repr__(self):
        if self.hypothesis.statements:
            hypothesis_str = " and ".join([f"{self.symbol} is {h}" for h in self.hypothesis.statements])
            return (f"For any {self.type} {self.symbol}, if {hypothesis_str}, then "
                    f"{self.conclusion.target}({self.symbol}) {self.conclusion.inequality} "
                    f"{self.conclusion.slope}*{self.conclusion.other}({self.symbol}) + "
                    f"{self.conclusion.intercept}, with equality on {self.touch} instances.")
        else:
            return (f"For any {self.type} {self.symbol}, "
                    f"{self.conclusion.target}({self.symbol}) {self.conclusion.inequality} "
                    f"{self.conclusion.slope}*{self.conclusion.other}({self.symbol}) + "
                    f"{self.conclusion.intercept}, with equality on {self.touch} instances.")

    def get_sharp_objects(self, df):
        X = df[self.conclusion.other].to_numpy()
        Y = df[self.conclusion.target].to_numpy()
        sharp_indices = df[np.isclose(Y, float(self.conclusion.slope) * X + float(self.conclusion.intercept))].index
        return df.loc[sharp_indices]

    def calculate_distances(self, df):
        X = df[self.conclusion.other].to_numpy()
        Y = df[self.conclusion.target].to_numpy()
        distances = np.abs(Y - (float(self.conclusion.slope) * X + float(self.conclusion.intercept)))
        return distances

def make_upper_linear_conjecture(df, target, other, hypothesis, symbol="C"):
    for hyp in hypothesis:
        df = df[df[hyp] == True]
    X = df[other].to_numpy()
    Y = df[target].to_numpy()

    prob = LpProblem("UpperBoundConjecture", LpMinimize)
    w = LpVariable("w")
    b = LpVariable("b")

    prob += lpSum([w * x + b - y for x, y in zip(X, Y)])

    for x, y in zip(X, Y):
        prob += w * x + b - y >= 0

    prob.solve()

    if w.varValue is None or b.varValue is None:
        return None

    m = Fraction(w.varValue).limit_denominator(10)
    b = Fraction(b.varValue).limit_denominator(10)
    if m == 0:
        return None  # Skip trivial conjectures

    touch = np.sum(np.isclose(Y, float(m) * X + float(b)))

    hypothesis = Hypothesis(hypothesis)
    conclusion = LinearConclusion(target, "<=", m, other, b)

    return LinearConjecture(hypothesis, conclusion, symbol, touch)

def make_lower_linear_conjecture(df, target, other, hypothesis, symbol="C"):
    for hyp in hypothesis:
        df = df[df[hyp] == True]
    X = df[other].to_numpy()
    Y = df[target].to_numpy()

    prob = LpProblem("LowerBoundConjecture", LpMaximize)
    w = LpVariable("w")
    b = LpVariable("b")

    prob += lpSum([w * x + b - y for x, y in zip(X, Y)])

    for x, y in zip(X, Y):
        prob += w * x + b - y <= 0

    prob.solve()

    if w.varValue is None or b.varValue is None:
        return None

    m = Fraction(w.varValue).limit_denominator(10)
    b = Fraction(b.varValue).limit_denominator(10)
    if m == 0:
        return None  # Skip trivial conjectures

    touch = np.sum(np.isclose(Y, float(m) * X + float(b)))

    hypothesis = Hypothesis(hypothesis)
    conclusion = LinearConclusion(target, ">=", m, other, b)

    return LinearConjecture(hypothesis, conclusion, symbol, touch)

def make_all_upper_linear_conjectures(df, target, others, properties):
    conjectures = []
    for other in others:
        for k in range(4):  # Considering hypotheses of none, one, two, and three boolean properties
            for prop_comb in combinations(properties, k):
                if other != target:
                    conjecture = make_upper_linear_conjecture(df, target, other, prop_comb)
                    if conjecture:
                        conjectures.append(conjecture)
    return conjectures

def make_all_lower_linear_conjectures(df, target, others, properties):
    conjectures = []
    for other in others:
        for k in range(4):  # Considering hypotheses of none, one, two, and three boolean properties
            for prop_comb in combinations(properties, k):
                if other != target:
                    conjecture = make_lower_linear_conjecture(df, target, other, prop_comb)
                    if conjecture:
                        conjectures.append(conjecture)
    return conjectures

def sort_by_touch_number(conjectures):
    return sorted(conjectures, key=lambda x: x.touch, reverse=True)

def apply_theo_heuristic(conjectures):
    filtered_conjectures = []
    for conj_1 in conjectures:
        is_general = True
        for conj_2 in filtered_conjectures:
            if (conj_1.conclusion.slope == conj_2.conclusion.slope and
                conj_1.conclusion.intercept == conj_2.conclusion.intercept and
                conj_1.conclusion.inequality == conj_2.conclusion.inequality and
                set(conj_1.hypothesis.statements).issubset(set(conj_2.hypothesis.statements))):
                is_general = False
                break
        if is_general:
            filtered_conjectures.append(conj_1)
    return filtered_conjectures

def apply_static_dalmatian_heuristic(df, conjectures):
    filtered_conjectures = []
    for conj in conjectures:
        conj_distances = conj.calculate_distances(df)
        keep_conj = True
        for other_conj in filtered_conjectures:
            other_distances = other_conj.calculate_distances(df)
            if np.all(conj_distances >= other_distances):
                keep_conj = False
                break
        if keep_conj:
            filtered_conjectures.append(conj)
    return filtered_conjectures

def txgraffiti_conjecture_generation(df, targets, invariants, properties):
    conjectures = []
    for target in targets:
        upper_conjectures = make_all_upper_linear_conjectures(df, target, invariants, properties)
        lower_conjectures = make_all_lower_linear_conjectures(df, target, invariants, properties)
        conjectures += upper_conjectures + lower_conjectures

    conjectures = sort_by_touch_number(conjectures)
    conjectures = apply_theo_heuristic(conjectures)
    conjectures = apply_static_dalmatian_heuristic(df, conjectures)

    return conjectures

# Load the breast cancer dataset
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Define boolean properties based on some thresholds
df['high_mean_radius'] = df['mean radius'] > df['mean radius'].median()
df['high_mean_texture'] = df['mean texture'] > df['mean texture'].median()
df['high_mean_perimeter'] = df['mean perimeter'] > df['mean perimeter'].median()
df['high_mean_area'] = df['mean area'] > df['mean area'].median()
df['high_mean_smoothness'] = df['mean smoothness'] > df['mean smoothness'].median()

# Define the targets, invariants, and properties
targets = ["mean radius", "mean texture", "mean perimeter", "mean area", "mean smoothness"]
invariants = ["mean radius", "mean texture", "mean perimeter", "mean area", "mean smoothness"]
properties = ["high_mean_radius", "high_mean_texture", "high_mean_perimeter", "high_mean_area", "high_mean_smoothness"]

# Generate conjectures using the TxGraffiti algorithm
conjectures = txgraffiti_conjecture_generation(df, targets, invariants, properties)

Welcome to the CBC MILP Solver 
Version: 2.10.3 
Build Date: Dec 15 2019 

command line - /Users/randydavila/Documents/Automated-Conjecturing/AI-discovery-in-mathematics-with-TxGraffiti/env/lib/python3.11/site-packages/pulp/solverdir/cbc/osx/64/cbc /var/folders/92/bxgdy2896wdgw0bx9f_1ghhh0000gn/T/326af3276fd64c9f9486f9b51e3070d3-pulp.mps -timeMode elapsed -branch -printingOptions all -solution /var/folders/92/bxgdy2896wdgw0bx9f_1ghhh0000gn/T/326af3276fd64c9f9486f9b51e3070d3-pulp.sol (default strategy 1)
At line 2 NAME          MODEL
At line 3 ROWS
At line 574 COLUMNS
At line 1715 RHS
At line 2285 BOUNDS
At line 2288 ENDATA
Problem MODEL has 569 rows, 2 columns and 1138 elements
Coin0008I MODEL read with 0 errors
Option for timeMode changed from cpu to elapsed
Presolve 479 (-90) rows, 2 (0) columns and 958 (-180) elements
0  Obj 0 Primal inf 1591.6156 (479) Dual inf 35.243143 (2) w.o. free dual inf (0)
3  Obj 15953.333
Optimal - objective value 15953.333
After Postsolve, objective 15953

In [13]:
# Print the generated conjectures
for i, conj in enumerate(conjectures[:20]):
    print(f"Conjecture {i+1}. ", conj, "\n")


Conjecture 1.  For any tumor C, if C is high_mean_texture, then mean perimeter(C) >= 168*mean smoothness(C) + 304/9, with equality on 2 instances. 

Conjecture 2.  For any tumor C, if C is high_mean_radius and C is high_mean_perimeter, then mean perimeter(C) >= 80/3*mean smoothness(C) + 671/8, with equality on 2 instances. 

Conjecture 3.  For any tumor C, if C is high_mean_radius and C is high_mean_texture and C is high_mean_perimeter, then mean perimeter(C) >= 887/7*mean smoothness(C) + 597/8, with equality on 2 instances. 

Conjecture 4.  For any tumor C, mean area(C) <= 169721/5*mean smoothness(C) + -10607/9, with equality on 2 instances. 

Conjecture 5.  For any tumor C, if C is high_mean_smoothness, then mean area(C) <= 966263/10*mean smoothness(C) + -55813/7, with equality on 2 instances. 

Conjecture 6.  For any tumor C, if C is high_mean_texture, then mean area(C) >= 95*mean radius(C) + -741, with equality on 2 instances. 

Conjecture 7.  For any tumor C, if C is high_mean_rad