# Wine Data Conjectures with TxGraffiti

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/RandyRDavila/AI-discovery-in-mathematics-with-TxGraffiti/blob/main/notebooks/wine_data.ipynb)

## Introduction

This notebook applies the TxGraffiti algorithm to generate conjectures on the wine quality dataset. The wine quality dataset, commonly used in machine learning, includes various chemical properties of wines that can be used to predict their quality. By applying TxGraffiti, we aim to discover new relationships and patterns within this dataset.

## Dataset

The dataset consists of wine samples and includes various numerical properties such as:
- **Numerical Properties**: Fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, and quality.
- **Boolean Properties**: High acidity, high sugar, high alcohol, high pH, and high quality.

## Objectives

- Generate conjectures relating different numerical properties of wines.
- Identify significant relationships and patterns in wine properties.
- Apply the Theo and Static Dalmatian heuristics to filter and refine the conjectures.

## Usage

1. **Run the cells to load the dataset and apply TxGraffiti.**
2. **Examine the generated conjectures and their significance.**

Discover new insights into wine properties and quality with TxGraffiti.

---

In [6]:
# If running in Google Colab, you will need to pip install pulp.
# !pip install pulp

# Import the necessary libraries.
import pandas as pd
import numpy as np
from pulp import *
from fractions import Fraction
from itertools import combinations

# Define the hypothesis, conclusion, and conjecture classes
class Hypothesis:
    def __init__(self, statements):
        self.statements = statements

class LinearConclusion:
    def __init__(self, target, inequality, slope, other, intercept):
        self.target = target
        self.inequality = inequality
        self.slope = slope
        self.other = other
        self.intercept = intercept

class LinearConjecture:
    def __init__(self, hypothesis, conclusion, symbol, touch, type="vino"):
        self.hypothesis = hypothesis
        self.conclusion = conclusion
        self.symbol = symbol
        self.touch = touch
        self.type = type

    def __repr__(self):
        if self.hypothesis.statements:
            hypothesis_str = " and ".join([f"{self.symbol} is {h}" for h in self.hypothesis.statements])
            return (f"For any {self.type} {self.symbol}, if {hypothesis_str}, then "
                    f"{self.conclusion.target}({self.symbol}) {self.conclusion.inequality} "
                    f"{self.conclusion.slope}*{self.conclusion.other}({self.symbol}) + "
                    f"{self.conclusion.intercept}, with equality on {self.touch} instances.")
        else:
            return (f"For any {self.type} {self.symbol}, "
                    f"{self.conclusion.target}({self.symbol}) {self.conclusion.inequality} "
                    f"{self.conclusion.slope}*{self.conclusion.other}({self.symbol}) + "
                    f"{self.conclusion.intercept}, with equality on {self.touch} instances.")

    def get_sharp_objects(self, df):
        X = df[self.conclusion.other].to_numpy()
        Y = df[self.conclusion.target].to_numpy()
        sharp_indices = df[np.isclose(Y, float(self.conclusion.slope) * X + float(self.conclusion.intercept))].index
        return df.loc[sharp_indices]

    def calculate_distances(self, df):
        X = df[self.conclusion.other].to_numpy()
        Y = df[self.conclusion.target].to_numpy()
        distances = np.abs(Y - (float(self.conclusion.slope) * X + float(self.conclusion.intercept)))
        return distances

def make_upper_linear_conjecture(df, target, other, hypothesis, symbol="W"):
    for hyp in hypothesis:
        df = df[df[hyp] == True]
    X = df[other].to_numpy()
    Y = df[target].to_numpy()

    prob = LpProblem("UpperBoundConjecture", LpMinimize)
    w = LpVariable("w")
    b = LpVariable("b")

    prob += lpSum([w * x + b - y for x, y in zip(X, Y)])

    for x, y in zip(X, Y):
        prob += w * x + b - y >= 0

    prob.solve()

    if w.varValue is None or b.varValue is None:
        return None

    m = Fraction(w.varValue).limit_denominator(10)
    b = Fraction(b.varValue).limit_denominator(10)
    if m == 0:
        return None  # Skip trivial conjectures

    touch = np.sum(np.isclose(Y, float(m) * X + float(b)))

    hypothesis = Hypothesis(hypothesis)
    conclusion = LinearConclusion(target, "<=", m, other, b)

    return LinearConjecture(hypothesis, conclusion, symbol, touch)

def make_lower_linear_conjecture(df, target, other, hypothesis, symbol="W"):
    for hyp in hypothesis:
        df = df[df[hyp] == True]
    X = df[other].to_numpy()
    Y = df[target].to_numpy()

    prob = LpProblem("LowerBoundConjecture", LpMaximize)
    w = LpVariable("w")
    b = LpVariable("b")

    prob += lpSum([w * x + b - y for x, y in zip(X, Y)])

    for x, y in zip(X, Y):
        prob += w * x + b - y <= 0

    prob.solve()

    if w.varValue is None or b.varValue is None:
        return None

    m = Fraction(w.varValue).limit_denominator(10)
    b = Fraction(b.varValue).limit_denominator(10)
    if m == 0:
        return None  # Skip trivial conjectures

    touch = np.sum(np.isclose(Y, float(m) * X + float(b)))

    hypothesis = Hypothesis(hypothesis)
    conclusion = LinearConclusion(target, ">=", m, other, b)

    return LinearConjecture(hypothesis, conclusion, symbol, touch)

def make_all_upper_linear_conjectures(df, target, others, properties):
    conjectures = []
    for other in others:
        for k in range(4):  # Considering hypotheses of none, one, two, and three boolean properties
            for prop_comb in combinations(properties, k):
                if other != target:
                    conjecture = make_upper_linear_conjecture(df, target, other, prop_comb)
                    if conjecture:
                        conjectures.append(conjecture)
    return conjectures

def make_all_lower_linear_conjectures(df, target, others, properties):
    conjectures = []
    for other in others:
        for k in range(4):  # Considering hypotheses of none, one, two, and three boolean properties
            for prop_comb in combinations(properties, k):
                if other != target:
                    conjecture = make_lower_linear_conjecture(df, target, other, prop_comb)
                    if conjecture:
                        conjectures.append(conjecture)
    return conjectures

def sort_by_touch_number(conjectures):
    return sorted(conjectures, key=lambda x: x.touch, reverse=True)

def apply_theo_heuristic(conjectures):
    filtered_conjectures = []
    for conj_1 in conjectures:
        is_general = True
        for conj_2 in filtered_conjectures:
            if (conj_1.conclusion.slope == conj_2.conclusion.slope and
                conj_1.conclusion.intercept == conj_2.conclusion.intercept and
                conj_1.conclusion.inequality == conj_2.conclusion.inequality and
                set(conj_1.hypothesis.statements).issubset(set(conj_2.hypothesis.statements))):
                is_general = False
                break
        if is_general:
            filtered_conjectures.append(conj_1)
    return filtered_conjectures

def apply_static_dalmatian_heuristic(df, conjectures):
    filtered_conjectures = []
    for conj in conjectures:
        conj_distances = conj.calculate_distances(df)
        keep_conj = True
        for other_conj in filtered_conjectures:
            other_distances = other_conj.calculate_distances(df)
            if np.all(conj_distances >= other_distances):
                keep_conj = False
                break
        if keep_conj:
            filtered_conjectures.append(conj)
    return filtered_conjectures

def txgraffiti_conjecture_generation(df, targets, invariants, properties):
    conjectures = []
    for target in targets:
        upper_conjectures = make_all_upper_linear_conjectures(df, target, invariants, properties)
        lower_conjectures = make_all_lower_linear_conjectures(df, target, invariants, properties)
        conjectures += upper_conjectures + lower_conjectures

    conjectures = sort_by_touch_number(conjectures)
    conjectures = apply_theo_heuristic(conjectures)
    conjectures = apply_static_dalmatian_heuristic(df, conjectures)

    return conjectures

# Load the wine quality dataset
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
df = pd.read_csv(url, sep=';')

# Define boolean properties based on some thresholds
df['high_acidity'] = df['fixed acidity'] > df['fixed acidity'].median()
df['high_sugar'] = df['residual sugar'] > df['residual sugar'].median()
df['high_alcohol'] = df['alcohol'] > df['alcohol'].median()
df['high_pH'] = df['pH'] > df['pH'].median()
df['high_quality'] = df['quality'] > df['quality'].median()

# Define the targets, invariants, and properties
targets = ["quality", "volatile acidity", "chlorides"]
invariants = ["fixed acidity", "volatile acidity", "citric acid", "residual sugar", "chlorides", "free sulfur dioxide", "total sulfur dioxide", "density", "pH", "sulphates", "alcohol"]
properties = ["high_acidity", "high_sugar", "high_alcohol", "high_pH", "high_quality"]

# Generate conjectures using the TxGraffiti algorithm
conjectures = txgraffiti_conjecture_generation(df, targets, invariants, properties)

Welcome to the CBC MILP Solver 
Version: 2.10.3 
Build Date: Dec 15 2019 

command line - /Users/randydavila/Documents/Automated-Conjecturing/AI-discovery-in-mathematics-with-TxGraffiti/env/lib/python3.11/site-packages/pulp/solverdir/cbc/osx/64/cbc /var/folders/92/bxgdy2896wdgw0bx9f_1ghhh0000gn/T/c435d920c5274f2dbbb884c66c7a4b19-pulp.mps -timeMode elapsed -branch -printingOptions all -solution /var/folders/92/bxgdy2896wdgw0bx9f_1ghhh0000gn/T/c435d920c5274f2dbbb884c66c7a4b19-pulp.sol (default strategy 1)
At line 2 NAME          MODEL
At line 3 ROWS
At line 1604 COLUMNS
At line 4805 RHS
At line 6405 BOUNDS
At line 6408 ENDATA
Problem MODEL has 1599 rows, 2 columns and 3198 elements
Coin0008I MODEL read with 0 errors
Option for timeMode changed from cpu to elapsed
Presolve 96 (-1503) rows, 2 (0) columns and 192 (-3006) elements
0  Obj 0 Primal inf 76.213316 (96) Dual inf 206.585 (2) w.o. free dual inf (0)
4  Obj 12792
Optimal - objective value 12792
After Postsolve, objective 12792, infea

In [7]:
# Print the generated conjectures
for i, conj in enumerate(conjectures[:20]):
    print(f"Conjecture {i+1}. ", conj, "\n")

Conjecture 1.  For any vino W, if W is high_sugar, then chlorides(W) >= 1/10*pH(W) + -1/4, with equality on 13 instances. 

Conjecture 2.  For any vino W, if W is high_alcohol, then chlorides(W) >= 1/10*sulphates(W) + 0, with equality on 13 instances. 

Conjecture 3.  For any vino W, if W is high_alcohol and W is high_pH, then quality(W) <= 5/6*alcohol(W) + -7/4, with equality on 7 instances. 

Conjecture 4.  For any vino W, if W is high_acidity and W is high_pH, then chlorides(W) >= -1/10*citric acid(W) + 1/10, with equality on 5 instances. 

Conjecture 5.  For any vino W, if W is high_acidity and W is high_pH and W is high_quality, then quality(W) <= 10/3*volatile acidity(W) + 33/5, with equality on 4 instances. 

Conjecture 6.  For any vino W, if W is high_acidity and W is high_sugar and W is high_pH, then quality(W) <= 5/9*alcohol(W) + 7/6, with equality on 4 instances. 

Conjecture 7.  For any vino W, if W is high_sugar and W is high_alcohol, then quality(W) >= -25/7*pH(W) + 108/7