# 1. Foundations: Probability and Toy Modeling

Consider a toy dataset with $n$ samples such that each attribute $\text{attr}(i)$, for $i \in \{1, 2, \dots, n\}$, is either $x$ or $y$. Or simply, we have flipped a lopsided¹ coin $n$ times and the dataset is a record of head or tail for each $i^{th}$ flip.

Let's say we want to predict, with some probability, what the $k^{th}$ flip will be, but in order to do just that we'll need to revisit some basic probability:

### i. Probability Space
Probability space is a triple of the form $<\text{W},\mathscr{A},\textbf{Pr}>$ where $\text{W}$ is a sample space. $\mathscr{A}$ is an algebra over $\text{W}$, the power set of $\text{W}$ being one of the many possible algebras: $\{\emptyset, \{H\}, \{T\}, \{H, T\}\}$. Lastly, $\textbf{Pr}$  is a function from $\mathscr{A} \to \mathbb{R}$ such that for all $X,Y\in \mathscr{A}$:

1. $\textbf{Pr}(X)\ge0$
2. $\textbf{Pr}(\text{W})=1$
3. if $X\cap Y = \emptyset$, then $\textbf{Pr}(X \cup Y)=\textbf{Pr}(X)+\textbf{Pr}(Y)$²

### ii. Probability Measure

Suppose $n=30$, and out of those $30$ flips $18$ were head while $12$ were tail. What if we define our measure as follows:

1. $\textbf{Pr}(\text{head})=18/30$
2. $\textbf{Pr}(\text{tail})=12/30$

You can verify that the conditions outlined above are satisfied. Therefore, we define the probability measure for any event $\text{event}_i $ with $k$ favorable outcomes as
$$\mathbf{Pr}(\text{event}_i) = \frac{\text{count}(\text{event}_i)}{n} = \frac{k}{n}$$
where $n$ denotes the total number of observed outcomes in the dataset.

### Notes:
¹ As much as we want life to be fair, it rarely is... and in this case, that's a blessing since a fair coin will eventually converge to $P(\text{head})\approx 1/2$ and $P(\text{tail})\approx 1/2$ which is elegant, yes—but boring, if not outright useless to model.

² Based on Kolmogorov’s Foundations of the Theory of Probability, which formalized the axioms of modern probability theory.

# 2. Predicting $k^{\text{th}}$ Flip

Now that we have the foundational details out of the way, we have a toy classifier ready. What's the likelihood of the $k^{\text{th}}$ flip being a head? Simple: it's $0.6$. In other words, our model is 60 percent confident that the $k^{\text{th}}$ flip is a head. Although this is a useful exercise for grappling with probabilistic intuition, it's not quite as practical in the real world—hence a toy "Bayesian"¹ classifier.

### Notes:
¹ It's a special case of Bayesian classifier utilizing priors without conditioning. Although simplified, it's still theoretically Bayesian.

# 3. Conditional Probability
As we said earlier, although our toy classifier is pedagogically useful, it's not practical. We can just tack on additional conditions on our theory to strengthen it. An old adage comes to mind: GIGO — garbage in, garbage out. If our dataset is inherently limited, regardless of how much we strengthen our model, it won't fare better than our toy model. So let's construct a slightly more granular dataset:

Suppose we have $n = 8$ samples, one attribute and one predictor:

| Smokes | HasLungDisease |
|--------|----------------|
| Yes    | Yes            |
| No     | No             |
| Yes    | Yes            |
| No     | No             |
| Yes    | No             |
| No     | Yes            |
| Yes    | Yes            |
| No     | No             |

Before anything else, for comparison, let's model this using our earlier classifier:

$$
\textbf{Pr(\text{HasLungDisease=Yes})}=\frac{1}{2}=\textbf{Pr(\text{HasLungDisease=No})}
$$

This doesn't quite tell us much except that everyone has a $50%$ chance of having a lung disease. Sort of like walking around with a hard hat in case of a falling fridge.

Since we have additional information, why not use it?

### i. Attribute Conditioned on Predictor

We define the probability that an individual has HasLungDisease given that they smoke, notated as: $\text{Pr}(\text{HasLungDisease = Yes | Smokes = Yes})$, as 
$$
\frac{\text{Pr}(\text{HasLungDisease = Yes and Smokes = Yes})} { \text{Pr}(\text{Smokes = Yes})}
$$

I won't get into probability proper here - but for further exploration, you can refer to John E. Freund's *Introduction to Probability* (1973).

Let's evaluate our conditional probability.

To do this, count the number of individuals who smoke and also have lung disease. This is symmetric in joint probability:

$$
\text{Pr}(\text{HasLungDisease} = \text{Yes} \text{ and } \text{Smokes} = \text{Yes}) = \frac{3}{8}
$$

Likewise, the proportion of smokers in our dataset is:

$$
\text{Pr}(\text{Smokes} = \text{Yes}) = \frac{4}{8}
$$

So the conditional probability becomes:

$$
\text{Pr}(\text{HasLungDisease} = \text{Yes} \mid \text{Smokes} = \text{Yes}) = \frac{3/8}{4/8} = \frac{3}{4}
$$

That’s a 75% chance of having lung disease if you smoke—much more informative than the flat 50% we started with.

A similar calculus, but for the incidence of lung disease conditioned on non-smoking, yields:

$$
\text{Pr}(\text{HasLungDisease} = \text{Yes} \mid \text{Smokes} = \text{No}) = \frac{1/8}{4/8} = \frac{1}{4}
$$

That's a 25% chance of having lung disease if you do not smoke.

Although this dataset was "cooked"¹, so to speak, it was "cooked" to reflect real life... let's suppose this is indeed a real life sample for fun.

### Notes:
¹ "Cooked" (i.e., artificially constructed for illustrative purposes)

# 4. Operationalizing Our Toy Classifer

In the previous section we saw that a person who smokes has a 75% chance of having a lung disease, whereas a non-smoker has a 25% chance of having a lung disease. In other words:
$$

\text{Pr}(\text{HasLungDisease} = \text{Yes} \mid \text{Smokes} = \text{Yes}) > \text{Pr}(\text{HasLungDisease} = \text{Yes} \mid \text{Smokes} = \text{No})

$$

Precisely this becomes our rule.

Suppose we get a new datapoint:

| Smokes | HasLungDisease |
|--------|----------------|
| Yes    |                |

Then, based on our rule, we classify it as:


| Smokes | HasLungDisease |
|--------|----------------|
| Yes    | Yes            |

Let's code this dummy model:

In [18]:
import pandas as pd


def ModelToy(df):
    df["HasLungDisease"] = df["Smokes"].apply(lambda x: "Yes" if x == "Yes" else "No")
    return df


TestData = pd.DataFrame(
    {"Smokes": ["Yes", "Yes", "No", "Yes"], "HasLungDisease": ["", "", "", ""]}
)

print("Before classification:")
print(TestData)

classified = ModelToy(TestData)

print("\nAfter classification:")
print(classified)

Before classification:
  Smokes HasLungDisease
0    Yes               
1    Yes               
2     No               
3    Yes               

After classification:
  Smokes HasLungDisease
0    Yes            Yes
1    Yes            Yes
2     No             No
3    Yes            Yes


# 5. A  Complete Naive Bayes Model (for Categorical Attributes and Classes)

Let's vectorize our model.

Suppose we now have $i$ predictors, $\mathbf{X} = [x_1, x_2, \cdots, x_i]$, rather than a singular $x$, and suppose we have $k$ possible classes. Then we generalize the likelihood as:

$$
\text{Pr}(\mathbf{X} \mid \text{Class}_k) = \prod_{j=1}^{i} \text{Pr}(x_j \mid \text{Class}_k)
$$

This is the heart of the Naive Bayes assumption: **conditional independence**. We assume that each feature $x_j$ contributes independently to the likelihood of the class.

Is this assumption always justified? Probably not. But when push comes to shove, we can often circumvent dependence through thoughtful feature engineering, collapsing correlated attributes, or something similar.

Let's see how a working model might look.

In [19]:
import pandas as pd


class NaiveBayesClassifier:
    def __init__(self):

        self.LabelProbs = {}

        self.ConditionalProbs = {}

    def fit(self, X, Y):

        data = X.copy()
        data[Y.name] = Y

        Classes = Y.unique()

        self.Attributes = X.columns
        self.Labels = Y.name

        self.LabelProbs = Y.value_counts(normalize=True).to_dict()

        self.ConditionalProbs = {label: {} for label in Classes}

        for label in Classes:

            subset = data[data[self.Labels] == label]
            for feature in self.Attributes:
                probs = subset[feature].value_counts(normalize=True).to_dict()
                self.ConditionalProbs[label][feature] = probs

    def predict(self, X):

        predictions = []

        for i, row in X.iterrows():
            Scores = {}

            for label in self.LabelProbs:
                score = self.LabelProbs[label]

                for Attribute in self.Attributes:
                    value = row.get(Attribute)

                    ConditionalProb = self.ConditionalProbs[label][Attribute].get(value)
                    score *= ConditionalProb if ConditionalProb is not None else 1e-6

                Scores[label] = score

            PredictedLabel = max(Scores, key=Scores.get)
            LabelScore = (PredictedLabel, str(Scores[PredictedLabel] * 100) + "%")
            predictions.append(LabelScore)

        return predictions

In [20]:
import pandas as pd

columns = [
    "class",
    "cap-shape",
    "cap-surface",
    "cap-color",
    "bruises",
    "odor",
    "gill-attachment",
    "gill-spacing",
    "gill-size",
    "gill-color",
    "stalk-shape",
    "stalk-root",
    "stalk-surface-above-ring",
    "stalk-surface-below-ring",
    "stalk-color-above-ring",
    "stalk-color-below-ring",
    "veil-type",
    "veil-color",
    "ring-number",
    "ring-type",
    "spore-print-color",
    "population",
    "habitat",
]

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data"
df = pd.read_csv(url, header=None, names=columns)


df.replace("?", pd.NA, inplace=True)

X = df.drop("class", axis=1)
Y = df["class"].map({"e": "edible", "p": "poisonous"})

In [21]:
X

Unnamed: 0,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,stalk-shape,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,x,s,n,t,p,f,c,n,k,e,...,s,w,w,p,w,o,p,k,s,u
1,x,s,y,t,a,f,c,b,k,e,...,s,w,w,p,w,o,p,n,n,g
2,b,s,w,t,l,f,c,b,n,e,...,s,w,w,p,w,o,p,n,n,m
3,x,y,w,t,p,f,c,n,n,e,...,s,w,w,p,w,o,p,k,s,u
4,x,s,g,f,n,f,w,b,k,t,...,s,w,w,p,w,o,e,n,a,g
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8119,k,s,n,f,n,a,c,b,y,e,...,s,o,o,p,o,o,p,b,c,l
8120,x,s,n,f,n,a,c,b,y,e,...,s,o,o,p,n,o,p,b,v,l
8121,f,s,n,f,n,a,c,b,n,e,...,s,o,o,p,o,o,p,b,c,l
8122,k,y,n,f,y,f,c,n,b,t,...,k,w,w,p,w,o,e,w,v,l


In [22]:
Y

0       poisonous
1          edible
2          edible
3       poisonous
4          edible
          ...    
8119       edible
8120       edible
8121       edible
8122    poisonous
8123       edible
Name: class, Length: 8124, dtype: object

In [23]:
model = NaiveBayesClassifier()
model.fit(X,Y)


In [24]:
TestSamples = X.sample(5, random_state=42)
TrueLabels = Y[TestSamples.index]

PredictedLabels = model.predict(TestSamples)

for i in range(len(TestSamples)):
    print(f"Sample {i+1}")
    print("Features:", TestSamples.iloc[i].to_dict())
    print("True Label:", TrueLabels.iloc[i])
    print("Predicted:", PredictedLabels[i])
    print()

Sample 1
Features: {'cap-shape': 'f', 'cap-surface': 'f', 'cap-color': 'n', 'bruises': 'f', 'odor': 'n', 'gill-attachment': 'f', 'gill-spacing': 'w', 'gill-size': 'b', 'gill-color': 'h', 'stalk-shape': 't', 'stalk-root': 'e', 'stalk-surface-above-ring': 's', 'stalk-surface-below-ring': 'f', 'stalk-color-above-ring': 'w', 'stalk-color-below-ring': 'w', 'veil-type': 'p', 'veil-color': 'w', 'ring-number': 'o', 'ring-type': 'e', 'spore-print-color': 'n', 'population': 's', 'habitat': 'g'}
True Label: edible
Predicted: ('edible', '2.625882975726804e-07%')

Sample 2
Features: {'cap-shape': 'f', 'cap-surface': 's', 'cap-color': 'e', 'bruises': 'f', 'odor': 'y', 'gill-attachment': 'f', 'gill-spacing': 'c', 'gill-size': 'n', 'gill-color': 'b', 'stalk-shape': 't', 'stalk-root': None, 'stalk-surface-above-ring': 's', 'stalk-surface-below-ring': 's', 'stalk-color-above-ring': 'p', 'stalk-color-below-ring': 'p', 'veil-type': 'p', 'veil-color': 'w', 'ring-number': 'o', 'ring-type': 'e', 'spore-print