# Bayesian Classification

## Objectives

* **Understand** Bayes Theorem,
* **Learn and Explore** how Bayes theorem can aid in the construction of bayesian models,
* **Create** algorithms for bayesian classification,
* **Observe** the results of this new models.

## Bayes Theroem, the start:

<p style='text-align: justify;'>
Bayes' Theorem, formulated by Thomas Bayes, is a fundamental concept in probability and statistics that describes how to update the probability of a hypothesis (or event) based on new evidence. It relates the probability of occurrence of two distinct events in relation to each other and finds wide applications in various fields, including data science, machine learning, medicine, economics, and more.

Suppose we have two events: "$H$" and "$E$". Bayes' Theorem allows us to calculate the conditional probability of "$H$" occurring, given that "$E$" has already happened, using the following formula **(1)**:
</p>

\begin{equation}
P(H|E) = \frac{P(E|H) \cdot P(H)}{P(E)}
\end{equation}

where:

$P(H|E)$, also known as "**Posterior**", is the probability of the hypothesis $H$ occurring given that evidence $E$ has occurred.

$P(E|H)$, also known as "**Likelihood**", is the probability of observing evidence $E$ given that hypothesis $H$ has occurred.

$P(H)$, also known as "**Prior**", is the initial probability of the hypothesis $H$ occurring before considering any evidence.

$P(E)$, also known as "**Marginal Likelihood**" or "**Evidence**", is the probability of observing evidence $E$.

From this theorem, two significant solutions in the field of machine learning and artificial intelligence were developed: Bayesian Networks (also known as Probability Networks) and Bayesian Classifiers. Do not forget to watch [this video](https://www.youtube.com/watch?v=HZGCoVF3YvM&ab_channel=3Blue1Brown) to get a better understading how to use Bayes theorem.

It's important to notice that although neural networks and Bayesian models (such as the Bayesian classifier and Bayesian network) are distinct approaches to tackling machine learning problems, they are interconnected conceptually. Neural networks can be trained using the maximum likelihood method, which is a technique related to the principle of maximum likelihood employed in statistics and probabilistic modeling, including Bayesian methods. Consequently, neural networks can be viewed as universal function approximators and possess the capability to learn underlying probability distributions, despite it not being their primary approach.


## ☆ Challenge #1: Protect the Oranges! (Bayesian Network) ☆ 

A Bayesian network, is a graphical representation of probabilistic relationships between different variables. It's a powerful tool used to model uncertain situations and make predictions between a lot of non-indepent variables.

The strength of Bayesian networks lies in their ability to update probabilities when new information is available. Let's say one event occurs, and we want to know how it affects the likelihood of another event happening. Bayesian networks allow us to do this by applying Bayes' theorem, which takes into account the prior probabilities and the new evidence. To make predictions using a Bayesian network, we follow a process called "inference." It involves using the network to calculate the probabilities of events, given the evidence we have. Let's discorver more with a example:

The farmer faces concerns about the security of his valuable orange farm during the night while he is **sleeping** because he fears the possibility of **thieves** invading the property to steal his precious fruits. To protect his plantation, he has installed a sensitive **alarm** system that is triggered if it detects the presence of intruders on the farm.

However, the farmer realizes that the alarm can be triggered not only by **thieves** but also by the farm's **sheeps** or **thunders** during the night.

In order to make more informed decisions about when to activate the alarm and ensure the **safety** of the oranges, the farmer has decided to use a probabilistic model known as "Bayesian Network." This Bayesian Network consists of five random variables:

- **IntruderPresence**: This variable represents the presence of intruders on the farm during the night. It can take two possible states: "True" if intruders are present and "False" if the farm is secure.

- **SheepPresence**: This variable represents the presence of sheep on the farm during the night. Sheep roaming around could potentially trigger the alarm. It can take two states: "True" if sheep are present and "False" if there are no sheep.

- **Thunder**: This variable represents the occurrence of thunder during the night. Thunder can also trigger the sensitive alarm system. It can take two states: "True" if there is thunder and "False" if there is no thunder.

- **AlarmTriggered**: This variable represents whether the alarm system is triggered or not. It can take two states: "True" if the alarm is triggered and "False" if it remains inactive.

- **FarmerWakeUp**: This variable represents the farmer's decision to wake up and defend the oranges in the event of a potential threat. It can take two states: "True" if the farmer wakes up and defends the oranges, and "False" if the farmer remains asleep.

- **OrangeSafety**: This variable represents the safety of the oranges on the farm. It can take two states: "1" if the oranges are secure and "0" if there is a potential threat to the oranges.

Our task is to develop a solution to predict the probability of each event within the Bayesian Network, to do that, the following were probabilities collected by the farmer, using his dataset ```./datasets/night_farm_dataset.csv```. Now, it's yout challenge to:


**a)** Model the Bayesian Network for the problem.

**b)** Based on the network created in **a)**, construct the useful tables for relating variables states and probabilities.

**c)** Calculate the probabilities for each variable, with no prior evidence.

**d)** What is the probability that the alarm has been activated due to a sheep in the plantation? Additionally, what is the probability of the farmer waking up if there is an intruder at the farm?

**e)** Verify your results at [this](http://bayesjs.github.io/bayesjs-editor/) link.


### ☆ Solution ☆

**a)** We can start modeling the network using any node. It is possible to model a Bayesian network by choosing an initial random variable and establishing the relationships on a case-by-case basis. However, it is worth noting that there is not just one way to model a Bayesian network. Below, we present the most intuitive way to model it for this problem:

<div align="center">
	<img src = "./assets/network_1.png" />
</div>

Notice that the intruders, the thunder, and the sheeps are exclusively linked to the alarm, as if the alarm goes on, one of them will be the cause. Similarly, the alarm and the farmer waking up is exclusively connected to the Orange safety, as if the farmer woke up and the alarm goes on, oranges are not in rink anymore. With data collected and the network done, we can start calculating the probabilities in each scenario.

**b)** To construct the tables, we should create them based on the modeled Bayesian network, linking the variables according to the hierarchy determined in the network. However, before that, we need to analyze the dataset to collect the initial probabilities for each table, using pandas we can start calculating marginal probabilities for the independent variables:

In [None]:
import pandas as pd

df = pd.read_csv('./datasets/night_farm_dataset.csv')
    
print('IntruderPresence marginal probabilities:')
p_IntruderPresence = df['IntruderPresence'].value_counts(normalize=True).to_dict()
print(p_IntruderPresence)
print()

print('SheepPresence marginal probabilities:')
p_SheepPresence = df['SheepPresence'].value_counts(normalize=True).to_dict()
print(p_SheepPresence)
print()

print('Thunder marginal probabilities:')
p_Thunder = df['Thunder'].value_counts(normalize=True).to_dict()
print(p_Thunder)
print()

After that, calculate the conditional probabilities for non-independent variables:

In [20]:
# AlarmTriggered conditional probability

p_AlarmTriggered = {}
print('AlarmTriggered probabilities:')
for intruder in [True, False]:
    for sheep in [True, False]:
        for thunder in [True, False]:
            sub_df = df[(df.IntruderPresence == intruder) & 
                        (df.SheepPresence == sheep) & 
                        (df.Thunder == thunder)]
            key = f'Intruder: {intruder}, Sheep: {sheep}, Thunder: {thunder}'
            p_AlarmTriggered[key] = sub_df.AlarmTriggered.value_counts(normalize=True).to_dict()
            print(key)
            print(p_AlarmTriggered[key])
            print()

# FarmerWakeUp conditional probability 

p_FarmerWakeUp = {}
print('FarmerWakeUp probabilities:')
for alarm in [True, False]:
    key = f'Alarm: {alarm}'    
    p_FarmerWakeUp[key] = df.FarmerWakeUp[df.AlarmTriggered == alarm].value_counts(normalize=True).to_dict()
    print(key)
    print(p_FarmerWakeUp[key])
    print()

# OrangeSafety conditional probability 

p_OrangeSafety = {}
print('OrangeSafety probabilities:')
for intruder in [True, False]:
    for wake_up in [True, False]:
        sub_df = df[(df.IntruderPresence == intruder) & (df.FarmerWakeUp == wake_up)]
        key = f'Intruder: {intruder}, WakeUp: {wake_up}'
        p_OrangeSafety[key] = sub_df.OrangeSafety.value_counts(normalize=True).to_dict()
        print(key)
        print(p_OrangeSafety[key]) 
        print()

AlarmTriggered probabilities:
Intruder: True, Sheep: True, Thunder: True
{True: 0.9064748201438849, False: 0.09352517985611511}

Intruder: True, Sheep: True, Thunder: False
{True: 0.9424083769633508, False: 0.05759162303664921}

Intruder: True, Sheep: False, Thunder: True
{True: 0.9189873417721519, False: 0.0810126582278481}

Intruder: True, Sheep: False, Thunder: False
{True: 0.9156010230179028, False: 0.08439897698209718}

Intruder: False, Sheep: True, Thunder: True
{True: 0.7982770997846375, False: 0.20172290021536252}

Intruder: False, Sheep: True, Thunder: False
{True: 0.7478422417463327, False: 0.2521577582536672}

Intruder: False, Sheep: False, Thunder: True
{True: 0.5923654760079718, False: 0.4076345239920282}

Intruder: False, Sheep: False, Thunder: False
{False: 0.9904435719655835, True: 0.009556428034416507}

FarmerWakeUp probabilities:
Alarm: True
{True: 0.8018630495178951, False: 0.19813695048210492}

Alarm: False
{False: 0.9513723795115626, True: 0.04862762048843743}

Ora

In the end, you're gonna end up with this tables (values were round off):

| IntruderPresence | Probability |
|:----------------:|:-----------:|
| True             |    0.05     |
| False            |    0.95     |


| SheepPresence | Probability |
|:------------:|:-----------:|
|     True     |    0.30     |
|    False     |    0.70     |

|  Thunder  | Probability |
|:---------:|:-----------:|
|   True    |    0.10     |
|   False   |    0.90     |

| IntruderPresence | SheepPresence | Thunder | Alarm(True) | Alarm(False) |
|:----------------:|:-------------:|:-------:|:-----------:|:-----------:|
|       True       |     True      |   True  |    0.95     |    0.05     |
|       True       |     True      |  False  |    0.94     |    0.06     |
|       True       |    False      |   True  |    0.93     |    0.07     |
|       True       |    False      |  False  |    0.92     |    0.08     |
|      False       |     True      |   True  |    0.80     |    0.20     |
|      False       |     True      |  False  |    0.75     |    0.25     |
|      False       |    False      |   True  |    0.60     |    0.40     |
|      False       |    False      |  False  |    0.01     |    0.99     |


| Alarm | FarmerWakeUp(True) | FarmerWakeUp(False) |
|:-----:|:-----------------:|:------------------:|
|  True |       0.80        |       0.20         |
|  False |       0.05        |       0.95         |

| IntruderPresence | FarmerWakeUp | OrangeSafety(Safe) | OrangeSafety(NotSafe) | 
|:----------------:|:------------:|:------------------:|:--------------------:|
|      False       |     True     |        1.00        |         0.00         |
|       True       |    False     |        0.00        |         1.00         |
|       True       |     True     |        1.00        |         0.00         |
|      False       |    False     |        1.00        |         0.00         |


**c)** This time, using the probabilities calculated at **b)** item, we can calculate the probabilities for OrangeSafety, FarmerWakeUp and AlarmTriggered for no prior evidence using the Law of Total Probability:

\begin{equation}
P(A) = \sum_{i=1}^{n} P(A|B_i) \cdot P(B_i)
\end{equation}


In [21]:
print('AlarmTriggered no prior evidence probability:')

nep_AlarmTriggered = 0
for intruder in [True, False]:
    for sheep in [True, False]:
        for thunder in [True, False]:
            value = p_AlarmTriggered[f'Intruder: {intruder}, Sheep: {sheep}, Thunder: {thunder}']
            nep_AlarmTriggered += value[True] * p_IntruderPresence[intruder] * p_SheepPresence [sheep] * p_Thunder[thunder]

print(f'True: {nep_AlarmTriggered}')
print(f'False: {1 - nep_AlarmTriggered}')
print()

print('FarmerWakeUp no prior evidence probability:')


nep_FarmerWakeUp = {True: 0, False: 0}

value = p_FarmerWakeUp[f'Alarm: {True}']
nep_FarmerWakeUp[True] += value[True] * nep_AlarmTriggered
print(nep_FarmerWakeUp[True])
value = p_FarmerWakeUp[f'Alarm: {False}']
nep_FarmerWakeUp[True] += value[True] * (1-nep_AlarmTriggered)

nep_FarmerWakeUp[False] = 1 - nep_FarmerWakeUp[True]  

print(f'True: {nep_FarmerWakeUp[True]}')
print(f'False: {nep_FarmerWakeUp[False]}')
print()

print('OrangeSafety no prior evidence probability:')
p_OrangeSafety[f'Intruder: {True}, WakeUp: {False}'][True] = 0

nep_OrangeSafety = 0
for intruder in [True, False]:
    for wake_up in [True, False]:
        value = p_OrangeSafety[f'Intruder: {intruder}, WakeUp: {wake_up}']
        nep_OrangeSafety += value[True] * p_IntruderPresence[intruder] * nep_FarmerWakeUp[wake_up]

print(f'True: {nep_OrangeSafety}')
print(f'False: {1 - nep_OrangeSafety}')
print()

AlarmTriggered no prior evidence probability:
True: 0.3058336980004401
False: 0.6941663019995599

FarmerWakeUp no prior evidence probability:
0.24523674172396787
True: 0.2789923972134645
False: 0.7210076027865355

OrangeSafety no prior evidence probability:
True: 0.9639568299367012
False: 0.036043170063298824



**d)** At the first scenario, we can use Bayes Theorem to update our beliefes about other variables with this new evidence:

\begin{equation}
P(Sheep|Alarm = True) = \frac{P(Alarm|Sheep = True) \cdot P(Sheep)}{P(Alarm)} = \frac{P(Alarm \cap Sheep)}{P(Alarm)}
\end{equation}

At the second scenario:

\begin{equation}
P(Farmer|Intruder = True) = \frac{P(Intruder|Farmer = True) \cdot P(Farmer)}{P(Intruder)} = \frac{P(Farmer \cap Intruder)}{P(Intruder)}
\end{equation}

In [37]:
alarm_sheep = 0
for intruder in [True, False]:
    for sheep in [True]:
        for thunder in [True, False]:
            value = p_AlarmTriggered[f'Intruder: {intruder}, Sheep: {sheep}, Thunder: {thunder}']
            alarm_sheep += value[True] * p_IntruderPresence[intruder] * p_SheepPresence [sheep] * p_Thunder[thunder]
            
p_1 = alarm_sheep / nep_AlarmTriggered
print(p_1)

alarm_intruder = 0
for intruder in [True]:
    for sheep in [True, False]:
        for thunder in [True, False]:
            for alarm in [True, False]:
                value = p_AlarmTriggered[f'Intruder: {intruder}, Sheep: {sheep}, Thunder: {thunder}']
                alarm_intruder += nep_FarmerWakeUp[True] * value[alarm] * p_IntruderPresence[True] * p_SheepPresence [sheep] * p_Thunder[thunder]

p_2 = alarm_intruder/p_IntruderPresence[True]
print(p_2)

0.7500379028982553
0.27899239721346447


## Bayesian Classifier

<p style='text-align: justify;'>
    
Also based on Bayes' Theorem, the Bayesian Classifier classifier is a widely used supervised machine learning algorithm for categorizing objects into distinct categories based on attributes or features. The most simple variation it is called Naive Bayesian Classifier because it makes a simplified assumption of independence between the variables of the problem, which simplifies the calculation of probabilities.

<div align="center">
	<a href="rep_icon">
	<img height = "400em" src = "./assets/bayes.png" />
    </a>
</div>

The classifier receives a training set and then calculates the probability of an example belonging to each class, based on the frequency of features in the training examples belonging to each class. From there, and assumes the assumption of conditional independence of variables (or features) given the class value. This means that, given the value of class $C_k$, the features $X_1, X_2, \ldots, X_n$ are considered independent of each other. Described as the Naive Bayes Assumption **(2)**:
</p>

\begin{equation}
P(X_1=x_1, X_2=x_2, \ldots, X_n=x_n | C_k) = \prod_{i=1}^{n} P(X_i=x_i | C_k)
\end{equation}

<p style='text-align: justify;'>
   
This assumption greatly simplifies the process of calculating probabilities and makes the algorithm more efficient, but it may not reflect reality in some cases. With this in mind, the classifier uses Bayes' Theorem to calculate the probability of belonging to each class. It then selects the class with the highest probability as the final classification. By calculating the probabilities for each class, the classifier assigns the test example to the class with the highest probability, like describe in this formula **(3)**:
</p>

\begin{equation}
\text{Class: } \arg\max_{C_k} P(C_k) \prod_{i=1}^{n} P(x_i | C_k)
\end{equation}

<p style='text-align: justify;'>

Bayesian classifiers are particularly useful in situations where data is available for training the classifier, such as in spam detection in emails. In this case, the classifier can be trained using a set of labeled emails (spam and non-spam), where features of the emails, such as keywords, term frequencies, or syntactic characteristics, are extracted. This approach is particularly efficient when dealing with categorical data, such as words in a text. Additionally, spam detection requires a quick response, as unwanted emails need to be filtered in real-time. Trained Bayesian classifiers demonstrate fast computational speed, making them a suitable choice for this type of problem.

</p>

## ☆ Challenge #2: Plague on orange farm is back. ☆

The same farmer owns an orange farm and is concerned about the possibility of his oranges being affected by an unknown pest. He has decided to use the knowledge of Bayesian Classifier to make informed decisions regarding pest control on his plantation.

The farmer noticed that some oranges show suspicious symptoms, while others appear healthy. He is also considering the application of pesticides to control the possible pest infestation. To assist him in his decisions, let's model this situation with a Bayesian Classifiers, with the dataset avaliable in ```./datasets/orange_farm.csv```, that contains the following variables:

- **Infestation**: A binary variable indicating whether the oranges are infected with the pest or not. "1" represents infected oranges, and "0" indicates healthy oranges.
- **Symptoms**: A binary variable representing the presence or absence of symptoms in the oranges. "1" indicates oranges with suspicious symptoms, and "0" represents oranges without visible symptoms.
- **Pesticides**: A binary variable indicating whether the farmer applied pesticides to the plantation or not. "1" represents the application of pesticides, and "0" indicates that no pesticides were used.
- **Rainfall**: A numerical variable indicating the amount of recent rainfall in millimeters.
- **Soil Moisture**: A numerical variable representing the soil moisture in the plantation as a percentage.
- **Temperature**: A numerical variable representing the average temperature in degrees Celsius.
- **Management Practices**: A categorical variable indicating the management practices adopted in the plantation, "Conventional" and "Organic".

Now, let's address the four parts of the farmer's request:

**a)** Make an exploratory analysis of the data, in that way the farmer should examine which variables seem to have the greatest influence on pest infestation. For example, he can use data visualization techniques like histograms, scatter plots, and correlation matrices to understand the relationships between variables and the infestation status. This analysis will help him identify potential patterns and select the most relevant variables for the Bayesian classification.

**b)** Calculate the marginal probabilities of infestation (I), symptoms (S), and pesticide application (P), the farmer needs to count the occurrences of each state and divide them by the total number of observations in the dataset.

**c)** Determine the conditional probabilities, the farmer needs to calculate the probabilities of pest infestation (I) given the presence of symptoms (S), the application of pesticides (P), and different levels of rainfall (C), soil moisture (U), temperature (T), and management practices (M). He can do this by dividing the number of occurrences of each combination by the total number of observations for the given condition.

**d)** Create a Bayesian Classification model, the farmer can use the conditional probabilities he calculated in part c) to estimate the likelihood of infestation (I) given the values of other variables. He can then use Bayes' theorem to compute the posterior probabilities and make predictions on new data. The farmer should use the test dataset ```./datasets/orange_farm_test.csv``` into training and testing sets to evaluate the model's accuracy. 

### ☆ Solution ☆

As requested in item a), we can begin the exploratory analysis to help the farmer visualize the data and gain a better understanding of the problem. It is essential to plot the data on graphs and create a correlation matrix with the variables. This will enable a more in-depth comprehension of the situation.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
data = pd.read_csv('./datasets/orange_farm.csv')

# Calculate the marginal probabilities
total_samples = len(data)

df = pd.DataFrame(data)

# Display the first few rows of the dataset to get an overview
print(df.head())

# Get summary statistics for numerical variables
print(df.describe())

# Check the data types and missing values
print(df.info())

# Set up subplots
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Bar plot for Infestation
sns.countplot(x='Infestation', data=df, ax=axes[0, 0])
axes[0, 0].set_title('Infestation Distribution')

# Bar plot for Symptoms
sns.countplot(x='Symptoms', data=df, ax=axes[0, 1])
axes[0, 1].set_title('Symptoms Distribution')

# Histogram for Rainfall
axes[1, 0].hist(df['Rainfall'], bins=20, edgecolor='k')
axes[1, 0].set_xlabel('Rainfall (mm)')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Rainfall Distribution')

# Histogram for Soil Moisture
axes[1, 1].hist(df['Soil Moisture'], bins=20, edgecolor='k')
axes[1, 1].set_xlabel('Soil Moisture (%)')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].set_title('Soil Moisture Distribution')

# Add some space between the subplots
plt.tight_layout()

# Display the plots
plt.show()

# Set up subplots for the next set of plots
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Histogram for Temperature
axes[0, 0].hist(df['Temperature'], bins=20, edgecolor='k')
axes[0, 0].set_xlabel('Temperature (°C)')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Temperature Distribution')

# Bar plot for Pesticides
sns.countplot(x='Pesticides', data=df, ax=axes[0, 1])
axes[0, 1].set_title('Pesticides Application')

# Bar plot for Management Practices
sns.countplot(x='Management Practices', hue='Infestation', data=df, ax=axes[1, 0])
axes[1, 0].set_title('Management Practices and Infestation')

# Compute the correlation matrix
correlation_matrix = df[['Infestation', 'Symptoms', 'Pesticides', 'Rainfall', 'Soil Moisture', 'Temperature']].corr()

# Plot the correlation matrix as a heatmap
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', ax=axes[1, 1])
axes[1, 1].set_title('Correlation Matrix')

# Add some space between the subplots
plt.tight_layout()

# Display the plots
plt.show()


As requested in item b), we can calculate the marginal probabilities using the following code:

In [None]:
# Count occurrences of each state for 'Infestation'
infestation_counts = data['Infestation'].value_counts()
p_infestation = infestation_counts[1] / total_samples

# Count occurrences of each state for 'Symptoms'
symptoms_counts = data['Symptoms'].value_counts()
p_symptoms = symptoms_counts[1] / total_samples

# Count occurrences of each state for 'Pesticides'
pesticides_counts = data['Pesticides'].value_counts()
p_pesticides = pesticides_counts[1] / total_samples

# Display the marginal probabilities
print(f"Marginal Probability of Infestation (I): {p_infestation:.2f}")
print(f"Marginal Probability of Symptoms (S): {p_symptoms:.2f}")
print(f"Marginal Probability of Pesticide Application (P): {p_pesticides:.2f}")

As requested in item c), we can calculate and store the conditional probabilities in using the following code:

In [None]:
# Conditional Probability of Infestation (I) given Symptoms (S)
infestation_given_symptoms = data[data['Infestation'] == 1]['Symptoms'].value_counts() / data['Symptoms'].value_counts()

# Conditional Probability of Infestation (I) given Pesticides (P)
infestation_given_pesticides = data[data['Infestation'] == 1]['Pesticides'].value_counts() / data['Pesticides'].value_counts()

# Conditional Probability of Infestation (I) given Rainfall (C)
rainfall_intervals = pd.cut(data['Rainfall'], bins=[0, 50, 100, 150, 200])
infestation_given_rainfall = data[data['Infestation'] == 1].groupby(rainfall_intervals)['Infestation'].count() / data.groupby(rainfall_intervals)['Infestation'].count()

# Conditional Probability of Infestation (I) given Soil Moisture (U)
moisture_intervals = pd.cut(data['Soil Moisture'], bins=[0, 25, 50, 75, 100])
infestation_given_moisture = data[data['Infestation'] == 1].groupby(moisture_intervals)['Infestation'].count() / data.groupby(moisture_intervals)['Infestation'].count()

# Conditional Probability of Infestation (I) given Temperature (T)
temperature_intervals = pd.cut(data['Temperature'], bins=[10, 20, 30, 40])
infestation_given_temperature = data[data['Infestation'] == 1].groupby(temperature_intervals)['Infestation'].count() / data.groupby(temperature_intervals)['Infestation'].count()

# Conditional Probability of Infestation (I) given Management Practices (M)
infestation_given_management = data[data['Infestation'] == 1]['Management Practices'].value_counts() / data['Management Practices'].value_counts()

# Imprimindo infestation_given_symptoms
print("Conditional Probability of Infestation (I) given Symptoms (S):")
print(infestation_given_symptoms)

# Imprimindo infestation_given_pesticides
print("\nConditional Probability of Infestation (I) given Pesticides (P):")
print(infestation_given_pesticides)

# Imprimindo infestation_given_rainfall
print("\nConditional Probability of Infestation (I) given Rainfall (C):")
print(infestation_given_rainfall)

# Imprimindo infestation_given_moisture
print("\nConditional Probability of Infestation (I) given Soil Moisture (U):")
print(infestation_given_moisture)

# Imprimindo infestation_given_temperature
print("\nConditional Probability of Infestation (I) given Temperature (T):")
print(infestation_given_temperature)

# Imprimindo infestation_given_management
print("\nConditional Probability of Infestation (I) given Management Practices (M):")
print(infestation_given_management)

Now it's time to begin item d), where we will develop the Bayesian model. First, let's define the function to calculate joint probabilities between the target variable (Infestation) and the selected features (Other variables). It does so by using the information collected at c) item and dividing the occurrences of specific target-feature combinations by the total occurrences of each target value. The resulting joint probabilities provide crucial information for Bayesian analysis and modeling.

In [None]:
# Define the features and the target
features = ['Symptoms', 'Pesticides', 'Rainfall', 'Soil Moisture', 'Temperature', 'Management Practices']
target = 'Infestation'

# Function to calculate joint Naive Bayes Assumption formula probabilities
def calculate_joint_probabilities(data, features, target):
    joint_probabilities = {}
    
    target_values = data[target].unique()
    feature_values = [data[feature].unique() for feature in features]

    for target_value in target_values:
        # Compute P(target)
        p_target = len(data[data[target] == target_value]) / len(data)

        # Initialize dictionary for this target value
        joint_probabilities[target_value] = {}

        for feature, feature_value in zip(features, feature_values):
            # Initialize dictionary for this feature
            joint_probabilities[target_value][feature] = {}

            for value in feature_value:
                # Compute P(feature | target)
                conditional_probability = len(data[(data[target] == target_value) & (data[feature] == value)]) / len(data[data[target] == target_value])

                # Compute joint probability and store it
                joint_probabilities[target_value][feature][value] = p_target * conditional_probability
                
    return joint_probabilities

After creating the function to calculate probabilities, it's time to train the classifier and test it with a new dataset to assess its performance and compare the results. We can create a classify function takes a new data point (sample) and predicts its class by calculating the joint probabilities for each class and each feature. It then assigns the sample to the class with the highest joint probability.

In [None]:
# Calculate joint probabilities
joint_probabilities = calculate_joint_probabilities(data, features, target)

# Function to classify a new sample
def classify(sample, joint_probabilities, features):
    target_values = joint_probabilities.keys()
    max_prob = -1
    predicted_class = None

    for target_value in target_values:
        prob = 1
        for feature in features:
            prob *= joint_probabilities[target_value][feature].get(sample[feature], 1e-10)  # Assume probability is 1e-10 if the feature value was not seen in the training set
        if prob > max_prob:
            max_prob = prob
            predicted_class = target_value
            
    return predicted_class

# Load the test dataset
test_data = pd.read_csv('./datasets/orange_farm_test.csv')

# Use the classifier to predict the target for each sample in the test set
test_data['Predicted Infestation'] = test_data.apply(lambda row: classify(row, joint_probabilities, features), axis=1)

# Calculate the accuracy of the classifier
accuracy = (test_data['Predicted Infestation'] == test_data['Infestation']).mean()
print(f'Accuracy: {accuracy:.2f}')

# Count the occurrences of each class in the real and predicted data
actual_counts = test_data['Infestation'].value_counts().sort_index()
predicted_counts = test_data['Predicted Infestation'].value_counts().sort_index()

# Create a DataFrame from these counts
counts_df = pd.DataFrame({'Actual': actual_counts, 'Predicted': predicted_counts})

# Plot a stacked bar chart
counts_df.plot(kind='bar', stacked=True, color=['skyblue', 'orange'])
plt.xlabel('Infestation')
plt.ylabel('Count')
plt.title('Comparison of actual and predicted infestation')
plt.show()

## Clear the Memory

Before moving on, please execute the following cell to clear up the CPU memory. This is required to move on to the next notebook.

In [None]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)