<div style="text-align:center">
    <h1>Universidad Politecnica de Yucatan</h1>
    <h2>Computational Robotics Engineering</h2>
    <h2>Machine Learning</h2>
    <h3>Task: Supervised Learning - SVM, DT & NaiveBayes</h3>
    <h3>Teacher Victor Alejandro Ortiz Santiago</h3>
    <h3>Mariana Guadalupe Chi Centeno</h3>
    <h3>20009038</h3>
    <h3>October 30, 2023</h3>
    <h3>9B</h3>
</div>

# Naive Bayes

Naive Bayes is a family of probabilistic algorithms that leverage probability theory and Bayes' Theorem to predict the class of given data points. The term "naive" refers to the assumption that the features in a dataset are mutually independent.

## 1. Basics:
- **Bayes' Theorem**: The foundational pillar of the Naive Bayes algorithm. The theorem links the conditional and marginal probabilities of two random events. Mathematically:

  \[
  P(A|B) = \frac{P(B|A) \times P(A)}{P(B)}
  \]

  Where:
  - \( P(A|B) \): Probability of \( A \) given \( B \) is true.
  - \( P(B|A) \): Probability of \( B \) given \( A \) is true.
  - \( P(A) \) and \( P(B) \): Probabilities of \( A \) and \( B \) respectively.

## 2. Classification Context:
Imagine \( A \) being the event that a data point belongs to a specific class, and \( B \) is the event of the data point having certain features. We're interested in \( P(A|B) \), the likelihood the data point belongs to that class considering its features.

## 3. Why "Naive"?
The algorithm is termed "naive" because it presumes the features are independent. Even though this is rarely true in real-world scenarios, the algorithm often exhibits impressive performance.

## 4. Types of Naive Bayes:
- **Gaussian Naive Bayes**: Assumes continuous features are normally distributed.
- **Multinomial Naive Bayes**: Designed for discrete counts. Commonly applied in text classification where data are represented as word vector counts.
- **Bernoulli Naive Bayes**: Useful for binary feature vectors.

## 5. Advantages:
- **Efficiency**: Only a small amount of training data is needed to estimate the required parameters for classification.
- **Simplicity**: The algorithm decouples class conditional feature distributions, allowing each distribution to be independently estimated as one-dimensional.

## 6. Disadvantages:
- **Naive Assumption**: The base assumption of Naive Bayes is that features are always independent. This might not hold true for all datasets, affecting performance.
- **Zero Frequency Issue**: If a categorical variable in the test dataset wasn't observed in the training dataset, the model will assign a zero probability to it. This can be mitigated with techniques like Laplace correction.

In summary, despite its straightforward nature and naive assumptions, Naive Bayes can be highly effective in various real-world situations, especially text classification tasks.


# References:

[1] L. Gonzalez. “Naive Bayes – Teoría - Aprende IA”. Aprende IA. Accedido el 31 de octubre de 2023. [En línea]. Disponible: https://aprendeia.com/algoritmo-naive-bayes-machine-learning/

[2] “Algoritmo Bayes naive de Microsoft”. Microsoft Learn: Build skills that open doors in your career. Accedido el 31 de octubre de 2023. [En línea]. Disponible: https://learn.microsoft.com/es-es/analysis-services/data-mining/microsoft-naive-bayes-algorithm?view=asallproducts-allversions

[3] “Modelos Naive Bayes: Precisión e independencia - The Black Box Lab”. The Black Box Lab. Accedido el 31 de octubre de 2023. [En línea]. Disponible: https://theblackboxlab.com/2022/03/30/modelos-naive-bayes/

[4] “Clasificación de Texto con Naive Bayes en Python - Ander Fernández”. Ander Fernández. Accedido el 31 de octubre de 2023. [En línea]. Disponible: https://anderfernandez.com/blog/naive-bayes-en-python/

In [11]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

In [4]:
df = pd.read_csv('D:\Desktop\9no Cuatrimestre\Machine Learning\datos\kaggle.csv')
df

Unnamed: 0,Lat,Lng,What Dinosaurs Eat,Accepted Name,Country,Cc,Diet,Early Interval,Formation,Geological Interval,Geological Time Period,Ref Author,Ref Pubyr,State,Max Ma,Min Ma
0,42.933300,123.966698,PLANT,Chaoyangsaurus youngi,China,CN,herbivore,Late Tithonian,Tuchengzi,Tithonian,Jurassic,Dong,1992,Liaoning,150.8,132.90
1,41.799999,120.733330,PLANT and ANIMAL,Protarchaeopteryx robusta,China,CN,omnivore,Late Barremian,Yixian,Barremian,Cretaceous,Ji et al.,1998,Liaoning,130.0,122.46
2,41.799999,120.733330,PLANT and ANIMAL,Caudipteryx zoui,China,CN,omnivore,Late Barremian,Yixian,Barremian,Cretaceous,Ji and Ji,1997,Liaoning,130.0,122.46
3,50.740726,-111.528732,FLESH,Gorgosaurus libratus,Canada,CA,carnivore,Late Campanian,Dinosaur Park,Campanian,Cretaceous,Matthew and Brown,1922,Alberta,83.5,70.60
4,50.737015,-111.549347,FLESH,Gorgosaurus libratus,Canada,CA,carnivore,Late Campanian,Dinosaur Park,Campanian,Cretaceous,Russell,1970,Alberta,83.5,70.60
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2458,49.183334,-98.099998,FISH,Hesperornis chowi,Canada,CA,piscivore,Early Campanian,Pierre Shale,Campanian,Cretaceous,Aotsuka and Sato,2016,Manitoba,83.5,70.60
2459,49.183334,-98.099998,FISH,Hesperornis macdonaldi,Canada,CA,piscivore,Early Campanian,Pierre Shale,Campanian,Cretaceous,Aotsuka and Sato,2016,Manitoba,83.5,70.60
2460,49.183334,-98.099998,FISH,Hesperornis macdonaldi,Canada,CA,piscivore,Early Campanian,Pierre Shale,Campanian,Cretaceous,Aotsuka and Sato,2016,Manitoba,83.5,70.60
2461,49.183334,-98.099998,FISH,Hesperornis chowi,Canada,CA,piscivore,Early Campanian,Pierre Shale,Campanian,Cretaceous,Aotsuka and Sato,2016,Manitoba,83.5,70.60


In [5]:
# Checking for missing values in the selected columns
missing_values = df[['What Dinosaurs Eat', 'Country', 'Geological Time Period', 'Diet']].isnull().sum()

# Checking the distribution of the 'Diet' column
diet_distribution = df['Diet'].value_counts()

missing_values, diet_distribution

(What Dinosaurs Eat        0
 Country                   0
 Geological Time Period    0
 Diet                      0
 dtype: int64,
 herbivore              1183
 carnivore              1085
 carnivore, omnivore      84
 omnivore                 42
 piscivore                37
 herbivore, omnivore      32
 Name: Diet, dtype: int64)

In [None]:
# Filtering the dataset for the three primary diet categories
filtered_data = data[data['Diet'].isin(['herbivore', 'carnivore', 'omnivore'])]

# One-hot encoding the categorical features
encoded_data = pd.get_dummies(filtered_data[['What Dinosaurs Eat', 'Country', 'Geological Time Period']], drop_first=True)

In [7]:
# Adding the 'Diet' column to our encoded data
encoded_data['Diet'] = filtered_data['Diet']

# Displaying the first few rows of the encoded data
encoded_data.head()

Unnamed: 0,What Dinosaurs Eat_PLANT,What Dinosaurs Eat_PLANT and ANIMAL,Country_China,Country_Mexico,Country_United States,Geological Time Period_Jurassic,Geological Time Period_Triassic,Diet
0,1,0,1,0,0,1,0,herbivore
1,0,1,1,0,0,0,0,omnivore
2,0,1,1,0,0,0,0,omnivore
3,0,0,0,0,0,0,0,carnivore
4,0,0,0,0,0,0,0,carnivore


In [9]:
# Splitting the data into features (X) and target (y)
X = encoded_data.drop('Diet', axis=1)
y = encoded_data['Diet']

In [10]:
# Splitting the data into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

X_train.shape, X_test.shape

((1848, 7), (462, 7))

In [13]:
# Initializing the Multinomial Naive Bayes classifier
nb_classifier = MultinomialNB()

# Training the classifier
nb_classifier.fit(X_train, y_train)

# Predicting on the test set
y_pred = nb_classifier.predict(X_test)

In [16]:
# Evaluating the model's performance
accuracy = accuracy_score(y_test, y_pred)
accuracy

0.9264069264069265

In [17]:
class_report = classification_report(y_test, y_pred)
class_report

'              precision    recall  f1-score   support\n\n   carnivore       1.00      0.84      0.92       217\n   herbivore       0.87      1.00      0.93       237\n    omnivore       1.00      1.00      1.00         8\n\n    accuracy                           0.93       462\n   macro avg       0.96      0.95      0.95       462\nweighted avg       0.94      0.93      0.93       462\n'