<a href="https://colab.research.google.com/github/AntoniosGergesNageh/A-Neural-Network-Model-for-Classifying-and-Recognizing-104-Types-of-Flowers-/blob/main/Pattern_Recognition(_Naive_Bayes_Assignment_).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

| Name                   | ID           |
|------------------------|--------------|
| <font color="#008000" size="5">Antonios Gerges</font> | <font color="#008000" size="5">20221903971</font> |


### Import Necessary Libraries

This cell imports essential Python libraries for data handling and machine learning:
- **Pandas**: For data manipulation and analysis.
- **Sklearn's `train_test_split`**: To split the dataset into training and testing sets.
- **Sklearn's `LabelEncoder`**: For encoding labels with a value between 0 and the number of classes minus 1.
- **Sklearn's `GaussianNB`**: To apply the Naive Bayes algorithm for classification.
- **Sklearn's Metrics**: To evaluate the model using confusion matrix, accuracy, recall, and precision scores.


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix, accuracy_score, recall_score, precision_score

### Steps 0 & 1: Data Loading and Preparation

**Step 0: Data Loading**
- **Data Source**: Load the dataset from the UCI Machine Learning Repository. This dataset contains demographic and employment details of adults.
- **Columns Defined**: Specific names are assigned to each column in the dataset to ensure clarity during manipulation and analysis.

**Step 1: Data Preparation**
- **Missing Values**: Replace missing values in the dataset with the mode of each column to ensure a consistent dataset without gaps.
- **Categorical Encoding**: Convert categorical features into numerical form using `LabelEncoder`, making them suitable for machine learning algorithms.
- **Feature Selection**: Separate the dataset into features (`X`) and the target variable (`y`), where the target is the 'income' column.
- **Data Splitting**: Divide the data into training and testing sets, allocating 75% for training and 25% for testing. This is crucial for training the model and evaluating its performance on unseen data.


In [None]:
# Step 0: Data Loading
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
# Define column names
columns = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status',
           'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss',
           'hours-per-week', 'native-country', 'income']

# Load data
data = pd.read_csv(url, names=columns, na_values=' ?', skipinitialspace=True)

# Step 1: Data Preparation
# Handle missing values by filling with the mode
data = data.fillna(data.mode().iloc[0])

# Encode categorical features
encoders = {}
for col in data.select_dtypes(include=['object']):
    le = LabelEncoder()
    data[col] = le.fit_transform(data[col])
    encoders[col] = le

# Split data into features and target variable
X = data.drop('income', axis=1)
y = data['income']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

### Steps 2 & 3: Naive Bayes Model Training and Evaluation

**Step 2: Naive Bayes Model**
- **Model Initialization and Training**: Initialize and train a Gaussian Naive Bayes model using the training data. This model is suitable for classification tasks and works well with features that are normally distributed.

**Step 3: Prediction and Evaluation**
- **Prediction**: Use the trained model to predict the income classification on the test data.
- **Evaluation Metrics**: Calculate key performance metrics to evaluate the model:
  - **Accuracy**: Overall correctness of the model.
  - **Sensitivity (Recall)**: Ability of the model to correctly identify positive instances.
  - **Specificity**: Ability of the model to correctly identify negative instances.
  - **Precision**: Accuracy of positive predictions.
- **Confusion Matrix**: Display the confusion matrix to visualize true positives, true negatives, false positives, and false negatives.



In [None]:
# Step 2: Naive Bayes Model
model = GaussianNB()
model.fit(X_train, y_train)

# Step 3: Prediction and Evaluation
y_pred = model.predict(X_test)

# Compute evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
sensitivity = recall_score(y_test, y_pred, average='macro')
precision = precision_score(y_test, y_pred, average='macro')

# Compute specificity
conf_matrix = confusion_matrix(y_test, y_pred)
tn, fp, fn, tp = conf_matrix.ravel()
specificity = tn / (tn + fp)

# Print evaluation metrics
print("Accuracy:", accuracy)
print("Sensitivity (Recall):", sensitivity)
print("Specificity:", specificity)
print("Precision:", precision)
print("Confusion Matrix:")
print(conf_matrix)



Accuracy: 0.800761577201818
Sensitivity (Recall): 0.6346307925138157
Specificity: 0.9501126488574188
Precision: 0.7415233415233415
Confusion Matrix:
[[5904  310]
 [1312  615]]


### Step 4: Posterior Probability Analysis

This step focuses on extracting and analyzing the posterior probabilities from the trained Gaussian Naive Bayes model:

- **Probability Calculation**: Compute the probability of each test instance belonging to the positive class (income over $50K).
- **Specific Instance Probability**: Display the posterior probability for the first test instance, offering insight into how likely it is classified as earning over $50K based on the model's prediction.
- **Average Probability**: Optionally, calculate and display the average probability across all test instances, which provides a general sense of how likely the model predicts the positive outcome for the entire test set.


In [None]:
# Step 4: Posterior Probability
prob_positive = model.predict_proba(X_test)[:, 1]
print("Posterior Probability of making over 50K (first instance):", prob_positive[0])

# Optional: Show average probability of making over 50K for all predictions
avg_probability = prob_positive.mean()
print("Average Posterior Probability of making over 50K:", avg_probability)


Posterior Probability of making over 50K (first instance): 0.005650398505762114
Average Posterior Probability of making over 50K: 0.12565043191240785


### Model Evaluation Metrics and Posterior Probabilities

| Metric             | Value                              |
|--------------------|------------------------------------|
| **Accuracy**       | 0.8008                             |
| **Sensitivity**    | 0.6346                             |
| **Specificity**    | 0.9501                             |
| **Precision**      | 0.7415                             |
| **Confusion Matrix** | [[5904, 310], [1312, 615]]      |
| **Posterior Probability (1st instance)** | 0.0057      |
| **Average Posterior Probability**        | 0.1257      |

These metrics provide a comprehensive overview of the model's performance, highlighting its accuracy, sensitivity, specificity, precision, and the distribution of outcomes as illustrated by the confusion matrix. The posterior probabilities indicate the likelihood of predicting an income over $50K for both a specific instance and on average.
