# Guassian Naive Bayes model

In [2]:
import numpy as np 
import pandas as pd 
import warnings 
warnings.filterwarnings("ignore")

DATA_PATH="/kaggle/input/crop-recommendation-dataset/Crop_recommendation.csv"


1.  Importing necessary Libraries(numpy and pandas).
2.  Specifying the path to the datset.

### read the data_path and display dataframe

In [3]:
df=pd.read_csv(DATA_PATH)
df.head()

Unnamed: 0,N,P,K,temperature,humidity,ph,rainfall,label
0,90,42,43,20.879744,82.002744,6.502985,202.935536,rice
1,85,58,41,21.770462,80.319644,7.038096,226.655537,rice
2,60,55,44,23.004459,82.320763,7.840207,263.964248,rice
3,74,35,40,26.491096,80.158363,6.980401,242.864034,rice
4,78,42,42,20.130175,81.604873,7.628473,262.71734,rice


1. Load the dataset from the specified path.
2. Display the first few rows of the dataset- 'head()'.

In [4]:
Y = np.asarray(df['label'])
unique_ele = np.unique(Y)
print(unique_ele) # Printing the unique crop types to view

['apple' 'banana' 'blackgram' 'chickpea' 'coconut' 'coffee' 'cotton'
 'grapes' 'jute' 'kidneybeans' 'lentil' 'maize' 'mango' 'mothbeans'
 'mungbean' 'muskmelon' 'orange' 'papaya' 'pigeonpeas' 'pomegranate'
 'rice' 'watermelon']


1. Seperating the Labels into a numpy array.
2. Convert the 'label' coloumn of dataframe to a numpy array.
3. Identify the unique elements (crop types) in the 'label' array.
4. Now 'Y' contains the labels, and 'unique_ele' contains the distinct crop types.
5. These will be useful for building and evaluating the model.

### Indexing the each distinct label

In [5]:
for i in range(len(Y)):
    for j in range(len(unique_ele)):
        if Y[i] == unique_ele[j]:
            Y[i] = j
print(Y)

[20 20 20 ... 5 5 5]


1. Orginal crop type label replaced by its corresponding index in the unique list of crop types.
2. converting categorical labels (such as crop types) into numerical indices, we make it easier for the model to process and learn from the data.

## Train-Test data split

In [7]:
def train_test_split(seed,split,X,Y):
    np.random.seed(seed)
    indices = np.random.permutation(len(df))
    
    # Define the test size
    test_size = split
    num_test_samples = int(len(df) * test_size)
    
    # Split indices into train and test
    train_indices = indices[:-num_test_samples]
    test_indices = indices[-num_test_samples:]
    x_train = X.iloc[train_indices].values
    y_train = Y[train_indices]
    x_test = X.iloc[test_indices].values
    y_test = Y[test_indices]
    
    return x_train,y_train,x_test,y_test



1. Sets the random seed to ensure reproducibility in random number generation.Generates a random permutation of indices corresponding to the rows of the DataFrame 
2. Calculate the test size:  The split parameter determines the fraction of data to be allocated for testing
3. Split the indices into training and testing sets
4. Extract the corresponding data for training and testing
5. Returns the training and testing data

In [8]:
 X=df.drop('label',axis=1).copy()
 x_train,y_train,x_test,y_test=train_test_split(seed=20,split=0.2,X=X,Y=Y)

1. sets the seed to 20 for random generation
2. split size = 0.2 i.e., test data 
3. X is dataframe excluding the 'label' column

## Gaussian Naive Bayes Classifier

1.  **A Gaussian Naive Bayes (GNB) classifier is a type of Naive Bayes classifier that assumes the features follow a Gaussian (normal) distribution.**
1.  **Naive Bayes Classification: Like other Naive Bayes classifiers, GNB calculates the probability of a data point belonging to each class and then assigns the data point to the class with the highest probability.**
1. **Training Phase: During training, GNB calculates and stores**
    1. **Class probabilities: The likelihood of each class occurring in the dataset.**
    2. **Mean and variance of features for each class: These parameters are used to model the Gaussian distribution for each class.**
1. **Prediction Phase:**
    1. **Given a new data point with feature values, GNB calculates the probability of the data point belonging to each class using Bayes' theorem.**
    2. **Bayes' theorem combines the prior probability (class probabilities) with the likelihood (probability of feature values given the class) to compute the posterior probability (probability of the class given the feature values).**
    3. **The class with the highest posterior probability is predicted as the class for the new data point.**
    

In [9]:
class GaussianNaiveBayes:
    def fit(self, X_train, y_train):
        self.classes = np.unique(y_train)
        self.class_probs = {}
        self.means = {}
        self.vars = {}
        
        for c in self.classes:
            indices = np.where(y_train == c)[0]
            X_c = X_train[indices]
            self.class_probs[c] = len(X_c) / len(X_train)
            self.means[c] = np.mean(X_c, axis=0)
            self.vars[c] = np.var(X_c, axis=0)
            print(self.class_probs[c])
    
    def predict(self, X_test):
         predictions = []
         for x in X_test:
             posteriors = []
             for c in self.classes:
                 prior = np.log(self.class_probs[c])
                 likelihood = np.sum(np.log(self.pdf(x, self.means[c], self.vars[c])))
                 posterior = prior + likelihood
                 posteriors.append(posterior)
             predicted_class = self.classes[np.argmax(posteriors)]
             predictions.append(predicted_class)
         return predictions


    
    def pdf(self, x, mean, var):
        sqrt_var = np.sqrt(var)
        if np.any(sqrt_var == 0):
            return 0  # Handle division by zero case
        else:
            exponent = -np.sum((x - mean)**2 / (2 * var))
            pdf = np.exp(exponent) / ((2 * np.pi) ** (len(x) / 2) * np.prod(sqrt_var))
            return pdf


1. This code defines a class GaussianNaiveBayes implementing a Gaussian Naive Bayes classifier. 
2. fit(self, X_train, y_train): is used to train the Gaussian Naive Bayes classifier. It calculates and stores the class probabilities, means, and variances for each feature in the training data.
3. predict(self, X_test): This method predicts the class labels for new data based on the trained model.
4. pdf(self, x, mean, var): This method calculates the probability density function (PDF) of a Gaussian distribution for a given data point x, mean, and variance.

## PCA - Principle Component Analysis

1. **Principal Component Analysis (PCA) is a dimensionality reduction technique used to transform high-dimensional data into a lower-dimensional space while preserving as much of the original variability as possible.**
1. **Dimensionality Reduction: PCA aims to reduce the number of features (dimensions) in a dataset while retaining as much information as possible. This is particularly useful when dealing with high-dimensional data where the number of features is large compared to the number of samples.**
1. **Orthogonal Transformation: PCA performs an orthogonal linear transformation to project the original data onto a new coordinate system defined by principal components. These components are orthogonal to each other and are ordered by the amount of variance they explain in the data.**
1. **Variance Maximization: The first principal component captures the direction of maximum variance in the data. Subsequent components capture the remaining variance in decreasing order, ensuring that the most important patterns in the data are retained in the lower-dimensional space.**
1. **Decorrelation: PCA also decorrelates the features in the transformed space, meaning that the new features (principal components) are uncorrelated with each other. This can be beneficial for certain machine learning algorithms that assume feature independence.**
1. **Eigenanalysis: PCA is based on eigenanalysis, where the eigenvectors of the covariance matrix of the data represent the directions of maximum variance, and the eigenvalues correspond to the amount of variance explained along those directions.**


In [10]:
class MyPCA:
    
    def __init__(self, n_components):
        self.n_components = n_components   
        
    def fit(self, X):
        # Standardize data 
        X = X.copy()
        self.mean = np.mean(X, axis = 0)
        self.scale = np.std(X, axis = 0)
        X_std = (X - self.mean) / self.scale
        
        # Eigendecomposition of covariance matrix       
        cov_mat = np.cov(X_std.T)
        eig_vals, eig_vecs = np.linalg.eig(cov_mat) 
        
        # Adjusting the eigenvectors that are largest in absolute value to be positive    
        max_abs_idx = np.argmax(np.abs(eig_vecs), axis=0)
        signs = np.sign(eig_vecs[max_abs_idx, range(eig_vecs.shape[0])])
        eig_vecs = eig_vecs*signs[np.newaxis,:]
        eig_vecs = eig_vecs.T
       
        eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[i,:]) for i in range(len(eig_vals))]
        eig_pairs.sort(key=lambda x: x[0], reverse=True)
        eig_vals_sorted = np.array([x[0] for x in eig_pairs])
        eig_vecs_sorted = np.array([x[1] for x in eig_pairs])
        
        self.components = eig_vecs_sorted[:self.n_components,:]
        
        # Explained variance ratio
        self.explained_variance_ratio = [i/np.sum(eig_vals) for i in eig_vals_sorted[:self.n_components]]
        
        self.cum_explained_variance = np.cumsum(self.explained_variance_ratio)

        return self

    def transform(self, X):
        X = X.copy()
        X_std = (X - self.mean) / self.scale
        X_proj = X_std.dot(self.components.T)
        
        return X_proj
    

1. This 'class MyPCA' Implements principal component analysis.
2. init(self, n_components): The constructor initializes the PCA object with the number of components (n_components) to retain after dimensionality reduction.
3. fit(self, X): This method fits the PCA model to the input data X and computes the principal components.
    1. Standardize data 
    2. Eigendecomposition of covariance matrix 
    3. Adjusting the eigenvectors that are largest in absolute value to be positive 
    4. Explained variance ratio
4. transform(self, X): This method transforms the input data X into the reduced dimensional space using the computed principal components.

### Displaying PCA results

In [11]:
my_pca = MyPCA(n_components = 6).fit(X)

print('Components:\n', my_pca.components)
print('Explained variance ratio :\n', my_pca.explained_variance_ratio)
print('Cumulative explained variance :\n', my_pca.cum_explained_variance)

X_proj = my_pca.transform(X)
print('Transformed data shape :', X_proj.shape)

Components:
 [[-0.30219096  0.64378667  0.62260719 -0.21242839 -0.06848339 -0.22694272
  -0.07253163]
 [ 0.33410693  0.03435809  0.2838292   0.35948683  0.73791663 -0.22065738
   0.290158  ]
 [-0.11204501 -0.10993913 -0.1631733  -0.24822796 -0.21359908 -0.54852029
   0.73526701]
 [-0.54165059 -0.04629318 -0.15486709  0.69082649 -0.0671714  -0.39570047
  -0.20531846]
 [-0.50778466  0.08233115  0.03342452  0.15486542  0.12887133  0.65188053
   0.51838188]
 [-0.48290443 -0.376847   -0.02896707 -0.50041798  0.54787098 -0.12571195
  -0.23992979]]
Explained variance ratio :
 [0.27588831430207345, 0.18484431095364603, 0.153787037552692, 0.1461273083991125, 0.1151326253047569, 0.09665166249057841]
Cumulative explained variance :
 [0.27588831 0.46073263 0.61451966 0.76064697 0.8757796  0.97243126]
Transformed data shape : (2200, 6)


1. Prints the principal components calculated by the PCA
2. Prints the explained variance ratio for each of the selected principal components. This ratio indicates the proportion of variance explained by each component relative to the total variance.
3.  Prints the cumulative explained variance, which shows the cumulative proportion of variance explained as more principal components are included.
4. ransforms the original data X into the lower-dimensional space using the computed principal components.
5. Prints the shape of the transformed data, which represents the reduced-dimensional representation of X after PCA.

In [13]:
x_train,y_train,x_test,y_test=train_test_split(seed=20,split=0.2,X=X_proj,Y=Y)

## printing gaussian probabilities of each class

In [14]:
model = GaussianNaiveBayes()
model.fit(x_train, y_train)

0.048863636363636366
0.04375
0.045454545454545456
0.04431818181818182
0.045454545454545456
0.04318181818181818
0.04261363636363636
0.042045454545454546
0.04715909090909091
0.048295454545454544
0.045454545454545456
0.03977272727272727
0.045454545454545456
0.045454545454545456
0.04772727272727273
0.04375
0.04602272727272727
0.044886363636363634
0.048295454545454544
0.04602272727272727
0.04772727272727273
0.048295454545454544


## Accuracy on prediction using GNB

In [15]:
predictions = model.predict(x_test)
correct_predictions=0
for pred, true_label in zip(predictions, y_test):
    if pred == true_label:
        correct_predictions += 1
accuracy=correct_predictions/len(y_test)
print(accuracy)

0.9204545454545454


## Precision- recall for multiclass data

In [16]:
def calculate_precision_recall_multi_class(true_labels, predicted_labels, num_classes):
    # Convert labels to numpy arrays for easier manipulation
    true_labels = np.array(true_labels)
    predicted_labels = np.array(predicted_labels)
    
    # Initialize dictionaries to store true positives, false positives, false negatives for each class
    true_positives = {}
    false_positives = {}
    false_negatives = {}
    
    # Calculate true positives, false positives, false negatives for each class
    for cls in range(num_classes):
        true_positives[cls] = sum((true == cls) and (pred == cls) for true, pred in zip(true_labels, predicted_labels))
        false_positives[cls] = sum((true != cls) and (pred == cls) for true, pred in zip(true_labels, predicted_labels))
        false_negatives[cls] = sum((true == cls) and (pred != cls) for true, pred in zip(true_labels, predicted_labels))
    
    # Calculate precision and recall for each class
    precision_recall = {}
    for cls in range(num_classes):
        precision = true_positives[cls] / (true_positives[cls] + false_positives[cls]) \
            if true_positives[cls] + false_positives[cls] > 0 else 0
        recall = true_positives[cls] / (true_positives[cls] + false_negatives[cls]) \
            if true_positives[cls] + false_negatives[cls] > 0 else 0
        precision_recall[cls] = (precision, recall)
    
    return precision_recall

num_classes = 22  # Assuming classes range from 0 to 21
precision_recall = calculate_precision_recall_multi_class(y_test, predictions, num_classes)

# Print precision and recall for each class
for cls, (precision, recall) in precision_recall.items():
    print(f'Class {cls}: Precision={precision:.4f}, Recall={recall:.4f}')

Class 0: Precision=1.0000, Recall=1.0000
Class 1: Precision=1.0000, Recall=1.0000
Class 2: Precision=0.8636, Recall=0.9500
Class 3: Precision=1.0000, Recall=1.0000
Class 4: Precision=0.9524, Recall=1.0000
Class 5: Precision=1.0000, Recall=1.0000
Class 6: Precision=0.8621, Recall=1.0000
Class 7: Precision=1.0000, Recall=1.0000
Class 8: Precision=0.8500, Recall=1.0000
Class 9: Precision=1.0000, Recall=1.0000
Class 10: Precision=0.8000, Recall=0.8000
Class 11: Precision=1.0000, Recall=0.8667
Class 12: Precision=0.9048, Recall=0.9500
Class 13: Precision=0.8462, Recall=0.5500
Class 14: Precision=0.9412, Recall=1.0000
Class 15: Precision=1.0000, Recall=1.0000
Class 16: Precision=0.8333, Recall=0.7895
Class 17: Precision=0.9474, Recall=0.8571
Class 18: Precision=0.6667, Recall=0.8000
Class 19: Precision=0.8421, Recall=0.8421
Class 20: Precision=0.8667, Recall=0.8125
Class 21: Precision=1.0000, Recall=1.0000


1. This code snippet helps in calculating precision-recall for mutli class classification problem
2. Data Preparation: Converts the true labels (true_labels) and predicted labels (predicted_labels) to NumPy arrays for easier manipulation.
3. Initialization: Initializes dictionaries (true_positives, false_positives, false_negatives) to store the counts of true positives, false positives, and false negatives for each class.
4. calculating counts of true positives, false positives, false negatives
5. Calculating Precision and Recall: 
    1. Precision for class cls is calculated as true_positives[cls] / (true_positives[cls] + false_positives[cls]), handling cases where the denominator is zero.
    2. Recall for class cls is calculated as true_positives[cls] / (true_positives[cls] + false_negatives[cls]), handling cases where the denominator is zero.
6. returns the precision and recall results

## F1-score calculation

In [17]:
def calculate_f1_score_multi_class(precision_recall):
    f1_scores = {}
    for cls, (precision, recall) in precision_recall.items():
        f1_scores[cls] = 2 * (precision * recall) / (precision + recall) if precision + recall > 0 else 0
    return f1_scores

# Assuming you have already calculated precision_recall using calculate_precision_recall_multi_class
f1_scores = calculate_f1_score_multi_class(precision_recall)

# Print F1 score for each class
for cls, f1 in f1_scores.items():
    print(f'Class {cls}: F1 Score={f1:.4f}')

Class 0: F1 Score=1.0000
Class 1: F1 Score=1.0000
Class 2: F1 Score=0.9048
Class 3: F1 Score=1.0000
Class 4: F1 Score=0.9756
Class 5: F1 Score=1.0000
Class 6: F1 Score=0.9259
Class 7: F1 Score=1.0000
Class 8: F1 Score=0.9189
Class 9: F1 Score=1.0000
Class 10: F1 Score=0.8000
Class 11: F1 Score=0.9286
Class 12: F1 Score=0.9268
Class 13: F1 Score=0.6667
Class 14: F1 Score=0.9697
Class 15: F1 Score=1.0000
Class 16: F1 Score=0.8108
Class 17: F1 Score=0.9000
Class 18: F1 Score=0.7273
Class 19: F1 Score=0.8421
Class 20: F1 Score=0.8387
Class 21: F1 Score=1.0000


1. The F1 score is a metric that combines precision and recall into a single value, providing a balanced assessment of a classifier's performance, especially in situations where there is an imbalance between classes or when both false positives and false negatives are important considerations.
2. F1 = 2*(precision * recall) / (precision + recall)


## Overall metrics values of precision, recall, F1 score, accuracy

In [18]:
def calculate_overall_metrics(precision_recall, f1_scores):
    # Calculate overall precision, recall, and F1 score
    overall_precision = sum(precision for precision, _ in precision_recall.values()) / len(precision_recall)
    overall_recall = sum(recall for _, recall in precision_recall.values()) / len(precision_recall)
    overall_f1 = sum(f1 for f1 in f1_scores.values()) / len(f1_scores)
    return overall_precision, overall_recall, overall_f1

# Assuming you have already calculated precision_recall and f1_scores
overall_precision, overall_recall, overall_f1 = calculate_overall_metrics(precision_recall, f1_scores)

print('Overall Precision:', overall_precision)
print('Overall Recall:', overall_recall)
print('Overall F1 Score:', overall_f1)
print(' Accuracy : ', accuracy)

Overall Precision: 0.9171054024507154
Overall Recall: 0.9189949305080883
Overall F1 Score: 0.9152672417087657
 Accuracy :  0.9204545454545454
