# Naive Bayes Classification using Scikit-learn
Find out at [this link](https://www.datacamp.com/community/tutorials/naive-bayes-scikit-learn).


## Classification Workflow
+ Understand the problem and identify potential features and label.
+ Features are those characteristics or attributes which affect the results of the label.
+ The classification has two phases:
    + a learning phase
    + the evaluation phase
+ Performance is evaluated on the basis of various parameters:
    + accuracy, error, precision, and recall.
    
![Classification workflow](images/nbc.webp "Classification workflow")

## What is Naive Bayes Classifier?
+ Naive Bayes classifier assumes that the effect of a particular feature in a class is independent of other features.
+ Even if these features are interdependent, these features are still considered independently.
+ This assumption simplifies computation, and that's why it is considered as naive.
+ This assumption is called class conditional independence.

![Bayes Theorem Equation for Naive Bayes Classification](images/nbc_equation.webp "Bayes Theorem Equation")

## How Naive Bayes classifier works?
### First Approach (In case of a single feature)
Naive Bayes classifier calculates the probability of an event in the following steps:

+ **Step 1:** Calculate the prior probability for given class labels
+ **Step 2:** Find Likelihood probability with each attribute for each class
+ **Step 3:** Put these value in Bayes Formula and calculate posterior probability.
+ **Step 4:** See which class has a higher probability, given the input belongs to the higher probability class.

![Wheather Table](images/wheather-table-1.webp "Wheather Table")

### Second Approach (In case of multiple features)

![Wheather Table](images/wheather-table-2.webp "Wheather Table")

## Classifier Building in Scikit-learn
### Naive Bayes Classifier
#### Defining Dataset
In this example, you can use the dummy dataset with three columns: weather, temperature, and play. The first two are features(weather, temperature) and the other is the label.

In [1]:
# Assigning features and label variables

weather = ['Sunny','Sunny','Overcast','Rainy','Rainy','Rainy','Overcast',
           'Sunny','Sunny','Rainy','Sunny','Overcast','Overcast','Rainy']

temp = ['Hot','Hot','Hot','Mild','Cool','Cool','Cool',
        'Mild','Cool','Mild','Mild','Mild','Hot','Mild']

play = ['No','No','Yes','Yes','Yes','No','Yes',
        'No','Yes','Yes','Yes','Yes','Yes','No']

#### Encoding Features
First, you need to convert these string labels into numbers. for example: 'Overcast', 'Rainy', 'Sunny' as 0, 1, 2. This is known as label encoding. Scikit-learn provides LabelEncoder library for encoding labels with a value between 0 and one less than the number of discrete classes.

In [2]:
# Import LabelEncoder
from sklearn.preprocessing import LabelEncoder

#creating labelEncoder
le = LabelEncoder()

# Converting string labels into numbers.
weather_encoded = le.fit_transform(weather)
print(wheather_encoded)

NameError: name 'wheather_encoded' is not defined

Similarly, you can also encode temp and play columns.

In [None]:
# Converting string labels into numbers

temp_encoded = le.fit_transform(temp)
label = le.fit_transform(play)

print(f'Temp: {temp_encoded}')
print(f'Play: {label}')

Now combine both the features (weather and temp) in a single variable (list of tuples).

In [None]:
#Combinig weather and temp into single listof tuples

features_zip = zip(weather_encoded,temp_encoded)
features = list(features_zip)
print(features)

#### Generating Model
Generate a model using naive bayes classifier in the following steps:
+ Create naive bayes classifier
+ Fit the dataset on classifier
+ Perform prediction

In [None]:
#Import Gaussian Naive Bayes model
from sklearn.naive_bayes import GaussianNB

#Create a Gaussian Classifier
model = GaussianNB()

# Train the model using the training sets
model.fit(features, label)

#Predict Output
predicted = model.predict([[0,2]]) # 0:Overcast, 2:Mild
print(f'Predicted Value: {predicted}')

Here, 1 indicates that players can 'play'.

### Naive Bayes with Multiple Labels
Which is known as multinomial Naive Bayes classification. For example, if you want to classify a news article about technology, entertainment, politics, or sports.
"This dataset is the result of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars." ([UC Irvine](https://archive.ics.uci.edu/ml/datasets/wine))

#### Loading Data
Let's first load the required wine dataset from scikit-learn datasets.

In [4]:
#Import scikit-learn dataset library
from sklearn import datasets

#Load dataset
wine = datasets.load_wine()

#### Exploring Data
You can print the target and feature names, to make sure you have the right dataset, as such:

In [5]:
# print the names of the 13 features
print(f'Features: {wine.feature_names}')

# print the label type of wine(class_0, class_1, class_2)
print(f'Labels: {wine.target_names}')

Features: ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
Labels: ['class_0' 'class_1' 'class_2']


In [6]:
# print data(feature)shape
wine.data.shape

(178, 13)

In [7]:
# print the wine data features (top 5 records)
print(wine.data[0:5])

[[1.423e+01 1.710e+00 2.430e+00 1.560e+01 1.270e+02 2.800e+00 3.060e+00
  2.800e-01 2.290e+00 5.640e+00 1.040e+00 3.920e+00 1.065e+03]
 [1.320e+01 1.780e+00 2.140e+00 1.120e+01 1.000e+02 2.650e+00 2.760e+00
  2.600e-01 1.280e+00 4.380e+00 1.050e+00 3.400e+00 1.050e+03]
 [1.316e+01 2.360e+00 2.670e+00 1.860e+01 1.010e+02 2.800e+00 3.240e+00
  3.000e-01 2.810e+00 5.680e+00 1.030e+00 3.170e+00 1.185e+03]
 [1.437e+01 1.950e+00 2.500e+00 1.680e+01 1.130e+02 3.850e+00 3.490e+00
  2.400e-01 2.180e+00 7.800e+00 8.600e-01 3.450e+00 1.480e+03]
 [1.324e+01 2.590e+00 2.870e+00 2.100e+01 1.180e+02 2.800e+00 2.690e+00
  3.900e-01 1.820e+00 4.320e+00 1.040e+00 2.930e+00 7.350e+02]]


In [8]:
# print the wine labels (0:Class_0, 1:class_2, 2:class_2)
print(wine.target)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]


#### Splitting Data
First, you separate the columns into dependent and independent variables(or features and label).
Then you split those variables into train and test set.

![Test-Train Ratio](images/test-train.webp "Test Train Ratio")

In [9]:
# Import train_test_split function
from sklearn.cross_validation import train_test_split

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(wine.data,
                                                    wine.target,
                                                    test_size=0.2,
                                                    random_state=109) # 70% training and 30% test



#### Model Generation
After splitting, you will generate a random forest model on the training set and perform prediction on test set features.

In [10]:
#Import Gaussian Naive Bayes model
from sklearn.naive_bayes import GaussianNB

#Create a Gaussian Classifier
gnb = GaussianNB()

#Train the model using the training sets
gnb.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = gnb.predict(X_test)

#### Evaluating Model
After model generation, check the accuracy using actual and predicted values.

In [11]:
#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics

# Model Accuracy, how often is the classifier correct?
accuracy = metrics.accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

Accuracy: 0.9444444444444444
