## Naive Bayes Classifier using the extended PlayTennis dataset.

We will be using the PlayTennis dataset in order to, predict whether the player will play or not, on the given weather. We will be using 4 independent variables and 1 dependent variable.
The independent variables will be the **OUTLOOK**, **TEMPERATURE**, **HUMIDITY**, **WINDY**, and the dependent variable will be whether the person will play or not.

In [1]:
# First we need to import all libraries needed for the training. Make sure to install pandas and scikit first using pip install.
import pandas as pd
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [2]:
# First load the CSV file using pandas read_csv method
play_tennis = pd.read_csv("PlayTennis.csv")
play_tennis.head()

Unnamed: 0,Outlook,Temperature,Humidity,Wind,Play Tennis
0,sunny,hot,high,False,no
1,sunny,hot,high,True,no
2,overcast,hot,high,False,yes
3,rainy,mild,high,False,yes
4,rainy,cool,normal,False,yes


In [3]:
# Label Encoder converts categorical data into numbers 
number = LabelEncoder()

In [4]:
# Let's convert categorical data into number
# 3 classes from Outlook (0 - Overcast, 1 - Rain, 2 - Sunny)
play_tennis['Outlook'] = number.fit_transform(play_tennis['Outlook'])

# 3 classes from Temperature (0 - Cool, 1 - Hot, 2 - Mild)
play_tennis['Temperature'] = number.fit_transform(play_tennis['Temperature'])

# 2 classes from Humidity (0 - High, 1 - Normal)
play_tennis['Humidity'] = number.fit_transform(play_tennis['Humidity'])

# 2 classes from Wind (0 - False, 1 - True), means whether the wind is present or not.
play_tennis['Wind'] = number.fit_transform(play_tennis['Wind'])

# 2 classes of Play Tennis (0 - No, 1 - Yes)
play_tennis['Play Tennis'] = number.fit_transform(play_tennis['Play Tennis'])

#Print the converted table after transformation to see the results
print(play_tennis)


    Outlook  Temperature  Humidity  Wind  Play Tennis
0         2            1         0     0            0
1         2            1         0     1            0
2         0            1         0     0            1
3         1            2         0     0            1
4         1            0         1     0            1
5         1            0         1     1            0
6         0            0         1     1            1
7         2            2         0     0            0
8         2            0         1     0            1
9         1            2         1     0            1
10        2            2         1     1            1
11        0            2         0     1            1
12        0            1         1     0            1
13        1            2         0     1            0


In [5]:
# Let's now define the features and the target variable, the features are the independent variable and our target variable will be the dependent variable
features = ["Outlook", "Temperature", "Humidity", "Wind"]
target = "Play Tennis"



In [12]:
# After defining the features and target variable, we proceeed to train, and test split. We will build the model using the TRAIN dataset and validate the model on TEST dataset
# We can use sklearn's train_test_split() method in order to split the data into training and test set.

features_train, features_test, target_train, target_test = train_test_split(play_tennis[features], play_tennis[target], test_size = 0.30, random_state = 40)

In [15]:
# Let's display the split datasets
print('\t Training Feature \n', features_train)
print('\t Testing Feature \n', features_test)
print('\t Testing Feature \n', target_train)
print('\t Testing Feature \n', target_test)


	 Training Feature 
     Outlook  Temperature  Humidity  Wind
4         1            0         1     0
1         2            1         0     1
2         0            1         0     0
9         1            2         1     0
8         2            0         1     0
5         1            0         1     1
7         2            2         0     0
11        0            2         0     1
6         0            0         1     1
	 Testing Feature 
     Outlook  Temperature  Humidity  Wind
0         2            1         0     0
13        1            2         0     1
10        2            2         1     1
3         1            2         0     0
12        0            1         1     0
	 Testing Feature 
 4     1
1     0
2     1
9     1
8     1
5     0
7     0
11    1
6     1
Name: Play Tennis, dtype: int32
	 Testing Feature 
 0     0
13    0
10    1
3     1
12    1
Name: Play Tennis, dtype: int32


We can see here the split data set for training and testing, we have used 20% for the testing therefore, it showed 3 test data randomly, which is approx. 20% of 14.


In [8]:
# We now proceed to the Gaussian Naive Bayes Implementation. The Machine Learning Implementation.
# We are going to use sklearn's GaussianNB module. Let's first create the model.
model = GaussianNB()


We have built the classifier for our trained data. We will use the fit() method in order to train the data. After this, the model can now make predictions.
We will also use the test set on predict() method, the **TEST SET** will serve as its parameters.

In [9]:
model.fit(features_train, target_train)

# After fitting the training data, we will now make predictions using predict with test set as its parameters.
pred = model.predict(features_test)


In [10]:
# then add and calculate the accuracy of our model
accuracy = accuracy_score(target_test, pred)
print("Model Accuracy = ", accuracy*100,"%")

Model Accuracy =  80.0 %


In [16]:
# now we will predict using the conditions set in the table
# for example 
# Now suppose we want to predict for the conditions:
# Outlook = Rain (Rain is represented as 1 in the Outlook class)
# Temperature = Mild (Mild is represented as 2 in the Temperature class)
# Humidity = High (High is represented as 0 in the Humidity class)
# Wind = False (False is represented as 0 in the Wind class)
# According to our data set, given these features play should be 1 or "PLAY"

answer = model.predict([[1,0,1,0]]) 

if answer == 1:
    print("\nPlay")
elif answer == 0:
    print("\nNo Play")

# There we have built a simple gaussian model classifier with an 80% accuracy. 
# You can now create random datasets to know how well 


Play


