### Import Libraries

In [1]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt 
import seaborn as sns
import warnings
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.naive_bayes import GaussianNB, MultinomialNB

warnings.filterwarnings("ignore")
%matplotlib inline

<!--  -->

### Import Data

In [2]:
file_path = os.getcwd()+"\\Data\\tennis.csv"
df_tennis_data = pd.read_csv(file_path, usecols=['outlook', 'temp', 'humidity', 'windy', 'play'])

<!--  -->

### Descriptive Analysis

In [3]:
print(df_tennis_data.head(2),"\n")
print("Columns: {}\n".format(df_tennis_data.columns.to_list()))
print("Data shape: {}\n".format(df_tennis_data.shape))
print(df_tennis_data.info(),"\n")
print("Missing Records per Column:")
print("--"*10)
print(df_tennis_data.isnull().sum()) # Missing values per Column
df_tennis_data.describe(include="all") # Stats

  outlook temp humidity  windy play
0   sunny  hot     high  False  yes
1   sunny  hot     high   True   no 

Columns: ['outlook', 'temp', 'humidity', 'windy', 'play']

Data shape: (15, 5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   outlook   15 non-null     object
 1   temp      15 non-null     object
 2   humidity  15 non-null     object
 3   windy     15 non-null     bool  
 4   play      15 non-null     object
dtypes: bool(1), object(4)
memory usage: 627.0+ bytes
None 

Missing Records per Column:
--------------------
outlook     0
temp        0
humidity    0
windy       0
play        0
dtype: int64


Unnamed: 0,outlook,temp,humidity,windy,play
count,15,15,15,15,15
unique,3,3,2,2,2
top,sunny,hot,high,False,yes
freq,10,9,13,11,12


<!--  -->

### Manual Calculation[Categorical Variables]

- Predict probability (or) classify play as yes or no based on below weather condition 

In [22]:
weather = {"outlook": "sunny", "temp": "hot", "humidity": "high", "windy": False}
# weather = {"outlook": "overcast", "temp": "mild", "humidity": "high", "windy": True}

- Calculating `Prior Probability (P(Play = yes))`
    - This is general chance of playing tennis[yes] on any given day (regardless of weather evidence/condition).
    - In simpler words, how many times play_tennis is true out of total data that can be both play_tennis cases i.e., yes or no 

In [23]:
# Counting positive cases (play = yes)
positive_count = sum(row["play"] == "yes" for index, row in df_tennis_data.iterrows())

# Total no of data points
total_data_points = len(df_tennis_data)

# Prior Probability i.e., Probability of being "play=yes" cases in training data
prior_probability = round(positive_count / total_data_points, 2)
print(f"Prior Probability (P(Play = yes)) --> {prior_probability}")

Prior Probability (P(Play = yes)) --> 0.8


- Calculating `Likelihood (P(Evidence/Condition | Play = yes))`
    - Chance of having the specific weather i.e., (sunny, hot, high humidity, no wind) given that someone will play tennis[yes].
    - In simpler words, how many times we found above weather evidence / condition when play_tennis[yes] out of over all data

In [24]:
# Filtering records where matching our weather condition during play_tennis[yes]  
filtered_data = [row for index, row in df_tennis_data.iterrows() if row["play"] == "yes" and all(row[key] == weather[key] for key in weather)]
print(filtered_data, "\n")

# Count filtered data
matching_evidence_count = len(filtered_data)
print(f"matching_evidence_count --> {matching_evidence_count}", "\n")

# Out of overall positive(i.e., play_tennis[true]) count, what is likelihood/probability/percentage of being both our weather condition matched and play_tennis[yes]
likelihood = round(matching_evidence_count / positive_count, 2)
print("Likelihood (P(Evidence | Play = yes)):", likelihood)

[outlook     sunny
temp          hot
humidity     high
windy       False
play          yes
Name: 0, dtype: object, outlook     sunny
temp          hot
humidity     high
windy       False
play          yes
Name: 2, dtype: object, outlook     sunny
temp          hot
humidity     high
windy       False
play          yes
Name: 4, dtype: object, outlook     sunny
temp          hot
humidity     high
windy       False
play          yes
Name: 6, dtype: object, outlook     sunny
temp          hot
humidity     high
windy       False
play          yes
Name: 8, dtype: object, outlook     sunny
temp          hot
humidity     high
windy       False
play          yes
Name: 10, dtype: object, outlook     sunny
temp          hot
humidity     high
windy       False
play          yes
Name: 12, dtype: object, outlook     sunny
temp          hot
humidity     high
windy       False
play          yes
Name: 13, dtype: object] 

matching_evidence_count --> 8 

Likelihood (P(Evidence | Play = yes)): 0.67


- Predict Probability
    - Actual formula => `P(Play = yes | Evidence) = (P(Play = yes) * P(Evidence | Play = yes)) / P(Evidence)`
        - Total evidence probability - The probability of having the specific weather conditions (Evidence) regardless of whether someone plays tennis (Yes or No).
        - P(Evidence) can be challenging to calculate in real-world scenarios. It requires considering all possible weather combinations and their probabilities (Play = Yes or No).
            - i.e., {"outlook": "sunny", "temp": "hot", "humidity": "high", "windy": False}, {"outlook": "sunny", "temp": "hot", "humidity": "low", "windy": True} etc.,
        - Assuming all evidence combinations are equally likely/distributed i.e., not baised. We ignore this parameter in the formula. Hence make formula as 
    - Updated formula => `P(Play = yes | Evidence) = (P(Play = yes)[Prior Probability] * P(Evidence | Play = yes)) [Likelihood]`
        - We multiply these two probabilities (prior chance i.e., play_tennis[yes] and chance of specific weather given playing tennis) to get a rough estimate of the chance of playing tennis today based on the weather we see.
        - Why multiply?
            - Multiplying combines these two pieces of information to give a more specific prediction. A high prior chance and a high likelihood for the weather would suggest a higher chance of playing tennis today.
            - In other words, `If there is high play_tennis[true] and have high counts / percentage of records with this weather during play_tennis[yes] out of overall records, results into => high chance / probability / percentage of play_tennis[true] of having this weather condition`

In [25]:
predicted_probability = round(prior_probability * likelihood, 2)
print(
    "Probability of Play = yes on given weather condition:", predicted_probability
)

if predicted_probability > 0.5:
    print("Play_tennis is predicted as 'yes'")
else:
    print("Play_tennis is predicted as 'no'")

Probability of Play = yes on given weather condition: 0.54
Play_tennis is predicted as 'yes'


Observation

- First we analysed prior probability of play_tennis[true] i.e., P(Play = yes) and likelihood / percentage of matching given weather condition on being play_tennis[true] i.e., P(Evidence | Play = yes). 
- Then with this prior knowledge of probabilities, hence we predicted predicted probability of play_tennis[yes] on given weather condition i.e., P(Play = yes | Evidence).
- `weather = {"outlook": "overcast", "temp": "mild", "humidity": "high", "windy": True}`
    - Play_tennis is predicted as 'no'. It seems correct, because we have less evidence in the data while with this weather condition of being play_tennis[yes]  
- `weather = {"outlook": "sunny", "temp": "hot", "humidity": "high", "windy": False}`
    - Play_tennis is predicted as 'yes'. It seems correct, because we have enough evidence in the data while with this weather condition of being play_tennis[yes]  

<!--  -->

### Sklearn[Categorical Variables]

- Predict probability (or) classify play as yes or no based on below weather condition 

In [14]:
weather = {"outlook": "overcast", "temp": "mild", "humidity": "high", "windy": True}

In [15]:
# Separate features (predictors) and target variable (response)
X = df_tennis_data[["outlook", "temp", "humidity", "windy"]]  # Predictor features
y = df_tennis_data["play"]  # Target variable (play or not)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [16]:
# Convert categorical into numerical features
encoder = LabelEncoder()
X_train["outlook"] = encoder.fit_transform(X_train["outlook"])
X_train["temp"] = encoder.fit_transform(X_train["temp"])
X_train["humidity"] = encoder.fit_transform(X_train["humidity"])
X_train["windy"] = encoder.fit_transform(X_train["windy"])

X_test["outlook"] = encoder.transform(X_test["outlook"])
X_test["temp"] = encoder.transform(X_test["temp"])
X_test["humidity"] = encoder.transform(X_test["humidity"])
X_test["windy"] = encoder.transform(X_test["windy"])

In [17]:
print(X_train.head(2))
print(X_test.head(2))

    outlook  temp  humidity  windy
13        2     1         0      0
5         1     0         1      1
    outlook  temp  humidity  windy
9         1     1         1      0
11        1     1         1      1


In [18]:
# Create a Multinomial Naive Bayes model
model = MultinomialNB()

# Train the model on the training data
model.fit(X_train, y_train)

In [20]:
# Original weather case
print(weather)
# Label encoded weather
df_weather_case_records = df_tennis_data[(df_tennis_data['outlook'] == 'overcast') & 
                            (df_tennis_data['temp'] == 'mild') & 
                            (df_tennis_data['humidity'] == 'high') & 
                            (df_tennis_data['windy']== True)
                            ]
print(df_weather_case_records)
weather_label_encoded = X_train.loc[14,:].to_dict()
weather_label_encoded

{'outlook': 'overcast', 'temp': 'mild', 'humidity': 'high', 'windy': True}
     outlook  temp humidity  windy play
11  overcast  mild     high   True  yes
14  overcast  mild     high   True  yes


{'outlook': 0, 'temp': 2, 'humidity': 0, 'windy': 1}

In [21]:
# Convert new evidence to a format the model expects (usually a 2D array)
new_weather_array = [list(weather_label_encoded.values())]

# Predict the probability of playing tennis (Yes) for the new evidence
predicted_proba = round(model.predict_proba(new_weather_array)[0][1], 10)  # Probability of class "Yes"

print(
    "Probability of Play = yes given weather {'outlook': 'sunny', 'temp': 'hot', 'humidity': 'high', 'windy': False} ---> ",
    predicted_proba,
)

Probability of Play = yes given weather {'outlook': 'sunny', 'temp': 'hot', 'humidity': 'high', 'windy': False} --->  0.6709667443


Observation

- This doesn't matches with manual calculation, because of below possible resons
    - Limited Data:  The provided sample data is small, which can lead to inaccurate estimates, especially for manual calculations. With a small dataset, a single data point with a different outcome ("no" in this case) can significantly affect the likelihood calculation.
    - Manual Likelihood Estimation: The manual calculation assumes all "Yes" cases are equally likely.  In reality, the likelihood of specific weather conditions given "play=yes" might vary based on the data.
    - Model Training: Sklearn's MultinomialNB learns from the entire dataset, including the "no" cases for sunny, hot, high, and no wind. This can influence the model's prediction compared to the simplified manual approach.
- On this small dataset sklearn prediction doesnt seems to be correct. May be it give correct on higher dataset.