# Naive Bayes

Naive Bayes is a probabilistic machine learning algorithm based on Bayes’ Theorem. It is called “naive” because it assumes that the features in a dataset are mutually independent, which is rarely true in real-world data. Despite this naive assumption, Naive Bayes often performs well in practice, especially for classification tasks. In this notebook we will use it to classify an unknown material as an insulator or conductor.

#### Video

https://www.youtube.com/watch?v=26wC9WmEWlw&list=PLL0SWcFqypCl4lrzk1dMWwTUrzQZFt7y0&index=36 (Naive Bayes and Bayes' Theorem)

https://www.youtube.com/watch?v=_mHmo6B6NSw&list=PLL0SWcFqypCl4lrzk1dMWwTUrzQZFt7y0&index=37 (Coding Naive Bayes classifier from scratch)

## Setup

Let's test out naive Bayes.
This notebook uses the old MPRester API.

In [None]:
#first some libraries
import pandas as pd
from pymatgen.ext.matproj import MPRester
import os
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats
import matplotlib.gridspec as gridspec

In [None]:
# Set up MPRester
filename = r'G:\My Drive\teaching\5540-6640 Materials Informatics\old_apikey.txt'

def get_file_contents(filename):
    try:
        with open(filename, 'r') as f:
            # It's assumed our file contains a single line,
            # with our API key
            return f.read().strip()
    except FileNotFoundError:
        print("'%s' file not found" % filename)


Sparks_API = get_file_contents(filename)

Now lets grab some data. We'll pick stable metals and stable insulators and collect their density, formation energy, volume, and formulae. We will also print the mean and standard deviation of their density. 

In [None]:
mpr = MPRester(Sparks_API)

# Criteria for stable insulators: e_above_hull <= 0.02, band_gap > 0
criteria = {'e_above_hull': {'$lte': 0.02}, 'band_gap': {'$gt': 0}}
props = ['pretty_formula', 'band_gap', "density", 'formation_energy_per_atom', 'volume']
entries = mpr.query(criteria=criteria, properties=props)

# Create a DataFrame for the found insulators
df_insulators = pd.DataFrame(entries)
print(f"Average density of insulators: {df_insulators['density'].mean()}")
print(f"Standard deviation of density for insulators: {df_insulators['density'].std()}")

# Criteria for stable metals: e_above_hull <= 0.02, band_gap = 0
criteria = {'e_above_hull': {'$lte': 0.02}, 'band_gap': {'$eq': 0}}
entries = mpr.query(criteria=criteria, properties=props)

# Create a DataFrame for the found metals
df_metals = pd.DataFrame(entries)
print(f"Average density of metals: {df_metals['density'].mean()}")
print(f"Standard deviation of density for metals: {df_metals['density'].std()}")

Now let's plot our data as probability distribution functions

In [None]:
# Plot the Gaussian distributions for density, volume, and formation energy
fig = plt.figure(1, figsize=(5,5))
gs = gridspec.GridSpec(3,1)
gs.update(wspace=0.2, hspace=0.25)

# Density plot
xtr_subsplot= fig.add_subplot(gs[0:1,0:1])
x=np.arange(0,20,0.1)
y_metals=scipy.stats.norm(df_metals['density'].mean(), df_metals['density'].std()).pdf(x) #probability distribution function
y_ins=scipy.stats.norm(df_insulators['density'].mean(), df_insulators['density'].std()).pdf(x) #probability distribution function
plt.plot(x,y_metals)
plt.plot(x,y_ins)
plt.ylabel(r'$\rho\,g/cc$')

# Volume plot
xtr_subsplot= fig.add_subplot(gs[1:2,0:1])
x=np.arange(-1000,5000,0.1)
y_metals=scipy.stats.norm(df_metals['volume'].mean(), df_metals['volume'].std()).pdf(x) #probability distribution function
y_ins=scipy.stats.norm(df_insulators['volume'].mean(), df_insulators['volume'].std()).pdf(x) #probability distribution function
plt.plot(x,y_metals)
plt.plot(x,y_ins)
plt.ylabel('$V$ Angstroms')

# Formation energy plot
xtr_subsplot= fig.add_subplot(gs[2:3,0:1])
x=np.arange(-4,2,0.1)
y_metals=scipy.stats.norm(df_metals['formation_energy_per_atom'].mean(), df_metals['formation_energy_per_atom'].std()).pdf(x) #probability distribution function
y_ins=scipy.stats.norm(df_insulators['formation_energy_per_atom'].mean(), df_insulators['formation_energy_per_atom'].std()).pdf(x) #probability distribution function
plt.plot(x,y_metals,label='metal')
plt.plot(x,y_ins,label='insulator')
plt.ylabel('$\Delta H/atom$ eV')

plt.legend()

Let's classify a new mystery material based on its density, volume, and formation energy.

In [None]:
# Define the properties of the mystery material
density = 4
volume = 800
formation_energy = -2
#is it a metal or insulator???

We will classify the mystery material by calculating the probabilities for each property and summing them up.

In [None]:
# Initial guess based on proportion of metals v insulators
prior_metals = df_metals['density'].count()/(df_insulators['density'].count()+df_metals['density'].count())
prior_insulators = 1-prior_metals
print('The first guess based on metal vs insulator proportion.')
print('Probability of being metal:',prior_metals)
print('Probability of being insulator:',prior_insulators,'\n')

# Probability based on density
density_metals = scipy.stats.norm(df_metals['density'].mean(), df_metals['density'].std()).pdf(density)
density_insulators = scipy.stats.norm(df_insulators['density'].mean(), df_insulators['density'].std()).pdf(density)
print('The second guess based on density.')
print('Density likelihood for metal:',density_metals)
print('Density likelihood for insulator:',density_insulators,'\n')

# Probability based on volume
volume_metals = scipy.stats.norm(df_metals['volume'].mean(), df_metals['volume'].std()).pdf(volume)
volume_insulators = scipy.stats.norm(df_insulators['volume'].mean(), df_insulators['volume'].std()).pdf(volume)
print('The third guess based on volume.')
print('Volume likelihood for metal:',volume_metals)
print('Volume likelihood for insulator:',volume_insulators,'\n')

# Probability based on formation energy
energy_metals = scipy.stats.norm(df_metals['formation_energy_per_atom'].mean(), df_metals['formation_energy_per_atom'].std()).pdf(formation_energy)
energy_insulators = scipy.stats.norm(df_insulators['formation_energy_per_atom'].mean(), df_insulators['formation_energy_per_atom'].std()).pdf(formation_energy)
print('The Fourth guess based on formation energy.')
print('Energy likelihood for metal:',energy_metals)
print('Energy likelihood for insulator:',energy_insulators,'\n')

# Now we add up the log of these probabilities and compare
odds_of_metal = np.log(prior_metals)+np.log(density_metals)+np.log(volume_metals)+np.log(energy_metals)
odds_of_insulator = np.log(prior_insulators)+np.log(density_insulators)+np.log(volume_insulators)+np.log(energy_insulators)
print('Our final guess is based on all of these probabilities combined!')
print('The odds of being a metal are:',odds_of_metal)
print('The odds of being an insulator are:',odds_of_insulator,'\n')

# Classify the material using the found odds
if odds_of_metal > odds_of_insulator:
    print('new material is probably a metal!')
else:
    print('new material is an insulator!')