<!-- # Predicting Stock Change With Python
> Scrapying, analyzing, and notificating stock change, inspired by this [blog](https://towardsdatascience.com/predicting-stock-prices-with-python-ec1d0c9bece1)
- toc: true 
- badges: true
- comments: true
- author: Genghua Chen
- annotations: true
<!-- - image: https://g.foolcdn.com/editorial/images/663842/why-amzn-stock-is- -->
<!-- - hide: false -->

&emsp;&emsp;A Naive Bayes classifier is a simple conditional probabilistic algorithm based on applying Bayes' theorem (Bayesian statistics) with naive assumptions, which is assumed that the probability of each feature belonged to the true label is independently with other features and given the exactly same weights on the predcting power. Naive Bayes classifier is a very popular and powerful algorithm in machine learning area and data science fields for classification problem. It popular because it is simple, extremely fast, and based on the understanding of the algorithm to tune the hyper-parameters and then we will have the surprisingly well model for many different kinds of problems. And this is the reason why it is so popular, and it also is the core tool for machine learning and artificial intelligence.

&emsp;&emsp;There is a really good example to demonstrate the formula above. It is a economics puzzles which helps me to understand the mechanics of Naive Bayes. Two psychologists argued about whether a person is a farmer or a librarian. The discription of a person is "Steve is very **shy and withdrawn**, invariably helpful but with very little interest in people or in the world of reality. **A meek and tidy soul**, he has a need for order and structure, and a passion for detail." Rationally thinking about this discription of steve, people might think he is a librarian since the key words "shy", "tidy" and "meek" appeared, so steve is more likely to be a librarian. However, the point of this question is not really asking whether a person is a farmer or a librarian, people only use their first impressions to make judgement, and they don't take into account the ratio of farmers to librarians in reality(ratio: 20:1). Statistically, steve is more likely to be a farmer than a librarian, even though there are many words might showing that the steve is a librarian, but we still cannot omit the statistic of the farmer and libraian. That is Naive Bayes assumption, all the features are independt and the weights of features are the same.

The formula of the Naive Bayes is:

$$
P(C~|~{\rm features}) = \frac{P({\rm features}~|~C)P(C)}{P({\rm features})}
$$

Explaination:

P(C): Probability a hypothesis is true.



P(features|C): Probability of seeing the evidence if the hypothesis is true.


P(features): Probability of seeing the evidence.


In [1]:
# jupyter nbconvert --to pdf naive_bayes.ipynb --no-input
# tidy industrious kindness

In [2]:
%%capture 
print('hello world')

In [3]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
import pandas as pd
sns.set(style="whitegrid")

In [4]:
iris = datasets.load_iris()
df = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                     columns= iris['feature_names'] + ['target'])

df = df.sample(frac=1, random_state=1)

X, y = df.iloc[:, :-1], df.iloc[:, -1]

X_train, X_test, y_train, y_test = X[:100], X[100:], y[:100], y[100:]

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(100, 4) (100,)
(50, 4) (50,)


In [5]:
X_train.groupby(y_train).var().values.tolist()

[[0.11739784946236563,
  0.11298924731182801,
  0.02511827956989247,
  0.009225806451612906],
 [0.2198387096774194,
  0.08539314516129025,
  0.19161290322580646,
  0.03318548387096776],
 [0.3665765765765765,
  0.11654654654654656,
  0.33780780780780784,
  0.06085585585585581]]

In [6]:
X_train.groupby(y_train).var().to_numpy()

array([[0.11739785, 0.11298925, 0.02511828, 0.00922581],
       [0.21983871, 0.08539315, 0.1916129 , 0.03318548],
       [0.36657658, 0.11654655, 0.33780781, 0.06085586]])

In [7]:
l = [1,2,3,4,5,6,7, 9,8,9]
len(list(set(l)))

9

In [104]:
class NaiveBayesClassifier():
    
    def __init__(self, X, y):
        self.X = X
        self.y = y
        
        self.len_y = len(y)
        self.unique_y = list(set(y))
        self.unique_len_y = len(list(set(y)))
        
        
    def fit(self, X, y):
        
        self._process_prior(y)
        self._process_mean_var(X, y)
        
        # return self.mean, self.var
    
    def predict(self, X):
        pred = []
        X = X.iloc[0:3]
        for i in X.values:
            a = self.posterior(i)
            pred.append(a)
        return pred
    
    
    def posterior(self, X):
        posterior = []
        for i in self.unique_y:
            i = int(i)
            prior = self.prior[i]
            conditional = self._process_conditional(X, i=i)
            # print(conditional)
            pos = prior * conditional
            posterior.append(pos)
        # print(self.unique_y)
        return self.unique_y[np.argmax(posterior)]
    
    
    def _process_prior(self, y):
        # ratio of each 
        self.prior =[]
        for i in self.unique_y:
            b = list(y).count(i)

            a = b / self.len_y
            self.prior.append(a)
            
        return self.prior
    
    
    def _gaussian(self, mean, var, X):
        
        gau_num = ((np.exp((-1/2)*((X-mean)**2) / (2 * var))))
        gau_deno = np.sqrt(2 * np.pi * var)
        gau = gau_num / gau_deno
        self.gau = np.sum(gau)        
        print(self.gau)
        return self.gau
    
    
    def _process_mean_var(self, X, y):
        
        self.mean = X.groupby(y).mean().values
        self.var = X.groupby(y).var().values
        
        return self.mean, self.var
    
    
    def _process_conditional(self, X, i):
        

        self.conditional = self._gaussian(self.mean[i], self.var[i], X)
        return self.conditional
    


        
        

In [105]:
y_test.values

array([0., 1., 0., 1., 1., 0., 1., 0., 0., 2., 2., 2., 0., 0., 1., 0., 2.,
       0., 2., 2., 0., 2., 0., 1., 0., 1., 1., 0., 0., 1., 0., 1., 1., 0.,
       1., 1., 1., 1., 2., 0., 0., 2., 1., 2., 1., 2., 2., 1., 2., 0.])

In [106]:
nb = NaiveBayesClassifier(X_train.values, y_train.values)
nb.fit(X_train, y_train)


In [107]:
nb.predict(X_test)

5.367986176561921
0.8121892156601042
0.34095737801583825
0.6753791826670533
4.85350490906133
1.5008669528871363
6.382966232865144
0.7634868388111924
0.4448065644288556


[0.0, 1.0, 0.0]

In [35]:
x = np.arange(24).reshape((2, 3, 4))
x[np.argmax(x)]

[23]

In [23]:
iris = datasets.load_iris()
df = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                     columns= iris['feature_names'] + ['target'])

In [25]:
X, y = df.iloc[:, :-1], df.iloc[:, -1]
X_train, X_test, y_train, y_test = X[:100], X[100:], y[:100], y[100:]

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0.0
1,4.9,3.0,1.4,0.2,0.0
2,4.7,3.2,1.3,0.2,0.0
3,4.6,3.1,1.5,0.2,0.0
4,5.0,3.6,1.4,0.2,0.0
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2.0
146,6.3,2.5,5.0,1.9,2.0
147,6.5,3.0,5.2,2.0,2.0
148,6.2,3.4,5.4,2.3,2.0


In [None]:
x.calc_prior(X_train, y_train)

In [None]:
x.prior

In [107]:
class abc():


    def __init__(self, ls):
        self.a = len(ls)
        

    def b(self):
        
        return self.a
    

def a():
    return 20


a = abc([1,2,3,4,5,6,7,8,9])
a.b()

9