# Naiwny klasyfikator bayesowski

ZADANIE:

Wyobraźmy sobie test na chorobę X. Testy wykrywa prawdidłowo 90% chorych, ale jeśli testowi poddaje się osoba zdrowa myli się w 30% przypadków. Choroba X dotyka średnio 10% populacji. 

Jakie jest prawdodpodobieństwo, że osoba, która poddała się testowi i otrzymała wynik pozytywny (chora), jest zdrowa?

P(A|B) = (P(B|A) * P(A)) / P(B)

## Budowa modelu na danych rzeczywistych

Po pierwsze, musimy zaimportować dane. Pobierzmy sobie zatem zbiory z githuba.

In [3]:
!git clone https://github.com/matzim95/ML-datasets


fatal: destination path 'ML-datasets' already exists and is not an empty directory.


In [53]:
import pandas as pd

def load_dataset(filename, class_column, index_col=None):
    dataset = pd.read_csv(f'ML-datasets/{filename}.csv', index_col=index_col)
    dataset['class'] = dataset[class_column].astype('category').cat.codes
    classes = dataset.pop(class_column).unique()
    return dataset, classes

def calculate_metrics(target, prediction, average='weighted'):
    accuracy = accuracy_score(target, prediction)
    precision = precision_score(target, prediction, average=average)
    recall = recall_score(target, prediction, average=average)
    f1 = f1_score(target, prediction, average=average)
    mislabeled = (target != prediction).sum()
    total = len(target)
    return accuracy, precision, recall, f1, mislabeled, total

def print_results(metrics, classifier_id='classifier'):
    print(f'Results for {classifier_id}')
    print('----')
    print(f'  Accuracy:  {metrics[0]}')
    print(f'  Precision: {metrics[1]}')
    print(f'  Recall:    {metrics[2]}')
    print(f'  F1 score:  {metrics[3]}')
    print(f'  Mislabeled {metrics[4]} out of {metrics[5]}')
    print('\n')

### Wczytanie zbioru

Glass

In [39]:
glass, glass_classes = load_dataset("glass", "Type", "ID")

In [40]:
pprint.pprint(glass_classes)

array(['building_windows_float_processed',
       'building_windows_non_float_processed',
       'vehicle_windows_float_processed', 'containers', 'tableware',
       'headlamps'], dtype=object)


In [41]:
y = glass.pop("class")
X = glass

Wine

### Normalizacja / standaryzacja

In [23]:
X.head()

Unnamed: 0_level_0,refractive index,Sodium,Magnesium,Aluminum,Silicon,Potassium,Calcium,Barium,Iron
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,1.52101,13.64,4.49,1.1,71.78,0.06,8.75,0.0,0.0
2,1.51761,13.89,3.6,1.36,72.73,0.48,7.83,0.0,0.0
3,1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0.0,0.0
4,1.51766,13.21,3.69,1.29,72.61,0.57,8.22,0.0,0.0
5,1.51742,13.27,3.62,1.24,73.08,0.55,8.07,0.0,0.0


In [44]:
from sklearn.preprocessing import Normalizer, StandardScaler, MinMaxScaler


min_max = MinMaxScaler()
normalizer = Normalizer()
standard_scaller = StandardScaler()
standarizer = StandardScaler()

In [27]:
glass.describe()

Unnamed: 0,refractive index,Sodium,Magnesium,Aluminum,Silicon,Potassium,Calcium,Barium,Iron
count,214.0,214.0,214.0,214.0,214.0,214.0,214.0,214.0,214.0
mean,1.518365,13.40785,2.684533,1.444907,72.650935,0.497056,8.956963,0.175047,0.057009
std,0.003037,0.816604,1.442408,0.49927,0.774546,0.652192,1.423153,0.497219,0.097439
min,1.51115,10.73,0.0,0.29,69.81,0.0,5.43,0.0,0.0
25%,1.516522,12.9075,2.115,1.19,72.28,0.1225,8.24,0.0,0.0
50%,1.51768,13.3,3.48,1.36,72.79,0.555,8.6,0.0,0.0
75%,1.519157,13.825,3.6,1.63,73.0875,0.61,9.1725,0.0,0.1
max,1.53393,17.38,4.49,3.5,75.41,6.21,16.19,3.15,0.51


In [34]:
X_normalized = X.copy()

normalizer.fit(X)
X_matrix = normalizer.transform(X)

X_normalized[:] = X_matrix

In [36]:
X_normalized.describe()

Unnamed: 0,refractive index,Sodium,Magnesium,Aluminum,Silicon,Potassium,Calcium,Barium,Iron
count,214.0,214.0,214.0,214.0,214.0,214.0,214.0,214.0,214.0
mean,0.020373,0.179893,0.036079,0.019389,0.974684,0.006695,0.120183,0.002353,0.000766
std,0.000214,0.010981,0.019395,0.006734,0.002748,0.008941,0.019227,0.006722,0.001311
min,0.019452,0.146107,0.0,0.003909,0.960172,0.0,0.072963,0.0,0.0
25%,0.020258,0.173025,0.028478,0.015932,0.973149,0.001663,0.110382,0.0,0.0
50%,0.020338,0.178892,0.046603,0.018337,0.975522,0.007438,0.115211,0.0,0.0
75%,0.020479,0.186234,0.048366,0.02184,0.976717,0.008155,0.124221,0.0,0.001343
max,0.02127,0.223717,0.060884,0.048796,0.979958,0.085825,0.221568,0.043756,0.00689


In [42]:
X_standarized = X.copy()

standard_scaller.fit(X)
X_matrix = standard_scaller.transform(X)

X_standarized[:] = X_matrix

In [45]:
X_standarized.describe()

Unnamed: 0,refractive index,Sodium,Magnesium,Aluminum,Silicon,Potassium,Calcium,Barium,Iron
count,214.0,214.0,214.0,214.0,214.0,214.0,214.0,214.0,214.0
mean,-2.877034e-14,2.191393e-15,-1.328117e-16,-2.988264e-16,9.525091e-16,3.5278110000000005e-17,-3.154278e-16,-6.640586000000001e-17,-3.73533e-17
std,1.002345,1.002345,1.002345,1.002345,1.002345,1.002345,1.002345,1.002345,1.002345
min,-2.381516,-3.286943,-1.865511,-2.318616,-3.676472,-0.7639186,-2.484084,-0.3528768,-0.5864509
25%,-0.6082728,-0.614158,-0.3957744,-0.511756,-0.4800288,-0.5756501,-0.5049657,-0.3528768,-0.5864509
50%,-0.2262293,-0.1323817,0.5527787,-0.1704602,0.1799655,0.08905322,-0.2514132,-0.3528768,-0.5864509
75%,0.2614331,0.5120326,0.636168,0.3715977,0.5649621,0.173582,0.1518057,-0.3528768,0.4422417
max,5.137232,4.875637,1.254639,4.125851,3.570524,8.780145,5.094318,5.99721,4.659881


In [47]:
X_min_maxed = X.copy()

min_max.fit(X)
X_matrix = min_max.transform(X)

X_standarized[:] = X_matrix

In [48]:
X_min_maxed.describe()

Unnamed: 0,refractive index,Sodium,Magnesium,Aluminum,Silicon,Potassium,Calcium,Barium,Iron
count,214.0,214.0,214.0,214.0,214.0,214.0,214.0,214.0,214.0
mean,1.518365,13.40785,2.684533,1.444907,72.650935,0.497056,8.956963,0.175047,0.057009
std,0.003037,0.816604,1.442408,0.49927,0.774546,0.652192,1.423153,0.497219,0.097439
min,1.51115,10.73,0.0,0.29,69.81,0.0,5.43,0.0,0.0
25%,1.516522,12.9075,2.115,1.19,72.28,0.1225,8.24,0.0,0.0
50%,1.51768,13.3,3.48,1.36,72.79,0.555,8.6,0.0,0.0
75%,1.519157,13.825,3.6,1.63,73.0875,0.61,9.1725,0.0,0.1
max,1.53393,17.38,4.49,3.5,75.41,6.21,16.19,3.15,0.51


Możemy już przejść do zdefiniowania modelu i jego walidacji

In [None]:
from sklearn.metrics import tra

In [49]:
from sklearn.model_selection import train_test_split

In [51]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state = 30)

In [52]:
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB

In [58]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [64]:
for classifier in [MultinomialNB(), GaussianNB(), BernoulliNB()]:
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    print_results(calculate_metrics(y_test, y_pred), classifier_id=classifier)

Results for MultinomialNB()
----
  Accuracy:  0.6046511627906976
  Precision: 0.5639602883176935
  Recall:    0.6046511627906976
  F1 score:  0.5653992040790947
  Mislabeled 17 out of 43


Results for GaussianNB()
----
  Accuracy:  0.3023255813953488
  Precision: 0.5454263565891473
  Recall:    0.3023255813953488
  F1 score:  0.34274039573917175
  Mislabeled 30 out of 43


Results for BernoulliNB()
----
  Accuracy:  0.4418604651162791
  Precision: 0.3817389006342495
  Recall:    0.4418604651162791
  F1 score:  0.40216872399918163
  Mislabeled 24 out of 43




  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Ten sposób porównywania nie jest zbyt czytelny, dlatego stosujemy metryki oraz macierze błędów. Przygotujmy funkcję:

A na co się przydaje ta dyskretyzacja? Sprawdźmy!

Zdefiniujmy sobie różne funkcje do dyskretyzacji, korzystając z pandasowych:
* pd.cut()
* pd.qcut()

Funkcje przeprowadzające kategoryzację:

## Granice decyzyjne na sztucznie wygenerowanym zbiorze: