# Ödev

Ödevde Naive Bayes algoritması kullanılarak 3 adet model geliştirilecektir. Ödevin amacı; Naive Bayes'in teorik alt yapısını anlamak ve bununla ilgili uygulama yapmaktır. Ödevin teorik kısmı GitHub readme.md içerisinde açıklanacaktır. IMRAD (Giriş (Özet), Metot, Sonuçlar ve Yorum) formatı kullanılabilir. Algoritmanın teorik yapısı ve kullanılan hiper parametrelerin çalışma prensibi anlatılmalıdır. Repository'nin Readme.md dosyası okunarak çalışma tekrarlanabilir olmalıdır.

## Veri Seti

In [2]:
import numpy as np # Numerical Python (NumPy) -> linear algebra, daha hızlı array işlemi vs..
import pandas as pd # R programlama dili (dataframe), python implementasyonu 

In [3]:
df = pd.read_csv("diabetes.csv") # csv verisini dataframe objesi olarak okuyor

In [6]:
df.head(10) # verinin başını gösterir

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
5,5,116,74,0,0,25.6,0.201,30,0
6,3,78,50,32,88,31.0,0.248,26,1
7,10,115,0,0,0,35.3,0.134,29,0
8,2,197,70,45,543,30.5,0.158,53,1
9,8,125,96,0,0,0.0,0.232,54,1


Veri setini pandas ve matplotlib kullanarak daha ayrıntılı analiz edebilir. Bunları yorumlayabiliriz. İsterseniz bu veri seti üzerinde de oynamalar yapabilirsiniz.

In [7]:
from sklearn.model_selection import train_test_split # veriyi train ve test diye ayırır

In [8]:
X = df.drop(['Outcome'], axis = 1)
y = df["Outcome"]

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=31)
print("The length of X_train is :", len(X_train))
print("The length of X_test is :", len(X_test))

The length of X_train is : 514
The length of X_test is : 254


## Model (Varsayılan Hiperparametreler İle Eğitim)

In [20]:
from sklearn.naive_bayes import GaussianNB # naive bayes algoritmasını verir 

In [21]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix # modeli bunlarla başarımını ölçeceğiz

In [22]:
gnb = GaussianNB()

In [23]:
gnb.get_params()

{'priors': None, 'var_smoothing': 1e-09}

In [24]:
gnb.fit(X_train, y_train)

In [25]:
y_pred = gnb.predict(X_test)

Traini öğrendi şimdi test verisiyle test ediceğiz

In [26]:
cm = confusion_matrix(y_test, y_pred)
cm

array([[126,  33],
       [ 34,  61]], dtype=int64)

In [27]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.79      0.79      0.79       159
           1       0.65      0.64      0.65        95

    accuracy                           0.74       254
   macro avg       0.72      0.72      0.72       254
weighted avg       0.74      0.74      0.74       254



In [28]:
accuracy_score(y_test, y_pred)

0.7362204724409449

## Model Tuning (GridSearch, RandomSearch, Bayesian Optimization)

Bu yöntemlerden birini(GridSearch ödev için yeterlidir.) ya da birden fazlasını kullanarak model tuning yapabilir, yeni hiperparametre seti ile model sonuçları bulabiliriz.

In [29]:
gnb.get_params()

{'priors': None, 'var_smoothing': 1e-09}

In [30]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'var_smoothing': [100, 10, 1, 1e-1, 1e-3, 1e-5, 1e-7, 1e-9, 1e-11, 1e-13] 
}

grid = GridSearchCV(gnb, param_grid=param_grid, cv=3, scoring='accuracy', n_jobs=-1)

In [31]:
grid.fit(X_train, y_train)

In [32]:
grid.best_estimator_

In [33]:
y_pred_grid = grid.predict(X_test)
accuracy_score(y_test, y_pred_grid)

0.7401574803149606

In [34]:
print(classification_report(y_test, y_pred_grid))

              precision    recall  f1-score   support

           0       0.78      0.81      0.80       159
           1       0.66      0.62      0.64        95

    accuracy                           0.74       254
   macro avg       0.72      0.72      0.72       254
weighted avg       0.74      0.74      0.74       254



In [41]:
from sklearn.model_selection import RandomizedSearchCV

param_distributions = {
    'var_smoothing': np.linspace(1, 1e-20, 300)
}

random = RandomizedSearchCV(gnb, n_iter = 150, param_distributions=param_distributions, cv=3, scoring='accuracy', n_jobs=-1)
random.fit(X_train, y_train)
random.best_estimator_

In [42]:
y_pred_rand = random.predict(X_test)
accuracy_score(y_test, y_pred_rand)

0.7362204724409449

In [43]:
print(classification_report(y_test, y_pred_rand))

              precision    recall  f1-score   support

           0       0.76      0.85      0.80       159
           1       0.68      0.55      0.61        95

    accuracy                           0.74       254
   macro avg       0.72      0.70      0.70       254
weighted avg       0.73      0.74      0.73       254



## Data Manipulation

Veri üzerinde değişiklik yaparak yeni veri seti elde edebilir ve modeli bu veri setinde eğitebiliriz. Örneğin; Normalizasyon (min-max, z-score), feature selection ya da feature engineering. Ödev için normalizasyon yeterlidir.

In [46]:
from sklearn.preprocessing import StandardScaler 

sc = StandardScaler()

X_train_scaled = sc.fit_transform(X_train)
X_test_scaled = sc.transform(X_test)

In [50]:
gnb.fit(X_train_scaled, y_train)
y_pred_scaled = gnb.predict(X_test_scaled)
accuracy_score(y_test, y_pred_scaled)

0.7362204724409449

In [51]:
print(classification_report(y_test, y_pred_scaled))

              precision    recall  f1-score   support

           0       0.79      0.79      0.79       159
           1       0.65      0.64      0.65        95

    accuracy                           0.74       254
   macro avg       0.72      0.72      0.72       254
weighted avg       0.74      0.74      0.74       254



In [54]:
# non-linear scaling
df['SkinThickness'].min(), df['SkinThickness'].max()

(0, 99)

In [55]:
df['BMI'].min(), df['BMI'].max()

(0.0, 67.1)

In [82]:
from sklearn.preprocessing import PowerTransformer

pt = PowerTransformer()

X_train_normalized = pt.fit_transform(X_train)
X_test_normalized = pt.transform(X_test)

In [78]:
gnb.fit(X_train_normalized, y_train)
y_pred_normalized = gnb.predict(X_test_normalized)
accuracy_score(y_test, y_pred_normalized)

0.7716535433070866

In [80]:
print(classification_report(y_test, y_pred_normalized))

              precision    recall  f1-score   support

           0       0.82      0.81      0.82       159
           1       0.69      0.71      0.70        95

    accuracy                           0.77       254
   macro avg       0.76      0.76      0.76       254
weighted avg       0.77      0.77      0.77       254

