## End-to-end data science project steps

🔹 1. Problemi Tanımlama
🔹 2. Veri Toplama (Kaggle)
🔹 3. EDA – Exploratory Data Analysis (boyutlar, kolon tipleri, eksik değerler, Özet istatistikler,outlier tespiti)
🔹 4. Veri Ön İşleme (Eksik değerlerin doldurulması/silinmesi, One-Hot, Label Encoding)
🔹 5. Model Eğitimi (Sınıflandırma)
🔹 6. Model Değerlendirme Accuracy, Precision, Recall, F1, ROC-AUC 

In [1]:
import pandas as pd

### Problemi Tanımlama 

In [None]:
### Bu çalışmanın amacı, online perakende müşterilerinin churn (müşteri kaybı) durumunu tahmin etmektir.

### Veri Toplama (Kaggle)

In [8]:
# Dataset Link : https://www.kaggle.com/datasets/hassaneskikri/online-retail-customer-churn-dataset

df = pd.read_csv("online_retail_customer_churn.csv")

In [9]:
df.head()

Unnamed: 0,Customer_ID,Age,Gender,Annual_Income,Total_Spend,Years_as_Customer,Num_of_Purchases,Average_Transaction_Amount,Num_of_Returns,Num_of_Support_Contacts,Satisfaction_Score,Last_Purchase_Days_Ago,Email_Opt_In,Promotion_Response,Target_Churn
0,1,62,Other,45.15,5892.58,5,22,453.8,2,0,3,129,True,Responded,True
1,2,65,Male,79.51,9025.47,13,77,22.9,2,2,3,227,False,Responded,False
2,3,18,Male,29.19,618.83,13,71,50.53,5,2,2,283,False,Responded,True
3,4,21,Other,79.63,9110.3,3,33,411.83,5,3,5,226,True,Ignored,True
4,5,21,Other,77.66,5390.88,15,43,101.19,3,0,5,242,False,Unsubscribed,False


### EDA – Exploratory Data Analysis

In [12]:
df.describe()

Unnamed: 0,Customer_ID,Age,Annual_Income,Total_Spend,Years_as_Customer,Num_of_Purchases,Average_Transaction_Amount,Num_of_Returns,Num_of_Support_Contacts,Satisfaction_Score,Last_Purchase_Days_Ago
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,500.5,43.267,111.96296,5080.79265,9.727,49.456,266.87653,4.612,1.934,2.974,182.89
std,288.819436,15.242311,52.844111,2862.12335,5.536346,28.543595,145.873445,2.896869,1.402716,1.391855,104.391319
min,1.0,18.0,20.01,108.94,1.0,1.0,10.46,0.0,0.0,1.0,1.0
25%,250.75,30.0,67.8,2678.675,5.0,25.0,139.6825,2.0,1.0,2.0,93.0
50%,500.5,43.0,114.14,4986.195,9.0,49.0,270.1,5.0,2.0,3.0,180.5
75%,750.25,56.0,158.4525,7606.47,14.0,74.0,401.6025,7.0,3.0,4.0,274.0
max,1000.0,69.0,199.73,9999.64,19.0,99.0,499.57,9.0,4.0,5.0,364.0


In [15]:
# Eksik Değer Kontrol ve kolon tipi preview
display(pd.DataFrame({
    'dtype': df.dtypes,
    'missing_count': df.isna().sum(),
    'missing_pct': df.isna().mean()
}).sort_values('missing_pct', ascending=False))

Unnamed: 0,dtype,missing_count,missing_pct
Customer_ID,int64,0,0.0
Age,int64,0,0.0
Gender,object,0,0.0
Annual_Income,float64,0,0.0
Total_Spend,float64,0,0.0
Years_as_Customer,int64,0,0.0
Num_of_Purchases,int64,0,0.0
Average_Transaction_Amount,float64,0,0.0
Num_of_Returns,int64,0,0.0
Num_of_Support_Contacts,int64,0,0.0


In [19]:
### Herhangi bir eksik değerimiz yok 

In [22]:
outlier_summary = {}
for c in num_cols:
    q1 = df[c].quantile(0.25)
    q3 = df[c].quantile(0.75)
    iqr = q3 - q1
    lower = q1 - 1.5 * iqr
    upper = q3 + 1.5 * iqr
    outlier_count = ((df[c] < lower) | (df[c] > upper)).sum()
    outlier_summary[c] = {'outlier_count': int(outlier_count), 'pct': outlier_count / len(df)}
outlier_df = pd.DataFrame(outlier_summary).T.sort_values('outlier_count', ascending=False)
display(outlier_df)

Unnamed: 0,outlier_count,pct
Customer_ID,0.0,0.0
Age,0.0,0.0
Annual_Income,0.0,0.0
Total_Spend,0.0,0.0
Years_as_Customer,0.0,0.0
Num_of_Purchases,0.0,0.0
Average_Transaction_Amount,0.0,0.0
Num_of_Returns,0.0,0.0
Num_of_Support_Contacts,0.0,0.0
Satisfaction_Score,0.0,0.0


### Veri Ön İşleme

In [26]:
# Sayısal kolonlar için: eksik doldurma (median) fakat eksik değer yok 


In [29]:
# Hedef değişken
target = "Target_Churn"
X = df.drop(columns=[target])
y = df[target]

In [30]:
y

0       True
1      False
2       True
3       True
4      False
       ...  
995    False
996     True
997    False
998     True
999     True
Name: Target_Churn, Length: 1000, dtype: bool

In [32]:
### Hem One-Hot hem Label Encoding uygulamak için Kolon Tiplerini Ayırıyoruz ve Cardinality Kontrolünü sağlıcaz.

num_cols = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
cat_cols = X.select_dtypes(include=['object']).columns.tolist()

In [33]:
# Kategorik kolonları cardinality'ye göre ayır
low_card_cols = [c for c in cat_cols if X[c].nunique() <= 10]   # One-Hot için
high_card_cols = [c for c in cat_cols if X[c].nunique() > 10]   # Label Encoding için

In [34]:
print("One-Hot uygulanacak kolonlar:", low_card_cols)
print("Label Encoding uygulanacak kolonlar:", high_card_cols)

One-Hot uygulanacak kolonlar: ['Gender', 'Promotion_Response']
Label Encoding uygulanacak kolonlar: []


In [41]:
## Buna göre Label Encoding kısmı boş, sadece One-Hot ve numeric preprocessing çalışacak.

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

In [43]:
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, num_cols),
        ('cat', categorical_transformer, cat_cols)
    ],
    remainder='drop'
)


In [45]:
from sklearn.naive_bayes import GaussianNB

nb_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('clf', GaussianNB())
])

In [46]:
nb_pipeline

### Model Eğitimi (Sınıflandırma)

In [49]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)


In [50]:
nb_pipeline.fit(X_train, y_train)

In [51]:
y_pred = nb_pipeline.predict(X_test)
y_proba = nb_pipeline.predict_proba(X_test)[:,1]

In [53]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, classification_report, confusion_matrix, roc_curve
import matplotlib.pyplot as plt


print("\n--- Naive Bayes (GaussianNB) Sonuçları ---")
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1:", f1_score(y_test, y_pred))
print("ROC-AUC:", roc_auc_score(y_test, y_proba))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))


--- Naive Bayes (GaussianNB) Sonuçları ---
Accuracy: 0.475
Precision: 0.5
Recall: 0.6190476190476191
F1: 0.5531914893617021
ROC-AUC: 0.46967418546365913

Confusion Matrix:
 [[30 65]
 [40 65]]

Classification Report:
               precision    recall  f1-score   support

       False       0.43      0.32      0.36        95
        True       0.50      0.62      0.55       105

    accuracy                           0.47       200
   macro avg       0.46      0.47      0.46       200
weighted avg       0.47      0.47      0.46       200



### Comments

In [55]:
### Accuracy: 0.475 → Model %47.5 doğrulukta tahmin ediyor.

### Precision (True): 0.50 → Churn olacak dediğinde %50 doğrulukla tahmin ediyor.

### Recall (True): 0.62 → Churn olacak müşterilerin %62’sini yakalıyor.

### F1 (True): 0.55 → Precision ve Recall dengesi.

### ROC-AUC: 0.47 → Rasgele tahmin e yakın