# Problem: 

**Şirketi terk edecek müşterileri tahmin edebilecek bir makine öğrenmesi modeli geliştirebilir misiniz?**

- Amaç bir bankanın müşterilerinin bankayı terk etme ya da terk etmeme durumunun tahmin edilmesidir.

- Müşteri terkini tanımlayan olay müşterinin banka hesabını kapatmasıdır.

**Veri Seti Hikayesi:**

- 10000 gözlemden ve 12 değişkenden oluşmaktadır. 
- Bağımsız değişkenler müşterilere ilişkin bilgiler barındırmaktadır.
- Bağımlı değişken müşteri terk durumunu ifade etmektedir.

**Değişkenler:**

- Surname : Soy isim
- CreditScore : Kredi skoru
- Geography : Ülke (Germany/France/Spain)
- Gender : Cinsiyet (Female/Male)
- Age : Yaş
- Tenure : Kaç yıllık müşteri
- Balance : Bakiye(In banking, the account balance is the amount of money you have available in your checking or savings)
- NumOfProducts : Kullanılan banka ürünü
- HasCrCard : Kredi kartı durumu (0=No,1=Yes)
- IsActiveMember : Aktif üyelik durumu (0=No,1=Yes)
- EstimatedSalary : Tahmini maaş
- Exited : Terk mi değil mi? (0=No,1=Yes)

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression  
from sklearn.neighbors import KNeighborsClassifier  
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier 
from sklearn.ensemble import RandomForestClassifier,RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.preprocessing import RobustScaler
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score, GridSearchCV

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 
warnings.filterwarnings("ignore", category=FutureWarning) 
warnings.filterwarnings("ignore", category=UserWarning) 
np.seterr(divide='ignore', invalid='ignore')

%config InlineBackend.figure_format = 'retina'

# to display all columns and rows:
pd.set_option('display.max_columns', None); pd.set_option('display.max_rows', None);

In [2]:
df = pd.read_csv(r'C:\Users\Sadullah\data_science\8. Hafta\ödev_churn\churn.csv')
df=df.copy()

In [3]:
df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [4]:
df=df.drop(["RowNumber","CustomerId","Surname"],axis=1)

In [5]:
df.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [6]:
df["Exited"].value_counts()*100/len(df)

0    79.63
1    20.37
Name: Exited, dtype: float64

In [7]:
df["Geography"].value_counts()

France     5014
Germany    2509
Spain      2477
Name: Geography, dtype: int64

In [8]:
df=pd.get_dummies(df, columns=["Geography"],drop_first=True)

In [9]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()

In [10]:
df["Gender"]=le.fit_transform(df["Gender"])

In [11]:
df["Gender"].value_counts()

1    5457
0    4543
Name: Gender, dtype: int64

In [12]:
df.head()

Unnamed: 0,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_Germany,Geography_Spain
0,619,0,42,2,0.0,1,1,1,101348.88,1,0,0
1,608,0,41,1,83807.86,1,0,1,112542.58,0,0,1
2,502,0,42,8,159660.8,3,1,0,113931.57,1,0,0
3,699,0,39,1,0.0,2,0,0,93826.63,0,0,0
4,850,0,43,2,125510.82,1,1,1,79084.1,0,0,1


In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   CreditScore        10000 non-null  int64  
 1   Gender             10000 non-null  int32  
 2   Age                10000 non-null  int64  
 3   Tenure             10000 non-null  int64  
 4   Balance            10000 non-null  float64
 5   NumOfProducts      10000 non-null  int64  
 6   HasCrCard          10000 non-null  int64  
 7   IsActiveMember     10000 non-null  int64  
 8   EstimatedSalary    10000 non-null  float64
 9   Exited             10000 non-null  int64  
 10  Geography_Germany  10000 non-null  uint8  
 11  Geography_Spain    10000 non-null  uint8  
dtypes: float64(2), int32(1), int64(7), uint8(2)
memory usage: 761.8 KB


In [14]:
models = []
models.append(('LR', LogisticRegression()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('RF', RandomForestClassifier()))
models.append(('SVC', SVC(gamma='auto')))
models.append(("XGBoost", XGBClassifier()))
models.append(("LGBM", LGBMClassifier()))

In [15]:
X = df.drop("Exited",axis=1)
y = df["Exited"]

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=12345)


for name, model in models:
    
        mod = model.fit(X_train,y_train)
        y_pred = mod.predict(X_test)
        res = accuracy_score(y_test,y_pred)
        print(name+": "+str(res))

LR: 0.783
KNN: 0.7565
CART: 0.7885
RF: 0.854
SVC: 0.7865
XGBoost: 0.8465
LGBM: 0.8515


Hiç dokunulmamış değerler

## Preprocessing

In [16]:
df.isnull().sum()

CreditScore          0
Gender               0
Age                  0
Tenure               0
Balance              0
NumOfProducts        0
HasCrCard            0
IsActiveMember       0
EstimatedSalary      0
Exited               0
Geography_Germany    0
Geography_Spain      0
dtype: int64

In [17]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
CreditScore,10000.0,650.5288,96.653299,350.0,584.0,652.0,718.0,850.0
Gender,10000.0,0.5457,0.497932,0.0,0.0,1.0,1.0,1.0
Age,10000.0,38.9218,10.487806,18.0,32.0,37.0,44.0,92.0
Tenure,10000.0,5.0128,2.892174,0.0,3.0,5.0,7.0,10.0
Balance,10000.0,76485.889288,62397.405202,0.0,0.0,97198.54,127644.24,250898.09
NumOfProducts,10000.0,1.5302,0.581654,1.0,1.0,1.0,2.0,4.0
HasCrCard,10000.0,0.7055,0.45584,0.0,0.0,1.0,1.0,1.0
IsActiveMember,10000.0,0.5151,0.499797,0.0,0.0,1.0,1.0,1.0
EstimatedSalary,10000.0,100090.239881,57510.492818,11.58,51002.11,100193.915,149388.2475,199992.48
Exited,10000.0,0.2037,0.402769,0.0,0.0,0.0,0.0,1.0


In [18]:
df["BalanceToCS"] = df["Balance"]/df["CreditScore"]
df["BalanceToAge"] = df["Balance"]/df["Age"]
df["BalanceToTenure"] = df["Balance"]/df["Tenure"]
df["TenureToCS"] = df["Tenure"]/df["CreditScore"]
df["Salary"] = df["EstimatedSalary"]/12
df["EstimatedToCS"] = df["EstimatedSalary"]/df["CreditScore"]

In [19]:
df.head()

Unnamed: 0,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_Germany,Geography_Spain,BalanceToCS,BalanceToAge,BalanceToTenure,TenureToCS,Salary,EstimatedToCS
0,619,0,42,2,0.0,1,1,1,101348.88,1,0,0,0.0,0.0,0.0,0.003231,8445.74,163.730016
1,608,0,41,1,83807.86,1,0,1,112542.58,0,0,1,137.841875,2044.094146,83807.86,0.001645,9378.548333,185.102928
2,502,0,42,8,159660.8,3,1,0,113931.57,1,0,0,318.049402,3801.447619,19957.6,0.015936,9494.2975,226.955319
3,699,0,39,1,0.0,2,0,0,93826.63,0,0,0,0.0,0.0,0.0,0.001431,7818.885833,134.2298
4,850,0,43,2,125510.82,1,1,1,79084.1,0,0,1,147.659788,2918.856279,62755.41,0.002353,6590.341667,93.040118


In [20]:
df.isnull().sum()

CreditScore            0
Gender                 0
Age                    0
Tenure                 0
Balance                0
NumOfProducts          0
HasCrCard              0
IsActiveMember         0
EstimatedSalary        0
Exited                 0
Geography_Germany      0
Geography_Spain        0
BalanceToCS            0
BalanceToAge           0
BalanceToTenure      137
TenureToCS             0
Salary                 0
EstimatedToCS          0
dtype: int64

In [21]:
df.dropna(inplace=True)

In [22]:
from sklearn.feature_selection import RFECV
from sklearn.ensemble import RandomForestRegressor

In [23]:
X = df.drop(["Exited"],axis = 1)
y = df["Exited"]

In [24]:
df = df.replace([np.inf, -np.inf], np.nan)
df = df.dropna()

In [25]:
def select_features(X,y):
    # numerik olmayan degiskenlerin silinmesi
    X = X.select_dtypes([np.number]).dropna(axis=1)
    
    clf = RandomForestRegressor(random_state=46)
    clf.fit(X, y)
    
    selector = RFECV(clf,cv=10)
    selector.fit(X, y)
    
    features = pd.DataFrame()
    features['Feature'] = X.columns
    features['Importance'] = clf.feature_importances_
    features.sort_values(by=['Importance'], ascending=False, inplace=True)
    features.set_index('Feature', inplace=True)
    features.plot(kind='bar', figsize=(12, 5))
    
    
    best_columns = list(X.columns[selector.support_])
    print("Best Columns \n"+"-"*12+"\n{}\n".format(best_columns))
    
    return best_columns

In [26]:
#best_features = select_features(X,y)
#best_features

## Outliers

# Local Outlier Factor Yöntemi

In [27]:
num_features = list(df.select_dtypes(['int64','float64']).columns)

In [28]:
from sklearn.neighbors import LocalOutlierFactor

clf=LocalOutlierFactor(n_neighbors=20, contamination=0.1)
clf.fit_predict(df[num_features])
df_scores=clf.negative_outlier_factor_
df_scores= np.sort(df_scores)
df_scores[0:20]

array([-34.61814452, -31.70149714, -26.24924626, -24.94533876,
       -24.651736  , -17.77286965, -12.51052035,  -7.32454184,
        -3.10290731,  -2.21272363,  -1.99196382,  -1.89867408,
        -1.87240539,  -1.81305591,  -1.76904515,  -1.71780963,
        -1.71039718,  -1.70469615,  -1.70140671,  -1.69472298])

In [29]:
threshold=np.sort(df_scores)[5]
print(threshold)
df = df.loc[df_scores > threshold]
df = df.reset_index(drop=True)

-17.772869654980553


In [30]:
df.shape

(9581, 18)

# BoxPlot Yöntemi

In [31]:
for feature in df:
    
    Q1 = df[feature].quantile(0.10)
    Q3 = df[feature].quantile(0.90)
    IQR = Q3-Q1
    lower = Q1- 1.5*IQR
    upper = Q3 + 1.5*IQR
    
    if df[(df[feature] > upper) | (df[feature] < lower)].any(axis=None):
        print(feature,"yes")
        print(df[(df[feature] > upper) | (df[feature] < lower)].shape[0])
    else:
        print(feature, "no")

CreditScore no
Gender no
Age no
Tenure no
Balance no
NumOfProducts yes
60
HasCrCard no
IsActiveMember no
EstimatedSalary no
Exited no
Geography_Germany no
Geography_Spain no
BalanceToCS no
BalanceToAge no
BalanceToTenure yes
58
TenureToCS no
Salary no
EstimatedToCS no


## Data Scaling

In [32]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9581 entries, 0 to 9580
Data columns (total 18 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   CreditScore        9581 non-null   int64  
 1   Gender             9581 non-null   int32  
 2   Age                9581 non-null   int64  
 3   Tenure             9581 non-null   int64  
 4   Balance            9581 non-null   float64
 5   NumOfProducts      9581 non-null   int64  
 6   HasCrCard          9581 non-null   int64  
 7   IsActiveMember     9581 non-null   int64  
 8   EstimatedSalary    9581 non-null   float64
 9   Exited             9581 non-null   int64  
 10  Geography_Germany  9581 non-null   uint8  
 11  Geography_Spain    9581 non-null   uint8  
 12  BalanceToCS        9581 non-null   float64
 13  BalanceToAge       9581 non-null   float64
 14  BalanceToTenure    9581 non-null   float64
 15  TenureToCS         9581 non-null   float64
 16  Salary             9581 

In [33]:
#df["Gender"]=df["Gender"].astype("object")
#df["IsActiveMember"]=df["IsActiveMember"].astype("object")
#df["HasCrCard"]=df["HasCrCard"].astype("object")
#df["Exited"]=df["Exited"].astype("object")

In [34]:
num_features = list(df.select_dtypes(['int64','float64']).columns)

In [None]:
#num_features

In [36]:
y = df["Exited"]
X = df.drop(["Exited"], axis = 1)

In [37]:
from sklearn.preprocessing import RobustScaler
rs = RobustScaler().fit(X)
df[num_features] = rs.fit_transform(df[num_features])

In [38]:
df.head()

Unnamed: 0,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_Germany,Geography_Spain,BalanceToCS,BalanceToAge,BalanceToTenure,TenureToCS,Salary,EstimatedToCS
0,1.259259,1,1.083333,0.4,-0.760543,1.0,0.0,0.0,-0.914347,0.0,0,0,-0.722881,-0.669286,-0.5139,0.069684,-0.914347,-0.935934
1,-2.044444,0,-0.666667,-0.2,0.141282,3.0,0.0,-1.0,0.197654,1.0,1,0,0.811046,0.48097,0.48202,0.356421,0.197654,1.074845
2,-1.118519,1,0.583333,-0.2,0.352963,1.0,-1.0,0.0,-0.254196,0.0,0,0,0.698548,0.266788,0.715787,-0.002157,-0.254196,-0.030995
3,0.237037,1,-0.833333,-0.6,0.294586,0.0,0.0,0.0,-0.286907,0.0,0,0,0.26367,0.776194,1.816538,-0.685745,-0.286907,-0.325655
4,-0.918519,1,-0.5,0.2,0.039143,1.0,-1.0,-1.0,-0.200871,0.0,0,0,0.245744,0.284889,0.074849,0.45441,-0.200871,-0.015996


In [39]:
models = []
models.append(('LR', LogisticRegression()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('RF', RandomForestClassifier()))
models.append(('SVC', SVC(gamma='auto')))
models.append(("XGBoost", XGBClassifier()))
models.append(("LGBM", LGBMClassifier()))

In [40]:
X = df.drop("Exited",axis=1)
y = df["Exited"]

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=12345)


for name, model in models:
    
        mod = model.fit(X_train,y_train)
        y_pred = mod.predict(X_test)
        res = accuracy_score(y_test,y_pred)
        print(name+": "+str(res))

LR: 0.809076682316119
KNN: 0.8215962441314554
CART: 0.7949921752738655
RF: 0.8612415232133542
SVC: 0.844548774126239
XGBoost: 0.8633281168492436
LGBM: 0.8633281168492436


In [None]:
LR: 0.809076682316119
KNN: 0.8215962441314554
CART: 0.7949921752738655
RF: 0.8612415232133542
SVC: 0.844548774126239
XGBoost: 0.8633281168492436
LGBM: 0.8633281168492436

# Smote Yöntemi

In [None]:
from imblearn.over_sampling import SMOTE

In [60]:
X = df.drop(["Exited"],axis = 1)
y = df["Exited"]

In [61]:
rs= RobustScaler().fit(X)
X_Sc=rs.transform(X)

In [62]:
training_features, test_features, \
training_target, test_target, = train_test_split(df.drop(['Exited'], axis=1),
                                               df['Exited'],
                                               test_size = .2,
                                               random_state=12)

In [63]:
sm = SMOTE(random_state=12)
X_res, y_res = sm.fit_sample(training_features, training_target)

In [64]:
X_train_res, X_val_res, y_train_res, y_val_res = train_test_split(X_res,
                                                    y_res,
                                                    test_size = .2,
                                                    random_state=12)

In [65]:
clf_rf = RandomForestClassifier(n_estimators=25, random_state=12345)
clf_rf.fit(X_train_res, y_train_res)
clf_rf.score(X_val_res, y_val_res)

0.9047424366312347

In [66]:
X_train, X_val, y_train, y_val = train_test_split(training_features, training_target,
                                                  test_size = .2,
                                                  random_state=12345)

In [67]:
sm = SMOTE(random_state=12345)
X_train_res, y_train_res = sm.fit_sample(X_train, y_train)

In [68]:
clf_rf = RandomForestClassifier(n_estimators=25, random_state=12345)
clf_rf.fit(X_train_res, y_train_res)

RandomForestClassifier(n_estimators=25, random_state=12345)

In [69]:
print ('Validation Results')
print (clf_rf.score(X_val, y_val))

Validation Results
0.8427919112850619


In [70]:
print ('Test Results')
print (clf_rf.score(test_features, test_target))

Test Results
0.8304642670839854
