![Churn](https://www.subscreasy.com/wp-content/uploads/2018/07/Abonelikte-Churn-Nedir-Neden-Onemlidir.png)

# What is churn?

Churn refers to the ratio of the number of customers you lost in a given period to the total number of customers. For example, if you have 1000 customers and 10 of your customers canceled their subscription that month, your churn speed corresponds to 10/1000, or 1%.

While you initially get 100 customers each month, if the churn rate is 10%, your customer count continues to increase. But if this ratio is still 10% at the point where you reach 1000 customers, then you are losing 100 customers every month. 100 new customers every month become insufficient for your growth.

The action to be taken at this point is to identify the reasons for losing customers. Churn may either be due to poor quality service, or a competitor may have started selling the same service on more attractive subscription terms.

# Meaning of terms
* **CreditScore:** It is a scoring system created by looking at a person's credit history and various risk factors. This scoring system is the first criterion that banks check in your loan application and is the most important criterion in terms of loan evaluation. For a score with a range between 300-850.
* **Tenure:** How long you can expect them to remain a customer. Tenure = 1/Churn
* **Balance:** The amount in the account.
* **EstimatedSalary:** The customer's estimated salary.

In [None]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

In [None]:
df = pd.read_csv('../input/churn-modelling/Churn_Modelling.csv')

In [None]:
df

In [None]:
df.info()

# Visualizations

In [None]:
fig = px.box(df, y="Age")
fig.show()

In [None]:
plt.figure(figsize=(15,15))
sns.heatmap(df.corr(), annot=True)

**Age feature enhancement**

In [None]:
age_labels = ['18-20', '21-30', '31-40', '41-50', '51-60', '61-70', '71-80', '81-90', '91-100']
Age_group = pd.cut(df['Age'], range(10, 101, 10), right=False, labels=age_labels)

In [None]:
df.groupby(Age_group)['EstimatedSalary'].mean().plot(kind='bar',stacked=True)
plt.title("Estimated Salary Distribution by Age Groups",fontsize=14)
plt.ylabel('Estimated Salary')
plt.xlabel('Age Group');

In [None]:
df.groupby(Age_group)['Exited'].mean().plot(kind='bar',stacked=True)
plt.title("Distribution of Age Groups",fontsize=14)
plt.ylabel('Credit Score')
plt.xlabel('Age Group')

In [None]:
plt.figure(figsize=(20,20))
sns.catplot(x="Geography", y="EstimatedSalary", hue="Gender", kind="box", data=df)
plt.title("Geography VS Estimated Salary")
plt.xlabel("Geography")
plt.ylabel("Estimated Salary")

In [None]:
fig = px.box(df, x="Age", y="Geography", notched=True)
fig.show()


In [None]:
fig = px.parallel_categories(df, dimensions=['HasCrCard', 'IsActiveMember'],
                 color_continuous_scale=px.colors.sequential.Inferno,
                labels={'HasCrCard':'Credit Card Holder', 'IsActiveMember':'Activity Status'})
fig.show()

In [None]:
fig = px.parallel_categories(df, dimensions=['HasCrCard', 'Gender','IsActiveMember'],
                 color_continuous_scale=px.colors.sequential.Inferno,
                labels={'Gender':'Gender', 'HasCrCard':'Credit Card Holder', 'IsActiveMember':'Activity Status'})
fig.show()


In [None]:
fig = px.parallel_categories(df, dimensions=['IsActiveMember', 'Exited',],
                 color_continuous_scale=px.colors.sequential.Inferno,
                labels={'IsActiveMember':'Activity Status', 'Exited':'Exited Members',})
fig.show()


**Distributions**

In [None]:
fig = plt.figure(figsize=(8,8))
sns.distplot(df.CreditScore, color="orange", label="CreditScore")
plt.legend();

In [None]:
fig = plt.figure(figsize=(8,8))
sns.distplot(df.Balance, color="red", label="Balance")
plt.legend();

In [None]:
fig = plt.figure(figsize=(8,8))
sns.distplot(df.EstimatedSalary, color="blue", label="Estimated Salary")
plt.legend();

## Future Editing

**Drop unnecessary columns for training**

In [None]:
df.drop('RowNumber', axis = 1, inplace = True)
df.drop('CustomerId', axis = 1, inplace = True)
df.drop('Surname', axis = 1, inplace = True)

**Only three country**

In [None]:
df.Geography.unique()

**One-hot-encoding Gender and Geography**

In [None]:
df_geo = pd.get_dummies(df['Geography'], columns= df.Geography[0], dtype= 'int64')
df_gender = pd.get_dummies(df['Gender'], columns= df.Gender[0], dtype= 'int64')

In [None]:
df = df.join(df_geo)
df = df.join(df_gender)

In [None]:
df.drop('Geography', axis = 1, inplace = True)
df.drop('Gender', axis = 1, inplace = True)

**Since there is a "Balance" value of more than three thousand 0 values, these values are arranged to be a normal distribution.**

In [None]:
df["Balance"] = df["Balance"].replace(0, np.nan)

In [None]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.neighbors import KNeighborsRegressor

In [None]:
df.info()

**The iterative imputer normally uses bayes. However, it was preferred to use KNN for normal distribution.**

In [None]:
imp = IterativeImputer(KNeighborsRegressor(n_neighbors=5, weights='distance', algorithm='kd_tree'))

In [None]:
df = imp.fit_transform(df)

In [None]:
df = pd.DataFrame(data=imp.transform(df), 
                             columns=['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard',
                                      'IsActiveMember', 'EstimatedSalary', 'Exited', 'France', 'Germany', 
                                      'Spain', 'Female', 'Male'])

In [None]:
fig = plt.figure(figsize=(8,8))
sns.distplot(df.Balance, color="red", label="Balance")
plt.legend();

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
scaler = MinMaxScaler()

In [None]:
columns = ['CreditScore', 'Balance', 'EstimatedSalary']
for col in columns:
    column = scaler.fit_transform(df[col].values.reshape(-1, 1))
    df[col] = pd.DataFrame(data=column, columns=[col])

In [None]:
exited_df = df['Exited']
df.drop('Exited', axis = 1, inplace = True)
df = df.join(exited_df)

**Correlation matrix after feature editing**

In [None]:
plt.figure(figsize=(20,20))
sns.heatmap(df.corr(), annot=True)

# Random Forest Classification

In [None]:
from sklearn.model_selection import train_test_split

X = df.iloc[:, :-1]
y = df.iloc[:, -1].astype('float')

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.3, random_state=42)
len(y_train), len(y_val)

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()

# fit the data
model.fit(X_train, y_train)

# Get predictions
y_preds = model.predict(X_val)

# Get score
accuracy_score(y_preds, y_val)

# PCA with K-Means

**Decide how many features we’d like to keep based on the cumulative variance plot.**

In [None]:
from sklearn.decomposition import PCA

pca = PCA()

pca.fit(df)

In [None]:
pca.explained_variance_ratio_

In [None]:
plt.figure(figsize = (8,8))
plt.plot(range(1,15), pca.explained_variance_ratio_.cumsum(), marker= 'o', linestyle= '--')
plt.title('Explained Variance by Component')
plt.xlabel('Number of Component')
plt.ylabel('Cumulative Explained Variance')

In [None]:
pca = PCA(n_components=2)

principalComponents = pca.fit_transform(df)

principalDf = pd.DataFrame(data = principalComponents
             , columns = ['principal component 1', 'principal component 2'])

**Decide how many clustering solutions will test.**

In [None]:
from sklearn.cluster import KMeans
wccs = []
for i in range (1, 15):
    kmeans_pca = KMeans(n_clusters=i, init= 'k-means++', random_state = 42)
    kmeans_pca.fit(principalDf)
    wccs.append(kmeans_pca.inertia_)

In [None]:
plt.figure(figsize = (8,8))
plt.plot(range(1,15), wccs, marker= 'o', linestyle= '--')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.title('K-Means with PCA')
plt.show()

In [None]:
kmeans_pca = KMeans(n_clusters= 2, init = 'k-means++', random_state=42)
kmeans_pca.fit(principalDf)

In [None]:
principalDf['KmeansPredict'] = kmeans_pca.labels_
principalDf

**PCA with K-Means comparison**

In [None]:
x_axis = principalDf['principal component 1']
y_axis = principalDf['principal component 2']
plt.figure(figsize= (8,8))
sns.scatterplot(x_axis, y_axis, hue= principalDf['KmeansPredict'], palette=['r', 'b'])
plt.show()

In [None]:
x_axis = principalDf['principal component 1']
y_axis = principalDf['principal component 2']
plt.figure(figsize= (8,8))
sns.scatterplot(x_axis, y_axis, hue= df['Exited'], palette=['r', 'b'])
plt.show()

# NN Model

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import Dense
from tensorflow.keras import Sequential

**The reason for using orthogonal is to increase the speed of training data by increasing the number of batch.**

In [None]:
model = Sequential()
model.add(Dense(13, input_dim=13, kernel_initializer='orthogonal', activation='softplus'))
model.add(Dense(8, kernel_initializer='orthogonal', activation='softplus'))
model.add(Dense(4, kernel_initializer='orthogonal', activation='softplus'))
model.add(Dense(1, kernel_initializer='orthogonal', activation='sigmoid'))

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [None]:
model.fit(X_train, y_train, epochs=250, batch_size= 200)

In [None]:
y_preds = model.predict(X_val)
y_preds = y_preds > 0.5

In [None]:
accuracy_score(y_preds, y_val)