# <font color = 'red'> Bank customer churn analysis and prediction for bank<font/>

#### This dataset ranks people by churn or not churn, our goal here is to predict which customers will churn.
#### Here we will use the "Customer-Churn-Records.csv" database that has been made available for use, below you will see an analysis of the data, the processing of this data and the use of the learning classification of machine models to achieve our goal.

<font color = 'blue'>**1. Importing libraries and taking a view of the data**<font/>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import metrics
from sklearn.model_selection import  train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn. model_selection import GridSearchCV

In [2]:
df = pd.read_csv('datasets/Customer-Churn-Records.csv')
df.head()

#### Columns description
- RowNumber: corresponds to the record (row) number and has no effect on the output.
- CustomerId: contains random values and has no effect on customer leaving the bank.
- Surname: the surname of a customer has no impact on their decision to leave the bank.
- CreditScore: can have an effect on customer churn, since a customer with a higher credit score is less likely to leave the bank.
- Geography: a customer’s location can affect their decision to leave the bank.
- Gender: it’s interesting to explore whether gender plays a role in a customer leaving the bank.
- Age: this is certainly relevant, since older customers are less likely to leave their bank than younger ones.
- Tenure: refers to the number of years that the customer has been a client of the bank. Normally, older clients are more loyal and less likely to leave a bank.
- Balance: also a very good indicator of customer churn, as people with a higher balance in their accounts are less likely to leave the bank compared to those with lower balances.
- NumOfProducts: refers to the number of products that a customer has purchased through the bank.
- HasCrCard: denotes whether or not a customer has a credit card. This column is also relevant, since people with a credit card are less likely to leave the bank.
- IsActiveMember: active customers are less likely to leave the bank.
- EstimatedSalary: as with balance, people with lower salaries are more likely to leave the bank compared to those with higher salaries.
- Exited: whether or not the customer left the bank.
- Complain: customer has complaint or not.
- Satisfaction Score: Score provided by the customer for their complaint resolution.
- Card Type: type of card hold by the customer.
- Points Earned: the points earned by the customer for using credit card.

In [3]:
#checking for null values
df.isnull().sum()

<font color='blue'> There are no columns with missing values <font/>

In [4]:
df.info()

<font color='blue'> The dataset has 18 columns and 10000 non-null rows <font/>

In [5]:
df.nunique()

- <font color= 'blue'>From the above output it can be seen that most of the columns seem to be categorical variables subject to further analysis.<font/>

In [6]:
#check for presence of duplicates
df.loc[df.duplicated()]

- no duplicates detected

### 2. Data Processing

In [7]:
# drop columns that do not provide import infomation to the processig i.e RowNumber, CustomerId, Surname
df.drop(['RowNumber', 'CustomerId', 'Surname'], axis =1, inplace=True)

In [8]:
df["Exited"].value_counts()/len(df)

- 20% of the customers churn

In [9]:
sns.countplot(x = 'Geography', data= df, hue='Exited' )

In [10]:
df.groupby('Geography').sum()['Exited']

In [11]:
#checking categorical variables
output = []
for col in df.columns:
    output.append({'columns': col, 'unique_labels': len(df[col].unique())})

output_df = pd.DataFrame(output)
output_df

In [12]:
full_df = pd.get_dummies(df,drop_first= True, dtype= int)
full_df.head()

In [13]:
X = full_df.drop(columns=['Exited'])
y = full_df['Exited']

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state=42)

In [15]:
print('X_train.shape :', X_train.shape, "\n"
     'X_test.shape:', X_test.shape, '\n',
     'y_train.shape:', y_train.shape, '\n',
     'y_test.shape:', y_test.shape )


##### using adaboost to train the model

In [16]:
#create adaboost classifier
adaModel = AdaBoostClassifier(n_estimators=50, learning_rate= 1)

In [17]:
#train adaboost classifier
model = adaModel.fit(X_train, y_train)

#predict
y_pred = model.predict(X_test)

In [18]:
print('accuracy:', metrics.accuracy_score(y_test, y_pred))

- Adaboost had an accuracy of 99.85% which may be a possible overfitting case

#### using logistic regression

In [19]:
#create logistic regression classifier
Logreg = LogisticRegression(max_iter= 1000)


In [20]:
#train logistic regression classifier
logmodel = Logreg.fit(X_train, y_train)

#predict
log_pred = logmodel.predict(X_test)

In [21]:
print('accuracy:', metrics.accuracy_score(y_test, log_pred))

- logistic regression had an accuracy score of 79.9%

### Using logistic regression as the base estimator for the adaboost

In [22]:
#creating the classifier
adaModel_log = AdaBoostClassifier(n_estimators= 150, estimator= Logreg, learning_rate=2)

In [23]:
#train the model
ada_log_model = adaModel_log.fit(X_train, y_train)

#predict
ada_log_pred = ada_log_model.predict(X_test)

In [24]:
print('accuracy:', metrics.accuracy_score(y_test, ada_log_pred))

- adaboost with logistic regression as the base model had an accuracy score of 90.6%

### using gradient boosting

In [25]:
gradient_classifier = GradientBoostingClassifier()

#train
gradient_classifier_model = gradient_classifier.fit(X_train, y_train)

#predict
gradient_classifier_pred = gradient_classifier_model.predict(X_test)


In [26]:
print('accuracy:', metrics.accuracy_score(y_test, gradient_classifier_pred))

In [27]:
#plot feature impotance
feature_impotance = gradient_classifier_model.feature_importances_

#make feature impotance relative to max impotance
feature_impotance = 100.0*(feature_impotance/ feature_impotance.max())
sorted_idx = np.argsort(feature_impotance)
pos = np.arange (sorted_idx.shape[0]) + .5

plt.barh(pos, feature_impotance[sorted_idx], align='center')
plt.yticks(pos, X.columns[sorted_idx])

plt.xlabel('Relative Importance')
plt.ylabel('Features')
plt.title('Feature Importance Plot')
                                


### hypeparameter tuning

In [None]:
param_grid = {
    'learning_rate': [0.1, 0.01, 0.001],
    'n_estimators': [100, 200, 300],
    #'max_depth': [3, 5, 7]
}

# Perform Grid Search with 5-fold cross-validation
grid_search = GridSearchCV(gradient_classifier_model, param_grid, cv=5)
grid_search.fit(X, y)

# Print the best hyperparameters and the corresponding mean cross-validated score
print("Best Hyperparameters:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)

