# ****Hi everyone!****

This is my first public notebook in which I will analyse [Telco Dataset](https://www.kaggle.com/blastchar/telco-customer-churn). Based on the available data, it is necessary to predict the behavior of the customers - whether they will stay with the operator or leave.

References: special thanks to Janio Martinez and his [notebook](https://www.kaggle.com/janiobachmann/credit-fraud-dealing-with-imbalanced-datasets)!

# 0. Libraries!

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# plotting
import seaborn as sns 
import matplotlib.pyplot as plt

# data encoding
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from imblearn.pipeline import make_pipeline
from imblearn.pipeline import Pipeline as imb_pipeline

# classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

# 1. Primary analysis

Let's take a look at our data:

In [None]:
df = pd.read_csv('/kaggle/input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv',usecols = lambda column : column not in 
["customerID"])
df.head()

In [None]:
df.info()

As we can see, there's no empty cells - that's amazing! We don`t need to think how to fill the gaps. **But there are a couple of nuances**  - almost all columns are in the "object" format, which is inconvenient for processing. Especially the column "TotalCharges", which alone contains numerical characteristics, while others are categorical. We are going to fix it:

At first, we need to convert "TotalCharges" to float:

In [None]:
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'])

Oops! Looks like we have empty values - lets count them:

In [None]:
empty_values = []
for i in range(len(df['TotalCharges'])):
    if df['TotalCharges'].iloc[i] == ' ':
        empty_values.append(i)
print("There are empty indexes found:", end=' ')
print(empty_values)
for i in range(len(empty_values)):
    print(df.iloc[empty_values[i]])

**Interesting fact:** for all rows with an empty value in the "TotalCharges" cell, the "tenure" cell has a value of zero, which means that these are *new* users, and we can replace the empty value with zero:

In [None]:
df["TotalCharges"] =  df["TotalCharges"].replace(r' ', '0')

New attempt to change data format:

In [None]:
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'])

So:

In [None]:
df.info()

Yay! We will return to proccesing later.

Let's see how many unique values each categorial column contains:

In [None]:
for i in df.columns:
    if i not in ['tenure','MonthlyCharges','TotalCharges']:
        print(i,'column has',len(pd.unique(df[i])),'unique values or rather:')
        print(pd.unique(df[i]))

Now let's take a look at target feature - "Churn"

In [None]:
plt.figure(figsize=(10,6))

plt.title("Churn chart")

sns.countplot(df['Churn'])

Wow! ****Our classes are very disbalanced.**** In numbers:

In [None]:
churn_yes = df[df.Churn == "Yes"].shape[0]
churn_no = df[df.Churn == "No"].shape[0]

churn_yes_percent = round((churn_yes / (churn_yes + churn_no) * 100),2)
churn_no_percent = round((churn_no / (churn_yes + churn_no) * 100 ),2)

print('There are',churn_yes_percent,'percent of customers that will churn and',churn_no_percent,'percent of customers that will not churn')

The conclusion suggests itself: it is necessary to somehow **"normalize" the data** so that the model does not retrain on the prevailing data or does not fail to learn on the data that are in the minority. To do this, you can use the methods of **artificial data normalization**, which will be described below, but first, we will create a test sample (which we will normalize) and a validation sample:

Good! Now we are going to transform all categorial values using [One Hot Encoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html):

In [None]:
categorial_columns = [cname for cname in df.columns if cname not in ['tenure','MonthlyCharges','TotalCharges','Churn']]

print("Our categorial columns:", categorial_columns)

Let's create train and validation datasets; we should make One Hot Encoding after splitting, not before, because our model must at the testing stage work with "raw data" that it sees for the first time; if you process the entire dataset, then data leakage may occur during splitting

In [None]:
from sklearn.model_selection import train_test_split

X = df.drop('Churn', axis=1)
y = df['Churn']

# Creating train and test subsets
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)

# One Hot Encoding
X_train = pd.get_dummies(X_train_full)
X_valid = pd.get_dummies(X_valid_full)

# For y-values we will use LabelEncoder

label_enc  = LabelEncoder()
y_train = label_enc.fit_transform(y_train)
y_valid = label_enc.fit_transform(y_valid)

Now our data looks like this:

In [None]:
X_train.head()

Amazing! Now we normalize the data, more precisely, columns "tenure", "MonthlyCharges" and "TotalCharges" so that the model can quickly establish dependencies between the data:

In [None]:
from sklearn.preprocessing import RobustScaler

# I use RobustScaler because it's quite robust to outliers

rob_scaler = RobustScaler()

columns_to_scale = ['tenure','MonthlyCharges','TotalCharges']

X_train[columns_to_scale] = rob_scaler.fit_transform(X_train[columns_to_scale])
X_valid[columns_to_scale] = rob_scaler.fit_transform(X_valid[columns_to_scale])

In [None]:
X_valid.head()

# 2. Base model

Our data is ready to implement basic algorithm - let's use linear regression:

In [None]:
# Use GridSearchCV to find the best parameters.
# from sklearn.model_selection import GridSearchCV

# Logistic Regression 
log_reg = LogisticRegression()

log_reg.fit(X_train, y_train)

In [None]:
predictions = log_reg.predict(X_valid)

In [None]:
# Calculate accuracy
ACC = accuracy_score(y_valid, predictions)

In [None]:
print(ACC)

Wow! Our algotithm is pretty accurate - let's take a look at confusion matrix:

Note: to get acquainted with the confusion matrix, I recommend [this article](https://en.wikipedia.org/wiki/Confusion_matrix)

In [None]:
log_reg_cf = confusion_matrix(y_valid, predictions)

fig, axes = plt.subplots(1, 1, figsize=(12, 6))

sns.heatmap(log_reg_cf, annot=True, cmap=plt.cm.Pastel1)
plt.title("Logistic Regression Confusion Matrix", fontsize=14)
plt.xlabel("Predicted classes")
plt.ylabel("Actual classes")

plt.show()

As we can see, our model has big False Positive value (160 values) and (relatively) small False Negative value (94 value) - it means that our model is good at detecting 'non-churn' customers and bad at detecting 'churn' customers. This is quite to be expected, since the dataset is dominated by rows with information about customers who are not going to leave. That is, **our model was retrained on the original data**. Let's try to fix it.

# 3. Undersampling data

For starters, you can shrink the original dataset by reducing the number of rows with a predominant target variable. This can be done with [NearMiss technique](https://imbalanced-learn.org/stable/generated/imblearn.under_sampling.NearMiss.html):

In [None]:
from imblearn.under_sampling import NearMiss

undersample_pipeline = make_pipeline(NearMiss(sampling_strategy='majority'), log_reg)
undersample_model = undersample_pipeline.fit(X_train, y_train)
undersample_predictions = undersample_model.predict(X_valid)

In [None]:
# Calculate accuracy
ACC = accuracy_score(y_valid, undersample_predictions)

In [None]:
print(ACC)

Well... Looks like accuracy has decreased markedly. But this is not a reason to be upset - let's take a look at confusion matrix:

In [None]:
log_reg_cf = confusion_matrix(y_valid, undersample_predictions)

fig, axes = plt.subplots(1, 1, figsize=(12, 6))

sns.heatmap(log_reg_cf, annot=True, cmap=plt.cm.Pastel1)
plt.title("Logistic Regression Confusion Matrix", fontsize=14)
plt.xlabel("Predicted classes")
plt.ylabel("Actual classes")

plt.show()

Тow everything is exactly the opposite: our model has small False Positive value (68 values) and very big False Negative velue (440 value) - it means that our model is very bad at detecting 'non-churn' customers and quite good at detecting 'churn' customers. 

Here a little philosophical question already arises - which is more profitable, poorly recognizing clients who are going to leave, or spamming a bunch of clients who are definitely not going to leave? I would love to participate in the discussion :)

But that's not all - let's try to oversmaple our data (that is, we will artificially increase the number of customers who are going to leave)!

# 4. Oversampling data

This can be done with [SMOTE](https://www.geeksforgeeks.org/ml-handling-imbalanced-data-with-smote-and-near-miss-algorithm-in-python/) technique:

In [None]:
from imblearn.over_sampling import SMOTE

# I use other solver and increase numer of iterations because our dataset will become larger
oversample_pipeline = make_pipeline(SMOTE(sampling_strategy='minority'), LogisticRegression(solver = 'saga', max_iter=10000))
oversample_model = oversample_pipeline.fit(X_train, y_train)
oversample_predictions = oversample_model.predict(X_valid)

In [None]:
# Calculate accuracy
ACC = accuracy_score(y_valid, oversample_predictions)

In [None]:
print(ACC)

In [None]:
log_reg_cf = confusion_matrix(y_valid, oversample_predictions)

fig, axes = plt.subplots(1, 1, figsize=(12, 6))

sns.heatmap(log_reg_cf, annot=True, cmap=plt.cm.Pastel1)
plt.title("Logistic Regression Confusion Matrix", fontsize=14)
plt.xlabel("Predicted classes")
plt.ylabel("Actual classes")

plt.show()

Wow! We got something in between the first and second options - we can say that it is "in the neutral zone" - according to the predictions of clients who are going to leave, it is better than the first algorithm, but worse than the second, and vice versa with clients who are going to stay.

# 5. Deep look into oversampled model

For example, you decided to choose third model - we want to get acceptable results on average, let's try to improve it:

In [None]:
# I use Grid Search to find best parameters for our model
from sklearn.model_selection import GridSearchCV

#Creating pipeline with data augmentation and subsequent regression
pipeline = imb_pipeline(
                    [('nearmiss', SMOTE(sampling_strategy='minority')),
                     ('logreg', LogisticRegression(solver = 'saga', max_iter=10000))
                     
])

parameters = {}
parameters['logreg__penalty'] = ['l1', 'l2']
parameters['logreg__C'] = [i for i in range(80,420,40)]

CV = GridSearchCV(pipeline, parameters, scoring = 'accuracy', n_jobs= 1)
CV.fit(X_train, y_train)   

print('Best parameter combination for linear regression is:', CV.best_params_)

In [None]:
oversample_pipeline = make_pipeline(SMOTE(sampling_strategy='minority'), LogisticRegression(solver = 'saga', penalty = 'l1', C=80, max_iter=10000))
oversample_model = oversample_pipeline.fit(X_train, y_train)
oversample_predictions = oversample_model.predict(X_valid)

print('Accuracy on validation set: %s' % (accuracy_score(y_valid, oversample_predictions)))

Well, looks like we achieved good accuracy - 81%! You can adjust the parameters and improve the result by yourself, but I that's all for the moment. Thank you for watching! I would be glad to receive feedback and interesting suggestions! Also, I'm ready to listen to criticism and different opinions. See you later!

*Contacts:*
telegramm - @univanxx, instagram - @univanxx