# **Telecom Churn Data | Extract, Load**

*Authors:*
- *Myroslava Sánchez Andrade A01730712*
- *Karen Rugerio Armenta A01733228*
- *José Antonio Bobadilla García A01734433*
- *Alejandro Castro Reus A01731065*

*Creation date: 17/08/2022*

*Last updated: 11/09/2022*

---

## **Extract**
The dataset used for this project is **[telecom_churn_me.csv](https://www.kaggle.com/datasets/mark18vi/telecom-churn-data?resource=download)**, downloaded from the plataform Kaggle.
<br>This dataset of a telecommunications company contains the costumers' account information and whether the customers left or not within the last month.


In [None]:
# REQUIRED LIBRARIES
# !pip install pandas numpy statsmodels sklearn scipy category_encoders joblib imblearn 

In [None]:
# RUN ONLY FOR GOOGLE COLAB

# from google.colab import drive

# drive.mount("path")  

# %cd "path"

In [2]:
# Importing the necessary libraries
import pandas as pd
import numpy as np
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn import preprocessing, impute
from scipy import stats
from sklearn.experimental import enable_iterative_imputer
from category_encoders import OrdinalEncoder
from joblib import dump
from imblearn.over_sampling import SMOTE

In [3]:
# Reading data via Pandas from CSV
telco_customers_data = pd.read_csv('../../../data/telecom_churn_me/telecom_churn_me.csv')
telco_customers_data

Unnamed: 0.1,Unnamed: 0,PTY_PROFILE_SUB_TYPE,SOCIO_ECONOMIC_SEGMENT,PARTY_NATIONALITY,PARTY_GENDER_CD,TARGET,YEAR_JOINED,CURRENT_YEAR,BILL_AMOUNT,PAID_AMOUNT,...,MOUS_FROM_LOCAL_MOBILES,MOUS_TO_LOCAL_LANDLINES,MOUS_FROM_LOCAL_LANDLINES,MOUS_TO_INT_NUMBER,MOUS_FROM_INT_NUMBER,DATA_IN_BNDL,DATA_OUT_BNDL,DATA_USG_PAYG,COMPLAINTS,Years_stayed
0,0,Residential,EMIRATI,United Arab Emirates,M,0,1994,2019,931.208938,812.175000,...,35.850,34.015,72.075,141.840,56.115,11944.079102,0.0,0.0,0,25
1,1,Prestige,EMIRATI,United Arab Emirates,M,0,1994,2019,431.082618,486.500000,...,10.595,7.715,11.750,5.110,0.000,9903.157715,0.0,0.0,0,25
2,2,Residential,EMIRATI,United Arab Emirates,M,0,1994,2019,50.619644,52.815000,...,0.000,0.000,0.000,0.000,0.000,0.102539,0.0,0.0,0,25
3,3,Prestige,EMIRATI,United Arab Emirates,M,0,1994,2019,399.710034,422.235000,...,158.500,2.670,15.965,0.000,0.000,3600.322266,0.0,0.0,0,25
4,4,Residential,EMIRATI,United Arab Emirates,M,0,1994,2019,612.665844,825.888333,...,186.050,17.515,28.685,3.235,4.475,3852.026367,0.0,0.0,0,25
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1140599,1140610,Residential,EMIRATI,United Arab Emirates,M,0,2017,2019,297.752650,313.950000,...,0.000,0.000,0.000,0.000,0.000,307945.957031,0.0,0.0,0,2
1140600,1140611,Residential,YOUTH,United Arab Emirates,M,0,2017,2019,160.663773,178.500000,...,0.000,0.000,0.000,0.000,0.000,22647.873535,0.0,0.0,0,2
1140601,1140612,Consumer via Retailer,EXPATS,Comoros,M,0,2017,2019,570.147016,642.911667,...,64.990,3.660,10.050,0.000,0.000,17582.867188,0.0,0.0,0,2
1140602,1140613,Residential,EXPATS,Philippines,M,0,2017,2019,452.736799,525.413333,...,102.075,54.065,7.980,5.350,0.065,3015.338867,0.0,0.0,0,2


#### ***Verifying structure and content***

In [4]:
# Validating the information of each column => There are no null values in the whole DF, there are multiple objects.
telco_customers_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1140604 entries, 0 to 1140603
Data columns (total 28 columns):
 #   Column                     Non-Null Count    Dtype  
---  ------                     --------------    -----  
 0   Unnamed: 0                 1140604 non-null  int64  
 1   PTY_PROFILE_SUB_TYPE       1140604 non-null  object 
 2   SOCIO_ECONOMIC_SEGMENT     1140604 non-null  object 
 3   PARTY_NATIONALITY          1140604 non-null  object 
 4   PARTY_GENDER_CD            1140604 non-null  object 
 5   TARGET                     1140604 non-null  int64  
 6   YEAR_JOINED                1140604 non-null  int64  
 7   CURRENT_YEAR               1140604 non-null  int64  
 8   BILL_AMOUNT                1140604 non-null  float64
 9   PAID_AMOUNT                1140604 non-null  float64
 10  PAYMENT_TRANSACTIONS       1140604 non-null  int64  
 11  PARTY_REV                  1140604 non-null  float64
 12  PREPAID_LINES              1140604 non-null  int64  
 13  POSTPAID_LIN

---
## **Transform**

### ***Row analysis***

#### *Row valuation*

Dropping the columns that have at least 65% null values

In [5]:
# Calculating the total cells per client (rows per client times the columns)
cells_per_client = len(telco_customers_data.columns)

# Calculating the total non null cells per client
non_null_cells_per_client = telco_customers_data.count(axis=1)
non_null_cells_per_client


0          28
1          28
2          28
3          28
4          28
           ..
1140599    28
1140600    28
1140601    28
1140602    28
1140603    28
Length: 1140604, dtype: int64

In [6]:
# Calculating the total non null cells per client in terms of percentages
percentages_non_null_cells_per_clients = (non_null_cells_per_client * 100) / (cells_per_client)
percentages_non_null_cells_per_clients

0          100.0
1          100.0
2          100.0
3          100.0
4          100.0
           ...  
1140599    100.0
1140600    100.0
1140601    100.0
1140602    100.0
1140603    100.0
Length: 1140604, dtype: float64

In [7]:
# Defining the list of the names of the columns to drop
drop_column_names = []

# Obtaining the clients that have more than 75% of null values
clients_to_drop = percentages_non_null_cells_per_clients < 25
clients_to_drop = clients_to_drop[clients_to_drop]
clients_to_drop = list(clients_to_drop.index)

# There were actually no clients that had that percentage of null values
drop_column_names.extend(clients_to_drop)
clients_to_drop

[]

### ***Column analysis***

#### *Column valuation*

Dropping the columns that have at least 65% null values

In [8]:
# Calculating the total non null cells per column
non_null_cells_per_column = telco_customers_data.count(axis=0)
percentages_non_null_cells_per_column = (non_null_cells_per_column * 100) / (telco_customers_data.shape[0])
# Obtaining the columns that have more than 65% of null values
columns_to_drop = percentages_non_null_cells_per_column < 35
columns_to_drop = columns_to_drop[columns_to_drop]
columns_to_drop = list(columns_to_drop.index)

drop_column_names.extend(columns_to_drop)

Storing the columns that are full of unique values

In [9]:
# Storing the number of unique values of each column
no_unique_values = telco_customers_data.nunique().to_frame()

# Storing the name of columns that are full of unique values (id)
drop_columns = no_unique_values[no_unique_values == telco_customers_data.shape[0]]
drop_columns = drop_columns.dropna()
drop_columns

drop_column_names.append(drop_columns.index[0])
drop_column_names

['Unnamed: 0']

Storing the columns that are full (or almost full) of the same values

In [10]:
# Calculating the percentiles of each column
data_description = telco_customers_data.describe()
data_description = data_description.drop(['count', 'mean', 'std', 'min', 'max'], axis = 0)
data_description

Unnamed: 0.1,Unnamed: 0,TARGET,YEAR_JOINED,CURRENT_YEAR,BILL_AMOUNT,PAID_AMOUNT,PAYMENT_TRANSACTIONS,PARTY_REV,PREPAID_LINES,POSTPAID_LINES,...,MOUS_FROM_LOCAL_MOBILES,MOUS_TO_LOCAL_LANDLINES,MOUS_FROM_LOCAL_LANDLINES,MOUS_TO_INT_NUMBER,MOUS_FROM_INT_NUMBER,DATA_IN_BNDL,DATA_OUT_BNDL,DATA_USG_PAYG,COMPLAINTS,Years_stayed
25%,285154.75,0.0,2013.0,2019.0,174.137757,181.666667,1.0,423.187917,0.0,1.0,...,0.425,0.35,0.015,0.0,0.0,708.101562,0.0,0.0,0.0,2.0
50%,570306.5,0.0,2016.0,2019.0,290.72394,300.729167,1.0,834.713333,1.0,2.0,...,29.445,7.16,10.025,2.175,0.0,4394.218506,0.0,0.0,0.0,3.0
75%,855461.25,0.0,2017.0,2019.0,460.9771,476.423333,2.0,1553.675,3.0,3.0,...,141.895,22.035,39.26,54.09,1.71,9955.910278,0.0,0.0,0.0,6.0


In [11]:
# Storing the difference column by column of the percentiles
data_description = data_description.diff()
data_description

Unnamed: 0.1,Unnamed: 0,TARGET,YEAR_JOINED,CURRENT_YEAR,BILL_AMOUNT,PAID_AMOUNT,PAYMENT_TRANSACTIONS,PARTY_REV,PREPAID_LINES,POSTPAID_LINES,...,MOUS_FROM_LOCAL_MOBILES,MOUS_TO_LOCAL_LANDLINES,MOUS_FROM_LOCAL_LANDLINES,MOUS_TO_INT_NUMBER,MOUS_FROM_INT_NUMBER,DATA_IN_BNDL,DATA_OUT_BNDL,DATA_USG_PAYG,COMPLAINTS,Years_stayed
25%,,,,,,,,,,,...,,,,,,,,,,
50%,285151.75,0.0,3.0,0.0,116.586183,119.0625,0.0,411.525417,1.0,1.0,...,29.02,6.81,10.01,2.175,0.0,3686.116943,0.0,0.0,0.0,1.0
75%,285154.75,0.0,1.0,0.0,170.25316,175.694167,1.0,718.961667,2.0,1.0,...,112.45,14.875,29.235,51.915,1.71,5561.691772,0.0,0.0,0.0,3.0


In [12]:
# If the difference is 0 in percentile 50 and 75, it means that the column has no variation in its values (columns full of the same value)
percentiles = data_description[1:2] == 0.0
percentiles = percentiles.append(data_description[2:3] == 0.0)

for col in data_description.columns:
    if percentiles[col].all() == False:
        percentiles = percentiles.drop(col, axis=1)

print(percentiles)
drop_column_names.extend(list(percentiles.columns))
drop_column_names

     TARGET  CURRENT_YEAR  DATA_OUT_BNDL  DATA_USG_PAYG  COMPLAINTS
50%    True          True           True           True        True
75%    True          True           True           True        True


  percentiles = percentiles.append(data_description[2:3] == 0.0)


['Unnamed: 0',
 'TARGET',
 'CURRENT_YEAR',
 'DATA_OUT_BNDL',
 'DATA_USG_PAYG',
 'COMPLAINTS']

Dropping the columns stored in steps before

In [13]:
# Removing the target from the columns to drop
y = 'TARGET'
drop_column_names.remove(y)

# Dropping the columns to drop
telco_customers_data = telco_customers_data.drop(drop_column_names, axis=1)
telco_customers_data

Unnamed: 0,PTY_PROFILE_SUB_TYPE,SOCIO_ECONOMIC_SEGMENT,PARTY_NATIONALITY,PARTY_GENDER_CD,TARGET,YEAR_JOINED,BILL_AMOUNT,PAID_AMOUNT,PAYMENT_TRANSACTIONS,PARTY_REV,...,LINE_REV,STATUS,MOUS_TO_LOCAL_MOBILES,MOUS_FROM_LOCAL_MOBILES,MOUS_TO_LOCAL_LANDLINES,MOUS_FROM_LOCAL_LANDLINES,MOUS_TO_INT_NUMBER,MOUS_FROM_INT_NUMBER,DATA_IN_BNDL,Years_stayed
0,Residential,EMIRATI,United Arab Emirates,M,0,1994,931.208938,812.175000,1,5968.700000,...,945.040000,ACTIVE,1004.070,35.850,34.015,72.075,141.840,56.115,11944.079102,25
1,Prestige,EMIRATI,United Arab Emirates,M,0,1994,431.082618,486.500000,1,6245.141667,...,493.815000,ACTIVE,159.050,10.595,7.715,11.750,5.110,0.000,9903.157715,25
2,Residential,EMIRATI,United Arab Emirates,M,0,1994,50.619644,52.815000,1,1666.488333,...,50.300000,ACTIVE,0.000,0.000,0.000,0.000,0.000,0.000,0.102539,25
3,Prestige,EMIRATI,United Arab Emirates,M,0,1994,399.710034,422.235000,1,2522.008333,...,406.586667,ACTIVE,288.805,158.500,2.670,15.965,0.000,0.000,3600.322266,25
4,Residential,EMIRATI,United Arab Emirates,M,0,1994,612.665844,825.888333,1,1219.961667,...,751.185000,ACTIVE,209.760,186.050,17.515,28.685,3.235,4.475,3852.026367,25
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1140599,Residential,EMIRATI,United Arab Emirates,M,0,2017,297.752650,313.950000,1,2418.486667,...,303.166667,ACTIVE,0.000,0.000,0.000,0.000,0.000,0.000,307945.957031,2
1140600,Residential,YOUTH,United Arab Emirates,M,0,2017,160.663773,178.500000,1,454.116667,...,170.000000,ACTIVE,0.000,0.000,0.000,0.000,0.000,0.000,22647.873535,2
1140601,Consumer via Retailer,EXPATS,Comoros,M,0,2017,570.147016,642.911667,1,615.866667,...,609.630000,ACTIVE,154.150,64.990,3.660,10.050,0.000,0.000,17582.867188,2
1140602,Residential,EXPATS,Philippines,M,0,2017,452.736799,525.413333,2,735.645000,...,414.840000,ACTIVE,218.805,102.075,54.065,7.980,5.350,0.065,3015.338867,2


#### *One hot encoding*

In [14]:
# Identifying the categorical columns
categorical_columns = telco_customers_data.select_dtypes(include='object')
categorical_columns = categorical_columns.columns
categorical_columns = list(categorical_columns)
categorical_columns

['PTY_PROFILE_SUB_TYPE',
 'SOCIO_ECONOMIC_SEGMENT',
 'PARTY_NATIONALITY',
 'PARTY_GENDER_CD',
 'STATUS']

In [15]:
# Creating a label encoder object
encoder = OrdinalEncoder().fit(telco_customers_data[categorical_columns])
# Categorizing the column values with numbers
categorical_columns_encoded = encoder.transform(telco_customers_data[categorical_columns])
categorical_columns_encoded

Unnamed: 0,PTY_PROFILE_SUB_TYPE,SOCIO_ECONOMIC_SEGMENT,PARTY_NATIONALITY,PARTY_GENDER_CD,STATUS
0,1,1,1,1,1
1,2,1,1,1,1
2,1,1,1,1,1
3,2,1,1,1,1
4,1,1,1,1,1
...,...,...,...,...,...
1140599,1,1,1,1,1
1140600,1,3,1,1,1
1140601,3,2,44,1,1
1140602,1,2,54,1,1


In [16]:
# Exporting the encoder
dump(encoder, "../joblibs/telecom_churn_me/etl/encoder.joblib")

['../joblibs/telecom_churn_me/etl/encoder.joblib']

In [17]:
# Subsituting the numerical categorical columns in the dataset
telco_customers_data[categorical_columns] = categorical_columns_encoded
telco_customers_data

Unnamed: 0,PTY_PROFILE_SUB_TYPE,SOCIO_ECONOMIC_SEGMENT,PARTY_NATIONALITY,PARTY_GENDER_CD,TARGET,YEAR_JOINED,BILL_AMOUNT,PAID_AMOUNT,PAYMENT_TRANSACTIONS,PARTY_REV,...,LINE_REV,STATUS,MOUS_TO_LOCAL_MOBILES,MOUS_FROM_LOCAL_MOBILES,MOUS_TO_LOCAL_LANDLINES,MOUS_FROM_LOCAL_LANDLINES,MOUS_TO_INT_NUMBER,MOUS_FROM_INT_NUMBER,DATA_IN_BNDL,Years_stayed
0,1,1,1,1,0,1994,931.208938,812.175000,1,5968.700000,...,945.040000,1,1004.070,35.850,34.015,72.075,141.840,56.115,11944.079102,25
1,2,1,1,1,0,1994,431.082618,486.500000,1,6245.141667,...,493.815000,1,159.050,10.595,7.715,11.750,5.110,0.000,9903.157715,25
2,1,1,1,1,0,1994,50.619644,52.815000,1,1666.488333,...,50.300000,1,0.000,0.000,0.000,0.000,0.000,0.000,0.102539,25
3,2,1,1,1,0,1994,399.710034,422.235000,1,2522.008333,...,406.586667,1,288.805,158.500,2.670,15.965,0.000,0.000,3600.322266,25
4,1,1,1,1,0,1994,612.665844,825.888333,1,1219.961667,...,751.185000,1,209.760,186.050,17.515,28.685,3.235,4.475,3852.026367,25
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1140599,1,1,1,1,0,2017,297.752650,313.950000,1,2418.486667,...,303.166667,1,0.000,0.000,0.000,0.000,0.000,0.000,307945.957031,2
1140600,1,3,1,1,0,2017,160.663773,178.500000,1,454.116667,...,170.000000,1,0.000,0.000,0.000,0.000,0.000,0.000,22647.873535,2
1140601,3,2,44,1,0,2017,570.147016,642.911667,1,615.866667,...,609.630000,1,154.150,64.990,3.660,10.050,0.000,0.000,17582.867188,2
1140602,1,2,54,1,0,2017,452.736799,525.413333,2,735.645000,...,414.840000,1,218.805,102.075,54.065,7.980,5.350,0.065,3015.338867,2


### ***Imputation***

In [18]:
# Using the iterative imputer to estimate missing values
imp_mean = impute.IterativeImputer()

In [19]:
# Imputing the missing values of the dataset
imputed_data = imp_mean.fit_transform(telco_customers_data)
imputed_data

array([[1.00000000e+00, 1.00000000e+00, 1.00000000e+00, ...,
        5.61150000e+01, 1.19440791e+04, 2.50000000e+01],
       [2.00000000e+00, 1.00000000e+00, 1.00000000e+00, ...,
        0.00000000e+00, 9.90315771e+03, 2.50000000e+01],
       [1.00000000e+00, 1.00000000e+00, 1.00000000e+00, ...,
        0.00000000e+00, 1.02539062e-01, 2.50000000e+01],
       ...,
       [3.00000000e+00, 2.00000000e+00, 4.40000000e+01, ...,
        0.00000000e+00, 1.75828672e+04, 2.00000000e+00],
       [1.00000000e+00, 2.00000000e+00, 5.40000000e+01, ...,
        6.50000000e-02, 3.01533887e+03, 2.00000000e+00],
       [1.00000000e+00, 2.00000000e+00, 5.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 2.00000000e+00]])

In [22]:
# Exporting the encoder
dump(imp_mean, "../joblibs/telecom_churn_me/etl/imputation.joblib")

['../joblibs/telecom_churn_me/etl/imputation.joblib']

In [21]:
# Since the imputation returns a array, we reconvert it to a DF
telco_customers_data = pd.DataFrame(imputed_data, columns=list(telco_customers_data.columns))
telco_customers_data

Unnamed: 0,PTY_PROFILE_SUB_TYPE,SOCIO_ECONOMIC_SEGMENT,PARTY_NATIONALITY,PARTY_GENDER_CD,TARGET,YEAR_JOINED,BILL_AMOUNT,PAID_AMOUNT,PAYMENT_TRANSACTIONS,PARTY_REV,...,LINE_REV,STATUS,MOUS_TO_LOCAL_MOBILES,MOUS_FROM_LOCAL_MOBILES,MOUS_TO_LOCAL_LANDLINES,MOUS_FROM_LOCAL_LANDLINES,MOUS_TO_INT_NUMBER,MOUS_FROM_INT_NUMBER,DATA_IN_BNDL,Years_stayed
0,1.0,1.0,1.0,1.0,0.0,1994.0,931.208938,812.175000,1.0,5968.700000,...,945.040000,1.0,1004.070,35.850,34.015,72.075,141.840,56.115,11944.079102,25.0
1,2.0,1.0,1.0,1.0,0.0,1994.0,431.082618,486.500000,1.0,6245.141667,...,493.815000,1.0,159.050,10.595,7.715,11.750,5.110,0.000,9903.157715,25.0
2,1.0,1.0,1.0,1.0,0.0,1994.0,50.619644,52.815000,1.0,1666.488333,...,50.300000,1.0,0.000,0.000,0.000,0.000,0.000,0.000,0.102539,25.0
3,2.0,1.0,1.0,1.0,0.0,1994.0,399.710034,422.235000,1.0,2522.008333,...,406.586667,1.0,288.805,158.500,2.670,15.965,0.000,0.000,3600.322266,25.0
4,1.0,1.0,1.0,1.0,0.0,1994.0,612.665844,825.888333,1.0,1219.961667,...,751.185000,1.0,209.760,186.050,17.515,28.685,3.235,4.475,3852.026367,25.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1140599,1.0,1.0,1.0,1.0,0.0,2017.0,297.752650,313.950000,1.0,2418.486667,...,303.166667,1.0,0.000,0.000,0.000,0.000,0.000,0.000,307945.957031,2.0
1140600,1.0,3.0,1.0,1.0,0.0,2017.0,160.663773,178.500000,1.0,454.116667,...,170.000000,1.0,0.000,0.000,0.000,0.000,0.000,0.000,22647.873535,2.0
1140601,3.0,2.0,44.0,1.0,0.0,2017.0,570.147016,642.911667,1.0,615.866667,...,609.630000,1.0,154.150,64.990,3.660,10.050,0.000,0.000,17582.867188,2.0
1140602,1.0,2.0,54.0,1.0,0.0,2017.0,452.736799,525.413333,2.0,735.645000,...,414.840000,1.0,218.805,102.075,54.065,7.980,5.350,0.065,3015.338867,2.0


#### *Multicollinearity*

In [23]:
# VIF dataframe
vif_data = pd.DataFrame()

# Removing the dependent variable
x_variables = telco_customers_data.drop(y, axis=1)
vif_data["x variables"] = x_variables.columns
vif_data

Unnamed: 0,x variables
0,PTY_PROFILE_SUB_TYPE
1,SOCIO_ECONOMIC_SEGMENT
2,PARTY_NATIONALITY
3,PARTY_GENDER_CD
4,YEAR_JOINED
5,BILL_AMOUNT
6,PAID_AMOUNT
7,PAYMENT_TRANSACTIONS
8,PARTY_REV
9,PREPAID_LINES


In [24]:
# Calculating the vif of the columns and dropping the high multicollinearity
def calculate_vif(vif_data, x_variables):
    vif_data['VIF'] = [variance_inflation_factor(x_variables.values, i) for i in range(len(x_variables.columns))]
    while vif_data['VIF'].max() > 5:
        max_index = vif_data['VIF'].idxmax()
        delete_column = vif_data['x variables'].iloc[max_index]
        # Adding deleted column to the global variable
        drop_column_names.append(delete_column)
        x_variables = x_variables.drop(columns=[delete_column], axis=1)
        vif_data = vif_data.drop(index=max_index, axis=0)
        vif_data['VIF'] = [variance_inflation_factor(x_variables.values, i) for i in range(len(x_variables.columns))]
        vif_data.reset_index(inplace=True, drop=True)
    return vif_data
    

# Storing the columns with no multicolinearity
filtered_vif_data = calculate_vif(vif_data, x_variables)
filtered_vif_data

Unnamed: 0,x variables,VIF
0,PARTY_NATIONALITY,1.39736
1,BILL_AMOUNT,2.47479
2,PAYMENT_TRANSACTIONS,3.592786
3,PARTY_REV,1.914242
4,PREPAID_LINES,1.415734
5,OTHER_LINES,1.848064
6,STATUS,3.538873
7,MOUS_TO_LOCAL_MOBILES,1.364588
8,MOUS_FROM_LOCAL_MOBILES,1.344523
9,MOUS_TO_LOCAL_LANDLINES,1.278394


Multicollinearity occurs when two or more independent variables have high correlation themselves and it might cause an unreliable estimation, thus, these variables must be detected and discarded.

For the detection of multicollinearity, the **Variance Inflation Factor (VIF)** technique was used. This method regress each independent variable against all others. The VIF is calculated: $VIF = {1\over 1 - R^2}$, where $R^2$ is the coefficient of determination in linear regression. A higher VIF denotates a strong collinearity. Generally, a VIF above 5 indicates a high multicollinearity. 

In [25]:
filtered_columns = list(filtered_vif_data['x variables'])
filtered_columns.append(y)

telco_customers_data = telco_customers_data[filtered_columns]
telco_customers_data

Unnamed: 0,PARTY_NATIONALITY,BILL_AMOUNT,PAYMENT_TRANSACTIONS,PARTY_REV,PREPAID_LINES,OTHER_LINES,STATUS,MOUS_TO_LOCAL_MOBILES,MOUS_FROM_LOCAL_MOBILES,MOUS_TO_LOCAL_LANDLINES,MOUS_FROM_LOCAL_LANDLINES,MOUS_TO_INT_NUMBER,MOUS_FROM_INT_NUMBER,DATA_IN_BNDL,Years_stayed,TARGET
0,1.0,931.208938,1.0,5968.700000,2.0,2.0,1.0,1004.070,35.850,34.015,72.075,141.840,56.115,11944.079102,25.0,0.0
1,1.0,431.082618,1.0,6245.141667,6.0,2.0,1.0,159.050,10.595,7.715,11.750,5.110,0.000,9903.157715,25.0,0.0
2,1.0,50.619644,1.0,1666.488333,2.0,1.0,1.0,0.000,0.000,0.000,0.000,0.000,0.000,0.102539,25.0,0.0
3,1.0,399.710034,1.0,2522.008333,3.0,3.0,1.0,288.805,158.500,2.670,15.965,0.000,0.000,3600.322266,25.0,0.0
4,1.0,612.665844,1.0,1219.961667,0.0,1.0,1.0,209.760,186.050,17.515,28.685,3.235,4.475,3852.026367,25.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1140599,1.0,297.752650,1.0,2418.486667,5.0,3.0,1.0,0.000,0.000,0.000,0.000,0.000,0.000,307945.957031,2.0,0.0
1140600,1.0,160.663773,1.0,454.116667,0.0,1.0,1.0,0.000,0.000,0.000,0.000,0.000,0.000,22647.873535,2.0,0.0
1140601,44.0,570.147016,1.0,615.866667,1.0,0.0,1.0,154.150,64.990,3.660,10.050,0.000,0.000,17582.867188,2.0,0.0
1140602,54.0,452.736799,2.0,735.645000,1.0,0.0,1.0,218.805,102.075,54.065,7.980,5.350,0.065,3015.338867,2.0,0.0


#### *Standardization*

In [26]:
# Obtaining the mean and standard deviation
telco_customers_data_mean_std = telco_customers_data.describe()
telco_customers_data_mean_std = telco_customers_data_mean_std.drop(['count', 'min', '25%', '50%', '75%', 'max'], axis = 0)
telco_customers_data_mean_std

Unnamed: 0,PARTY_NATIONALITY,BILL_AMOUNT,PAYMENT_TRANSACTIONS,PARTY_REV,PREPAID_LINES,OTHER_LINES,STATUS,MOUS_TO_LOCAL_MOBILES,MOUS_FROM_LOCAL_MOBILES,MOUS_TO_LOCAL_LANDLINES,MOUS_FROM_LOCAL_LANDLINES,MOUS_TO_INT_NUMBER,MOUS_FROM_INT_NUMBER,DATA_IN_BNDL,Years_stayed,TARGET
mean,12.353525,381.18039,1.346936,1910.808512,2.1447,0.899103,1.105673,393.181442,129.417629,17.171341,36.186004,51.084417,6.840673,10491.521078,5.566531,0.052783
std,20.371123,369.70395,0.730928,18370.153789,5.751809,7.864311,0.478543,901.601659,298.945225,38.574464,120.017158,114.449291,35.604586,33572.216744,6.104279,0.223601


In [27]:
# Exporting the data for the standardization
dump(telco_customers_data_mean_std, '../joblibs/telecom_churn_me/etl/mean_std.joblib')

['../joblibs/telecom_churn_me/etl/mean_std.joblib']

In [28]:
# Calculating the z-score
z_score = telco_customers_data.copy()
for column in z_score.columns:
    z_score[column] = z_score[column] - telco_customers_data_mean_std[column][0]
    z_score[column] = z_score[column].div(telco_customers_data_mean_std[column][1])

z_score[y] = telco_customers_data[y]

# Updating the main variable
telco_customers_data = z_score
telco_customers_data

Unnamed: 0,PARTY_NATIONALITY,BILL_AMOUNT,PAYMENT_TRANSACTIONS,PARTY_REV,PREPAID_LINES,OTHER_LINES,STATUS,MOUS_TO_LOCAL_MOBILES,MOUS_FROM_LOCAL_MOBILES,MOUS_TO_LOCAL_LANDLINES,MOUS_FROM_LOCAL_LANDLINES,MOUS_TO_INT_NUMBER,MOUS_FROM_INT_NUMBER,DATA_IN_BNDL,Years_stayed,TARGET
0,-0.557334,1.487754,-0.474652,0.220896,-0.025157,0.139987,-0.220822,0.677559,-0.312993,0.436653,0.299032,0.792976,1.383932,0.043267,3.183581,0.0
1,-0.557334,0.134979,-0.474652,0.235944,0.670276,0.139987,-0.220822,-0.259684,-0.397473,-0.245145,-0.203604,-0.401701,-0.192129,-0.017525,3.183581,0.0
2,-0.557334,-0.894123,-0.474652,-0.013300,-0.025157,0.012830,-0.220822,-0.436092,-0.432914,-0.445148,-0.301507,-0.446350,-0.192129,-0.312503,3.183581,0.0
3,-0.557334,0.050120,-0.474652,0.033271,0.148701,0.267143,-0.220822,-0.115768,0.097283,-0.375931,-0.168484,-0.446350,-0.192129,-0.205265,3.183581,0.0
4,-0.557334,0.626137,-0.474652,-0.037607,-0.372874,0.012830,-0.220822,-0.203440,0.189441,0.008909,-0.062499,-0.418084,-0.066443,-0.197768,3.183581,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1140599,-0.557334,-0.225661,-0.474652,0.027636,0.496418,0.267143,-0.220822,-0.436092,-0.432914,-0.445148,-0.301507,-0.446350,-0.192129,8.860137,-0.584267,0.0
1140600,-0.557334,-0.596468,-0.474652,-0.079297,-0.372874,0.012830,-0.220822,-0.436092,-0.432914,-0.445148,-0.301507,-0.446350,-0.192129,0.362096,-0.584267,0.0
1140601,1.553497,0.511130,-0.474652,-0.070492,-0.199016,-0.114327,-0.220822,-0.265119,-0.215517,-0.350266,-0.217769,-0.446350,-0.192129,0.211227,-0.584267,0.0
1140602,2.044388,0.193551,0.893471,-0.063971,-0.199016,-0.114327,-0.220822,-0.193407,-0.091464,0.956427,-0.235016,-0.399604,-0.190303,-0.222690,-0.584267,0.0


The **z-score** (standard score) is a popular method to standardize data. The **z-score** (standard score) is a numerical score metric that exposes how far is data point from the mean. <br>The formula to standardize the data is: $z\_score = {data\_point - mean \over std. deviation}$

#### *Outlier management*

In [29]:
# Defining the threshold
threshold = 3

# Position of the outlier
row_postion_outlier = np.where(z_score > threshold)[0]
row_postion_outlier = np.unique(row_postion_outlier)
len(row_postion_outlier)

155720

In [30]:
# Storing the outliers for the test
outliers = telco_customers_data.iloc[row_postion_outlier]
outliers = outliers.reset_index()
outliers

Unnamed: 0,index,PARTY_NATIONALITY,BILL_AMOUNT,PAYMENT_TRANSACTIONS,PARTY_REV,PREPAID_LINES,OTHER_LINES,STATUS,MOUS_TO_LOCAL_MOBILES,MOUS_FROM_LOCAL_MOBILES,MOUS_TO_LOCAL_LANDLINES,MOUS_FROM_LOCAL_LANDLINES,MOUS_TO_INT_NUMBER,MOUS_FROM_INT_NUMBER,DATA_IN_BNDL,Years_stayed,TARGET
0,0,-0.557334,1.487754,-0.474652,0.220896,-0.025157,0.139987,-0.220822,0.677559,-0.312993,0.436653,0.299032,0.792976,1.383932,0.043267,3.183581,0.0
1,1,-0.557334,0.134979,-0.474652,0.235944,0.670276,0.139987,-0.220822,-0.259684,-0.397473,-0.245145,-0.203604,-0.401701,-0.192129,-0.017525,3.183581,0.0
2,2,-0.557334,-0.894123,-0.474652,-0.013300,-0.025157,0.012830,-0.220822,-0.436092,-0.432914,-0.445148,-0.301507,-0.446350,-0.192129,-0.312503,3.183581,0.0
3,3,-0.557334,0.050120,-0.474652,0.033271,0.148701,0.267143,-0.220822,-0.115768,0.097283,-0.375931,-0.168484,-0.446350,-0.192129,-0.205265,3.183581,0.0
4,4,-0.557334,0.626137,-0.474652,-0.037607,-0.372874,0.012830,-0.220822,-0.203440,0.189441,0.008909,-0.062499,-0.418084,-0.066443,-0.197768,3.183581,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
155715,1140567,-0.262800,4.915750,-0.474652,0.124195,-0.199016,0.012830,-0.220822,-0.144860,0.368453,-0.339508,0.297574,0.017262,1.261335,0.035596,-0.584267,0.0
155716,1140580,-0.410067,0.449948,0.893471,-0.066096,-0.372874,-0.114327,-0.220822,-0.149341,0.286565,5.283901,0.103852,-0.187851,-0.172047,-0.246633,-0.584267,0.0
155717,1140583,-0.213711,-0.474021,-0.474652,-0.045320,-0.199016,0.012830,-0.220822,4.913632,0.008705,-0.409243,-0.162152,-0.446350,-0.192129,-0.167769,-0.584267,0.0
155718,1140595,0.669893,-0.200308,6.365964,-0.061388,-0.372874,-0.114327,-0.220822,-0.355913,-0.432914,-0.219221,-0.301507,-0.446350,-0.192129,-0.274189,-0.584267,0.0


In [31]:
row_postion_outlier

array([      0,       1,       2, ..., 1140583, 1140595, 1140599],
      dtype=int64)

In [32]:
# Removing the outliers from the main dataset
telco_customers_data = telco_customers_data.drop(row_postion_outlier).reset_index()
telco_costumers_data = telco_customers_data.drop(['index'], axis=1, inplace=True)
telco_customers_data

Unnamed: 0,PARTY_NATIONALITY,BILL_AMOUNT,PAYMENT_TRANSACTIONS,PARTY_REV,PREPAID_LINES,OTHER_LINES,STATUS,MOUS_TO_LOCAL_MOBILES,MOUS_FROM_LOCAL_MOBILES,MOUS_TO_LOCAL_LANDLINES,MOUS_FROM_LOCAL_LANDLINES,MOUS_TO_INT_NUMBER,MOUS_FROM_INT_NUMBER,DATA_IN_BNDL,Years_stayed,TARGET
0,-0.017354,-0.529086,0.893471,-0.061609,-0.025157,0.012830,-0.220822,-0.113999,0.246592,0.321421,0.442178,-0.409347,-0.030212,-0.278056,2.855942,1.0
1,-0.508245,-0.900672,-0.474652,-0.097808,-0.025157,-0.114327,-0.220822,-0.424324,-0.390799,-0.445148,-0.283426,-0.446350,-0.192129,-0.312505,2.855942,1.0
2,1.209873,-0.030584,-0.474652,-0.071117,-0.372874,-0.114327,-0.220822,-0.354088,-0.423732,-0.429594,0.074481,-0.049668,1.035662,-0.299820,2.855942,1.0
3,0.326269,-0.580203,-0.474652,-0.062318,-0.025157,0.012830,-0.220822,0.084254,-0.300365,-0.101656,-0.290675,-0.446350,-0.191848,-0.296438,2.855942,1.0
4,0.326269,-0.976798,-0.474652,-0.062318,-0.025157,0.012830,-0.220822,-0.436092,-0.432914,-0.445148,-0.301507,-0.446350,-0.192129,-0.312506,2.855942,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
984879,-0.017354,-0.161310,-0.474652,-0.056754,0.670276,0.012830,1.868853,-0.436092,-0.432914,-0.445148,-0.301507,-0.446350,-0.192129,-0.312506,-0.584267,0.0
984880,-0.557334,-0.596468,-0.474652,-0.079297,-0.372874,0.012830,-0.220822,-0.436092,-0.432914,-0.445148,-0.301507,-0.446350,-0.192129,0.362096,-0.584267,0.0
984881,1.553497,0.511130,-0.474652,-0.070492,-0.199016,-0.114327,-0.220822,-0.265119,-0.215517,-0.350266,-0.217769,-0.446350,-0.192129,0.211227,-0.584267,0.0
984882,2.044388,0.193551,0.893471,-0.063971,-0.199016,-0.114327,-0.220822,-0.193407,-0.091464,0.956427,-0.235016,-0.399604,-0.190303,-0.222690,-0.584267,0.0


Having the z-score it is easy to indentify the outliers with a threshold. The threshold is value that defines the outlier as standard deviations, usually the number 3 is chosen for it since 99.7% of the data points lie between 3 standard deviations using the Central Limit Theorem approach (Gaussian Distribution).

In [33]:
# Exporting the column names to drop
dump(drop_column_names, '../joblibs/telecom_churn_me/etl/drop_columns_names.joblib')

['../joblibs/telecom_churn_me/etl/drop_columns_names.joblib']

### ***Data preparation***

In [36]:
# Shuffling the dataset to avoid any pre-order
telco_customers_data = telco_customers_data.sample(frac = 1).reset_index(drop = True)
telco_customers_data

Unnamed: 0,PARTY_NATIONALITY,BILL_AMOUNT,PAYMENT_TRANSACTIONS,PARTY_REV,PREPAID_LINES,OTHER_LINES,STATUS,MOUS_TO_LOCAL_MOBILES,MOUS_FROM_LOCAL_MOBILES,MOUS_TO_LOCAL_LANDLINES,MOUS_FROM_LOCAL_LANDLINES,MOUS_TO_INT_NUMBER,MOUS_FROM_INT_NUMBER,DATA_IN_BNDL,Years_stayed,TARGET
0,-0.557334,-0.348558,-0.474652,0.140438,-0.372874,-0.114327,-0.220822,0.038391,-0.432914,0.581179,-0.301507,-0.446350,-0.192129,-0.312506,-0.420448,0.0
1,-0.557334,-0.070440,-0.474652,-0.028510,-0.372874,-0.114327,-0.220822,-0.436092,-0.432914,-0.445148,-0.301507,-0.446350,-0.192129,-0.312506,-0.256628,0.0
2,2.633457,-0.080500,0.893471,-0.051061,-0.372874,0.012830,-0.220822,-0.408846,-0.398627,0.016945,-0.237058,-0.054910,-0.191146,-0.203272,-0.420448,0.0
3,-0.557334,-0.691682,-0.474652,-0.028877,-0.025157,0.012830,-0.220822,-0.329937,-0.305198,-0.276254,-0.299716,-0.446350,-0.192129,-0.280508,2.692123,0.0
4,-0.410067,-0.898229,-0.474652,-0.043469,-0.025157,0.012830,-0.220822,-0.436092,-0.432914,-0.445148,-0.301507,-0.446350,-0.192129,-0.312506,-0.420448,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
984879,-0.557334,-0.489847,-0.474652,-0.034312,0.844135,0.012830,-0.220822,-0.202813,0.408678,-0.445148,-0.239432,-0.446350,-0.177665,-0.312499,2.036845,0.0
984880,-0.213711,-0.112294,0.893471,-0.082831,-0.372874,-0.114327,-0.220822,-0.378972,-0.255290,-0.272106,-0.031754,0.562656,0.122718,-0.172546,-0.748087,0.0
984881,-0.410067,-0.586804,-0.474652,-0.093218,-0.025157,-0.114327,-0.220822,-0.313277,-0.378857,-0.250460,0.075189,0.608353,-0.003951,-0.279148,-0.420448,0.0
984882,-0.459156,1.642950,-0.474652,0.044539,-0.025157,-0.114327,-0.220822,0.672041,-0.137291,-0.134321,-0.062874,-0.412055,-0.191708,-0.183640,-0.092809,0.0


In [37]:
# Test dataset
no_rows_test = int((telco_customers_data.shape[0]+outliers.shape[0])*0.3)
test = outliers
test = test.append(telco_customers_data[:no_rows_test-outliers.shape[0]]).reset_index()
test = test.drop(['level_0', 'index'], axis=1)

# Train dataset
train = telco_customers_data[no_rows_test-outliers.shape[0]:].reset_index()
train = train.drop(['index'], axis=1)

  test = test.append(telco_customers_data[:no_rows_test-outliers.shape[0]]).reset_index()


In [38]:
# Using smote algorithm for over-sampling
sm = SMOTE(random_state = 2)
x_train, y_train = sm.fit_resample(train.drop([y], axis=1), train[y])

In [39]:
print('Shape of the original train: ', train.drop([y], axis=1).shape)
print('Shape of the smote train: ', x_train.shape)
print('Count of label 0.0, original train: ', train[y].value_counts()[0.0])
print('Count of label 0.0, smote train: ', y_train.value_counts()[0.0])
print('Count of label 1.0, original train: ', train[y].value_counts()[1.0])
print('Count of label 1.0, smote train: ', y_train.value_counts()[1.0])


Shape of the original train:  (798423, 15)
Shape of the smote train:  (1513438, 15)
Count of label 0.0, original train:  756719
Count of label 0.0, smote train:  756719
Count of label 1.0, original train:  41704
Count of label 1.0, smote train:  756719


---
## **Load**

In [40]:
# Dividing the target and labels
y_test = pd.DataFrame(test[y])
x_test = test.drop([y], axis=1)

# Reestructuring the train dataset
y_train = pd.DataFrame(y_train)

In [41]:
# Exporting the test and train of the dataframes
x_test.to_csv('../../../data/telecom_churn_me/test/x_test.csv', index=False)
y_test.to_csv('../../../data/telecom_churn_me/test/y_test.csv', index=False)

x_train.to_csv('../../../data/telecom_churn_me/train/x_train.csv', index=False)
y_train.to_csv('../../../data/telecom_churn_me/train/y_train.csv', index=False)

train.to_csv('../../../data/telecom_churn_me/original_train.csv', index=False)
telco_customers_data.to_csv('../../../data/telecom_churn_me/full_transformed_data.csv', index=False)