# **Telecom Churn Data | ETL**

*Authors:*
- *Myroslava Sánchez Andrade A01730712*
- *Karen Rugerio Armenta A01733228*
- *José Antonio Bobadilla García A01734433*
- *Alejandro Castro Reus A01731065*

*Creation date: 17/08/2022*

*Last updated: 11/09/2022*

---

## **Extract**
The dataset used for this project is **[telecom_churn_me.csv](https://www.kaggle.com/datasets/mark18vi/telecom-churn-data?resource=download)**, downloaded from the plataform Kaggle.
<br>This dataset of a telecommunications company contains the costumers' account information and whether the customers left or not within the last month.


In [97]:
# REQUIRED LIBRARIES
# !pip install pandas numpy matplotlib statsmodels sklearn scipy

In [98]:
# RUN ONLY FOR GOOGLE COLAB

# from google.colab import drive

# drive.mount("path")  

# %cd "path"

In [130]:
# Importing the necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn import preprocessing, impute
from scipy import stats
from sklearn.experimental import enable_iterative_imputer



In [100]:
# Reading data via Pandas from CSV
telco_customers_data = pd.read_csv('../../../data/telecom_churn_me.csv')
telco_customers_data

Unnamed: 0.1,Unnamed: 0,PTY_PROFILE_SUB_TYPE,SOCIO_ECONOMIC_SEGMENT,PARTY_NATIONALITY,PARTY_GENDER_CD,TARGET,YEAR_JOINED,CURRENT_YEAR,BILL_AMOUNT,PAID_AMOUNT,...,MOUS_FROM_LOCAL_MOBILES,MOUS_TO_LOCAL_LANDLINES,MOUS_FROM_LOCAL_LANDLINES,MOUS_TO_INT_NUMBER,MOUS_FROM_INT_NUMBER,DATA_IN_BNDL,DATA_OUT_BNDL,DATA_USG_PAYG,COMPLAINTS,Years_stayed
0,0,Residential,EMIRATI,United Arab Emirates,M,0,1994,2019,931.208938,812.175000,...,35.850,34.015,72.075,141.840,56.115,11944.079102,0.0,0.0,0,25
1,1,Prestige,EMIRATI,United Arab Emirates,M,0,1994,2019,431.082618,486.500000,...,10.595,7.715,11.750,5.110,0.000,9903.157715,0.0,0.0,0,25
2,2,Residential,EMIRATI,United Arab Emirates,M,0,1994,2019,50.619644,52.815000,...,0.000,0.000,0.000,0.000,0.000,0.102539,0.0,0.0,0,25
3,3,Prestige,EMIRATI,United Arab Emirates,M,0,1994,2019,399.710034,422.235000,...,158.500,2.670,15.965,0.000,0.000,3600.322266,0.0,0.0,0,25
4,4,Residential,EMIRATI,United Arab Emirates,M,0,1994,2019,612.665844,825.888333,...,186.050,17.515,28.685,3.235,4.475,3852.026367,0.0,0.0,0,25
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1140599,1140610,Residential,EMIRATI,United Arab Emirates,M,0,2017,2019,297.752650,313.950000,...,0.000,0.000,0.000,0.000,0.000,307945.957031,0.0,0.0,0,2
1140600,1140611,Residential,YOUTH,United Arab Emirates,M,0,2017,2019,160.663773,178.500000,...,0.000,0.000,0.000,0.000,0.000,22647.873535,0.0,0.0,0,2
1140601,1140612,Consumer via Retailer,EXPATS,Comoros,M,0,2017,2019,570.147016,642.911667,...,64.990,3.660,10.050,0.000,0.000,17582.867188,0.0,0.0,0,2
1140602,1140613,Residential,EXPATS,Philippines,M,0,2017,2019,452.736799,525.413333,...,102.075,54.065,7.980,5.350,0.065,3015.338867,0.0,0.0,0,2


#### ***Verifying structure and content***

In [101]:
# Validating the information of each column => There are no null values in the whole DF, there are multiple objects.
telco_customers_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1140604 entries, 0 to 1140603
Data columns (total 28 columns):
 #   Column                     Non-Null Count    Dtype  
---  ------                     --------------    -----  
 0   Unnamed: 0                 1140604 non-null  int64  
 1   PTY_PROFILE_SUB_TYPE       1140604 non-null  object 
 2   SOCIO_ECONOMIC_SEGMENT     1140604 non-null  object 
 3   PARTY_NATIONALITY          1140604 non-null  object 
 4   PARTY_GENDER_CD            1140604 non-null  object 
 5   TARGET                     1140604 non-null  int64  
 6   YEAR_JOINED                1140604 non-null  int64  
 7   CURRENT_YEAR               1140604 non-null  int64  
 8   BILL_AMOUNT                1140604 non-null  float64
 9   PAID_AMOUNT                1140604 non-null  float64
 10  PAYMENT_TRANSACTIONS       1140604 non-null  int64  
 11  PARTY_REV                  1140604 non-null  float64
 12  PREPAID_LINES              1140604 non-null  int64  
 13  POSTPAID_LIN

---
## **Transform**

### ***Column analysis***

#### *Column valuation*

Dropping the columns that have at least 65% null values

In [102]:
telco_customers_data = telco_customers_data.dropna(thresh = (telco_customers_data.shape[0] * 0.65) , axis = 1)

Storing the columns that are full of unique values

In [103]:
# Storing the number of unique values of each column
no_unique_values = telco_customers_data.nunique().to_frame()

# Storing the name of columns that are full of unique values (id)
drop_columns = no_unique_values[no_unique_values == telco_customers_data.shape[0]]
drop_columns = drop_columns.dropna()

drop_column_names = []
drop_column_names.append(drop_columns.index[0])
drop_column_names

['Unnamed: 0']

Storing the columns that are full (or almost full) of the same values

In [104]:
# Calculating the percentiles of each column
data_description = telco_customers_data.describe()
data_description = data_description.drop(['count', 'mean', 'std', 'min', 'max'], axis = 0)
data_description

Unnamed: 0.1,Unnamed: 0,TARGET,YEAR_JOINED,CURRENT_YEAR,BILL_AMOUNT,PAID_AMOUNT,PAYMENT_TRANSACTIONS,PARTY_REV,PREPAID_LINES,POSTPAID_LINES,...,MOUS_FROM_LOCAL_MOBILES,MOUS_TO_LOCAL_LANDLINES,MOUS_FROM_LOCAL_LANDLINES,MOUS_TO_INT_NUMBER,MOUS_FROM_INT_NUMBER,DATA_IN_BNDL,DATA_OUT_BNDL,DATA_USG_PAYG,COMPLAINTS,Years_stayed
25%,285154.75,0.0,2013.0,2019.0,174.137757,181.666667,1.0,423.187917,0.0,1.0,...,0.425,0.35,0.015,0.0,0.0,708.101562,0.0,0.0,0.0,2.0
50%,570306.5,0.0,2016.0,2019.0,290.72394,300.729167,1.0,834.713333,1.0,2.0,...,29.445,7.16,10.025,2.175,0.0,4394.218506,0.0,0.0,0.0,3.0
75%,855461.25,0.0,2017.0,2019.0,460.9771,476.423333,2.0,1553.675,3.0,3.0,...,141.895,22.035,39.26,54.09,1.71,9955.910278,0.0,0.0,0.0,6.0


In [105]:
# Storing the difference column by column of the percentiles
data_description = data_description.diff()
data_description

Unnamed: 0.1,Unnamed: 0,TARGET,YEAR_JOINED,CURRENT_YEAR,BILL_AMOUNT,PAID_AMOUNT,PAYMENT_TRANSACTIONS,PARTY_REV,PREPAID_LINES,POSTPAID_LINES,...,MOUS_FROM_LOCAL_MOBILES,MOUS_TO_LOCAL_LANDLINES,MOUS_FROM_LOCAL_LANDLINES,MOUS_TO_INT_NUMBER,MOUS_FROM_INT_NUMBER,DATA_IN_BNDL,DATA_OUT_BNDL,DATA_USG_PAYG,COMPLAINTS,Years_stayed
25%,,,,,,,,,,,...,,,,,,,,,,
50%,285151.75,0.0,3.0,0.0,116.586183,119.0625,0.0,411.525417,1.0,1.0,...,29.02,6.81,10.01,2.175,0.0,3686.116943,0.0,0.0,0.0,1.0
75%,285154.75,0.0,1.0,0.0,170.25316,175.694167,1.0,718.961667,2.0,1.0,...,112.45,14.875,29.235,51.915,1.71,5561.691772,0.0,0.0,0.0,3.0


In [106]:
# If the difference is 0 in percentile 50 and 75, it means that the column has no variation in its values (columns full of the same value)
percentiles = data_description[1:2] == 0.0
percentiles = percentiles.append(data_description[2:3] == 0.0)

for col in data_description.columns:
    if percentiles[col].all() == False:
        percentiles = percentiles.drop(col, axis=1)

print(percentiles)
drop_column_names.extend(list(percentiles.columns))
drop_column_names

     TARGET  CURRENT_YEAR  DATA_OUT_BNDL  DATA_USG_PAYG  COMPLAINTS
50%    True          True           True           True        True
75%    True          True           True           True        True


  percentiles = percentiles.append(data_description[2:3] == 0.0)


['Unnamed: 0',
 'TARGET',
 'CURRENT_YEAR',
 'DATA_OUT_BNDL',
 'DATA_USG_PAYG',
 'COMPLAINTS']

Dropping the columns stored in steps before

In [107]:
# Removing the target from the columns to drop
y = 'TARGET'
drop_column_names.remove(y)

# Dropping the columns to drop
telco_customers_data = telco_customers_data.drop(drop_column_names, axis=1)
telco_customers_data

Unnamed: 0,PTY_PROFILE_SUB_TYPE,SOCIO_ECONOMIC_SEGMENT,PARTY_NATIONALITY,PARTY_GENDER_CD,TARGET,YEAR_JOINED,BILL_AMOUNT,PAID_AMOUNT,PAYMENT_TRANSACTIONS,PARTY_REV,...,LINE_REV,STATUS,MOUS_TO_LOCAL_MOBILES,MOUS_FROM_LOCAL_MOBILES,MOUS_TO_LOCAL_LANDLINES,MOUS_FROM_LOCAL_LANDLINES,MOUS_TO_INT_NUMBER,MOUS_FROM_INT_NUMBER,DATA_IN_BNDL,Years_stayed
0,Residential,EMIRATI,United Arab Emirates,M,0,1994,931.208938,812.175000,1,5968.700000,...,945.040000,ACTIVE,1004.070,35.850,34.015,72.075,141.840,56.115,11944.079102,25
1,Prestige,EMIRATI,United Arab Emirates,M,0,1994,431.082618,486.500000,1,6245.141667,...,493.815000,ACTIVE,159.050,10.595,7.715,11.750,5.110,0.000,9903.157715,25
2,Residential,EMIRATI,United Arab Emirates,M,0,1994,50.619644,52.815000,1,1666.488333,...,50.300000,ACTIVE,0.000,0.000,0.000,0.000,0.000,0.000,0.102539,25
3,Prestige,EMIRATI,United Arab Emirates,M,0,1994,399.710034,422.235000,1,2522.008333,...,406.586667,ACTIVE,288.805,158.500,2.670,15.965,0.000,0.000,3600.322266,25
4,Residential,EMIRATI,United Arab Emirates,M,0,1994,612.665844,825.888333,1,1219.961667,...,751.185000,ACTIVE,209.760,186.050,17.515,28.685,3.235,4.475,3852.026367,25
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1140599,Residential,EMIRATI,United Arab Emirates,M,0,2017,297.752650,313.950000,1,2418.486667,...,303.166667,ACTIVE,0.000,0.000,0.000,0.000,0.000,0.000,307945.957031,2
1140600,Residential,YOUTH,United Arab Emirates,M,0,2017,160.663773,178.500000,1,454.116667,...,170.000000,ACTIVE,0.000,0.000,0.000,0.000,0.000,0.000,22647.873535,2
1140601,Consumer via Retailer,EXPATS,Comoros,M,0,2017,570.147016,642.911667,1,615.866667,...,609.630000,ACTIVE,154.150,64.990,3.660,10.050,0.000,0.000,17582.867188,2
1140602,Residential,EXPATS,Philippines,M,0,2017,452.736799,525.413333,2,735.645000,...,414.840000,ACTIVE,218.805,102.075,54.065,7.980,5.350,0.065,3015.338867,2


#### *One hot encoding*

In [108]:
# Identifying the categorical columns
no_unique_values = telco_customers_data.drop(y, axis=1).nunique().to_frame()
categorical_columns = no_unique_values[no_unique_values <= telco_customers_data.shape[0]*0.02]
categorical_columns = categorical_columns.dropna()
categorical_columns = list(categorical_columns.index)
categorical_columns

['PTY_PROFILE_SUB_TYPE',
 'SOCIO_ECONOMIC_SEGMENT',
 'PARTY_NATIONALITY',
 'PARTY_GENDER_CD',
 'YEAR_JOINED',
 'PAYMENT_TRANSACTIONS',
 'PREPAID_LINES',
 'POSTPAID_LINES',
 'OTHER_LINES',
 'STATUS',
 'Years_stayed']

In [109]:
# Creating a label encoder object
le = preprocessing.LabelEncoder()
# Categorizing the column values with numbers
categorical_columns = telco_customers_data[categorical_columns].apply(le.fit_transform)
categorical_columns

Unnamed: 0,PTY_PROFILE_SUB_TYPE,SOCIO_ECONOMIC_SEGMENT,PARTY_NATIONALITY,PARTY_GENDER_CD,YEAR_JOINED,PAYMENT_TRANSACTIONS,PREPAID_LINES,POSTPAID_LINES,OTHER_LINES,STATUS,Years_stayed
0,2,0,179,1,0,1,2,5,2,0,25
1,1,0,179,1,0,1,6,3,2,0,25
2,2,0,179,1,0,1,2,2,1,0,25
3,1,0,179,1,0,1,3,3,3,0,25
4,2,0,179,1,0,1,0,1,1,0,25
...,...,...,...,...,...,...,...,...,...,...,...
1140599,2,0,179,1,23,1,5,3,3,0,2
1140600,2,2,179,1,23,1,0,0,1,0,2
1140601,0,1,39,1,23,1,1,0,0,0,2
1140602,2,1,136,1,23,2,1,1,0,0,2


In [110]:
# Subsituting the numerical categorical columns in the dataset
telco_customers_data[categorical_columns.columns] = categorical_columns
telco_customers_data

Unnamed: 0,PTY_PROFILE_SUB_TYPE,SOCIO_ECONOMIC_SEGMENT,PARTY_NATIONALITY,PARTY_GENDER_CD,TARGET,YEAR_JOINED,BILL_AMOUNT,PAID_AMOUNT,PAYMENT_TRANSACTIONS,PARTY_REV,...,LINE_REV,STATUS,MOUS_TO_LOCAL_MOBILES,MOUS_FROM_LOCAL_MOBILES,MOUS_TO_LOCAL_LANDLINES,MOUS_FROM_LOCAL_LANDLINES,MOUS_TO_INT_NUMBER,MOUS_FROM_INT_NUMBER,DATA_IN_BNDL,Years_stayed
0,2,0,179,1,0,0,931.208938,812.175000,1,5968.700000,...,945.040000,0,1004.070,35.850,34.015,72.075,141.840,56.115,11944.079102,25
1,1,0,179,1,0,0,431.082618,486.500000,1,6245.141667,...,493.815000,0,159.050,10.595,7.715,11.750,5.110,0.000,9903.157715,25
2,2,0,179,1,0,0,50.619644,52.815000,1,1666.488333,...,50.300000,0,0.000,0.000,0.000,0.000,0.000,0.000,0.102539,25
3,1,0,179,1,0,0,399.710034,422.235000,1,2522.008333,...,406.586667,0,288.805,158.500,2.670,15.965,0.000,0.000,3600.322266,25
4,2,0,179,1,0,0,612.665844,825.888333,1,1219.961667,...,751.185000,0,209.760,186.050,17.515,28.685,3.235,4.475,3852.026367,25
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1140599,2,0,179,1,0,23,297.752650,313.950000,1,2418.486667,...,303.166667,0,0.000,0.000,0.000,0.000,0.000,0.000,307945.957031,2
1140600,2,2,179,1,0,23,160.663773,178.500000,1,454.116667,...,170.000000,0,0.000,0.000,0.000,0.000,0.000,0.000,22647.873535,2
1140601,0,1,39,1,0,23,570.147016,642.911667,1,615.866667,...,609.630000,0,154.150,64.990,3.660,10.050,0.000,0.000,17582.867188,2
1140602,2,1,136,1,0,23,452.736799,525.413333,2,735.645000,...,414.840000,0,218.805,102.075,54.065,7.980,5.350,0.065,3015.338867,2


#### *Multicollinearity*

In [111]:
# VIF dataframe
vif_data = pd.DataFrame()

# Removing the dependent variable
x_variables = telco_customers_data.drop(y, axis=1)
vif_data["x variables"] = x_variables.columns
vif_data

Unnamed: 0,x variables
0,PTY_PROFILE_SUB_TYPE
1,SOCIO_ECONOMIC_SEGMENT
2,PARTY_NATIONALITY
3,PARTY_GENDER_CD
4,YEAR_JOINED
5,BILL_AMOUNT
6,PAID_AMOUNT
7,PAYMENT_TRANSACTIONS
8,PARTY_REV
9,PREPAID_LINES


In [112]:
# Calculating the vif of the columns and dropping the high multicollinearity
def calculate_vif(vif_data, x_variables):
    vif_data['VIF'] = [variance_inflation_factor(x_variables.values, i) for i in range(len(x_variables.columns))]
    while vif_data['VIF'].max() > 5:
        max_index = vif_data['VIF'].idxmax()
        delete_column = vif_data['x variables'].iloc[max_index]
        x_variables = x_variables.drop(columns=[delete_column], axis=1)
        vif_data = vif_data.drop(index=max_index, axis=0)
        vif_data['VIF'] = [variance_inflation_factor(x_variables.values, i) for i in range(len(x_variables.columns))]
        vif_data.reset_index(inplace=True, drop=True)
    return vif_data
    

# Storing the columns with no multicolinearity
filtered_vif_data = calculate_vif(vif_data, x_variables)
filtered_vif_data

Unnamed: 0,x variables,VIF
0,SOCIO_ECONOMIC_SEGMENT,2.467935
1,PARTY_GENDER_CD,3.912827
2,BILL_AMOUNT,2.470802
3,PAYMENT_TRANSACTIONS,3.629384
4,PARTY_REV,2.527949
5,PREPAID_LINES,2.179423
6,POSTPAID_LINES,3.19945
7,OTHER_LINES,3.724654
8,STATUS,1.083224
9,MOUS_TO_LOCAL_MOBILES,1.366994


Multicollinearity occurs when two or more independent variables have high correlation themselves and it might cause an unreliable estimation, thus, these variables must be detected and discarded.

For the detection of multicollinearity, the **Variance Inflation Factor (VIF)** technique was used. This method regress each independent variable against all others. The VIF is calculated: $VIF = {1\over 1 - R^2}$, where $R^2$ is the coefficient of determination in linear regression. A higher VIF denotates a strong collinearity. Generally, a VIF above 5 indicates a high multicollinearity. 

In [113]:
filtered_columns = list(filtered_vif_data['x variables'])
filtered_columns.append(y)

telco_customers_data = telco_customers_data[filtered_columns]
telco_customers_data

Unnamed: 0,SOCIO_ECONOMIC_SEGMENT,PARTY_GENDER_CD,BILL_AMOUNT,PAYMENT_TRANSACTIONS,PARTY_REV,PREPAID_LINES,POSTPAID_LINES,OTHER_LINES,STATUS,MOUS_TO_LOCAL_MOBILES,MOUS_FROM_LOCAL_MOBILES,MOUS_TO_LOCAL_LANDLINES,MOUS_FROM_LOCAL_LANDLINES,MOUS_TO_INT_NUMBER,MOUS_FROM_INT_NUMBER,DATA_IN_BNDL,Years_stayed,TARGET
0,0,1,931.208938,1,5968.700000,2,5,2,0,1004.070,35.850,34.015,72.075,141.840,56.115,11944.079102,25,0
1,0,1,431.082618,1,6245.141667,6,3,2,0,159.050,10.595,7.715,11.750,5.110,0.000,9903.157715,25,0
2,0,1,50.619644,1,1666.488333,2,2,1,0,0.000,0.000,0.000,0.000,0.000,0.000,0.102539,25,0
3,0,1,399.710034,1,2522.008333,3,3,3,0,288.805,158.500,2.670,15.965,0.000,0.000,3600.322266,25,0
4,0,1,612.665844,1,1219.961667,0,1,1,0,209.760,186.050,17.515,28.685,3.235,4.475,3852.026367,25,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1140599,0,1,297.752650,1,2418.486667,5,3,3,0,0.000,0.000,0.000,0.000,0.000,0.000,307945.957031,2,0
1140600,2,1,160.663773,1,454.116667,0,0,1,0,0.000,0.000,0.000,0.000,0.000,0.000,22647.873535,2,0
1140601,1,1,570.147016,1,615.866667,1,0,0,0,154.150,64.990,3.660,10.050,0.000,0.000,17582.867188,2,0
1140602,1,1,452.736799,2,735.645000,1,1,0,0,218.805,102.075,54.065,7.980,5.350,0.065,3015.338867,2,0


#### *Standardization*

In [114]:
# Calculating the z-score
z = np.abs(stats.zscore(telco_customers_data.drop(y, axis=1)))
z[y] = telco_customers_data[y]

# Updating the main variable
telco_customers_data = z
telco_customers_data

Unnamed: 0,SOCIO_ECONOMIC_SEGMENT,PARTY_GENDER_CD,BILL_AMOUNT,PAYMENT_TRANSACTIONS,PARTY_REV,PREPAID_LINES,POSTPAID_LINES,OTHER_LINES,STATUS,MOUS_TO_LOCAL_MOBILES,MOUS_FROM_LOCAL_MOBILES,MOUS_TO_LOCAL_LANDLINES,MOUS_FROM_LOCAL_LANDLINES,MOUS_TO_INT_NUMBER,MOUS_FROM_INT_NUMBER,DATA_IN_BNDL,Years_stayed,TARGET
0,1.272502,0.490663,1.487755,0.474766,0.220896,0.020694,1.177967,0.615191,0.272202,0.677560,0.312993,0.436653,0.299032,0.792977,1.383933,0.043267,3.183582,0
1,1.272502,0.490663,0.134979,0.474766,0.235944,1.051171,0.530613,0.615191,0.272202,0.259684,0.397473,0.245145,0.203604,0.401701,0.192129,0.017525,3.183582,0
2,1.272502,0.490663,0.894123,0.474766,0.013300,0.020694,0.206937,0.135102,0.272202,0.436092,0.432914,0.445148,0.301507,0.446350,0.192129,0.312503,3.183582,0
3,1.272502,0.490663,0.050120,0.474766,0.033271,0.247272,0.530613,1.095280,0.272202,0.115768,0.097283,0.375931,0.168484,0.446350,0.192129,0.205265,3.183582,0
4,1.272502,0.490663,0.626138,0.474766,0.037607,0.556627,0.116740,0.135102,0.272202,0.203440,0.189441,0.008909,0.062499,0.418084,0.066443,0.197768,3.183582,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1140599,1.272502,0.490663,0.225661,0.474766,0.027636,0.783205,0.530613,1.095280,0.272202,0.436092,0.432914,0.445148,0.301507,0.446350,0.192129,8.860141,0.584268,0
1140600,1.855048,0.490663,0.596468,0.474766,0.079297,0.556627,0.440417,0.135102,0.272202,0.436092,0.432914,0.445148,0.301507,0.446350,0.192129,0.362096,0.584268,0
1140601,0.291273,0.490663,0.511130,0.474766,0.070492,0.288660,0.440417,0.344986,0.272202,0.265119,0.215517,0.350267,0.217769,0.446350,0.192129,0.211227,0.584268,0
1140602,0.291273,0.490663,0.193551,0.893708,0.063971,0.288660,0.116740,0.344986,0.272202,0.193407,0.091464,0.956427,0.235017,0.399604,0.190303,0.222690,0.584268,0


The **z-score** (standard score) is a popular method to standardize data. The **z-score** (standard score) is a numerical score metric that exposes how far is data point from the mean. <br>The formula to standardize the data is: $z\_score = {data\_point - mean \over std. deviation}$

#### *Outlier management*

In [115]:
# Defining the threshold
threshold = 3

# Position of the outlier
row_postion_outlier = np.where(z > 3)[0]
row_postion_outlier = np.unique(row_postion_outlier)
len(row_postion_outlier)

203842

In [116]:
# Storing the outliers for the test
outliers = telco_customers_data.iloc[row_postion_outlier]
outliers

Unnamed: 0,SOCIO_ECONOMIC_SEGMENT,PARTY_GENDER_CD,BILL_AMOUNT,PAYMENT_TRANSACTIONS,PARTY_REV,PREPAID_LINES,POSTPAID_LINES,OTHER_LINES,STATUS,MOUS_TO_LOCAL_MOBILES,MOUS_FROM_LOCAL_MOBILES,MOUS_TO_LOCAL_LANDLINES,MOUS_FROM_LOCAL_LANDLINES,MOUS_TO_INT_NUMBER,MOUS_FROM_INT_NUMBER,DATA_IN_BNDL,Years_stayed,TARGET
0,1.272502,0.490663,1.487755,0.474766,0.220896,0.020694,1.177967,0.615191,0.272202,0.677560,0.312993,0.436653,0.299032,0.792977,1.383933,0.043267,3.183582,0
1,1.272502,0.490663,0.134979,0.474766,0.235944,1.051171,0.530613,0.615191,0.272202,0.259684,0.397473,0.245145,0.203604,0.401701,0.192129,0.017525,3.183582,0
2,1.272502,0.490663,0.894123,0.474766,0.013300,0.020694,0.206937,0.135102,0.272202,0.436092,0.432914,0.445148,0.301507,0.446350,0.192129,0.312503,3.183582,0
3,1.272502,0.490663,0.050120,0.474766,0.033271,0.247272,0.530613,1.095280,0.272202,0.115768,0.097283,0.375931,0.168484,0.446350,0.192129,0.205265,3.183582,0
4,1.272502,0.490663,0.626138,0.474766,0.037607,0.556627,0.116740,0.135102,0.272202,0.203440,0.189441,0.008909,0.062499,0.418084,0.066443,0.197768,3.183582,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1140594,1.272502,0.490663,0.520234,0.474766,0.076665,3.998801,1.501643,0.135102,0.272202,0.082821,0.732316,0.209889,0.003341,0.446350,0.192129,0.253995,0.584268,0
1140595,0.291273,1.969370,0.200308,6.367603,0.061388,0.556627,0.206937,0.344986,0.272202,0.355913,0.432914,0.219221,0.301507,0.446350,0.192129,0.274189,0.584268,0
1140598,0.291273,0.490663,0.161310,0.474766,0.056755,1.051171,0.116740,0.135102,3.562971,0.436092,0.432914,0.445148,0.301507,0.446350,0.192129,0.312506,0.584268,0
1140599,1.272502,0.490663,0.225661,0.474766,0.027636,0.783205,0.530613,1.095280,0.272202,0.436092,0.432914,0.445148,0.301507,0.446350,0.192129,8.860141,0.584268,0


In [143]:
# Removing the outliers from the main dataset
telco_customers_data = telco_customers_data.drop(row_postion_outlier).reset_index()
telco_costumers_data = telco_customers_data.drop(['index'], axis=1, inplace=True)
telco_customers_data

Unnamed: 0,SOCIO_ECONOMIC_SEGMENT,PARTY_GENDER_CD,BILL_AMOUNT,PAYMENT_TRANSACTIONS,PARTY_REV,PREPAID_LINES,POSTPAID_LINES,OTHER_LINES,STATUS,MOUS_TO_LOCAL_MOBILES,MOUS_FROM_LOCAL_MOBILES,MOUS_TO_LOCAL_LANDLINES,MOUS_FROM_LOCAL_LANDLINES,MOUS_TO_INT_NUMBER,MOUS_FROM_INT_NUMBER,DATA_IN_BNDL,Years_stayed,TARGET
0,0.291273,0.490663,0.529087,0.893708,0.061609,0.020694,0.440417,0.135102,0.272202,0.113999,0.246592,0.321422,0.442179,0.409347,0.030212,0.278056,2.855943,1
1,0.291273,0.490663,0.900673,0.474766,0.097808,0.020694,0.440417,0.344986,0.272202,0.424324,0.390800,0.445148,0.283426,0.446350,0.192129,0.312505,2.855943,1
2,0.291273,0.490663,0.030584,0.474766,0.071117,0.556627,0.116740,0.344986,0.272202,0.354088,0.423732,0.429594,0.074481,0.049668,1.035663,0.299820,2.855943,1
3,0.291273,0.490663,0.580204,0.474766,0.062318,0.020694,0.116740,0.135102,0.272202,0.084254,0.300365,0.101656,0.290675,0.446350,0.191848,0.296438,2.855943,1
4,0.291273,0.490663,0.976799,0.474766,0.062318,0.020694,0.116740,0.135102,0.272202,0.436092,0.432914,0.445148,0.301507,0.446350,0.192129,0.312506,2.855943,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
936757,1.855048,1.969370,0.308659,0.474766,0.089308,0.556627,0.440417,0.344986,0.272202,0.422566,0.399330,0.222073,0.294300,1.337323,0.192129,0.006076,0.584268,0
936758,0.291273,0.490663,0.466712,2.262182,0.032082,0.288660,0.206937,0.135102,0.272202,0.436092,0.432914,0.445148,0.301507,0.095889,0.192129,0.312506,0.584268,0
936759,1.855048,0.490663,0.596468,0.474766,0.079297,0.556627,0.440417,0.135102,0.272202,0.436092,0.432914,0.445148,0.301507,0.446350,0.192129,0.362096,0.584268,0
936760,0.291273,0.490663,0.511130,0.474766,0.070492,0.288660,0.440417,0.344986,0.272202,0.265119,0.215517,0.350267,0.217769,0.446350,0.192129,0.211227,0.584268,0


Having the z-score it is easy to indentify the outliers with a threshold. The threshold is value that defines the outlier as standard deviations, usually the number 3 is chosen for it since 99.7% of the data points lie between 3 standard deviations using the Central Limit Theorem approach (Gaussian Distribution).

### ***Row analysis***

#### *Row valuation*

Dropping the columns that have at least 65% null values

In [144]:
# Calculating the total cells per client (rows per client times the columns)
cells_per_client = len(telco_customers_data.columns)

# Calculating the total non null cells per client
non_null_cells_per_client = telco_customers_data.count(axis=1)
non_null_cells_per_client


0         18
1         18
2         18
3         18
4         18
          ..
936757    18
936758    18
936759    18
936760    18
936761    18
Length: 936762, dtype: int64

In [145]:
# Calculating the total non null cells per client in terms of percentages
percentages_non_null_cells_per_clients = (non_null_cells_per_client * 100) / (cells_per_client)
percentages_non_null_cells_per_clients

0         100.0
1         100.0
2         100.0
3         100.0
4         100.0
          ...  
936757    100.0
936758    100.0
936759    100.0
936760    100.0
936761    100.0
Length: 936762, dtype: float64

In [146]:
# Obtaining the clients that have more than 75% of null values
clients_to_drop = percentages_non_null_cells_per_clients < 25
clients_to_drop = clients_to_drop[clients_to_drop]
# There were actually no clients that had that percentage of null values
clients_to_drop

Series([], dtype: bool)

### ***Imputation***

In [147]:
# Using the iterative imputer to estimate missing values
imp_mean = impute.IterativeImputer()

In [148]:
# Imputing the missing values of the dataset
imputed_data = imp_mean.fit_transform(telco_customers_data)
imputed_data

array([[0.29127291, 0.49066344, 0.52908672, ..., 0.27805604, 2.85594335,
        1.        ],
       [0.29127291, 0.49066344, 0.90067276, ..., 0.31250532, 2.85594335,
        1.        ],
       [0.29127291, 0.49066344, 0.03058398, ..., 0.29981977, 2.85594335,
        1.        ],
       ...,
       [1.8550475 , 0.49066344, 0.59646837, ..., 0.36209577, 0.58426765,
        0.        ],
       [0.29127291, 0.49066344, 0.51112981, ..., 0.21122672, 0.58426765,
        0.        ],
       [0.29127291, 0.49066344, 0.19355065, ..., 0.22268966, 0.58426765,
        0.        ]])

In [149]:
# Since the imputation returns a array, we reconvert it to a DF
telco_customers_data = pd.DataFrame(imputed_data, columns=list(telco_customers_data.columns))
telco_customers_data

Unnamed: 0,SOCIO_ECONOMIC_SEGMENT,PARTY_GENDER_CD,BILL_AMOUNT,PAYMENT_TRANSACTIONS,PARTY_REV,PREPAID_LINES,POSTPAID_LINES,OTHER_LINES,STATUS,MOUS_TO_LOCAL_MOBILES,MOUS_FROM_LOCAL_MOBILES,MOUS_TO_LOCAL_LANDLINES,MOUS_FROM_LOCAL_LANDLINES,MOUS_TO_INT_NUMBER,MOUS_FROM_INT_NUMBER,DATA_IN_BNDL,Years_stayed,TARGET
0,0.291273,0.490663,0.529087,0.893708,0.061609,0.020694,0.440417,0.135102,0.272202,0.113999,0.246592,0.321422,0.442179,0.409347,0.030212,0.278056,2.855943,1.0
1,0.291273,0.490663,0.900673,0.474766,0.097808,0.020694,0.440417,0.344986,0.272202,0.424324,0.390800,0.445148,0.283426,0.446350,0.192129,0.312505,2.855943,1.0
2,0.291273,0.490663,0.030584,0.474766,0.071117,0.556627,0.116740,0.344986,0.272202,0.354088,0.423732,0.429594,0.074481,0.049668,1.035663,0.299820,2.855943,1.0
3,0.291273,0.490663,0.580204,0.474766,0.062318,0.020694,0.116740,0.135102,0.272202,0.084254,0.300365,0.101656,0.290675,0.446350,0.191848,0.296438,2.855943,1.0
4,0.291273,0.490663,0.976799,0.474766,0.062318,0.020694,0.116740,0.135102,0.272202,0.436092,0.432914,0.445148,0.301507,0.446350,0.192129,0.312506,2.855943,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
936757,1.855048,1.969370,0.308659,0.474766,0.089308,0.556627,0.440417,0.344986,0.272202,0.422566,0.399330,0.222073,0.294300,1.337323,0.192129,0.006076,0.584268,0.0
936758,0.291273,0.490663,0.466712,2.262182,0.032082,0.288660,0.206937,0.135102,0.272202,0.436092,0.432914,0.445148,0.301507,0.095889,0.192129,0.312506,0.584268,0.0
936759,1.855048,0.490663,0.596468,0.474766,0.079297,0.556627,0.440417,0.135102,0.272202,0.436092,0.432914,0.445148,0.301507,0.446350,0.192129,0.362096,0.584268,0.0
936760,0.291273,0.490663,0.511130,0.474766,0.070492,0.288660,0.440417,0.344986,0.272202,0.265119,0.215517,0.350267,0.217769,0.446350,0.192129,0.211227,0.584268,0.0


### ***Data preparation***

In [160]:
# k-fold cross validation separation
no_rows_folds = int(telco_customers_data.shape[0] / 3)
fold_1 = telco_customers_data[:no_rows_folds]
fold_2 = telco_customers_data[no_rows_folds:no_rows_folds*2]
fold_3 = telco_customers_data[no_rows_folds*2:]

For the training it will be used a k-fold cross validation schema. The original dataset is splitted in 3 parts and in the trainning process this parts will be rotated.

---
## **Load**

In [161]:
# Exporting the folds of the dataframe (transformations applied)
fold_1.to_csv("../../../data/fold_1.csv")
fold_2.to_csv("../../../data/fold_2.csv")
fold_3.to_csv("../../../data/fold_3.csv")