# Alura Challenge - Data Science - Week 1

Guilherme Lupinari Volpato
E-mail: lv.gui97@gmail.com
Github: https://github.com/LupiVolpi

### *Week 1*
- You have been hired as a data scientist by the *telecom operator Alura Voz*. In the initial meeting with the people responsible for the company's *sales area*, the importance of reducing the Customer Evasion Rate, known as *Churn Rate*, was explained. Basically, the Churn Rate indicates how much the company lost revenue or customers in a period of time.


### *Challenges:*
- Understand the dataset information
- Analyse data types
- Check for inconsistencies in the data
- Fix data incosistencies
- Create daily account columns


### *Data index:*
- *customerID:* each customer's unique identification number.
- *Churn:* whether the client has left the company or not.
- *gender:* Male or Female (according to the database).
- *SeniorCitizen:* whether a client is 65 years of age or older.
- *Partner:* whether the client is partnered or not.
- *Dependents:* whether the client has got dependents or not.
- *tenure:* duration (in months) of the client's contract with the company.
- *PhoneService:* whether the client has hired the companie's phone service.
- *MultipleLines:* whether the client has hired more than one phone line.
- *InternetService:* whether the client has hired a provider of internet.
- *OnlineSecurity:* whether the client has hired an additional online security membership.
- *OnlineBackup:* whether the client has hired an additional online backup membership.
- *DeviceProtection:* whether the client has hired an additional device protection membership.
- *TechSupport:* whether the client has hired an additional technical support membership (with decreased waiting time for services).
- *StreamingTV:* whether the client has hired the cable TV service.
- *StreamingMovies:* whether the client has hired a movie streaming membership.
- *Contract:* the type of the client's contract.
- *PaperlessBilling:* whether the client prefers to receive his billings online.
- *PaymentMethod:* the client's prefered method of payment.
- *Charges.Monthly:* the monthly sum of the client's hired services and membreships.
- *Charges.Total:* the total sum of the client's hired services and memberships.

---
# Importing libraries and setting preferences

In [1]:
import pandas as pd

pd.set_option("display.max_rows", 100) # Pandas will display 100 DataFrame rows at most.
pd.set_option("display.max_columns", None) # Pandas won't collapse DataFrame columns visualization.
pd.set_option("display.max_colwidth", None) # Pandas will display all the information in each column, regardless of how large the values are.

---
# Understanding the dataset

##### For the purposes of understanding the layout of the dataset used for this project before actually loading it, I used the Online JSON Viewer website, which can be found [here](http://jsonviewer.stack.hu/)

---
# Loading the dataset.

In [2]:
data_url = "https://raw.githubusercontent.com/sthemonica/alura-voz/main/Dados/Telco-Customer-Churn.json"

data = pd.read_json(path_or_buf = data_url)

data.head(10)

Unnamed: 0,customerID,Churn,customer,phone,internet,account
0,0002-ORFBO,No,"{'gender': 'Female', 'SeniorCitizen': 0, 'Partner': 'Yes', 'Dependents': 'Yes', 'tenure': 9}","{'PhoneService': 'Yes', 'MultipleLines': 'No'}","{'InternetService': 'DSL', 'OnlineSecurity': 'No', 'OnlineBackup': 'Yes', 'DeviceProtection': 'No', 'TechSupport': 'Yes', 'StreamingTV': 'Yes', 'StreamingMovies': 'No'}","{'Contract': 'One year', 'PaperlessBilling': 'Yes', 'PaymentMethod': 'Mailed check', 'Charges': {'Monthly': 65.6, 'Total': '593.3'}}"
1,0003-MKNFE,No,"{'gender': 'Male', 'SeniorCitizen': 0, 'Partner': 'No', 'Dependents': 'No', 'tenure': 9}","{'PhoneService': 'Yes', 'MultipleLines': 'Yes'}","{'InternetService': 'DSL', 'OnlineSecurity': 'No', 'OnlineBackup': 'No', 'DeviceProtection': 'No', 'TechSupport': 'No', 'StreamingTV': 'No', 'StreamingMovies': 'Yes'}","{'Contract': 'Month-to-month', 'PaperlessBilling': 'No', 'PaymentMethod': 'Mailed check', 'Charges': {'Monthly': 59.9, 'Total': '542.4'}}"
2,0004-TLHLJ,Yes,"{'gender': 'Male', 'SeniorCitizen': 0, 'Partner': 'No', 'Dependents': 'No', 'tenure': 4}","{'PhoneService': 'Yes', 'MultipleLines': 'No'}","{'InternetService': 'Fiber optic', 'OnlineSecurity': 'No', 'OnlineBackup': 'No', 'DeviceProtection': 'Yes', 'TechSupport': 'No', 'StreamingTV': 'No', 'StreamingMovies': 'No'}","{'Contract': 'Month-to-month', 'PaperlessBilling': 'Yes', 'PaymentMethod': 'Electronic check', 'Charges': {'Monthly': 73.9, 'Total': '280.85'}}"
3,0011-IGKFF,Yes,"{'gender': 'Male', 'SeniorCitizen': 1, 'Partner': 'Yes', 'Dependents': 'No', 'tenure': 13}","{'PhoneService': 'Yes', 'MultipleLines': 'No'}","{'InternetService': 'Fiber optic', 'OnlineSecurity': 'No', 'OnlineBackup': 'Yes', 'DeviceProtection': 'Yes', 'TechSupport': 'No', 'StreamingTV': 'Yes', 'StreamingMovies': 'Yes'}","{'Contract': 'Month-to-month', 'PaperlessBilling': 'Yes', 'PaymentMethod': 'Electronic check', 'Charges': {'Monthly': 98.0, 'Total': '1237.85'}}"
4,0013-EXCHZ,Yes,"{'gender': 'Female', 'SeniorCitizen': 1, 'Partner': 'Yes', 'Dependents': 'No', 'tenure': 3}","{'PhoneService': 'Yes', 'MultipleLines': 'No'}","{'InternetService': 'Fiber optic', 'OnlineSecurity': 'No', 'OnlineBackup': 'No', 'DeviceProtection': 'No', 'TechSupport': 'Yes', 'StreamingTV': 'Yes', 'StreamingMovies': 'No'}","{'Contract': 'Month-to-month', 'PaperlessBilling': 'Yes', 'PaymentMethod': 'Mailed check', 'Charges': {'Monthly': 83.9, 'Total': '267.4'}}"
5,0013-MHZWF,No,"{'gender': 'Female', 'SeniorCitizen': 0, 'Partner': 'No', 'Dependents': 'Yes', 'tenure': 9}","{'PhoneService': 'Yes', 'MultipleLines': 'No'}","{'InternetService': 'DSL', 'OnlineSecurity': 'No', 'OnlineBackup': 'No', 'DeviceProtection': 'No', 'TechSupport': 'Yes', 'StreamingTV': 'Yes', 'StreamingMovies': 'Yes'}","{'Contract': 'Month-to-month', 'PaperlessBilling': 'Yes', 'PaymentMethod': 'Credit card (automatic)', 'Charges': {'Monthly': 69.4, 'Total': '571.45'}}"
6,0013-SMEOE,No,"{'gender': 'Female', 'SeniorCitizen': 1, 'Partner': 'Yes', 'Dependents': 'No', 'tenure': 71}","{'PhoneService': 'Yes', 'MultipleLines': 'No'}","{'InternetService': 'Fiber optic', 'OnlineSecurity': 'Yes', 'OnlineBackup': 'Yes', 'DeviceProtection': 'Yes', 'TechSupport': 'Yes', 'StreamingTV': 'Yes', 'StreamingMovies': 'Yes'}","{'Contract': 'Two year', 'PaperlessBilling': 'Yes', 'PaymentMethod': 'Bank transfer (automatic)', 'Charges': {'Monthly': 109.7, 'Total': '7904.25'}}"
7,0014-BMAQU,No,"{'gender': 'Male', 'SeniorCitizen': 0, 'Partner': 'Yes', 'Dependents': 'No', 'tenure': 63}","{'PhoneService': 'Yes', 'MultipleLines': 'Yes'}","{'InternetService': 'Fiber optic', 'OnlineSecurity': 'Yes', 'OnlineBackup': 'No', 'DeviceProtection': 'No', 'TechSupport': 'Yes', 'StreamingTV': 'No', 'StreamingMovies': 'No'}","{'Contract': 'Two year', 'PaperlessBilling': 'Yes', 'PaymentMethod': 'Credit card (automatic)', 'Charges': {'Monthly': 84.65, 'Total': '5377.8'}}"
8,0015-UOCOJ,No,"{'gender': 'Female', 'SeniorCitizen': 1, 'Partner': 'No', 'Dependents': 'No', 'tenure': 7}","{'PhoneService': 'Yes', 'MultipleLines': 'No'}","{'InternetService': 'DSL', 'OnlineSecurity': 'Yes', 'OnlineBackup': 'No', 'DeviceProtection': 'No', 'TechSupport': 'No', 'StreamingTV': 'No', 'StreamingMovies': 'No'}","{'Contract': 'Month-to-month', 'PaperlessBilling': 'Yes', 'PaymentMethod': 'Electronic check', 'Charges': {'Monthly': 48.2, 'Total': '340.35'}}"
9,0016-QLJIS,No,"{'gender': 'Female', 'SeniorCitizen': 0, 'Partner': 'Yes', 'Dependents': 'Yes', 'tenure': 65}","{'PhoneService': 'Yes', 'MultipleLines': 'Yes'}","{'InternetService': 'DSL', 'OnlineSecurity': 'Yes', 'OnlineBackup': 'Yes', 'DeviceProtection': 'Yes', 'TechSupport': 'Yes', 'StreamingTV': 'Yes', 'StreamingMovies': 'Yes'}","{'Contract': 'Two year', 'PaperlessBilling': 'Yes', 'PaymentMethod': 'Mailed check', 'Charges': {'Monthly': 90.45, 'Total': '5957.9'}}"


In [3]:
# Standardizing the "CustomerID" and "Churn" columns' names.

data = data.rename({"customerID": "customer_id", "Churn": "churn"}, axis = 1)

data.head(10)

Unnamed: 0,customer_id,churn,customer,phone,internet,account
0,0002-ORFBO,No,"{'gender': 'Female', 'SeniorCitizen': 0, 'Partner': 'Yes', 'Dependents': 'Yes', 'tenure': 9}","{'PhoneService': 'Yes', 'MultipleLines': 'No'}","{'InternetService': 'DSL', 'OnlineSecurity': 'No', 'OnlineBackup': 'Yes', 'DeviceProtection': 'No', 'TechSupport': 'Yes', 'StreamingTV': 'Yes', 'StreamingMovies': 'No'}","{'Contract': 'One year', 'PaperlessBilling': 'Yes', 'PaymentMethod': 'Mailed check', 'Charges': {'Monthly': 65.6, 'Total': '593.3'}}"
1,0003-MKNFE,No,"{'gender': 'Male', 'SeniorCitizen': 0, 'Partner': 'No', 'Dependents': 'No', 'tenure': 9}","{'PhoneService': 'Yes', 'MultipleLines': 'Yes'}","{'InternetService': 'DSL', 'OnlineSecurity': 'No', 'OnlineBackup': 'No', 'DeviceProtection': 'No', 'TechSupport': 'No', 'StreamingTV': 'No', 'StreamingMovies': 'Yes'}","{'Contract': 'Month-to-month', 'PaperlessBilling': 'No', 'PaymentMethod': 'Mailed check', 'Charges': {'Monthly': 59.9, 'Total': '542.4'}}"
2,0004-TLHLJ,Yes,"{'gender': 'Male', 'SeniorCitizen': 0, 'Partner': 'No', 'Dependents': 'No', 'tenure': 4}","{'PhoneService': 'Yes', 'MultipleLines': 'No'}","{'InternetService': 'Fiber optic', 'OnlineSecurity': 'No', 'OnlineBackup': 'No', 'DeviceProtection': 'Yes', 'TechSupport': 'No', 'StreamingTV': 'No', 'StreamingMovies': 'No'}","{'Contract': 'Month-to-month', 'PaperlessBilling': 'Yes', 'PaymentMethod': 'Electronic check', 'Charges': {'Monthly': 73.9, 'Total': '280.85'}}"
3,0011-IGKFF,Yes,"{'gender': 'Male', 'SeniorCitizen': 1, 'Partner': 'Yes', 'Dependents': 'No', 'tenure': 13}","{'PhoneService': 'Yes', 'MultipleLines': 'No'}","{'InternetService': 'Fiber optic', 'OnlineSecurity': 'No', 'OnlineBackup': 'Yes', 'DeviceProtection': 'Yes', 'TechSupport': 'No', 'StreamingTV': 'Yes', 'StreamingMovies': 'Yes'}","{'Contract': 'Month-to-month', 'PaperlessBilling': 'Yes', 'PaymentMethod': 'Electronic check', 'Charges': {'Monthly': 98.0, 'Total': '1237.85'}}"
4,0013-EXCHZ,Yes,"{'gender': 'Female', 'SeniorCitizen': 1, 'Partner': 'Yes', 'Dependents': 'No', 'tenure': 3}","{'PhoneService': 'Yes', 'MultipleLines': 'No'}","{'InternetService': 'Fiber optic', 'OnlineSecurity': 'No', 'OnlineBackup': 'No', 'DeviceProtection': 'No', 'TechSupport': 'Yes', 'StreamingTV': 'Yes', 'StreamingMovies': 'No'}","{'Contract': 'Month-to-month', 'PaperlessBilling': 'Yes', 'PaymentMethod': 'Mailed check', 'Charges': {'Monthly': 83.9, 'Total': '267.4'}}"
5,0013-MHZWF,No,"{'gender': 'Female', 'SeniorCitizen': 0, 'Partner': 'No', 'Dependents': 'Yes', 'tenure': 9}","{'PhoneService': 'Yes', 'MultipleLines': 'No'}","{'InternetService': 'DSL', 'OnlineSecurity': 'No', 'OnlineBackup': 'No', 'DeviceProtection': 'No', 'TechSupport': 'Yes', 'StreamingTV': 'Yes', 'StreamingMovies': 'Yes'}","{'Contract': 'Month-to-month', 'PaperlessBilling': 'Yes', 'PaymentMethod': 'Credit card (automatic)', 'Charges': {'Monthly': 69.4, 'Total': '571.45'}}"
6,0013-SMEOE,No,"{'gender': 'Female', 'SeniorCitizen': 1, 'Partner': 'Yes', 'Dependents': 'No', 'tenure': 71}","{'PhoneService': 'Yes', 'MultipleLines': 'No'}","{'InternetService': 'Fiber optic', 'OnlineSecurity': 'Yes', 'OnlineBackup': 'Yes', 'DeviceProtection': 'Yes', 'TechSupport': 'Yes', 'StreamingTV': 'Yes', 'StreamingMovies': 'Yes'}","{'Contract': 'Two year', 'PaperlessBilling': 'Yes', 'PaymentMethod': 'Bank transfer (automatic)', 'Charges': {'Monthly': 109.7, 'Total': '7904.25'}}"
7,0014-BMAQU,No,"{'gender': 'Male', 'SeniorCitizen': 0, 'Partner': 'Yes', 'Dependents': 'No', 'tenure': 63}","{'PhoneService': 'Yes', 'MultipleLines': 'Yes'}","{'InternetService': 'Fiber optic', 'OnlineSecurity': 'Yes', 'OnlineBackup': 'No', 'DeviceProtection': 'No', 'TechSupport': 'Yes', 'StreamingTV': 'No', 'StreamingMovies': 'No'}","{'Contract': 'Two year', 'PaperlessBilling': 'Yes', 'PaymentMethod': 'Credit card (automatic)', 'Charges': {'Monthly': 84.65, 'Total': '5377.8'}}"
8,0015-UOCOJ,No,"{'gender': 'Female', 'SeniorCitizen': 1, 'Partner': 'No', 'Dependents': 'No', 'tenure': 7}","{'PhoneService': 'Yes', 'MultipleLines': 'No'}","{'InternetService': 'DSL', 'OnlineSecurity': 'Yes', 'OnlineBackup': 'No', 'DeviceProtection': 'No', 'TechSupport': 'No', 'StreamingTV': 'No', 'StreamingMovies': 'No'}","{'Contract': 'Month-to-month', 'PaperlessBilling': 'Yes', 'PaymentMethod': 'Electronic check', 'Charges': {'Monthly': 48.2, 'Total': '340.35'}}"
9,0016-QLJIS,No,"{'gender': 'Female', 'SeniorCitizen': 0, 'Partner': 'Yes', 'Dependents': 'Yes', 'tenure': 65}","{'PhoneService': 'Yes', 'MultipleLines': 'Yes'}","{'InternetService': 'DSL', 'OnlineSecurity': 'Yes', 'OnlineBackup': 'Yes', 'DeviceProtection': 'Yes', 'TechSupport': 'Yes', 'StreamingTV': 'Yes', 'StreamingMovies': 'Yes'}","{'Contract': 'Two year', 'PaperlessBilling': 'Yes', 'PaymentMethod': 'Mailed check', 'Charges': {'Monthly': 90.45, 'Total': '5957.9'}}"


---
# Normalizing the DataFrame

##### As I had seen that the "customer", "phone", "internet" and "account" columns all had more information in dictionaries, my next step was to extract this data into separate columns for 2 reasons:
##### 1. to improve visibility and readability of the DataFrame and;
##### 2. to allow for operations and analysis to be made with the data, which would have been impossible in the form of dictionaries.

In [4]:
# As we check the data entries for all these categories, we see that they really are dictionaries.
# Because of that, we can extract the data inside each of them through the method json_nrmalize().

data["customer"].apply(lambda value: type(value) == dict).value_counts(normalize = True) * 100

True    100.0
Name: customer, dtype: float64

In [5]:
data["phone"].apply(lambda value: type(value) == dict).value_counts(normalize = True) * 100

True    100.0
Name: phone, dtype: float64

In [6]:
data["internet"].apply(lambda value: type(value) == dict).value_counts(normalize = True) * 100

True    100.0
Name: internet, dtype: float64

In [7]:
data["account"].apply(lambda value: type(value) == dict).value_counts(normalize = True) * 100

True    100.0
Name: account, dtype: float64

### <font color = "Blue"> Information under "customer".

In [8]:
data_customer = pd.json_normalize(data = data["customer"], sep = "_")

data_customer.columns

Index(['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure'], dtype='object')

In [9]:
# Standardizing the columns' names.

data_customer = data_customer.rename({"SeniorCitizen": 'senior_citizen', "Partner": 'partner', "Dependents": 'dependents'}, axis = 1)

data_customer

Unnamed: 0,gender,senior_citizen,partner,dependents,tenure
0,Female,0,Yes,Yes,9
1,Male,0,No,No,9
2,Male,0,No,No,4
3,Male,1,Yes,No,13
4,Female,1,Yes,No,3
...,...,...,...,...,...
7262,Female,0,No,No,13
7263,Male,0,Yes,No,22
7264,Male,0,No,No,2
7265,Male,0,Yes,Yes,67


### <font color = "Blue"> Information under "phone".

In [10]:
data_phone = pd.json_normalize(data = data["phone"], sep = "_")

data_phone.columns

Index(['PhoneService', 'MultipleLines'], dtype='object')

In [11]:
# Standardizing the columns' names.

data_phone = data_phone.rename({"PhoneService": "phone_service", "MultipleLines": "multiple_lines"}, axis = 1)

data_phone

Unnamed: 0,phone_service,multiple_lines
0,Yes,No
1,Yes,Yes
2,Yes,No
3,Yes,No
4,Yes,No
...,...,...
7262,Yes,No
7263,Yes,Yes
7264,Yes,No
7265,Yes,No


### <font color = "Blue"> Information under "internet".

In [12]:
data_internet = pd.json_normalize(data= data["internet"], sep = "_")

data_internet.columns

Index(['InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
       'TechSupport', 'StreamingTV', 'StreamingMovies'],
      dtype='object')

In [13]:
# Standardizing the columns' names.

data_internet = data_internet.rename({"InternetService": "internet_service", "OnlineSecurity": "online_security", "OnlineBackup": "online_backup", "DeviceProtection": "device_protection", "TechSupport": "tech_support", "StreamingTV": "streaming_tv", "StreamingMovies": "streaming_movies"}, axis = 1)

data_internet

Unnamed: 0,internet_service,online_security,online_backup,device_protection,tech_support,streaming_tv,streaming_movies
0,DSL,No,Yes,No,Yes,Yes,No
1,DSL,No,No,No,No,No,Yes
2,Fiber optic,No,No,Yes,No,No,No
3,Fiber optic,No,Yes,Yes,No,Yes,Yes
4,Fiber optic,No,No,No,Yes,Yes,No
...,...,...,...,...,...,...,...
7262,DSL,Yes,No,No,Yes,No,No
7263,Fiber optic,No,No,No,No,No,Yes
7264,DSL,No,Yes,No,No,No,No
7265,DSL,Yes,No,Yes,Yes,No,Yes


### <font color = "Blue"> Information under "account".

In [14]:
data_account = pd.json_normalize(data = data["account"])

data_account.columns

Index(['Contract', 'PaperlessBilling', 'PaymentMethod', 'Charges.Monthly',
       'Charges.Total'],
      dtype='object')

In [15]:
# Standardizing the columns' names.

data_account = data_account.rename({"Contract": "contract", "PaperlessBilling": "paperless_billing", "PaymentMethod": "payment_method", "Charges.Monthly": "charges_monthly", "Charges.Total": "charges_total"}, axis = 1)

data_account

Unnamed: 0,contract,paperless_billing,payment_method,charges_monthly,charges_total
0,One year,Yes,Mailed check,65.60,593.3
1,Month-to-month,No,Mailed check,59.90,542.4
2,Month-to-month,Yes,Electronic check,73.90,280.85
3,Month-to-month,Yes,Electronic check,98.00,1237.85
4,Month-to-month,Yes,Mailed check,83.90,267.4
...,...,...,...,...,...
7262,One year,No,Mailed check,55.15,742.9
7263,Month-to-month,Yes,Electronic check,85.10,1873.7
7264,Month-to-month,Yes,Mailed check,50.30,92.75
7265,Two year,No,Mailed check,67.85,4627.65


### <font color = "Blue"> Concatenating all subsets into the original DataFrame laterally.

In [16]:
# The "customer", "phone", "internet" and "account" columns of the original DataFrame have been left out deliberately.

data = pd.concat([data[["customer_id", "churn"]], data_customer, data_phone, data_internet,data_account], axis = 1)

data.head(10)

Unnamed: 0,customer_id,churn,gender,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,internet_service,online_security,online_backup,device_protection,tech_support,streaming_tv,streaming_movies,contract,paperless_billing,payment_method,charges_monthly,charges_total
0,0002-ORFBO,No,Female,0,Yes,Yes,9,Yes,No,DSL,No,Yes,No,Yes,Yes,No,One year,Yes,Mailed check,65.6,593.3
1,0003-MKNFE,No,Male,0,No,No,9,Yes,Yes,DSL,No,No,No,No,No,Yes,Month-to-month,No,Mailed check,59.9,542.4
2,0004-TLHLJ,Yes,Male,0,No,No,4,Yes,No,Fiber optic,No,No,Yes,No,No,No,Month-to-month,Yes,Electronic check,73.9,280.85
3,0011-IGKFF,Yes,Male,1,Yes,No,13,Yes,No,Fiber optic,No,Yes,Yes,No,Yes,Yes,Month-to-month,Yes,Electronic check,98.0,1237.85
4,0013-EXCHZ,Yes,Female,1,Yes,No,3,Yes,No,Fiber optic,No,No,No,Yes,Yes,No,Month-to-month,Yes,Mailed check,83.9,267.4
5,0013-MHZWF,No,Female,0,No,Yes,9,Yes,No,DSL,No,No,No,Yes,Yes,Yes,Month-to-month,Yes,Credit card (automatic),69.4,571.45
6,0013-SMEOE,No,Female,1,Yes,No,71,Yes,No,Fiber optic,Yes,Yes,Yes,Yes,Yes,Yes,Two year,Yes,Bank transfer (automatic),109.7,7904.25
7,0014-BMAQU,No,Male,0,Yes,No,63,Yes,Yes,Fiber optic,Yes,No,No,Yes,No,No,Two year,Yes,Credit card (automatic),84.65,5377.8
8,0015-UOCOJ,No,Female,1,No,No,7,Yes,No,DSL,Yes,No,No,No,No,No,Month-to-month,Yes,Electronic check,48.2,340.35
9,0016-QLJIS,No,Female,0,Yes,Yes,65,Yes,Yes,DSL,Yes,Yes,Yes,Yes,Yes,Yes,Two year,Yes,Mailed check,90.45,5957.9


---
# Understanding the dataset information

In [17]:
print(f"The dataset currently has {data.shape[0]} customer entries and {data.shape[1]} different pieces of information about each of them")

The dataset currently has 7267 customer entries and 21 different pieces of information about each of them


---
# Analysing data entries and their types

In [18]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7267 entries, 0 to 7266
Data columns (total 21 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   customer_id        7267 non-null   object 
 1   churn              7267 non-null   object 
 2   gender             7267 non-null   object 
 3   senior_citizen     7267 non-null   int64  
 4   partner            7267 non-null   object 
 5   dependents         7267 non-null   object 
 6   tenure             7267 non-null   int64  
 7   phone_service      7267 non-null   object 
 8   multiple_lines     7267 non-null   object 
 9   internet_service   7267 non-null   object 
 10  online_security    7267 non-null   object 
 11  online_backup      7267 non-null   object 
 12  device_protection  7267 non-null   object 
 13  tech_support       7267 non-null   object 
 14  streaming_tv       7267 non-null   object 
 15  streaming_movies   7267 non-null   object 
 16  contract           7267 

##### - As the DataFrame info is analysed, we can see that there are no values considered to be null in all columns.
##### - The variable "senior_citizen", which should be a boolean one, is actually labeled as being of the int type. This happened because the "Yes" or "No" have been replaced with "0" and "1". At the stage of fixing data inconsistencies, it will be decided whether to switch to "Yes" and "No" or maintain zeroes and ones.
##### - The variable "charges_total", which should be of the float/integer type, is actually an object. The reasons and corrections to this irregularity will be addressed in the stage of fixing data inconsistencies.

### <font color = "Blue"> Boolean Variables

Beginer-friendly text: a boolean variable is that which involves a dichotomy of "True"/"False", "Yes"/"No", one or the other, without any middle ground.
- churn
- senior_citizen
- partner
- dependents
- phone_service
- multiple_lines
- online_security
- online_backup
- device_protection
- tech_support
- streaming_tv
- streaming_movies
- paperless_billing

### <font color = "Blue"> Quantitative Variables

- tenure
- charges_monthly
- charges_total

### <font color = "Blue"> Descriptive Variables

- customer_id
- gender
- internet_service
- contract
- payment_method

---
# Checking for and Fixing inconsistencies in the data

#### <font color = "Blue"> 1. The first thing that came to mind was to check if there were no duplicated customers in the dataset. This was done by using the unique() method and applying it to the customer_id column.

In [19]:
# The result of this cell indicates that there are 7267 different customer IDs in the dataset.

data["customer_id"].unique().shape[0]

7267

In [20]:
data.shape[0]

7267

In [21]:
# Let's quickly check if the customer_id has got any blank entries.

data.query("customer_id == ''")

Unnamed: 0,customer_id,churn,gender,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,internet_service,online_security,online_backup,device_protection,tech_support,streaming_tv,streaming_movies,contract,paperless_billing,payment_method,charges_monthly,charges_total


##### <font color = "Green"> Being that the dataset originally has got 7.267 entries, none of them are blank, and each of them has got a unique customer ID, we can discard the possibility of there being duplicated entries.

#### <font color = "Blue"> 2. The *senior_citizen* variable has got 0 and 1 for values rather than "Yes" and "No".

##### First, I wanted to understand which number represented which condition of "Yes" (is 65 years old or older) or "No".

In [22]:
data["senior_citizen"].value_counts(normalize = True) * 100

0    83.734691
1    16.265309
Name: senior_citizen, dtype: float64

##### Given that a bit less than 83.75% of customers are "0", it is most likely that 0 means "No" (they are not 65 years or older).


##### For the purpose of standardizing variable labels, <font color = "Green"> I have decided to replace 0 with "No" and 1 with "Yes".

In [23]:
data["senior_citizen"] = data["senior_citizen"].transform(lambda value: "Yes" if value == 1 else "No")

In [24]:
# With the test in this cell, we can see that the percentages have not changed after altering the labels of each value.

data["senior_citizen"].value_counts(normalize = True) * 100

No     83.734691
Yes    16.265309
Name: senior_citizen, dtype: float64

##### After this check, we are certain that all values in the *senior_citizen* column have been replaced with "Yes" and "No".

#### <font color = "Blue"> 3. As has been mentioned before, the "charges_total" variable values are labeled as objects when we use the info() method. Let's see what that's about.

In [25]:
data["charges_total"][0]

'593.3'

In [26]:
type(data["charges_total"][0])

str

##### The test above already raises a possibility. Though the values appear to be floats, they might all be strings. Let's build a more sophisticated test and apply it to all entries in this column.

In [27]:
data["charges_total"].apply(lambda value: type(value) == str).value_counts(normalize = True) * 100

True    100.0
Name: charges_total, dtype: float64

##### <font color = "Green"> As this test has proven, 100% of the values in this column are strings. We can change their type to floats using the to_numeric() method.

In [28]:
data["charges_total"] = pd.to_numeric(arg = data["charges_total"], errors = "coerce")

##### After this switch has been done, let's reapply the test from before, but checking if they have become floats.

In [29]:
data["charges_total"].apply(lambda value: type(value) == float).value_counts(normalize = True) * 100

True    100.0
Name: charges_total, dtype: float64

##### 100% of them are now floats!

#### <font color = "Blue"> 4. Still on the "charges_total" variable, a quick use of the isnull() method shows us that there are 11 entries with null values.

In [30]:
data["charges_total"].isnull().value_counts()

False    7256
True       11
Name: charges_total, dtype: int64

##### Let's understand the reason by visualizing these 11 entries through a query.

In [31]:
data[data["charges_total"].isna() == True]

Unnamed: 0,customer_id,churn,gender,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,internet_service,online_security,online_backup,device_protection,tech_support,streaming_tv,streaming_movies,contract,paperless_billing,payment_method,charges_monthly,charges_total
975,1371-DWPAZ,No,Female,No,Yes,Yes,0,No,No phone service,DSL,Yes,Yes,Yes,Yes,Yes,No,Two year,No,Credit card (automatic),56.05,
1775,2520-SGTTA,No,Female,No,Yes,Yes,0,Yes,No,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,20.0,
1955,2775-SEFEE,No,Male,No,No,Yes,0,Yes,Yes,DSL,Yes,Yes,No,Yes,No,No,Two year,Yes,Bank transfer (automatic),61.9,
2075,2923-ARZLG,No,Male,No,Yes,Yes,0,Yes,No,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,One year,Yes,Mailed check,19.7,
2232,3115-CZMZD,No,Male,No,No,Yes,0,Yes,No,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,20.25,
2308,3213-VVOLG,No,Male,No,Yes,Yes,0,Yes,Yes,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,25.35,
2930,4075-WKNIU,No,Female,No,Yes,Yes,0,Yes,Yes,DSL,No,Yes,Yes,Yes,Yes,No,Two year,No,Mailed check,73.35,
3134,4367-NUYAO,No,Male,No,Yes,Yes,0,Yes,Yes,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,25.75,
3203,4472-LVYGI,No,Female,No,Yes,Yes,0,No,No phone service,DSL,Yes,No,Yes,Yes,Yes,No,Two year,Yes,Bank transfer (automatic),52.55,
4169,5709-LVOEQ,No,Female,No,Yes,Yes,0,Yes,No,DSL,Yes,Yes,Yes,No,Yes,Yes,Two year,No,Mailed check,80.85,


##### As future inspiration for analysis, it's interesting to note that:
##### - All of these customers are below 64 years of age (as seen by the *senior_citizen* variable).
##### - All of these customers have dependents.
##### - None of these customers have left the company (as seen by the *churn* variable).

##### For the purpose of this specific issue though, the most important information is that <font color = "Green"> none of these customers have completed a full month of tenure with the company, which is probably why the *charges_total* values have appeared as null.
##### Being that the case, let's see how *charges_total* behaves for 1-month-tenure customers.

In [32]:
data.query("tenure == 1").shape[0]

634

In [33]:
data.query("tenure == 1").query("charges_monthly == charges_total").shape[0]

634

##### As shown above, the *charges_monthly* and *charges_total* variables have the same values in each entry for 1-month-tenure customers. This would mean that *charges_total* values are "1 month behind" for all customers in the database, as charges for 0-month-tenure customers are not included in the dataset until the first month.

##### Being that the case, <font color = "Green"> I have decided to fill the null values of 0-month-tenure customers with 0 rather than fill them with the information from the *charges_monthly*.

In [34]:
# Let's apply 0 to the null values.

data["charges_total"].fillna(value = 0, inplace = True)

# And repeat the test to see if they have actually been replaced.

data[data["charges_total"].isna() == True]

Unnamed: 0,customer_id,churn,gender,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,internet_service,online_security,online_backup,device_protection,tech_support,streaming_tv,streaming_movies,contract,paperless_billing,payment_method,charges_monthly,charges_total


#### <font color = "Blue"> 5. Checking for controversial entries between columns.

##### By "controversial entries" I mean values that would mean impossible situations in reality. For example, it should be impossible for a customer to have "Yes" on the *multiple_lines* column and "No" on the *phone_service* column (in order to actually have multiple phone lines, the customer has to have at least one to begin with).

##### <font color = "Blue"> 5a. It should be impossible for a customer to have “Yes” on the multiple_lines column and “No” on the phone_service column.

In [35]:
data.query("phone_service == 'No' and multiple_lines == 'Yes'")

Unnamed: 0,customer_id,churn,gender,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,internet_service,online_security,online_backup,device_protection,tech_support,streaming_tv,streaming_movies,contract,paperless_billing,payment_method,charges_monthly,charges_total


##### No entries are returned, so <font color = "Green"> **we can consider the data to be in conformity**.

##### <font color = "Blue"> 5b. Whenever a customer has got "No" on the *phone_service* column, the *multiple_lines* column should be labeled as "No phone service"

In [36]:
data.query("phone_service == 'No' and multiple_lines == 'No'")

Unnamed: 0,customer_id,churn,gender,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,internet_service,online_security,online_backup,device_protection,tech_support,streaming_tv,streaming_movies,contract,paperless_billing,payment_method,charges_monthly,charges_total


##### No entries are returned, so <font color = "Green"> **we can consider the data to be in conformity**.

##### <font color = "Blue"> 5c. Whenever a customer has got "No" on the *internet_service* column, the *online_security*, *online_backup*, *device_protection*, *tech_support*, *streaming_tv* and *streaming_movies* columns should all be labeled as "No internet service".

In [37]:
data.query("internet_service == 'No' and (online_security == 'No' or online_backup == 'No' or device_protection == 'No' or tech_support == 'No' or streaming_tv == 'No' or streaming_movies == 'No')")

Unnamed: 0,customer_id,churn,gender,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,internet_service,online_security,online_backup,device_protection,tech_support,streaming_tv,streaming_movies,contract,paperless_billing,payment_method,charges_monthly,charges_total


##### No entries are returned, so <font color = "Green"> **we can consider the data to be in conformity**.

##### <font color = "Blue"> 5d. Whenever a customer has got "Yes" on the *phone_service* column, the *multiple_lines* column can't be labeled as "No phone service".

In [38]:
data.query("phone_service == 'Yes' and multiple_lines == 'No phone service'")

Unnamed: 0,customer_id,churn,gender,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,internet_service,online_security,online_backup,device_protection,tech_support,streaming_tv,streaming_movies,contract,paperless_billing,payment_method,charges_monthly,charges_total


##### No entries are returned, so <font color = "Green"> **we can consider the data to be in conformity**.

##### <font color = "Blue"> 5e. Whenever a customer has got "Yes" on the *internet_service* column, the *online_security*, *online_backup*, *device_protection*, *tech_support*, *streaming_tv* and *streaming_movies* columns can't be labeled as "No internet service".


In [39]:
data.query("internet_service == 'Yes' and (online_security == 'No phone service' or online_backup == 'No phone service' or device_protection == 'No phone service' or tech_support == 'No phone service' or streaming_tv == 'No phone service' or streaming_movies == 'No phone service')")

Unnamed: 0,customer_id,churn,gender,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,internet_service,online_security,online_backup,device_protection,tech_support,streaming_tv,streaming_movies,contract,paperless_billing,payment_method,charges_monthly,charges_total


##### No entries are returned, so <font color = "Green"> **we can consider the data to be in conformity**.

##### <font color = "Blue"> 5f. Since this is a customer database, all entries must have hired at least one of the services available. So there shouldn't be entries that have "No" in both *phone_service* and *internet_service* columns.


In [40]:
data.query("phone_service == 'No' and internet_service == 'No'")

Unnamed: 0,customer_id,churn,gender,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,internet_service,online_security,online_backup,device_protection,tech_support,streaming_tv,streaming_movies,contract,paperless_billing,payment_method,charges_monthly,charges_total


##### No entries are returned, so <font color = "Green"> **we can consider the data to be in conformity**.


#### <font color = "Blue"> 6. Checking for blank values in the DataFrame.

In [41]:
# Checking for blank values with spaces (' ').

data[data.eq(' ').any(1)]

Unnamed: 0,customer_id,churn,gender,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,internet_service,online_security,online_backup,device_protection,tech_support,streaming_tv,streaming_movies,contract,paperless_billing,payment_method,charges_monthly,charges_total


In [42]:
# Checking for blank values without spaces ('').

data[data.eq('').any(1)]

Unnamed: 0,customer_id,churn,gender,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,internet_service,online_security,online_backup,device_protection,tech_support,streaming_tv,streaming_movies,contract,paperless_billing,payment_method,charges_monthly,charges_total
30,0047-ZHDTW,,Female,No,No,No,11,Yes,Yes,Fiber optic,Yes,No,No,No,No,No,Month-to-month,Yes,Bank transfer (automatic),79.00,929.30
75,0120-YZLQA,,Male,No,No,No,71,Yes,No,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Two year,Yes,Credit card (automatic),19.90,1355.10
96,0154-QYHJU,,Male,No,No,No,29,Yes,No,DSL,Yes,Yes,No,Yes,No,No,One year,Yes,Electronic check,58.75,1696.20
98,0162-RZGMZ,,Female,Yes,No,No,5,Yes,No,DSL,Yes,Yes,No,Yes,No,No,Month-to-month,No,Credit card (automatic),59.90,287.85
175,0274-VVQOQ,,Male,Yes,Yes,No,65,Yes,Yes,Fiber optic,No,Yes,Yes,No,Yes,Yes,One year,Yes,Bank transfer (automatic),103.15,6792.45
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7158,9840-GSRFX,,Female,No,No,No,14,Yes,Yes,DSL,No,Yes,No,No,No,No,One year,Yes,Mailed check,54.25,773.20
7180,9872-RZQQB,,Female,No,Yes,No,49,No,No phone service,DSL,Yes,No,No,No,Yes,No,Month-to-month,No,Bank transfer (automatic),40.65,2070.75
7211,9920-GNDMB,,Male,No,No,No,9,Yes,Yes,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,76.25,684.85
7239,9955-RVWSC,,Female,No,Yes,Yes,67,Yes,No,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Two year,Yes,Bank transfer (automatic),19.25,1372.90


##### Though the *churn* variable is only supposed to have only "Yes" or "No" for values, the unique() method demonstrates that there are blank values in some of the entries.

In [43]:
# In 224 of them, to be exact.

data["churn"].value_counts()

No     5174
Yes    1869
        224
Name: churn, dtype: int64

##### As the *churn* variable will be the most important to group entries in the DataFrame in the analysis stage, a decision has to be made of whether to keep the blank entries in the dataset or not.
##### Let's see what these 224 blank entries represent in terms of percentage.

In [44]:
data["churn"].value_counts(normalize = True) * 100

No     71.198569
Yes    25.719004
        3.082427
Name: churn, dtype: float64

##### <font color = "Green"> Since blank *churn* values represent a small percentage of entries in the DataFrame, I have decided to assign *data* to a new, sanitized version of the original DataFrame, excluding these blank entries.

##### <font color = "Green"> Nevertheless,I have decided to create a second DataFrame named *data_all_entries*, which will be an identical version of the *data* DataFrame (before excluding the blank entries), to act as a backup in case it becomes relevant in the future.

#### <font color = "Red"> The creation of both DataFrames is done at the end of this notebook.

---
# Creating the *charges_daily* column

##### The company has requested that a new column is created in the 18th position to store information about each customer's charges, divided by day. For this exercise, I have decided to consider a 30-day month.

In [45]:
# Calculating the daily spending of each customer and assigning them to a Series.

charges_daily = (data["charges_monthly"] / 30).round(2)

charges_daily

0       2.19
1       2.00
2       2.46
3       3.27
4       2.80
        ... 
7262    1.84
7263    2.84
7264    1.68
7265    2.26
7266    1.97
Name: charges_monthly, Length: 7267, dtype: float64

In [46]:
# Renaming the label of the "charges_daily" Series.

charges_daily.rename("charges_daily", inplace = True)

charges_daily

0       2.19
1       2.00
2       2.46
3       3.27
4       2.80
        ... 
7262    1.84
7263    2.84
7264    1.68
7265    2.26
7266    1.97
Name: charges_daily, Length: 7267, dtype: float64

In [47]:
# The company asked to insert this column on the 18th position, which would be where the paperless_billing column is. Let's see if it is a good place to put it.

data.columns[17:]

Index(['paperless_billing', 'payment_method', 'charges_monthly',
       'charges_total'],
      dtype='object')

##### Though the company has asked to put the *charges_daily* column on the 18th position, <font color = "Green"> I have decided to insert it in the 20th position, as it will be right after *payment_method* and together with other columns with charges information.

In [48]:
# Inserting the Series into the DataFrame.

data.insert(loc = 19, column = "charges_daily", value = charges_daily)
          # loc = 19 because the columns idex starts at 0.

In [49]:
data.head(10)

Unnamed: 0,customer_id,churn,gender,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,internet_service,online_security,online_backup,device_protection,tech_support,streaming_tv,streaming_movies,contract,paperless_billing,payment_method,charges_daily,charges_monthly,charges_total
0,0002-ORFBO,No,Female,No,Yes,Yes,9,Yes,No,DSL,No,Yes,No,Yes,Yes,No,One year,Yes,Mailed check,2.19,65.6,593.3
1,0003-MKNFE,No,Male,No,No,No,9,Yes,Yes,DSL,No,No,No,No,No,Yes,Month-to-month,No,Mailed check,2.0,59.9,542.4
2,0004-TLHLJ,Yes,Male,No,No,No,4,Yes,No,Fiber optic,No,No,Yes,No,No,No,Month-to-month,Yes,Electronic check,2.46,73.9,280.85
3,0011-IGKFF,Yes,Male,Yes,Yes,No,13,Yes,No,Fiber optic,No,Yes,Yes,No,Yes,Yes,Month-to-month,Yes,Electronic check,3.27,98.0,1237.85
4,0013-EXCHZ,Yes,Female,Yes,Yes,No,3,Yes,No,Fiber optic,No,No,No,Yes,Yes,No,Month-to-month,Yes,Mailed check,2.8,83.9,267.4
5,0013-MHZWF,No,Female,No,No,Yes,9,Yes,No,DSL,No,No,No,Yes,Yes,Yes,Month-to-month,Yes,Credit card (automatic),2.31,69.4,571.45
6,0013-SMEOE,No,Female,Yes,Yes,No,71,Yes,No,Fiber optic,Yes,Yes,Yes,Yes,Yes,Yes,Two year,Yes,Bank transfer (automatic),3.66,109.7,7904.25
7,0014-BMAQU,No,Male,No,Yes,No,63,Yes,Yes,Fiber optic,Yes,No,No,Yes,No,No,Two year,Yes,Credit card (automatic),2.82,84.65,5377.8
8,0015-UOCOJ,No,Female,Yes,No,No,7,Yes,No,DSL,Yes,No,No,No,No,No,Month-to-month,Yes,Electronic check,1.61,48.2,340.35
9,0016-QLJIS,No,Female,No,Yes,Yes,65,Yes,Yes,DSL,Yes,Yes,Yes,Yes,Yes,Yes,Two year,Yes,Mailed check,3.02,90.45,5957.9


---
# Creating the *services_hired* column.

##### After treating this dataset for some time now, I realised how difficult it is to visualise how many services each customer has hired, which is why I have decided to create the *services_hired* column.

##### The values of this column will be integers and will be calculated by counting each instance of "Yes" in the services columns (*phone_service*, *multiple_lines*, *online_security*, *online_backup*, *device_protection*, *tech_support*, *streaming_tv* and *streaming_movies*). Because the *internet_service* column labels are different from "Yes" and "No", we just have to include all of its labels except for "No".

In [50]:
# Using the eq() method we can obtain a boolean variation of the pertinent columns, where "True" indicates services that have been hired and "False" indicates the opposite.

services_hired = data[["phone_service", "multiple_lines", "online_security", "online_backup", "device_protection", "tech_support", "streaming_tv", "streaming_movies"]].eq("Yes" or "DSL" or "Fiber optic")

services_hired

Unnamed: 0,phone_service,multiple_lines,online_security,online_backup,device_protection,tech_support,streaming_tv,streaming_movies
0,True,False,False,True,False,True,True,False
1,True,True,False,False,False,False,False,True
2,True,False,False,False,True,False,False,False
3,True,False,False,True,True,False,True,True
4,True,False,False,False,False,True,True,False
...,...,...,...,...,...,...,...,...
7262,True,False,True,False,False,True,False,False
7263,True,True,False,False,False,False,False,True
7264,True,False,False,True,False,False,False,False
7265,True,False,True,False,True,True,False,True


In [51]:
data["internet_service"].unique()

array(['DSL', 'Fiber optic', 'No'], dtype=object)

In [52]:
data_internet_service_dsl = data["internet_service"].eq("DSL")

In [53]:
data_internet_service_fiber_optic = data["internet_service"].eq("Fiber optic")

In [54]:
# Let's insert the data_internet_service into the services_hired DataFrame.

services_hired.insert(loc = 2, column = "internet_service_dsl", value = data_internet_service_dsl)

services_hired.insert(loc = 3, column = "internet_service_fiber_optic", value = data_internet_service_fiber_optic)

In [55]:
services_hired

Unnamed: 0,phone_service,multiple_lines,internet_service_dsl,internet_service_fiber_optic,online_security,online_backup,device_protection,tech_support,streaming_tv,streaming_movies
0,True,False,True,False,False,True,False,True,True,False
1,True,True,True,False,False,False,False,False,False,True
2,True,False,False,True,False,False,True,False,False,False
3,True,False,False,True,False,True,True,False,True,True
4,True,False,False,True,False,False,False,True,True,False
...,...,...,...,...,...,...,...,...,...,...
7262,True,False,True,False,True,False,False,True,False,False
7263,True,True,False,True,False,False,False,False,False,True
7264,True,False,True,False,False,True,False,False,False,False
7265,True,False,True,False,True,False,True,True,False,True


In [56]:
# Let's add the True values for each client separately.

services_hired.sum(axis = 1)

0       5
1       4
2       3
3       6
4       4
       ..
7262    4
7263    4
7264    3
7265    6
7266    6
Length: 7267, dtype: int64

##### After the data has been obtained, we can add it to the original DataFrame. <font color = "Green"> I have decided to insert it on the 17th position for a more intuitive visualization.

In [57]:
data.insert(loc = 16, column = "services_hired", value = services_hired.sum(axis = 1))

In [58]:
data.sample(5)

Unnamed: 0,customer_id,churn,gender,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,internet_service,online_security,online_backup,device_protection,tech_support,streaming_tv,streaming_movies,services_hired,contract,paperless_billing,payment_method,charges_daily,charges_monthly,charges_total
5758,7868-TMWMZ,No,Female,Yes,Yes,No,60,Yes,No,Fiber optic,Yes,Yes,Yes,Yes,Yes,Yes,8,Two year,Yes,Bank transfer (automatic),3.67,110.0,6668.35
505,0719-SYFRB,Yes,Female,No,No,No,12,Yes,Yes,DSL,Yes,No,Yes,Yes,No,No,6,Month-to-month,Yes,Mailed check,2.06,61.65,713.75
5339,7293-LSCDV,No,Female,No,Yes,Yes,60,Yes,No,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,1,Two year,Yes,Credit card (automatic),0.64,19.25,1103.25
6822,9405-GPBBG,No,Female,No,No,No,64,Yes,Yes,Fiber optic,No,Yes,Yes,Yes,Yes,Yes,8,Two year,Yes,Credit card (automatic),3.68,110.5,7069.25
4899,6696-YDAYZ,No,Male,No,Yes,Yes,16,Yes,No,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,1,Two year,No,Mailed check,0.68,20.5,290.55


---
# Exporting the datasets

##### As written on issue correction 6, two different DataFrames will be created and exported in this section.

#### <font color = "Blue"> 1. First, a DataFrame with all entries, including the ones with blank values in the "churn" variable. This DataFrame (and the JSON file that will originate from it) will be called *data_all_entries.json*.

In [59]:
data_all_entries = data

In [60]:
data_all_entries.to_json(path_or_buf = "data_all_entries.json", orient = "split", index = False)

#### <font color = "Blue"> 2. Next, a DataFrame without the entries with blank values in the "churn" variable. This DataFrame (and the JSON file that will originate from it) will be called *data_treated.json*.

In [61]:
data = data.query("churn == 'Yes' or churn == 'No'")

In [62]:
data.to_json(path_or_buf = "data_treated.json", orient = "split", index = False)