# Intorduction

In this preprocessing section, we will need to prepare our data in order to make sure that it will be suitable for this study, what we will need to do is to find any null values in data and convert some of it to numerical values to easily work with it in the machine learning model

# Import and load

We will start by importing functions to load and save data, and also load the data

In [23]:
import sys
import os

sys.path.append(os.path.abspath('../src'))

from data_loader import load_data, save_data

In [24]:
data = load_data("telco.csv", "raw", False)

Now, we will see the null values and their counts

In [25]:
data.isna().value_counts()

Customer ID  Gender  Age    Under 30  Senior Citizen  Married  Dependents  Number of Dependents  Country  State  City   Zip Code  Latitude  Longitude  Population  Quarter  Referred a Friend  Number of Referrals  Tenure in Months  Offer  Phone Service  Avg Monthly Long Distance Charges  Multiple Lines  Internet Service  Internet Type  Avg Monthly GB Download  Online Security  Online Backup  Device Protection Plan  Premium Tech Support  Streaming TV  Streaming Movies  Streaming Music  Unlimited Data  Contract  Paperless Billing  Payment Method  Monthly Charge  Total Charges  Total Refunds  Total Extra Data Charges  Total Long Distance Charges  Total Revenue  Satisfaction Score  Customer Status  Churn Label  Churn Score  CLTV   Churn Category  Churn Reason
False        False   False  False     False           False    False       False                 False    False  False  False     False     False      False       False    False              False                False             True  

After looking at the counts of null values, we find that most columns do not contain any null values, the columns that contain null values are either not necessary for this analysis, or are going to be converted into numerical values for the machine learning model

# Prepare the data for the machine learning model

We will start by looking at the different columns and decide which ones we will use to train the model, and which one is going to be used as the target

In [26]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 50 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Customer ID                        7043 non-null   object 
 1   Gender                             7043 non-null   object 
 2   Age                                7043 non-null   int64  
 3   Under 30                           7043 non-null   object 
 4   Senior Citizen                     7043 non-null   object 
 5   Married                            7043 non-null   object 
 6   Dependents                         7043 non-null   object 
 7   Number of Dependents               7043 non-null   int64  
 8   Country                            7043 non-null   object 
 9   State                              7043 non-null   object 
 10  City                               7043 non-null   object 
 11  Zip Code                           7043 non-null   int64

After looking at the columns, we will descide the following columns to train the model:
<ul>
<li>Gender</li>
<li>Age</li>
<li>Senior Citizen</li>
<li>Married</li>
<li>Dependents</li>
<li>Tenure in Months</li>
<li>Offer</li>
<li>Phone Service</li>
<li>Contract</li>
<li>Monthly Charge</li>
<li>Premium Tech Support</li>
</ul>

- For the target value, we will use the **Churn Label** to find if a customer decides to churn or not

In [27]:
model_info = ["Gender", "Age", "Senior Citizen", "Married", "Dependents", "Tenure in Months", "Offer", "Phone Service", "Contract", "Monthly Charge", "Premium Tech Support", "Churn Label"]

We will now create a new dataset that contains our columns for the machine learning model

In [28]:
model_data = data[model_info]

We will also take a look at the value counts for each column to see what information to convert to numerical values

In [29]:
model_data.value_counts()

Gender  Age  Senior Citizen  Married  Dependents  Tenure in Months  Offer    Phone Service  Contract        Monthly Charge  Premium Tech Support  Churn Label
Male    80   Yes             Yes      Yes         68                Offer A  Yes            One Year        107.15          Yes                   No             1
Female  19   No              No       No          1                 Offer E  Yes            Month-to-Month  60.15           No                    Yes            1
                                                  2                 Offer E  No             Month-to-Month  34.70           No                    Yes            1
                                                  4                 Offer E  No             Month-to-Month  48.25           Yes                   No             1
                                                  11                Offer D  Yes            Month-to-Month  64.05           No                    No             1
                           

We will then replace the values accordingly:

In [33]:
yes_no_replacements = {
    "Yes":1,
    "No":0,
    "Male":0,
    "Female":1
}
offers = {
    "Offer A": 1,
    "Offer B": 2,
    "Offer C": 3,
    "Offer D": 4,
    "Offer E": 5
}
contracts = {
    "Month-to-Month":0,
    "One Year":1,
    "Two Year":2,
}
model_data.fillna(0, inplace=True)
model_data = model_data.replace(yes_no_replacements).infer_objects(copy=False)
model_data = model_data.replace(contracts).infer_objects(copy=False)
model_data = model_data.replace(offers).infer_objects(copy=False)

And we can take a look at a sample from the model dataset

In [31]:
model_data.sample()

Unnamed: 0,Gender,Age,Senior Citizen,Married,Dependents,Tenure in Months,Offer,Phone Service,Contract,Monthly Charge,Premium Tech Support,Churn Label
5326,1,22,0,1,1,54,0,1,2,60.0,1,0


Finally, we will save the dataset

In [32]:
save_data(model_data, "model_data.csv", False)