### KDD Process (ricordare cosa dobbiamo fare (indicate dal prof))
1. Dataset load and features semantics
1. Data Cleaning (handle missing values, remove useless variables)
1. Feature Engineering
1. Classification Preprocessing (feature reshaping, train/test partitioning)
1. Parameter Tuning
1. Perform Classification
1. Analyze the classification results
1. Analyze the classification performance
1. Can we improve the performance using another classifier?

In [1]:
import numpy as np
import pandas as pd 

### Dataset load and features semantics

In [2]:
# load the dataset 
df = pd.read_csv("data/telecom_users.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,1869,7010-BRBUU,Male,0,Yes,Yes,72,Yes,Yes,No,...,No internet service,No internet service,No internet service,No internet service,Two year,No,Credit card (automatic),24.1,1734.65,No
1,4528,9688-YGXVR,Female,0,No,No,44,Yes,No,Fiber optic,...,Yes,No,Yes,No,Month-to-month,Yes,Credit card (automatic),88.15,3973.2,No
2,6344,9286-DOJGF,Female,1,Yes,No,38,Yes,Yes,Fiber optic,...,No,No,No,No,Month-to-month,Yes,Bank transfer (automatic),74.95,2869.85,Yes
3,6739,6994-KERXL,Male,0,No,No,4,Yes,No,DSL,...,No,No,No,Yes,Month-to-month,Yes,Electronic check,55.9,238.5,No
4,432,2181-UAESM,Male,0,No,No,2,Yes,No,DSL,...,Yes,No,No,No,Month-to-month,No,Electronic check,53.45,119.5,No


- ``customerID``- customer id;
- ``gender`` - client gender (male / female)
- ``SeniorCitizen`` - is the client retired (1, 0)
- ``Partner`` - is the client married (Yes, No)
- ``tenure`` - how many months a person has been a client of the company
- ``PhoneService`` - is the telephone service connected (Yes, No)
- ``MultipleLines`` - are multiple phone lines connected (Yes, No, No phone service)
- ``InternetService`` - client's Internet service provider (DSL, Fiber optic, No)
- ``OnlineSecurity`` - is the online security service connected (Yes, No, No internet service)
- ``OnlineBackup`` - is the online backup service activated (Yes, No, No internet service)
- ``DeviceProtection`` - does the client have equipment insurance (Yes, No, No internet service)
- ``TechSupport`` - is the technical support service connected (Yes, No, No internet service)
- ``StreamingTV`` - is the streaming TV service connected (Yes, No, No internet service)
- ``StreamingMovies`` - is the streaming cinema service activated (Yes, No, No internet service)
- ``Contract`` - type of customer contract (Month-to-month, One year, Two year)
- ``PaperlessBilling`` - whether the client uses paperless billing (Yes, No)
- ``PaymentMethod`` - payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))
- ``MonthlyCharges`` - current monthly payment
- ``TotalCharges`` - the total amount that the client paid for the services for the entire time
- ``Churn`` - whether there was a churn (Yes or No)

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5986 entries, 0 to 5985
Data columns (total 22 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Unnamed: 0        5986 non-null   int64  
 1   customerID        5986 non-null   object 
 2   gender            5986 non-null   object 
 3   SeniorCitizen     5986 non-null   int64  
 4   Partner           5986 non-null   object 
 5   Dependents        5986 non-null   object 
 6   tenure            5986 non-null   int64  
 7   PhoneService      5986 non-null   object 
 8   MultipleLines     5986 non-null   object 
 9   InternetService   5986 non-null   object 
 10  OnlineSecurity    5986 non-null   object 
 11  OnlineBackup      5986 non-null   object 
 12  DeviceProtection  5986 non-null   object 
 13  TechSupport       5986 non-null   object 
 14  StreamingTV       5986 non-null   object 
 15  StreamingMovies   5986 non-null   object 
 16  Contract          5986 non-null   object 


### Data cleaning

TotalCharges: In this column there were 10 missing values. Analyzing this records we have noticed that in all of them 'tenure' was zero, so new customers that haven't done still the first payment. So we decided to fill TotalCharges with their MonthlyCharges.

In [4]:
df['TotalCharges'] = np.where(df['tenure'] == 0, df['MonthlyCharges'], df['TotalCharges'])

In [5]:
df.isnull().sum()

Unnamed: 0          0
customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

In [6]:
df['TotalCharges'] = df['TotalCharges'].astype(float)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5986 entries, 0 to 5985
Data columns (total 22 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Unnamed: 0        5986 non-null   int64  
 1   customerID        5986 non-null   object 
 2   gender            5986 non-null   object 
 3   SeniorCitizen     5986 non-null   int64  
 4   Partner           5986 non-null   object 
 5   Dependents        5986 non-null   object 
 6   tenure            5986 non-null   int64  
 7   PhoneService      5986 non-null   object 
 8   MultipleLines     5986 non-null   object 
 9   InternetService   5986 non-null   object 
 10  OnlineSecurity    5986 non-null   object 
 11  OnlineBackup      5986 non-null   object 
 12  DeviceProtection  5986 non-null   object 
 13  TechSupport       5986 non-null   object 
 14  StreamingTV       5986 non-null   object 
 15  StreamingMovies   5986 non-null   object 
 16  Contract          5986 non-null   object 


Removed useless column

In [8]:
df.drop('Unnamed: 0', axis=1, inplace = True)