## Listing needed Data

The data set includes information about:

- **Customers who left within the last month**: the column is called Churn
- **Signed up Services**: phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies.
- **Account information**: how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges.
- **Demographic info**: gender, age range, and if they have partners and dependents.

## Data Size Estimate:

Ideally, several thousand records (customers) are needed for reliable modeling.<br>
**Minimum**: At least 1,000 customers' worth of data, depending on the complexity and features.

## Data Availability

Data is available at <a href="https://www.kaggle.com/datasets/blastchar/telco-customer-churn">kaggle</a> 

## Space it takes

Too smalll ~1mb.

## Legal obligations & Authorization

Its open-source data for any use.

## Get Data

Download data from kaggle should be located at **../data/WA_Fn-UseC_-Telco-Customer-Churn.csv**

## Load Data

In [1]:
import numpy as np
import pandas as pd

In [2]:
# Load CSV Data
data = pd.read_csv("../data/WA_Fn-UseC_-Telco-Customer-Churn.csv")
data.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [3]:
# Shape of the data
data.shape

(7043, 21)

In [4]:
# Check Balance of the data
np.unique(data.Churn,return_counts=True)

(array(['No', 'Yes'], dtype=object), array([5174, 1869], dtype=int64))

We see that data is not balanced at all.<br>
**SMOTE** will be used to balance it later.

## Check the size and type of data

It’s a sample of records of customer interactions and account details.<br>
Contains 7_000 customer records.

## Sample a test set

In [5]:
from sklearn.model_selection import train_test_split
train_data,test_data = train_test_split(data,stratify=data.Churn,test_size=0.1,random_state=42)

In [6]:
train_data.to_csv("../data/train_data.csv",index=False)
test_data.to_csv("../data/test_data.csv",index=False)