## problem Stament -
For a given customer's historical data, we are asked to predict whether a customer will stop using a company's product or not. We will be using the Telco Customer Churn dataset for building an end to end production-grade machine learning system that can predict whether the customer will stay loyal or not. The dataset has 20 input features and a target variable for 7043 customers.

When someone leaves a company and when that customer stops paying a business for its services or products, we call that 'churn'. We can calculate a churn rate for a company by dividing the number of customers who churned by the total number of customers and then multiplying that number by 100 to reach a percentage value.

## Data Cleaning.
Data cleaning, also known as data cleansing or data scrubbing, refers to the process of identifying and correcting or removing errors, inconsistencies, inaccuracies, and anomalies in a dataset. It involves tasks such as handling missing values, correcting typos, resolving formatting issues, dealing with outliers, and eliminating duplicate or redundant entries.
the importance of the data cleaning - 
+ Data Accuracy
+ Reliable Insights
+ Effective Decision-making
+ Data Integration
+ Regulatory Compliance
+ Efficient Analysis
+ Data Reusability


In [13]:
#Importing the libraries
import pandas as pd 
import numpy as np 

In [14]:
# Loading the data
data = pd.read_csv('/Users/tarakram/Documents/Customer-Churn/data/raw/customer_churn_raw_data.csv')
# pd.set_option('display.max_columns', None)
data.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,InternetService,OnlineSecurity,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,DSL,No,Month-to-month,Electronic check,29.85,29.85,No
1,Male,0,No,No,34,Yes,DSL,Yes,One year,Mailed check,56.95,1889.5,No
2,Male,0,No,No,2,Yes,DSL,Yes,Month-to-month,Mailed check,53.85,108.15,Yes
3,Male,0,No,No,45,No,DSL,Yes,One year,Bank transfer (automatic),42.3,1840.75,No
4,Female,0,No,No,2,Yes,Fiber optic,No,Month-to-month,Electronic check,70.7,151.65,Yes


## Data Description
+ customerID: A unique identifier assigned to each customer.
+ gender: The gender of the customer (male or female).
+ SeniorCitizen: A binary variable indicating whether the customer is a senior citizen or not (1 for senior citizen and 0 for non-senior citizen).
+ Partner: A binary variable indicating whether the customer has a partner or not (1 for yes and 0 for no).
+ Dependents: A binary variable indicating whether the customer has dependents or not (1 for yes and 0 for no).
+ tenure: The number of months the customer has been with the telecom company.
+ PhoneService: A binary variable indicating whether the customer has a phone service or not (1 for yes and 0 for no).
+ MultipleLines: A binary variable indicating whether the customer has multiple phone lines or not (1 for yes and 0 for no).
+ InternetService: The type of internet service the customer has subscribed to (DSL, Fiber optic, or No).
+ OnlineSecurity: A binary variable indicating whether the customer has online security or not (1 for yes and 0 for no).
+ OnlineBackup: A binary variable indicating whether the customer has online backup or not (1 for yes and 0 for no).
+ DeviceProtection: A binary variable indicating whether the customer has device protection or not (1 for yes and 0 for no).
+ TechSupport: A binary variable indicating whether the customer has technical support or not (1 for yes and 0 for no).
+ StreamingTV: A binary variable indicating whether the customer has streaming TV or not (1 for yes and 0 for no).
+ StreamingMovies: A binary variable indicating whether the customer has streaming movies or not (1 for yes and 0 for no).
+ Contract: The type of contract the customer has subscribed to (Month-to-month, One year, or Two year).
+ PaperlessBilling: A binary variable indicating whether the customer has opted for paperless billing or not (1 for yes and 0 for no).
+ PaymentMethod: The payment method used by the customer (Electronic check, Mailed check, Bank transfer (automatic), or Credit card (automatic)).
+ MonthlyCharges: The amount charged to the customer on a monthly basis.
+ TotalCharges: The total amount charged to the customer over the entire tenure period.
+ Churn: A binary variable indicating whether the customer has churned or not (1 for churned and 0 for retained).


In [15]:
# shape of the data
print(f'Our Data has {data.shape[0]} rows, and {data.shape[1]} Columns.')

Our Data has 7043 rows, and 13 Columns.


In [16]:
# checking for null values 
data.isnull().sum()

gender             0
SeniorCitizen      0
Partner            0
Dependents         0
tenure             0
PhoneService       0
InternetService    0
OnlineSecurity     0
Contract           0
PaymentMethod      0
MonthlyCharges     0
TotalCharges       0
Churn              0
dtype: int64

In [17]:
# checking for duplicated values.
data.duplicated().sum()

34

In [18]:
data.drop_duplicates(inplace=True)

In [19]:
# checking for unique values that are present in our data.
data.nunique()

gender                2
SeniorCitizen         2
Partner               2
Dependents            2
tenure               73
PhoneService          2
InternetService       3
OnlineSecurity        3
Contract              3
PaymentMethod         4
MonthlyCharges     1585
TotalCharges       6531
Churn                 2
dtype: int64

In [20]:
data.dtypes

gender              object
SeniorCitizen        int64
Partner             object
Dependents          object
tenure               int64
PhoneService        object
InternetService     object
OnlineSecurity      object
Contract            object
PaymentMethod       object
MonthlyCharges     float64
TotalCharges        object
Churn               object
dtype: object

Most of the data is in Objects, and Machines only understand numbers, so are going to change them into int in preprocessing module.

In [21]:
# Saving the data into processed folder.
data.to_csv('/Users/tarakram/Documents/Customer-Churn/data/processed/cleaned_data.csv', index= False)