
# Data Preprocessing for Customer Churn Prediction

This Jupyter Notebook provides the steps for preprocessing the Telco Customer Churn dataset.

## Import Libraries

```python
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder
```

## Load the Data

```python
# Load the dataset
data_path = './WA_Fn-UseC_-Telco-Customer-Churn.csv'
df = pd.read_csv(data_path)
```

## Initial Data Exploration

```python
# Display the first few rows of the dataframe
df.head()
```

## Handling Missing Values

```python
# Check for missing values
df.isnull().sum()

# Assuming 'TotalCharges' has whitespaces or similar issues that make them NA after loading
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df['TotalCharges'].fillna(df['TotalCharges'].mean(), inplace=True)
```

## Converting Data Types

```python
# Convert data types if necessary (e.g., TotalCharges from object to float)
df['TotalCharges'] = df['TotalCharges'].astype(float)
```

## Normalizing or Standardizing Data

```python
# Standardize numerical features
scaler = StandardScaler()
numerical_cols = ['tenure', 'MonthlyCharges', 'TotalCharges']  # example columns
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])
```

## Encoding Categorical Variables

```python
# Encode categorical variables
encoder = LabelEncoder()
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
categorical_cols.remove('customerID')  # Assuming we don't need to encode 'customerID'

for col in categorical_cols:
    df[col] = encoder.fit_transform(df[col])
```

## Save the Preprocessed Data

```python
# Save the cleaned dataframe
df.to_csv('./processed_telco_churn.csv', index=False)
```


In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder

In [2]:
# Load the dataset
data_path = './WA_Fn-UseC_-Telco-Customer-Churn.csv'
df = pd.read_csv(data_path)

In [3]:
# Display the first few rows of the dataframe
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [4]:
## Handling Missing Values
# Check for missing values
df.isnull().sum()

# Assuming 'TotalCharges' has whitespaces or similar issues that make them NA after loading
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df['TotalCharges'].fillna(df['TotalCharges'].mean(), inplace=True)

In [5]:
## Converting Data Types
df['TotalCharges'] = df['TotalCharges'].astype(float)

In [6]:
## Normalizing or Standardizing Data
# Standardize numerical features
scaler = StandardScaler()
numerical_cols = ['tenure', 'MonthlyCharges', 'TotalCharges']  # example columns
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])

In [7]:
## Encoding Categorical Variables
encoder = LabelEncoder()
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
categorical_cols.remove('customerID')  # Assuming we don't need to encode 'customerID'

for col in categorical_cols:
    df[col] = encoder.fit_transform(df[col])

In [8]:
## Save the Preprocessed Data
# Save the cleaned dataframe
df.to_csv('./processed_telco_churn.csv', index=False)