## 📊 DATA UNDERSTANDING & CLEANING


In this notebook, we load the raw churn dataset, perform initial exploration to understand its structure, identify missing or inconsistent values, and clean/transform the data so that it’s ready for feature engineering (Notebook 04).

In [None]:
#Import the packages we need
import pandas as pd
import numpy as np


###  Load Dataset



In [None]:
df = pd.read_csv("../data/customer_churn.csv")
#show first few rows to show if it loaded properly
df.head()



### Initial Structure: Shape, Info, and Columns



We check the number of rows and columns, data types, and non-null counts to see if any obvious issues pop up right away.


In [None]:
#Identifying how many rows and columns are in the dataset
print("Dataset shape:", df.shape)

#Identifying the data types and non-null counts
df.info()

#Looking at the column names and trying to get more information generally about what we're working with
print("\nColumns:\n", df.columns.tolist())



- The dataset contains **7,043 rows** and **21 columns**.
- **No columns are missing values** (all non-null counts equal 7,043).
- There are **18 object-type columns** and only **3 numeric columns** (`SeniorCitizen`, `tenure`, and `MonthlyCharges`).
- Notably, **`TotalCharges` is read as an object** even though it should be numeric. We will need to convert it to a numeric type in a later step.
- Most object columns correspond to categorical variables (e.g., `gender`, `Partner`, `Contract`, `Churn`), so we’ll handle those appropriately during preprocessing.









### Descriptive Statistics & Unique Values



- Use `.describe()` on numeric columns to spot odd min/max.
- Use `.nunique()` on every column to see how many distinct values each has (especially important for categorical features).


In [None]:
# Numeric summary
print("Numeric summary:\n")
print(df.describe().T)

# Unique counts per column
print("\nUnique counts per column:")
print(df.nunique().sort_values(ascending=False))


- From the `.describe()` summary, we observe that `MonthlyCharges` and `tenure` have a wide range of values, suggesting they may play an important role in predicting churn.
- `SeniorCitizen` is binary but encoded as integers (0 or 1), so it’s fine as-is for modeling.
- The `TotalCharges` column has 6531 unique values, yet the dataset has 7043 rows — this discrepancy suggests **missing or non-numeric values** that need to be cleaned.
- Categorical features mostly have 2 or 3 unique values, so they’ll be easy to encode later.
- `customerID` is a unique identifier with no modeling value — we’ll drop it eventually.


In [None]:
# Checking problematic values in TotalCharges
print("Non-numeric 'TotalCharges' rows:")
print(df[pd.to_numeric(df["TotalCharges"], errors="coerce").isna()][["customerID", "tenure", "TotalCharges"]])


In [None]:
# Convert TotalCharges to numeric, forcing invalid strings to NaN
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors='coerce')

# Drop rows where TotalCharges is NaN
df = df[df["TotalCharges"].notnull()]

# Reset index after dropping rows
df.reset_index(drop=True, inplace=True)


- We discovered that 11 rows had blank strings in the `TotalCharges` column, all with `tenure = 0`. These are likely new customers who haven't been billed yet.
- We converted the column to numeric, coercing invalid entries to `NaN`, and then removed those rows since they don't offer meaningful information for churn prediction.
- Index was reset to keep the dataset clean and continuous.


### Checking for duplicated rows
- We check for duplicate rows in the dataset. Duplicate records could distort model learning and skew metrics.


In [None]:

duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")


## Checking for missing values
- `.info()` earlier said everything was non-null. We know that the "TotalCharges" column had empty strings which pandas doesn't count as missing.
-We go deeper looking for empty strings or white spaces pretending to be real data.

In [None]:
# Check for missing values (NaNs)
print("Missing values (NaNs):\n", df.isna().sum())

# Check for empty strings or spaces in object columns
print("\nEmpty strings or whitespace in object columns:\n")
for col in df.select_dtypes(include='object'):
    empty_count = df[col].apply(lambda x: isinstance(x, str) and x.strip() == "").sum()
    if empty_count > 0:
        print(f"{col}: {empty_count}")


-We verified the dataset for both traditional missing values (NaN) and hidden missing indicators like empty strings or whitespaces.
-No NaN values were found in any column.
-No object-type columns contained empty or whitespace-only entries.

### Strip Whitespace from Object Columns
In this step, we remove any leading or trailing spaces from all string-based (object) columns to ensure consistency (e.g., no accidental "Yes " vs. "Yes"). After stripping, we’ll re-print unique values for each object column to confirm that no stray spaces remain.

In [None]:
# Strip leading/trailing whitespace from all object-type columns
for col in df.select_dtypes(include="object"):
    df[col] = df[col].str.strip()

# Re-print unique values for each object column to verify
for col in df.select_dtypes(include="object"):
    print(f"\nUnique values in '{col}':")
    print(sorted(df[col].unique()))


## Whitespace Stripping & Category Review

- After stripping leading/trailing spaces, we re-inspected each object-type column’s unique values.
- **`customerID`**: All values look uniform (alphanumeric IDs), so no residual whitespace issues here.
- **Binary “Yes/No” columns** (e.g., `gender`, `Partner`, `Dependents`, `PhoneService`, `PaperlessBilling`, `Churn`) now correctly show only `'Yes'` or `'No'`.
- For **service-related columns** (`MultipleLines`, `OnlineSecurity`, `OnlineBackup`, `DeviceProtection`, `TechSupport`, `StreamingTV`, `StreamingMovies`), we still see entries like `"No phone service"` or `"No internet service"`. Those are legitimate labels (indicating the customer doesn’t have that service), but we need to decide how to encode them:
  - Since these columns are effectively binary (has service vs. doesn’t), we’ll consolidate `"No phone service"` ->`"No"` and `"No internet service"` -> `"No"` in the next step.
- **`InternetService`** values are `['DSL', 'Fiber optic', 'No']`, which aligns with expectations (no extra whitespace).
- **`Contract`** values are `['Month-to-month', 'One year', 'Two year']`(no stray spaces).
- **`PaymentMethod`** shows all four expected categories (`'Bank transfer (automatic)'`, `'Credit card (automatic)'`, `'Electronic check'`, `'Mailed check'`), so no further trimming is needed.

**Conclusion:** Whitespace cleanup succeeded. Our next move is to standardize service-related columns by mapping `"No phone service"` and `"No internet service"` to a simple `"No"`, preparing them for binary encoding.```



##Consolidate Service Labels ("No phone service" / "No internet service" -> "No")

-This is done because all 7 service-related columns are binary meaning: Either the customer has the service or not.
-We aim to standardize all 7 service-related columns to only have "Yes" or "No".

In [None]:
# Consolidate "No phone service" and "No internet service" to "No"
replace_cols = ['MultipleLines', 'OnlineSecurity', 'OnlineBackup', 
                'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies']

# Replace both "No phone service" and "No internet service" with "No"
for col in replace_cols:
    df[col] = df[col].replace({'No phone service': 'No', 'No internet service': 'No'})

# Double-check the unique values now
for col in replace_cols:
    print(f"\nUnique values in '{col}': {sorted(df[col].unique())}")


### Service Label Consolidation



- We simplified category values across 7 service-related columns:
  - `MultipleLines`
  - `OnlineSecurity`
  - `OnlineBackup`
  - `DeviceProtection`
  - `TechSupport`
  - `StreamingTV`
  - `StreamingMovies`

- After the replacements, all 7 columns now contain **only `'Yes'` or `'No'`**, making them ready for encoding.


In [None]:
#Confirming if any missing values exist or unexpected issues arose afterwards 
print(df.info())
print(df.isna().sum())


### After all cleaning steps, no columns contain nulls and all dtypes are correct. The dataset is ready for feature engineering.

In [None]:
#Saving the cleaned file to a csv 
df.to_csv("../data/customer_churn_cleaned.csv", index=False)
