## Phase 1 – Data Setup & Ingestion

In this phase, we will:
1. Import required libraries  
2. Load the raw Telco Customer Churn dataset  
3. Preview the dataset (rows, columns, datatypes)  
4. Check for missing values  
5. Check the target column (`Churn`) distribution  
6. Save a small sample + metadata for future reference  


##  Step 1 – Load the dataset

We first check if the dataset exists in the raw folder.  
Then, we load a small preview (first 5 rows) to quickly inspect the data.


In [18]:
# Import libraries

import pandas as pd
import os
import json

# Define dataset path
DATA_PATH = "data/raw/Telco-Customer-Churn.csv"  


In [39]:
# Cell 2: Check if file exists and load preview
if not os.path.exists(DATA_PATH):
    raise FileNotFoundError(f"Dataset not found at {DATA_PATH}")

# Preview only first 5 rows
df_preview = pd.read_csv(DATA_PATH, nrows=5)
df_preview


Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


##  Step 2 – Load full dataset

Now we will load the complete dataset to inspect its structure (rows, columns, datatypes).


In [27]:
df = pd.read_csv(DATA_PATH)

print("Dataset Shape:", df.shape)
print("\nColumn Names:", df.columns.tolist())
print("\nData Types:\n", df.dtypes)


Dataset Shape: (7043, 21)

Column Names: ['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn']

Data Types:
 customerID           object
gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object


## Step 3 – Missing Values Check

Missing values can affect data quality. Let's check how many null values are present in each column.


In [42]:

df.isnull().sum()


customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

## Step 4 – Target Column Distribution

The dataset has a target column `Churn` (Yes/No).  
We check its frequency and percentage distribution.


In [45]:

if "Churn" in df.columns:
    print(df["Churn"].value_counts(dropna=False))
    print("\nProportion:\n", df["Churn"].value_counts(normalize=True, dropna=False))
else:
    print("Target column 'Churn' not found.")


Churn
No     5174
Yes    1869
Name: count, dtype: int64

Proportion:
 Churn
No     0.73463
Yes    0.26537
Name: proportion, dtype: float64


## Step 5 – Quick Statistical Summary

We display summary statistics for numerical columns and unique counts for categorical columns.


In [50]:
# Summary statistics
display(df.describe(include=[float, int]).T.head(10))

# Show categorical unique counts
cat_cols = df.select_dtypes(include="object").columns.tolist()
print(f"Categorical columns ({len(cat_cols)}): {cat_cols[:10]}")
for c in cat_cols[:5]:
    print(f"{c}: {df[c].nunique()} unique values")


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
SeniorCitizen,7043.0,0.162147,0.368612,0.0,0.0,0.0,0.0,1.0
tenure,7043.0,32.371149,24.559481,0.0,9.0,29.0,55.0,72.0
MonthlyCharges,7043.0,64.761692,30.090047,18.25,35.5,70.35,89.85,118.75


Categorical columns (18): ['customerID', 'gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection']
customerID: 7043 unique values
gender: 2 unique values
Partner: 2 unique values
Dependents: 2 unique values
PhoneService: 2 unique values


## Step 6 – Save Metadata & Sample

To keep track of the dataset, we save:  
- Metadata (row count, column count, missing values) as JSON  
- A sample of the first 500 rows as CSV (useful for quick testing / sharing)  


In [72]:
# Cell 7: Save metadata and sample (with auto dir creation)
import os, json

# Ensure folder exists
os.makedirs("../data/analysis", exist_ok=True)

# Save metadata
meta = {
    "rows": df.shape[0],
    "cols": df.shape[1],
    "missing_counts": df.isnull().sum().to_dict()
}

with open("../data/analysis/raw_preview_meta.json", "w") as fh:
    json.dump(meta, fh, indent=2)

# Save sample of first 500 rows
df.head(500).to_csv("../data/analysis/sample_first_500_rows.csv", index=False)

print("Saved metadata to customer-churn-Analysis/data/analysis/raw_preview_meta.json")
print("Saved sample (first 500 rows) to customer-churn-Analysis/data/analysis/sample_first_500_rows.csv")
print("File absolute path:", os.path.abspath("../data/analysis/raw_preview_meta.json"))



Saved metadata to customer-churn-Analysis/data/analysis/raw_preview_meta.json
Saved sample (first 500 rows) to customer-churn-Analysis/data/analysis/sample_first_500_rows.csv
File absolute path: /Users/hrithik/data/analysis/raw_preview_meta.json


# Phase 1 Completed

We have:
- Successfully loaded the raw dataset  
- Explored basic info (shape, dtypes, missing values)  
- Checked churn distribution  
- Saved metadata and sample file  

**Next Phase: Data Cleaning & Exploratory Data Analysis (EDA)**
