# Customer Study Notebook

## Objectives

*   Answer business requirement 1: 
    * The client is interested to understand the patterns from the customer base, so the client can learn the most relevant variables that are correlated to a churned customer.

## Inputs

* outputs/datasets/collection/TelcoCustomerChurn.csv

## Outputs

* generate code that answers business requirement 1 and can be used to build the Streamlit App

---

# Change working directory

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/cherryleaves/jupyter_notebooks'

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/cherryleaves'

---

# Load Data

In [4]:
import pandas as pd

train_file = '/workspace/cherryleaves/outputs/datasets/collection/train.csv'
val_file = '/workspace/cherryleaves/outputs/datasets/collection/val.csv'

train_df = pd.read_csv(train_file)
val_df = pd.read_csv(val_file)

print("Training Data Preview:")
print(train_df.head())

print(f"Training data shape: {train_df.shape}")
print(f"Validation data shape: {val_df.shape}")


Training Data Preview:
                                          image_path  label
0  /workspace/cherryleaves/inputs/datasets/raw/ch...      1
1  /workspace/cherryleaves/inputs/datasets/raw/ch...      0
2  /workspace/cherryleaves/inputs/datasets/raw/ch...      0
3  /workspace/cherryleaves/inputs/datasets/raw/ch...      1
4  /workspace/cherryleaves/inputs/datasets/raw/ch...      0
Training data shape: (3366, 2)
Validation data shape: (842, 2)


Validate image paths

In [5]:
import os

missing_files = [path for path in train_df['image_path'] if not os.path.exists(path)]
print(f"Missing files in training data: {len(missing_files)}")

if missing_files:
    print("Sample of missing files:")
    print(missing_files[:5])  # Display the first few missing file paths
else:
    print("All training image paths are valid!")


Missing files in training data: 0
All training image paths are valid!


---

# Data Exploration

Explore the dataset to understand variable types and distributions and what this means for the study

In [17]:
import pandas as pd
from ydata_profiling import ProfileReport

train_df = pd.read_csv("/workspace/cherryleaves/outputs/datasets/collection/train.csv")

profile = ProfileReport(train_df, title="Training Data Exploration Report")
profile.to_notebook_iframe()


Summarize dataset: 100%|██████████| 11/11 [00:00<00:00, 35.20it/s, Completed]                 
Generate report structure: 100%|██████████| 1/1 [00:01<00:00,  1.05s/it]
Render HTML: 100%|██████████| 1/1 [00:00<00:00,  9.39it/s]


In [18]:

print("Missing values per column:")
print(train_df.isnull().sum())

print("Data types of columns:")
print(train_df.dtypes)

print("Label distribution in training data:")
print(train_df['label'].value_counts())


Missing values per column:
image_path    0
label         0
dtype: int64
Data types of columns:
image_path    object
label          int64
dtype: object
Label distribution in training data:
label
1    1702
0    1664
Name: count, dtype: int64
