Your company wants to implement an AI model to retain customers planning to leave. The IT department sent you the first sample of historical data (`sample_data`). Your task is not yet to build the model, but to assess the quality of the raw material.

In [None]:
import pandas as pd
import numpy as np

data = [
    {"customer_id": "C001", "join_date": "2022-01-15", "monthly_spend": 59.99, "contract_type": "One year", "age": 34, "last_interaction_channel": "App", "complaints_count": 0, "churned": False},
    {"customer_id": "C002", "join_date": "2023-03-10", "monthly_spend": np.nan, "contract_type": "Month-to-month", "age": 150, "last_interaction_channel": "Call", "complaints_count": 2, "churned": True},
    {"customer_id": "C003", "join_date": np.nan, "monthly_spend": -10.00, "contract_type": "Two year", "age": 45, "last_interaction_channel": "Email", "complaints_count": 0, "churned": False},
    {"customer_id": "C001", "join_date": "2022-01-15", "monthly_spend": 59.99, "contract_type": "One-year", "age": 34, "last_interaction_channel": "App", "complaints_count": 0, "churned": False}, # Duplicate and typo in contract
    {"customer_id": "C005", "join_date": "2021-11-20", "monthly_spend": 85.50, "contract_type": "Month-to-month", "age": 29, "last_interaction_channel": None, "complaints_count": 5, "churned": True},
    {"customer_id": "C006", "join_date": "2020-05-05", "monthly_spend": 120.00, "contract_type": "Two year", "age": 19, "last_interaction_channel": "Branch", "complaints_count": 1, "churned": False}
]

In [None]:
df = pd.DataFrame(data)

# Display data
print("Data preview:")
display(df)

Data preview:


Unnamed: 0,customer_id,join_date,monthly_spend,contract_type,age,last_interaction_channel,complaints_count,churned
0,C001,2022-01-15,59.99,One year,34,App,0,False
1,C002,2023-03-10,,Month-to-month,150,Call,2,True
2,C003,,-10.0,Two year,45,Email,0,False
3,C001,2022-01-15,59.99,One-year,34,App,0,False
4,C005,2021-11-20,85.5,Month-to-month,29,,5,True
5,C006,2020-05-05,120.0,Two year,19,Branch,1,False


In [None]:
print("\nStructure information:")
print(df.info())


Structure information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 8 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   customer_id               6 non-null      object 
 1   join_date                 5 non-null      object 
 2   monthly_spend             5 non-null      float64
 3   contract_type             6 non-null      object 
 4   age                       6 non-null      int64  
 5   last_interaction_channel  5 non-null      object 
 6   complaints_count          6 non-null      int64  
 7   churned                   6 non-null      bool   
dtypes: bool(1), float64(1), int64(2), object(4)
memory usage: 474.0+ bytes
None


In [None]:
print("\nDescriptive statistics:")
print(df.describe())


Descriptive statistics:
       monthly_spend         age  complaints_count
count       5.000000    6.000000          6.000000
mean       63.096000   51.833333          1.333333
std        47.703051   48.823833          1.966384
min       -10.000000   19.000000          0.000000
25%        59.990000   30.250000          0.000000
50%        59.990000   34.000000          0.500000
75%        85.500000   42.250000          1.750000
max       120.000000  150.000000          5.000000


1.  **Duplicate Identification:** Are there duplicate records in the data? Check the `customer_id` column. Why can duplicates be dangerous for the model?

2.  **Value Analysis (Sanity Check):** Look at numeric columns (`age`, `monthly_spend`). Do you see values that are impossible in reality? (e.g., negative spend, age > 120 years). What would you do with them?

3.  **Missing Data (Missing Values):** In which columns is data missing? Is a missing value in `monthly_spend` the same as a missing value in `last_interaction_channel`?

4.  **Category Consistency:** Check the `contract_type` column. Are the same contract types written in the same way? (Pay attention to typos/formatting).

In [None]:
print("Data preview:")
display(df)

Data preview:


Unnamed: 0,customer_id,join_date,monthly_spend,contract_type,age,last_interaction_channel,complaints_count,churned
0,C001,2022-01-15,59.99,One year,34,App,0,False
1,C002,2023-03-10,,Month-to-month,150,Call,2,True
2,C003,,-10.0,Two year,45,Email,0,False
3,C001,2022-01-15,59.99,One-year,34,App,0,False
4,C005,2021-11-20,85.5,Month-to-month,29,,5,True
5,C006,2020-05-05,120.0,Two year,19,Branch,1,False
