# Assessment

## Questions

1. How many facilities do we have?
2. How many premium facilities do we have?
3. What’s the facilities  distribution by size?
4. How many valid phones do we have?
5. What’s the state with the most premium facilities as of today?
6. How many facilities have churned since June 2022?
7. What’s the top 3 states with the highest churn percentage?
8. What’s the churn probability for facilities in its 5th month (at national level)?
9. What’s the top 3 states with the most invalid phone number proportion?
10. What’s the average facilities lifespan?
11. What’s the top 3 states with the biggest facilities?
12. Do we have duplicated phone numbers?
13. What’s the top 3 valid duplicated numbers?
14. What’s the top 3 states with the most valid duplicated numbers?
15. What is the relation between valid/invalid phones with churn propensity?

In [None]:
import pandas as pd
import numpy as np

In [35]:
df = pd.read_csv("CS_Ops_Assessment_dataset_JuanReyes.csv")

## EDA

In [None]:
df.info()

In [None]:
df.head()

In [None]:
df.tail()

In [36]:
#Let's get rid of the trailing white spaces of the columns
df.columns = df.columns.str.strip()

In [37]:
#Let's get rid of the trailing white spaces of the all the rows
df = df.map(lambda x: x.strip() if isinstance(x, str) else x)

In [38]:
#Replace standalone word 'na' not words that contain 'nan'
df = df.replace(r'\bna\b', np.nan, regex=True)

In [None]:
#Visualize the unique values per column of interest
print(f"""
Unique Facility Categories:
{df['facility_category'].unique()}

Unique Facility Sizes:
{df['facility_size'].unique()}

Unique States:
{df['state'].unique()}

Unique Cities:
{df['city'].unique()}

Unique Is_premium:
{df["is_premium"].unique()}
""")


## 1. How many facilities do we have?

In [None]:
#Assuming facility_id is a primary key then the number of facilities is simply the total number of unique entries
total_facilities = df["facility_id"].nunique()

print(total_facilities)


## 2. How many premium facilities do we have?

In [None]:
#Two approaches
total_premium_facilities = df['is_premium'].value_counts().get(1, 0)
total_premium_facilities_query = df.query('is_premium == 1').shape[0]


In [None]:
print(total_premium_facilities)
print(total_premium_facilities_query)

## 3. What’s the facilities distribution by size?

In [None]:
# Mimicking a groupby operation
facility_size_distribution = df['facility_size'].value_counts()
print(facility_size_distribution)


## 4. How many valid phones do we have?


In [None]:
#All valid numbers in MX need to have 10 digits: https://telmex.com/10digitos

df['phone'] = df['phone'].astype(int) #If we leave it as float there are mistakes
df['phone'] = df['phone'].astype(str)
valid_phones_df = df.loc[df['phone'].str.len() == 10]
number_of_valid_phones = valid_phones_df.shape[0]


print(f"Number of valid phone numbers with exactly 10 digits: {number_of_valid_phones}")


# 5. What’s the state with the most premium facilities as of today?

In [None]:
df['is_premium'] = df['is_premium'].astype(int)  # Ensure it is an int
premium_counts_by_state = df[df['is_premium'] == 1]['state'].value_counts()
print(f"Counts of premium facilities by state:{premium_counts_by_state}")

most_premium_state = premium_counts_by_state.idxmax()
print(f"\nThe state with the most premium facilities is: {most_premium_state} with {premium_counts_by_state.max()} premium facilities.")


# 6. How many facilities have churned since June 2022?

In [57]:
def parse_date_with_multiple_formats(date_str):
    if pd.isna(date_str):
        return pd.NaT
    try:
        return pd.to_datetime(date_str, format="%m/%d/%Y")
    except ValueError:
        print(f"Failed to parse '{date_str}'")
        pass

    return pd.NaT

df['churn_since'] = df['churn_since'].apply(parse_date_with_multiple_formats)

churn_since_june_df = df[(df['is_churn'] == "1") & (df['churn_since'] > pd.to_datetime('2022-06-01'))]


In [58]:
churn_since_june_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 194 entries, 11 to 1627
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   facility_id        194 non-null    int64         
 1   facility_category  194 non-null    object        
 2   facility_size      194 non-null    object        
 3   state              194 non-null    object        
 4   city               194 non-null    object        
 5   phone              194 non-null    float64       
 6   is_premium         194 non-null    int64         
 7   premium_since      194 non-null    object        
 8   is_churn           194 non-null    object        
 9   churn_since        194 non-null    datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(2), object(6)
memory usage: 16.7+ KB


In [59]:
churn_since_june_df.count()

facility_id          194
facility_category    194
facility_size        194
state                194
city                 194
phone                194
is_premium           194
premium_since        194
is_churn             194
churn_since          194
dtype: int64

# 7. What’s the top 3 states with the highest churn percentage?