# Doctoralia Assessment

# Questions

# Table of Contents

1. [How many facilities do we have?](#1-how-many-facilities-do-we-have)
2. [How many premium facilities do we have?](#2-how-many-premium-facilities-do-we-have)
3. [What’s the facilities distribution by size?](#3-whats-the-facilities-distribution-by-size)
4. [How many valid phones do we have?](#4-how-many-valid-phones-do-we-have)
5. [What’s the state with the most premium facilities as of today?](#5-whats-the-state-with-the-most-premium-facilities-as-of-today)
6. [How many facilities have churned since June 2022?](#6-how-many-facilities-have-churned-since-june-2022)
7. [What’s the top 3 states with the highest churn percentage?](#7-whats-the-top-3-states-with-the-highest-churn-percentage)
8. [What’s the churn probability for facilities in its 5th month (at national level)?](#8-whats-the-churn-probability-for-facilities-in-its-5th-month-at-national-level)
9. [What’s the top 3 states with the most invalid phone number proportion?](#9-whats-the-top-3-states-with-the-most-invalid-phone-number-proportion)
10. [What’s the average facilities lifespan?](#10-whats-the-average-facilities-lifespan)
11. [What’s the top 3 states with the biggest facilities?](#11-whats-the-top-3-states-with-the-biggest-facilities)
12. [Do we have duplicated phone numbers?](#12-do-we-have-duplicated-phone-numbers)
13. [What’s the top 3 valid duplicated numbers?](#13-whats-the-top-3-valid-duplicated-numbers)
14. [What’s the top 3 states with the most valid duplicated numbers?](#14-whats-the-top-3-states-with-the-most-valid-duplicated-numbers)
15. [What is the relation between valid/invalid phones with churn propensity?](#15-what-is-the-relation-between-validinvalid-phones-with-churn-propensity)


# Data Acquisition

In [1]:
import pandas as pd
import numpy as np
import logging
from datetime import datetime
from scipy.stats import chi2_contingency
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report


In [2]:
log_format = '%(asctime)s %(levelname)s: \n%(message)s\n'

# Configure the logger with the custom format
logging.basicConfig(filename="logs/LoggerDoctoralia.log",
                    level=logging.INFO,
					format=log_format)

In [3]:
df = pd.read_csv("data/CS_Ops_Assessment_dataset_JuanReyes.csv")

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1631 entries, 0 to 1630
Data columns (total 10 columns):
 #   Column                                                                                            Non-Null Count  Dtype  
---  ------                                                                                            --------------  -----  
 0   facility_id                                                                                       1631 non-null   int64  
 1   facility_category                                                                                 1631 non-null   object 
 2   facility_size                                                                                     1631 non-null   object 
 3   state                                                                                             1631 non-null   object 
 4   city                                                                                              1631 non-null   object 
 5  

In [5]:
df.head()

Unnamed: 0,facility_id,facility_category,facility_size,state,city,phone,is_premium,premium_since,is_churn,churn_since
0,80,consultorios de medicina general del sector pr...,0 a 5 personas,chiapas,san cristobal de las casas ...,0.0,1,9/8/2020,0,na
1,160,consultorios del sector privado de audiologia ...,0 a 5 personas,guanajuato,leon ...,1772161000.0,0,na,na,na
2,240,consultorios de medicina general del sector pr...,0 a 5 personas,morelos,tepoztlan ...,0.0,1,5/27/2020,1,9/30/2021
3,320,laboratorios medicos y de diagnostico del sect...,0 a 5 personas,queretaro,queretaro ...,0.0,0,na,na,na
4,400,consultorios dentales del sector privado ...,0 a 5 personas,veracruz de ignacio de la llave,xalapa ...,2261960000.0,1,12/11/2021,0,na


In [6]:
df.tail()

Unnamed: 0,facility_id,facility_category,facility_size,state,city,phone,is_premium,premium_since,is_churn,churn_since
1626,130160,consultorios dentales del sector privado ...,0 a 5 personas,zacatecas,ojocaliente ...,0.0,0,na,na,na
1627,130240,consultorios de medicina general del sector pr...,0 a 5 personas,zacatecas,tepetongo ...,19191110.0,1,5/22/2022,1,6/12/2022
1628,130320,consultorios dentales del sector privado ...,0 a 5 personas,zacatecas,guadalupe ...,1921662000.0,1,11/22/2020,0,na
1629,130400,laboratorios medicos y de diagnostico del sect...,0 a 5 personas,zacatecas,fresnillo ...,19393370.0,1,3/7/2022,0,na
1630,130480,consultorios de medicina general del sector pr...,0 a 5 personas,zacatecas,calera ...,176962.0,0,na,na,na


In [7]:
#Let's get rid of the trailing white spaces of the columns
df.columns = df.columns.str.strip()
#Let's get rid of the trailing white spaces of the all the rows
df = df.map(lambda x: x.strip() if isinstance(x, str) else x)

In [8]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)  # Prevent line wrapping in the display
pd.set_option('display.width', None)
pd.set_option('max_colwidth', None)

In [9]:
#Visualize the unique values per column of interest
print(f"""
Unique Facility Categories:
{df['facility_category'].unique()}

Unique Facility Sizes:
{df['facility_size'].unique()}

Unique States:
{df['state'].unique()}

Unique Cities:
{df['city'].unique()}

Unique Is_premium:
{df["is_premium"].unique()}
""")



Unique Facility Categories:
['consultorios de medicina general del sector privado'
 'consultorios del sector privado de audiologia y de terapia ocupacional, fisica y del lenguaje'
 'laboratorios medicos y de diagnostico del sector privado'
 'consultorios dentales del sector privado'
 'consultorios de medicina especializada del sector privado'
 'consultorios de optometria'
 'clinicas de consultorios medicos del sector privado'
 'consultorios de nutriologos y dietistas del sector privado'
 'hospitales generales del sector privado'
 'consultorios de psicologia del sector privado'
 'consultorios de quiropractica del sector privado'
 'otros consultorios del sector privado para el cuidado de la salud'
 'hospitales del sector privado de otras especialidades medicas']

Unique Facility Sizes:
['0 a 5 personas' '6 a 10 personas' '11 a 30 personas' '31 a 50 personas'
 '101 a 250 personas' '251 y mas personas' '51 a 100 personas']

Unique States:
['chiapas' 'guanajuato' 'morelos' 'queretaro'
 've

# Initial Data Analysis (IDA)

In [10]:
#Replace standalone word 'na' not words that contain 'nan'
df = df.replace(r'\bna\b', np.nan, regex=True)

In [11]:
#Replace NaN for NaT & ensuring dates are in datetime64
def parse_date(date_str):
    if pd.isna(date_str):
        return pd.NaT
    try:
        return pd.to_datetime(date_str, format="%m/%d/%Y")
    except ValueError:
        logging.error(f"Failed to parse '{date_str}'")
        pass

    return pd.NaT

df['churn_since'] = df['churn_since'].apply(parse_date)
df['premium_since'] = df['premium_since'].apply(parse_date)

In [12]:
df['is_churn'] = df['is_churn'].astype('Int64') #We do this to handle NaNs & to ensure nums are int

In [13]:
df['is_premium'] = df['is_premium'].astype(int)  # Ensure it is an int

In [14]:
df['phone'] = df['phone'].astype(int) #If we leave it as float there are mistakes
df['phone'] = df['phone'].astype(str) #We need this to determine valid numbers

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1631 entries, 0 to 1630
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   facility_id        1631 non-null   int64         
 1   facility_category  1631 non-null   object        
 2   facility_size      1631 non-null   object        
 3   state              1631 non-null   object        
 4   city               1631 non-null   object        
 5   phone              1631 non-null   object        
 6   is_premium         1631 non-null   int64         
 7   premium_since      799 non-null    datetime64[ns]
 8   is_churn           799 non-null    Int64         
 9   churn_since        380 non-null    datetime64[ns]
dtypes: Int64(1), datetime64[ns](2), int64(2), object(5)
memory usage: 129.1+ KB


# EDA

In [16]:
df.describe()

Unnamed: 0,facility_id,is_premium,premium_since,is_churn,churn_since
count,1631.0,1631.0,799,799.0,380
mean,65280.0,0.489884,2021-06-13 22:29:53.241551872,0.475594,2022-04-02 08:12:37.894736896
min,80.0,0.0,2020-01-02 00:00:00,0.0,2020-01-12 00:00:00
25%,32680.0,0.0,2020-10-14 00:00:00,0.0,2021-11-29 18:00:00
50%,65280.0,0.0,2021-06-08 00:00:00,0.0,2022-06-05 12:00:00
75%,97880.0,1.0,2022-02-12 12:00:00,1.0,2022-10-01 00:00:00
max,130480.0,1.0,2022-11-30 00:00:00,1.0,2022-12-30 00:00:00
std,37677.876798,0.500051,,0.499717,


# 1. How many facilities do we have?

In [17]:
total_facilities = df["facility_id"].nunique()

print(f"Doctoralia has a total of {total_facilities} facilities")


Doctoralia has a total of 1631 facilities


# 2. How many premium facilities do we have?

In [18]:
#Two approaches
total_premium_facilities_df = df[(df["is_premium"] == 1)]
total_premium_facilities_active_df = df[(df["is_premium"] == 1) & (df["is_churn"] == 0)]


In [19]:
print(f"Historically, Doctoralia has had a total of {total_premium_facilities_df.shape[0]} premium facilities")
print(f"As of today, Doctoralia has a total of {total_premium_facilities_active_df.shape[0]} active premium facilities")


Historically, Doctoralia has had a total of 799 premium facilities
As of today, Doctoralia has a total of 419 active premium facilities


# 3. What’s the facilities distribution by size?

In [20]:
# Mimicking a groupby operation
facility_size_distribution = df['facility_size'].value_counts()
print(facility_size_distribution)


facility_size
0 a 5 personas        1519
6 a 10 personas         74
11 a 30 personas        25
31 a 50 personas         8
101 a 250 personas       3
251 y mas personas       1
51 a 100 personas        1
Name: count, dtype: int64


# 4. How many valid phones do we have?


In [21]:


#Surprisingly hard to find a reliable source for a complete list of LADA codes. Got it from https://www.bajaregroup.com/pdf/mexican_area_codes.pdf

with open('data/TodasLadaMexico.md', 'r', encoding='utf-8') as file:
    md_content = file.read()

lines = md_content.strip().split('\n')

data = []
for line in lines:
    # Split on the last occurrence of '. '
    parts = line.rsplit('. ', 1)
    if len(parts) == 2:
        city_with_state, lada = parts
        data.append({'city_with_state': city_with_state, 'lada': lada})

df_lada = pd.DataFrame(data)

df_lada['lada'] = df_lada['lada'].astype(str)

lada_numbers = set(df_lada['lada'])

def check_lada_vectorized(phone_numbers):
    mask_two_digits = phone_numbers.str[:2].isin(lada_numbers)
    mask_three_digits = phone_numbers.str[:3].isin(lada_numbers)
    return mask_two_digits | mask_three_digits

df.loc[:, 'lada_check'] = check_lada_vectorized(df['phone'])

#All valid numbers in MX need to have 10 digits: https://telmex.com/10digitos

mask_ten_digits = df['phone'].str.len() == 10

df['lada_check'] = np.where(mask_ten_digits, df['lada_check'], pd.NA)

# Log phone numbers that do not match the LADA pattern
invalid_lada_phones = df.loc[df['lada_check'] == False, 'phone']
for phone in invalid_lada_phones:
    logging.error(f"Phone number does not match LADA pattern: {phone}")





In [22]:
valid_phones_df = df.loc[df['lada_check'] == True]
number_of_valid_phones = valid_phones_df.shape[0]

print(f"Number of phone numbers with exactly 10 digits & valid LADA codes: {number_of_valid_phones}")

Number of phone numbers with exactly 10 digits & valid LADA codes: 296


# 5. What’s the state with the most premium facilities as of today?

In [23]:
premium_counts_by_state = df[(df['is_premium'] == 1) & (df['is_churn'] == 0)]['state'].value_counts()
print(f"Counts of premium facilities by state with an active premium status:\n{premium_counts_by_state}")

most_premium_state = premium_counts_by_state.idxmax()
print(f"\nThe state with the most premium facilities is: {most_premium_state} with {premium_counts_by_state.max()} active premium facilities.")

Counts of premium facilities by state with an active premium status:
state
mexico                             40
ciudad de mexico                   33
jalisco                            31
veracruz de ignacio de la llave    29
puebla                             27
michoacan de ocampo                24
guanajuato                         22
nuevo leon                         20
oaxaca                             18
sinaloa                            17
baja california                    13
tamaulipas                         13
chihuahua                          13
guerrero                           11
yucatan                            10
chiapas                            10
coahuila de zaragoza               10
hidalgo                             9
morelos                             9
queretaro                           8
sonora                              7
nayarit                             6
tabasco                             6
zacatecas                           6
colima       

# 6. How many facilities have churned since June 2022?

In [24]:
churn_since_june_df = df[(df['is_churn'] == 1) & (df['churn_since'] > pd.to_datetime('2022-06-01'))]


In [25]:
churn_since_june_df.head()

Unnamed: 0,facility_id,facility_category,facility_size,state,city,phone,is_premium,premium_since,is_churn,churn_since,lada_check
11,960,consultorios dentales del sector privado,0 a 5 personas,queretaro,queretaro,0,1,2022-09-19,1,2022-11-16,
16,1360,clinicas de consultorios medicos del sector privado,0 a 5 personas,aguascalientes,aguascalientes,1191109600,1,2021-04-07,1,2022-11-05,False
26,2160,clinicas de consultorios medicos del sector privado,0 a 5 personas,nuevo leon,monterrey,6163691111,1,2021-08-16,1,2022-08-18,True
32,2640,consultorios dentales del sector privado,0 a 5 personas,sinaloa,el fuerte,0,1,2021-10-17,1,2022-06-18,
34,2800,consultorios de quiropractica del sector privado,0 a 5 personas,tlaxcala,tlaxcala,0,1,2022-10-28,1,2022-12-25,


In [26]:
churn_since_june = churn_since_june_df.shape[0]

print(f"{churn_since_june} facilities have churned since June 2022")

194 facilities have churned since June 2022


# 7. What’s the top 3 states with the highest churn percentage?

In [27]:

churn_percentage_by_state = df.groupby('state')['is_churn'].mean() * 100
sorted_states = churn_percentage_by_state.sort_values(ascending=False)
top_3_states = sorted_states.head(3)

print(sorted_states, "\n\n", top_3_states)

state
campeche                               100.0
quintana roo                       83.333333
chiapas                            61.538462
san luis potosi                    61.538462
mexico                             59.183673
chihuahua                          56.666667
sonora                                 56.25
queretaro                          55.555556
baja california                    53.571429
ciudad de mexico                   53.521127
morelos                            52.631579
zacatecas                               50.0
baja california sur                     50.0
nuevo leon                         47.368421
oaxaca                             45.454545
aguascalientes                     44.444444
tlaxcala                           44.444444
durango                            42.857143
michoacan de ocampo                41.463415
coahuila de zaragoza               41.176471
veracruz de ignacio de la llave    40.816327
hidalgo                                 40.0
pueb

In [28]:
#campeche = df.query('state == "campeche"')
#campeche

# 8. What’s the churn probability for facilities in its 5th month (at national level)?


In [29]:
df.head()

Unnamed: 0,facility_id,facility_category,facility_size,state,city,phone,is_premium,premium_since,is_churn,churn_since,lada_check
0,80,consultorios de medicina general del sector privado,0 a 5 personas,chiapas,san cristobal de las casas,0,1,2020-09-08,0.0,NaT,
1,160,"consultorios del sector privado de audiologia y de terapia ocupacional, fisica y del lenguaje",0 a 5 personas,guanajuato,leon,1772161292,0,NaT,,NaT,False
2,240,consultorios de medicina general del sector privado,0 a 5 personas,morelos,tepoztlan,0,1,2020-05-27,1.0,2021-09-30,
3,320,laboratorios medicos y de diagnostico del sector privado,0 a 5 personas,queretaro,queretaro,0,0,NaT,,NaT,
4,400,consultorios dentales del sector privado,0 a 5 personas,veracruz de ignacio de la llave,xalapa,2261960106,1,2021-12-11,0.0,NaT,True


In [30]:
reference_date = datetime.now()

df['tenure_days'] = np.where(
    pd.notnull(df['churn_since']),
    (df['churn_since'] - df['premium_since']).dt.days,
    (reference_date - df['premium_since']).dt.days
)

# Fill NaN values that result from NaT entries with the placeholder -1
df['tenure_days'] = df['tenure_days'].fillna(-1)

df['tenure_months'] = (df['tenure_days'] / 30.44).astype(int)


#Starts on 0 so 5th month is 4. Also, this includes facilities that churn exactly on the 5th month mark
facilities_in_5th_month = df[(df['tenure_months'] == 4)]
facilities_in_5th_month.describe()

Unnamed: 0,facility_id,is_premium,premium_since,is_churn,churn_since,tenure_days,tenure_months
count,26.0,26.0,26,26.0,26,26.0,26.0
mean,70581.538462,1.0,2021-11-22 04:36:55.384615424,1.0,2022-04-07 21:13:50.769230848,136.692308,4.0
min,1760.0,1.0,2020-06-18 00:00:00,1.0,2020-11-16 00:00:00,122.0,4.0
25%,45940.0,1.0,2021-09-06 12:00:00,1.0,2022-01-14 18:00:00,128.25,4.0
50%,79480.0,1.0,2021-12-07 12:00:00,1.0,2022-04-28 00:00:00,136.5,4.0
75%,96780.0,1.0,2022-06-09 18:00:00,1.0,2022-10-24 00:00:00,145.5,4.0
max,126800.0,1.0,2022-08-11 00:00:00,1.0,2022-12-30 00:00:00,151.0,4.0
std,34355.003559,0.0,,0.0,,9.540521,0.0


In [31]:
churned_in_5th_month = facilities_in_5th_month[facilities_in_5th_month['is_churn'] == 1].shape[0]
total_premium_facilities = df[pd.notnull(df['premium_since'])].shape[0]

churn_probability_5th_month = churned_in_5th_month / total_premium_facilities

print(f"The churn probability for a facility in its 5th month is: {churn_probability_5th_month:.2%}")


The churn probability for a facility in its 5th month is: 3.25%


# 9. What’s the top 3 states with the most invalid phone number proportion?

In [32]:
df['invalid_phone'] = 0

df.loc[df['lada_check'] != True, 'invalid_phone'] = 1

state_phone_stats = df.groupby('state')['invalid_phone'].agg(['sum', 'count'])

state_phone_stats["invalid_phone_proportion"] = state_phone_stats["sum"] / state_phone_stats["count"]

top_invalid_phone_states = state_phone_stats.sort_values(
    by=['invalid_phone_proportion', 'sum'], 
    ascending=[False, False]
)

top_invalid_phone_states.head()

Unnamed: 0_level_0,sum,count,invalid_phone_proportion
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
guanajuato,92,92,1.0
san luis potosi,34,34,1.0
zacatecas,23,23,1.0
aguascalientes,22,22,1.0
queretaro,41,42,0.97619


In [33]:
#aguascalientes = df.query("state == 'aguascalientes'")
#aguascalientes

# 10. What’s the average facilities lifespan?


In [34]:
df.describe()

Unnamed: 0,facility_id,is_premium,premium_since,is_churn,churn_since,tenure_days,tenure_months,invalid_phone
count,1631.0,1631.0,799,799.0,380,1631.0,1631.0,1631.0
mean,65280.0,0.489884,2021-06-13 22:29:53.241551872,0.475594,2022-04-02 08:12:37.894736896,321.446965,10.331085,0.818516
min,80.0,0.0,2020-01-02 00:00:00,0.0,2020-01-12 00:00:00,-1.0,0.0,0.0
25%,32680.0,0.0,2020-10-14 00:00:00,0.0,2021-11-29 18:00:00,-1.0,0.0,1.0
50%,65280.0,0.0,2021-06-08 00:00:00,0.0,2022-06-05 12:00:00,-1.0,0.0,1.0
75%,97880.0,1.0,2022-02-12 12:00:00,1.0,2022-10-01 00:00:00,602.5,19.0,1.0
max,130480.0,1.0,2022-11-30 00:00:00,1.0,2022-12-30 00:00:00,1517.0,49.0,1.0
std,37677.876798,0.500051,,0.499717,,459.417947,14.905947,0.385537


In [35]:
avg_lifespan = df["tenure_days"].mean()

print(f"The average facilities lifespan is {avg_lifespan:.0f} days.")

The average facilities lifespan is 321 days.


# 11. What’s the top 3 states with the biggest facilities?


In [36]:
big_facility_categories = ['51 a 100 personas', '101 a 250 personas', '251 y mas personas']
big_facilities_df = df[df['facility_size'].isin(big_facility_categories)]
state_category_crosstab = pd.crosstab(big_facilities_df['state'], big_facilities_df['facility_size'])

state_category_crosstab['Total'] = state_category_crosstab.sum(axis=1)

print(state_category_crosstab.head(3))

facility_size     101 a 250 personas  251 y mas personas  51 a 100 personas  Total
state                                                                             
ciudad de mexico                   0                   1                  1      2
jalisco                            1                   0                  0      1
mexico                             1                   0                  0      1


# 12. Do we have duplicated phone numbers?


In [37]:
phones_not_0 = df[df['phone'] != '0']

#We are NOT keeping the first occurence.
duplicated_phones = phones_not_0['phone'].duplicated(keep=False)

number_of_duplicated_phones = duplicated_phones.sum()

df_duplicated_phones = phones_not_0[duplicated_phones]

print(f"There are {number_of_duplicated_phones} duplicated phone numbers, excluding the placeholder '0' and not keeping the first occurence.\n{df_duplicated_phones['phone']}")

There are 18 duplicated phone numbers, excluding the placeholder '0' and not keeping the first occurence.
10      6009116666
12      3317373737
49      6616391600
223     6009116666
409     6009116666
463     6009116666
630     6009116666
711          61616
781     6009116666
823     6616391600
1051    3317373737
1087    9931261396
1189         61616
1198    3317373737
1233        996216
1359        996216
1468    6009116666
1576    9931261396
Name: phone, dtype: object


# 13. What’s the top 3 valid duplicated numbers?


In [38]:
df_valid_duplicated_phones = df_duplicated_phones.loc[df_duplicated_phones['invalid_phone'] == 0]

top_valid_duplicated_phones = df_valid_duplicated_phones["phone"].value_counts()

top_valid_duplicated_phones.head(3)

phone
3317373737    3
6616391600    2
9931261396    2
Name: count, dtype: int64

# 14. What’s the top 3 states with the most valid duplicated numbers?


In [39]:


valid_duplicated_phones_crosstab = pd.crosstab(df_valid_duplicated_phones['state'], df_valid_duplicated_phones['phone'])

valid_duplicated_phones_crosstab['Total'] = valid_duplicated_phones_crosstab.sum(axis=1)

valid_duplicated_phones_crosstab.head(3)

phone,3317373737,6616391600,9931261396,Total
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
baja california,0,2,0,2
ciudad de mexico,1,0,0,1
nayarit,1,0,0,1


# 15. What is the relation between valid/invalid phones with churn propensity?

## Chi-squared test

In [44]:


df_no_zeros = df[df['phone'] != 0]
df_no_zeros_dropped = df_no_zeros.drop_duplicates()
df_zeros = df[df['phone'] == 0]
df_chi = pd.concat([df_no_zeros_dropped, df_zeros], ignore_index=True)

df_chi = df_chi.sort_values(by='facility_id')

df_chi = df.dropna(subset=['is_churn'])

invalid_phone_counts = df_chi['invalid_phone'].value_counts()
is_churn_counts = df_chi['is_churn'].value_counts()


print(f"invalid_phone_counts: {invalid_phone_counts}\n")
print(f"is_churn_counts: {is_churn_counts}\n")

contingency_table = pd.crosstab(df_chi['invalid_phone'], df_chi['is_churn'])

chi2, p, dof, expected = chi2_contingency(contingency_table)

print("Chi-squared test results:\n")
print(f"Chi-squared: {chi2}")
print(f"P-value: {p}")
print(f"Degrees of freedom: {dof}")
print("Expected frequencies:")
print(expected)

alpha = 0.05  
if p < alpha:
    print("There is a significant relationship between phone validity and churn propensity.")
else:
    print("There is no significant relationship between phone validity and churn propensity.")


invalid_phone_counts: invalid_phone
1    654
0    145
Name: count, dtype: int64

is_churn_counts: is_churn
0    419
1    380
Name: count, dtype: Int64

Chi-squared test results:

Chi-squared: 0.07213020851360304
P-value: 0.7882600798587198
Degrees of freedom: 1
Expected frequencies:
[[ 76.0387985  68.9612015]
 [342.9612015 311.0387985]]
There is no significant relationship between phone validity and churn propensity.


## Logistic Regression

In [41]:
df.head()

Unnamed: 0,facility_id,facility_category,facility_size,state,city,phone,is_premium,premium_since,is_churn,churn_since,lada_check,tenure_days,tenure_months,invalid_phone
0,80,consultorios de medicina general del sector privado,0 a 5 personas,chiapas,san cristobal de las casas,0,1,2020-09-08,0.0,NaT,,1267.0,41,1
1,160,"consultorios del sector privado de audiologia y de terapia ocupacional, fisica y del lenguaje",0 a 5 personas,guanajuato,leon,1772161292,0,NaT,,NaT,False,-1.0,0,1
2,240,consultorios de medicina general del sector privado,0 a 5 personas,morelos,tepoztlan,0,1,2020-05-27,1.0,2021-09-30,,491.0,16,1
3,320,laboratorios medicos y de diagnostico del sector privado,0 a 5 personas,queretaro,queretaro,0,0,NaT,,NaT,,-1.0,0,1
4,400,consultorios dentales del sector privado,0 a 5 personas,veracruz de ignacio de la llave,xalapa,2261960106,1,2021-12-11,0.0,NaT,True,808.0,26,0


In [45]:
df_logit = df_chi.copy()

#df_logit = df_logit[df_logit['phone'] != '0']

X = df_logit[['invalid_phone']]  
y = df_logit['is_churn'].dropna()

# Since 'is_churn' has missing values, we need to filter those out
X = X.loc[y.index]

X_counts = X.value_counts()

y_counts = y.value_counts()

print(f"X_counts: {X_counts}\n\ny_counts: {y_counts}")



X_counts: invalid_phone
1                654
0                145
Name: count, dtype: int64

y_counts: is_churn
0    419
1    380
Name: count, dtype: Int64


In [59]:



X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = LogisticRegression()

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred))

print(f"Coefficient for invalid_phone: {model.coef_[0]}")

              precision    recall  f1-score   support

         0.0       0.51      1.00      0.67       122
         1.0       0.00      0.00      0.00       118

    accuracy                           0.51       240
   macro avg       0.25      0.50      0.34       240
weighted avg       0.26      0.51      0.34       240

Coefficient for invalid_phone: [0.08344299]


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
