# 1.0 Business Understanding

### 1.1 Background

What is customer churn? This is the rate at which customers leave a business(especially SaaS)<br>
SyriaTel is a telecommunications company facing the challenge of customer churn, which refers to customers discontinuing their services. Churn can have significant financial implications for SyriaTel, including the loss of recurring revenue, increased customer acquisition costs, and potential negative impact on the company's reputation. To address this issue, SyriaTel aims to build a churn prediction system that can identify customers likely to churn in the near future.

According to Forbes Advisor an article by Monique Danao, published March 2nd 2023,11.00am, Customer churn rate, "refers to the rate at which subscribers or customers stop transacting with your business." https://www.forbes.com/advisor/business/churn-rate/

In other words customers discontinue the use of company services and products which in turn leads to revenue loaas and in the long run affects the companys profitability.


## Project Questions
* Are there any predictable patterns to whether a customer will soon stop doing business with SyriaTel

## Objectives
* To determine whether customers who have a high number of outbound calls or data usage are likely to churn
* To identify if the number of customer service calls has a relationship to churn
* To asses whether the number of outbound calls or data usage affects churn
* To determine if the time of day influences churn

## Stakeholders
SyriaTel Telecommunications Company
Focusing on the 3 departments :
 * Top Management
 * Marketing Team
 * Customer Retention Team

## Hypothesis
H0: The number of customer service calls has no relationship to churn<br>
H1: The number of customer service calls is related to churn

Customers who have a shorter tenure with SyriaTel are more likely to churn <br>
Customers who have a high number of outbound calls or data usage are less likely to churn

# 2.0 Data Understanding

In [None]:
# from pyforest import *

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('./bigml_59c28831336c6604c800002a.csv', index_col=0)
display(df.head())

Unnamed: 0_level_0,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,total eve minutes,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
KS,128,415,382-4657,no,yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
OH,107,415,371-7191,no,yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
NJ,137,415,358-1921,no,no,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
OH,84,408,375-9999,yes,no,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
OK,75,415,330-6626,yes,no,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


In [None]:
def analyze_dataset(df):
    # Dataset shape
    print("Shape of the dataset:", df.shape, '\n')
    print("========================================", '\n')

    # Missing values
    print("Null Values count:", df.isnull().sum(), '\n')
    print("========================================", '\n')

    # Duplicate values
    print("Number of duplicates:", len(df.loc[df.duplicated()]), '\n')
    print("========================================", '\n')

    # Target value count
    print("Count of each value of target:")
    print(df['churn'].value_counts(normalize=True), '\n')
    print("========================================", '\n')

    # Unique values
    print("The unique values per column are:")
    print(df.nunique(), '\n')
    print("========================================", '\n')

    # Dataset information
    print("Information about the dataset:")
    print(df.info())

    # distribution
    display(df.describe())


# Usage example
analyze_dataset(df)


Shape of the dataset: (3333, 20) 


Null Values count: account length            0
area code                 0
phone number              0
international plan        0
voice mail plan           0
number vmail messages     0
total day minutes         0
total day calls           0
total day charge          0
total eve minutes         0
total eve calls           0
total eve charge          0
total night minutes       0
total night calls         0
total night charge        0
total intl minutes        0
total intl calls          0
total intl charge         0
customer service calls    0
churn                     0
dtype: int64 


Number of duplicates: 0 


Count of each value of target:
False    0.855086
True     0.144914
Name: churn, dtype: float64 


The unique values per column are:
account length             212
area code                    3
phone number              3333
international plan           2
voice mail plan              2
number vmail messages       46
total day minutes       

Unnamed: 0,account length,area code,number vmail messages,total day minutes,total day calls,total day charge,total eve minutes,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls
count,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0
mean,101.064806,437.182418,8.09901,179.775098,100.435644,30.562307,200.980348,100.114311,17.08354,200.872037,100.107711,9.039325,10.237294,4.479448,2.764581,1.562856
std,39.822106,42.37129,13.688365,54.467389,20.069084,9.259435,50.713844,19.922625,4.310668,50.573847,19.568609,2.275873,2.79184,2.461214,0.753773,1.315491
min,1.0,408.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,23.2,33.0,1.04,0.0,0.0,0.0,0.0
25%,74.0,408.0,0.0,143.7,87.0,24.43,166.6,87.0,14.16,167.0,87.0,7.52,8.5,3.0,2.3,1.0
50%,101.0,415.0,0.0,179.4,101.0,30.5,201.4,100.0,17.12,201.2,100.0,9.05,10.3,4.0,2.78,1.0
75%,127.0,510.0,20.0,216.4,114.0,36.79,235.3,114.0,20.0,235.3,113.0,10.59,12.1,6.0,3.27,2.0
max,243.0,510.0,51.0,350.8,165.0,59.64,363.7,170.0,30.91,395.0,175.0,17.77,20.0,20.0,5.4,9.0


## Observations
* dataset contains 3333 rows and 20 columns

Variable Names and Descriptions : <br>
1. State: the US state in which the customer resides, indicated by a two-letter abbreviation
2. Account Length: the number of days that this account has been active
3. Area Code: the three-digit area code of the corresponding customer’s phone number
4. Phone: the remaining seven-digit phone number
5. Int’l Plan: whether the customer has an international calling plan: yes/no
6. VMail Plan: whether the customer has a voice mail feature: yes/no
7. VMail Message: presumably the average number

Our dataset has no Null values. Categorized of bla bla bla




# 3.0 Data Preparation

## 3.1 Data Cleaning

In [None]:
# # Handle that class imbalance
# X = df.drop('churn', axis=1)
# y = df['churn'] # target

# X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
# print(X_train.shape, y_train.shape)

In [None]:
# from imblearn.over_sampling import SMOTE

# oversample = SMOTE()
# X_train_smote, y_train_smote = oversample.fit_resample(X_train, y_train)

# print(X_train_smote.shape, y_train_smote.shape)

In [None]:
# y_train_smote.value_counts()

In [None]:
# from lazypredict.Supervised import LazyClassifier
# from sklearn.model_selection import train_test_split

# X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=.5,random_state =123)
# clf = LazyClassifier(verbose=0,ignore_warnings=True, custom_metric=None)
# models,predictions = clf.fit(X_train, X_test, y_train, y_test)
# models