# Problem Background and Motivation

### Brief Intro on Telecommunication 
 Telecommunication sector is one of the major growing sectors in term of revenue and technology. An **interesting** part of this sector is still the plain old telephone calls is the biggest revenue generator.  But due to advancement in technology, Telecom is growing less about voice and increasingly about video, text, and data which inturn directly increasing the revenue
 
>* The major contributor for the revenue in this sector is the **CUSTOMER**. 
>* Some major telcom players are AT&T, T-mobile and verizon
>* Customer is *GOD*
>* More Information in the link [Research](https://www.investopedia.com/ask/answers/070815/what-telecommunications-sector.asp)


<img src="http://www.allaccesstelecom.com/wp-content/uploads/2019/12/Telecom-Networks-small-848x300.jpg" alt="Telecom Networks" />
  
<img src="https://www.boldbi.com/wp-content/uploads/2021/02/General-Management@2x-1.png" alt="Dashboard for revenue in Telecom Industry" />

### Motivation for taking the case

>* It is important for such big industry players to know if their customers are loyal or not (customer churn)
>* Customer retention is less costlier than that of acquiring new customers
>* keeping the churn rate at a minimum in simple terms means more profit or revenue
>* Project Reference for a telco churn problem [REFERENCE](https://towardsdatascience.com/end-to-end-machine-learning-project-telco-customer-churn-90744a8df97d)

<img src="https://skillsireupload.s3.amazonaws.com/upload/photos/2020/05/5twqQmRLzwjfNafls5WA_21_34ff7747f0fbaa5f140904a2c44ffe5a_image.jpg" />


>**Questions To Think About:**
>- What if the model gives a false positive and the company/stakeholders gives compensation to that individual? Who should be held responsible?
>- Who is gonna consume our model?  Who are the stakeholders? Who do we need to convince that our work is good/useful?
>- What are the risks? How accurate or precise will our results be?
>- How are we checking the error rate? Is there a error handling mechanism for it?
>- What if our eventual model makes mistakes (it will) - are some mistakes more costly than others?
>- Could we be held liable for those mistakes?

# Import Package Dependencies
  Import the required packages needed for the model.
  
  ### Pandas:
  >* Pandas library is used mainly for data manipulation and analysis
  >* pandas is aliased as pd for ease of usage
  >* https://pandas.pydata.org/
  
  ### SKLEARN:
  >* Famous machine learning library with various classification, regression and clustering algorithms
  >* We only import the required functions for the model by using " from **library** import **function** "
  >* https://scikit-learn.org/stable/
  
  ### PICKLE:
  >* This module implements binary protocols for serializing and de-serializing a Python object structure.
  >* More accessible and easier use for deployment
  >* https://docs.python.org/3/library/pickle.html#module-pickle

In [4]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import pickle

# Get the data ( Input )

>* Using pandas inbuilt input function read_csv is called on the csv file (the dataset) and stored in a variable called df(data frame)
>* Default output of read_csv is a dataframe 

>**Questions to think about**
>- can we read non-csv files?  How do we do that? - We can use the pandas library function -> read_table,read_clipboard and read_excel for other types of files
>- does the read_csv() function have other arguments besides the file name?
>- how much data can a Pandas dataframe handle without bogging down my system? -> Pandas is optimized to work with data on the memory, so if your dataset is bigger than the memory(RAM)- then it will affect the system performance 
>- What are the default parameters for the read_csv function?


>**Future Reference:**
>- In the future, I might be working with several/multiple data sets and might need to concat/merge the table.
>- Inbuilt pandas function is there which supports the above statement ->https://pandas.pydata.org/docs/user_guide/merging.html#

In [5]:
#Read Data
url = 'https://github.com/Bhyrav17/Telco_Churn_prediction/blob/main/Data/telco_customer_churn.csv?raw=true'
df = pd.read_csv(url, index_col=0)

# X and Y values ( Data Prep Stage)

>* iloc is a functionality of pandas which is used to select a specific row or column from the data set.
>* X -> Contains all the rows and column 1 to length of the dataframe(6) - Last one is excluded
>* Y -> Contains all the rows and target variable
>* https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html

## Note to Myself 
DrG gave us **'clean'** data.  In reality, our data won't be clean and will need to be processed in multiple steps.  Let's create a simple checklist so we don't forget the many things we should look into with future data sets.

- [x] **Data Types & Definitions** - Go through the given data and get to know the data and its definitions
       For Example: In telecom churn data, there are 5 predictor variables{senior_citizen->type:bool,
       months_as_customer -  >type:number ) and etc. churn(target variable) defines if the customer has churned or not.
- [x] **Explore our data** - Evaluate (plot) distributions and get to know the central tendencies and distributions. ( mean, standard deviation and variances)
- [ ] **Outlier Detection** - We can get to know initially by commonplots like scatterplot or bar graphs. But there is several algorithms for finding the outlier like Elliptic envelope or Isolation Forest Algorithm (Go through these once and read up on it)
- [ ] **Data cleansing and validation** - Remove duplicate data, Missing data should either be removed or estimated properly.
- [ ] **Data structuring** - Sometimes the data maybe in a form that is not suitable for the ML model, so we need to perform transformations to get it to a structured way


In [6]:
X = df.iloc[:,1:len(df.columns)]
y = df.iloc[:,0]

# Building the model 

What is the nature of our prediction problem?  In this case, we are predicting a categorical target variable, binary to be exact( Either YES OR NO- Is the customer going to churn? ). We are doing Classification and there are a lot of alternative modeling frameworks for us to choose from like  Before we get to modeling and model selection.

**Final Check:** - Is our data REALLY prepared for modeling?  Ok then!let's go!

>* Logistic Regression is a function from the sklearn package and is used for linear continuous variables
>* By default - lbfgs alogrithm is considered. It is particularly used for small datasets. Max_iter variable gives the Maximum number of iterations taken for the solver to converge.
>* https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression

>*  The fit function fit the linear model where X is the training data and Y is the target variable
>* https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.fit



**Questions to think about(For Me):**

- What are the key indicators of performance(KPI)? Is it enough to show the stakeholders that the model is good?
- Is it possible for us to adjust our models so they perform better (however that is defined?)
- How can we choose which solver to go for in classifications? lbfgs,newton-cg,liblinear, sag or saga?
- How do we determine which is the "best" model?
- What about model interpretability?  Is that important for our current use-case?  In other words, do we care whether or not we can measure and interpret relationships between predictors and our target...or do we ONLY care about the prediction?
- How would I convince others that our model is a good one? With Accuracy_Score, Zscore, Precision or what?


(Note: I have used some of the same questions as in the Template by Dr.G because I also have questions about the same )


In [7]:
model = LogisticRegression(solver='lbfgs',max_iter=800)
model.fit(X,y)

# Prediction 

>* The predict function predicts the class labels for the variable passed to it i.e. X is the data matrix for which we want to get the predictions
>* https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict


In [8]:
predictions = model.predict(X)

# Accuracy Score

>* As the name suggests the accuracy_Score computes accuracy of the predicted values and the true values
>* https://scikit-learn.org/stable/modules/model_evaluation.html#accuracy-score

**Questions to think about:**

- How is the accuracy score calculated?
- Is it reliable to give the perfect values?

In [9]:
print(accuracy_score(y,predictions))

0.8378531875621185


# Pickling

>* Pickling is done, so as to save the python object file as itself and can retrieve it back whenever we want to in its original form
>* We are creating a new object called pickle_out. This will contain a new file called classifier in write mode. We are opening this file to write to it in binary(wb)
>* https://sites.pitt.edu/~naraehan/python3/pickling.html

>* The dump command is a method for saving the data out to the designated pickle file (pickle_out)
>* The close method will close the opened file


# Deployment time for the '🖖ChurnPredictor SP17000X'

>**Questions of Interest(For ME):**
>- Is WebApp deployment the best and easy way to deploy a ML model? What are the other ways we can deploy an ML model?
>- Where can we host our deployed webapp?  Does it cost anything to deploy it in a AWS or any cloud service?
>- Can we use authentication to make sure our our app is secure and no data breach is occurred?
>- Streamlit commands and docs: https://docs.streamlit.io/
>- Streamlit templates: see https://streamlit.io/gallery


In [10]:
pickle_out = open('classifier', mode='wb')
pickle.dump(model, pickle_out)
pickle_out.close()

### Extra Notes

>* **The below code (app.py) contains unpickling , Prediction and the inputs to our churn predictor model**

In [11]:
%%writefile app.py

import pickle
import streamlit as st

pickle_in = open('classifier', 'rb')   #Unpickling
classifier = pickle.load(pickle_in)

st.set_page_config(page_title='ChurnPredictor', page_icon="🖖")    #The set_page_config method lets us define the webpage title name and icon

@st.cache()


# Define the function which will make the prediction using data
# inputs from users
def prediction(senior_citizen, has_dependents,
               months_as_customer, has_internet_service, has_month_to_month_contract):
    
    # Make predictions
    prediction = classifier.predict(
        [[senior_citizen, has_dependents,months_as_customer, has_internet_service, has_month_to_month_contract]])
    
    if prediction == 0:
        pred = 'Everything Looks good. The Customer is Loyal!'
    else:
        pred = 'Oh No! The customer is gonna CHURN! **Better do something about it**'
    return pred

# This is the main function in which we define our webpage
def main():
    
    st.title("The Churn Predictor Model")   # Title of the model displayed in the webpage
    
    #Give a little bit information of the Model
    st.info('The Model takes in the below predictor variables for a telecom company and predicts if a customer is going to churn or not!', icon="ℹ️")
    
    # Create input fields
    senior_citizen = st.number_input("Are you a senior citizen? ('1' for Yes and '0' for NO)",
                                  min_value=0,
                                  max_value=1,
                                  value=0,
                                  step=1,
                                 )
    has_dependents = st.number_input("Do you have any dependents? ('1' for Yes and '0' for NO) ",
                              min_value=0,
                              max_value=1,
                              value=0,
                              step=1
                             )

    months_as_customer = st.number_input("Enter the months the customer has stayed with the company(0-72)",
                              min_value=0,
                              max_value=72,
                              value=10,
                              step=3
                             )
    has_internet_service = st.number_input("Did the customer have a internet service?('1' for Yes and '0' for NO)",
                          min_value=0,
                          max_value=1,
                          value=0,
                          step=1
                         )
    has_month_to_month_contract = st.number_input("Does the customer have month to month contract?('1' for Yes and '0' for NO)",
                          min_value=0,
                          max_value=1,
                          value=0,
                          step=1
                         )

    result = ""
    
    # When 'Predict' is clicked, make the prediction and store it
    if st.button("Predict"):
        result = prediction(senior_citizen, has_dependents,months_as_customer, has_internet_service, has_month_to_month_contract)
        st.success(result)
        #If the predictions are true, celebrate that the model is properly working, else spit out a churn warning
        if(result == 'Everything Looks good. The Customer is Loyal!'):
            st.balloons()   
        else:
            st.image('https://www.smartkarrot.com/wp-content/uploads/2020/09/Customer-churn-reduction.png',caption="Customer CHURN ALERT",width=150)
        
if __name__=='__main__':
    main()

Overwriting app.py


In [12]:
!streamlit run app.py

^C
