<img style="text-aling:left;" src="https://github.com/Microsoft/sqlworkshops/blob/master/graphics/solutions-microsoft-logo-small.png?raw=true" alt="Microsoft">
<br>

# Machine Learning and the Team Data Science Process 
## http://aka.ms/tdsp

This Jupyter Notebook walks through the various steps of the Team Data Science Process (TDSP) using Machine Learning in Python. We'll walk through the Phases of the Team Data Science Process to predict a customer churn number - something quite common as a use-case. 

## Phase One: Business Understanding
The Orange Telecom company in France is one of the largest operators of mobile and internet services in Europe and Africa and a global leader in corporate telecommunication services. They have 256 million customers worldwide. They have significant coverage in France, Spain, Belgium, Poland, Romania, Slovakia Moldova, and a large presence Africa and the Middle East. Customer Churn is always an issue in any company. Orange would like to predict the propensity of customers to switch provider (churn), buy new products or services (appetency), or buy upgrades or add-ons proposed to them to make the sale more profitable (up-selling). For this effort, they think churn is the first thing they would like to focus on.

In this Jupyter Notebook, you'll create, train and store a Machine Learning model using SciKit-Learn, so that it can be deployed to multiple hosts. 

Let's start by bringing in the libraries we'll use for machine learning in Python:

In [11]:
import pickle
import pandas as pd
import numpy as np
import csv
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
print("Libraries Loaded.")

Libraries Loaded.


## Phase Two: Data Acquisition and Understanding

The Data Aquisition and Understanding phase of the TDSP you ingest or access data from various locations to answer the questions the organization has asked. In most cases, this data will be in multiple locations. Once the data is ingested into the system, you’ll need to examine it to see what it holds. All data needs cleaning, so after the inspection phase, you’ll replace missing values, add and change columns. You’ll cover more extensive Data Wrangling tasks in other labs.

In this section, we’ll use a single file-based dataset to train our model which the company provided. We'll then explore that data a bit, which is often done with graphical outputs as well:

In [15]:
# Read customer data from a single file
df = pd.read_csv('https://cs7a9736a9346a1x44c6xb00.blob.core.windows.net/backups/CATelcoCustomerChurnTrainingSample.csv', header=0)

# Ensure that you have 29 columns and 20,468 rows loaded
print('Data Loaded. There should be 20468 obervations of 29 variables:')
print(df.shape, '\n')

# Show the size and shape of data:
print('The size of the data is: %d rows and  %d columns' % df.shape, '\n')

# Show the first and last 10 rows
# print('First ten rows of the data: ')
# print(df.head(10), '\n')
# print('Last ten rows of the data: ')
# print(df.tail(10), '\n')

# Show the dataframe structure:
print('Dataframe Structure: ', '\n')
print(df.info(), '\n')

# Check for missing values:
# print('Missing values: ', '\n')
# print(df.apply(lambda x: sum(x.isnull()),axis=0), '\n') 

# perform a simple statistical display:    
# print('Dataframe Statistics: ', '\n')
# print(df.describe(), '\n')

Data Loaded. There should be 20468 obervations of 29 variables:
(20468, 29) 

The size of the data is: 20468 rows and  29 columns 

Dataframe Structure:  

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20468 entries, 0 to 20467
Data columns (total 29 columns):
age                                     20468 non-null int64
annualincome                            20468 non-null int64
calldroprate                            20468 non-null float64
callfailurerate                         20468 non-null float64
callingnum                              20468 non-null int64
customerid                              20468 non-null int64
customersuspended                       20468 non-null object
education                               20468 non-null object
gender                                  20468 non-null object
homeowner                               20468 non-null object
maritalstatus                           20468 non-null object
monthlybilledamount                     20468 non-null 

## Phase Three: Modeling

In this phase, we'll create the experiment runs, perform feature engineering, and run experiments with various settings and parameters. After selecting the best performing run, we'll create a trained model and save it for operationalization in the next phase.

In [16]:
# Fill all NA values with 0:
df = df.fillna(0)

# Drop all duplicate observations:
df = df.drop_duplicates()

# We don't need the 'year" or 'month' variables
df = df.drop('year', 1)
df = df.drop('month', 1)

# Implement One-Hot Encoding for this model (https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/) 
columns_to_encode = list(df.select_dtypes(include=['category','object']))
dummies = pd.get_dummies(df[columns_to_encode]) #

# Drop the original categorical columns:
df = df.drop(columns_to_encode, axis=1) # 

# Re-join the dummies frame to the original data:
df = df.join(dummies)

# Show the new columns in the joined dataframe:
print(df.columns, '\n')

# Experiment using Naive Bayes:
nb_model = GaussianNB()
random_seed = 42
split_ratio = .3
train, test = train_test_split(df, random_state = random_seed, test_size = split_ratio)

target = train['churn'].values
train = train.drop('churn', 1)
train = train.values
nb_model.fit(train, target)

# Compare training versus known values
expected = test['churn'].values
test = test.drop('churn', 1)
predicted = nb_model.predict(test)

# Print out the Naive Bayes Classification Accuracy:
print("Naive Bayes Classification Accuracy", accuracy_score(expected, predicted))

# Experiment using Decision Trees:
dt_model = DecisionTreeClassifier(min_samples_split=20, random_state=99)
dt_model.fit(train, target)
predicted = dt_model.predict(test)

# Print out the Decision Tree Accuracy:
print("Decision Tree Classification Accuracy", accuracy_score(expected, predicted))


Index(['age', 'annualincome', 'calldroprate', 'callfailurerate', 'callingnum',
       'customerid', 'monthlybilledamount', 'numberofcomplaints',
       'numberofmonthunpaid', 'numdayscontractequipmentplanexpiring',
       'penaltytoswitch', 'totalminsusedinlastmonth', 'unpaidbalance',
       'percentagecalloutsidenetwork', 'totalcallduration', 'avgcallduration',
       'churn', 'customersuspended_No', 'customersuspended_Yes',
       'education_Bachelor or equivalent', 'education_High School or below',
       'education_Master or equivalent', 'education_PhD or equivalent',
       'gender_Female', 'gender_Male', 'homeowner_No', 'homeowner_Yes',
       'maritalstatus_Married', 'maritalstatus_Single', 'noadditionallines_\N',
       'occupation_Non-technology Related Job', 'occupation_Others',
       'occupation_Technology Related Job', 'state_AK', 'state_AL', 'state_AR',
       'state_AZ', 'state_CA', 'state_CO', 'state_CT', 'state_DE', 'state_FL',
       'state_GA', 'state_HI', 'state_IA'

Decision Tree Classification Accuracy 0.9091353199804592


## Phase 4: Deployment

Once you are satisfied with the Model, you can save it out using the "Pickle" library for deployment to other systems.

In [17]:
# serialize the best performing model on disk
print ("Serialize the model to a model.pkl file in the root")
ModelFile = open('./model.pkl', 'wb')
pickle.dump(dt_model, ModelFile)
ModelFile.close()


Serialize the model to a model.pkl file in the root


## Phase 5: Customer Acceptance

The final phase involves testing the model predictions on real-world queries to ensure that it meets all requirements. In this pase we also document the project so that all parameters are well-known. Finally, a mechanism is created to re-train the model. In this code, we'll open the model we deployed, send it some new data, and predict whether a customer (which the model has never seen) will leave the product or not, allowing a salesperson to concentrate on retaining them:


In [18]:
# Prepare the web service definition before deploying
# Import for the pickle
from sklearn.externals import joblib

# load the model file
global model
model = joblib.load('model.pkl')

# Import for handling the JSON file
import json
import pandas as pd

# Set up a sample "call" from a client:
input_df = "{\"callfailurerate\": 0, \"education\": \"Bachelor or equivalent\", \"usesinternetservice\": \"No\", \"gender\": \"Male\", \"unpaidbalance\": 19, \"occupation\": \"Technology Related Job\", \"year\": 2015, \"numberofcomplaints\": 0, \"avgcallduration\": 663, \"usesvoiceservice\": \"No\", \"annualincome\": 168147, \"totalminsusedinlastmonth\": 15, \"homeowner\": \"Yes\", \"age\": 12, \"maritalstatus\": \"Single\", \"month\": 1, \"calldroprate\": 0.06, \"percentagecalloutsidenetwork\": 0.82, \"penaltytoswitch\": 371, \"monthlybilledamount\": 71, \"churn\": 0, \"numdayscontractequipmentplanexpiring\": 96, \"totalcallduration\": 5971, \"callingnum\": 4251078442, \"state\": \"WA\", \"customerid\": 1, \"customersuspended\": \"Yes\", \"numberofmonthunpaid\": 7, \"noadditionallines\": \"\\\\N\"}"

# Cleanup 
input_df_encoded = json.loads(input_df)
input_df_encoded = pd.DataFrame([input_df_encoded], columns=input_df_encoded.keys())
input_df_encoded = input_df_encoded.drop('year', 1)
input_df_encoded = input_df_encoded.drop('month', 1)
input_df_encoded = input_df_encoded.drop('churn', 1)

# Pre-process scoring data consistent with training data
columns_to_encode = ['customersuspended', 'education', 'gender', 'homeowner', 'maritalstatus', 'noadditionallines', 'occupation', 'state', 'usesinternetservice', 'usesvoiceservice']
dummies = pd.get_dummies(input_df_encoded[columns_to_encode])
input_df_encoded = input_df_encoded.join(dummies)
input_df_encoded = input_df_encoded.drop(columns_to_encode, axis=1)

columns_encoded = ['age', 'annualincome', 'calldroprate', 'callfailurerate', 'callingnum',
       'customerid', 'monthlybilledamount', 'numberofcomplaints',
       'numberofmonthunpaid', 'numdayscontractequipmentplanexpiring',
       'penaltytoswitch', 'totalminsusedinlastmonth', 'unpaidbalance',
       'percentagecalloutsidenetwork', 'totalcallduration', 'avgcallduration',
       'customersuspended_No', 'customersuspended_Yes',
       'education_Bachelor or equivalent', 'education_High School or below',
       'education_Master or equivalent', 'education_PhD or equivalent',
       'gender_Female', 'gender_Male', 'homeowner_No', 'homeowner_Yes',
       'maritalstatus_Married', 'maritalstatus_Single', 'noadditionallines_\\N',
       'occupation_Non-technology Related Job', 'occupation_Others',
       'occupation_Technology Related Job', 'state_AK', 'state_AL', 'state_AR',
       'state_AZ', 'state_CA', 'state_CO', 'state_CT', 'state_DE', 'state_FL',
       'state_GA', 'state_HI', 'state_IA', 'state_ID', 'state_IL', 'state_IN',
       'state_KS', 'state_KY', 'state_LA', 'state_MA', 'state_MD', 'state_ME',
       'state_MI', 'state_MN', 'state_MO', 'state_MS', 'state_MT', 'state_NC',
       'state_ND', 'state_NE', 'state_NH', 'state_NJ', 'state_NM', 'state_NV',
       'state_NY', 'state_OH', 'state_OK', 'state_OR', 'state_PA', 'state_RI',
       'state_SC', 'state_SD', 'state_TN', 'state_TX', 'state_UT', 'state_VA',
       'state_VT', 'state_WA', 'state_WI', 'state_WV', 'state_WY',
       'usesinternetservice_No', 'usesinternetservice_Yes',
       'usesvoiceservice_No', 'usesvoiceservice_Yes']

# Now that they are encoded, some values will be "empty". Fill those with 0's:
for column_encoded in columns_encoded:
    if not column_encoded in input_df_encoded.columns:
        input_df_encoded[column_encoded] = 0

# Return final prediction
pred = model.predict(input_df_encoded)

# (In production you would replace Print() statement here with some sort of return to JSON)
print('JSON sent to the prediction Model:', '\n')
print(input_df, '\n')
print('For the JSON string sent from the client, The prediction is returned as more JSON (0 = No churn, 1 = Churn):', '\n')
print(json.dumps(str(pred[0])))

JSON sent to the prediction Model: 

{"callfailurerate": 0, "education": "Bachelor or equivalent", "usesinternetservice": "No", "gender": "Male", "unpaidbalance": 19, "occupation": "Technology Related Job", "year": 2015, "numberofcomplaints": 0, "avgcallduration": 663, "usesvoiceservice": "No", "annualincome": 168147, "totalminsusedinlastmonth": 15, "homeowner": "Yes", "age": 12, "maritalstatus": "Single", "month": 1, "calldroprate": 0.06, "percentagecalloutsidenetwork": 0.82, "penaltytoswitch": 371, "monthlybilledamount": 71, "churn": 0, "numdayscontractequipmentplanexpiring": 96, "totalcallduration": 5971, "callingnum": 4251078442, "state": "WA", "customerid": 1, "customersuspended": "Yes", "numberofmonthunpaid": 7, "noadditionallines": "\\N"} 

For the JSON string sent from the client, The prediction is returned as more JSON (0 = No churn, 1 = Churn): 

"0"


