#### OFM3 — OFM3 TASK 3: ASSOCIATION RULES AND LIFT ANALYSIS

<ul>
<li>Ryan L. Buchanan</li>
<li>Student ID:  001826691</li>
<li>Masters Data Analytics (12/01/2020)</li>
<li>Program Mentor:  Dan Estes</li>
<li>385-432-9281 (MST)</li>
<li>rbuch49@wgu.edu</li>
</ul>

#### Scenario 1
One of the most critical factors in customer relationship management that directly affects a company’s long-term profitability is understanding its customers. When a company can better understand its customer characteristics, it is better able to target products and marketing campaigns for customers, resulting in better profits for the company in the long term.

You are an analyst for a telecommunications company that wants to better understand the characteristics of its customers. You have been asked to perform a market basket analysis to analyze customer data to identify key associations of your customer purchases, ultimately allowing better business and strategic decision-making.

#### Part I: Research Question
A.  Describe the purpose of this data mining report by doing the following:

#### 1. Propose one question relevant to a real-world organizational situation that you will answer using market basket analysis.

#### <span style="color:green"><b>A1. Proposal of Question</b>:</span>
Which principal variables of your customers demonstrate that they are at high risk of churn?  And, therefore, which customers will churn?
This question will be answered using <span style="color:red">market basket analysis</span>.

#### 2.  Define one goal of the data analysis. Ensure that your goal is reasonable within the scope of the scenario and is represented in the available data.

#### <span style="color:green"><b>A2. Defined Goal</b>:</span>
Stakeholders in the company will benefit by knowing, with some measure of confidence, which customers are at highest risk of churn because this will provide weight for decisions in marketing improved services to customers with these characteristics and past user experiences.
The goal of this data analysis is to present numerical values to company stakeholders to help them better understand their customers.

#### Part II: Market Basket Justification
B.  Explain the reasons for using market basket analysis by doing the following:

#### <span style="color:green"><b>B1. Explanation of Market Basket</b>:</span>
<span style="color:red">Explain how market basket analyzes the selected dataset. Include expected outcomes.</span>


#### <span style="color:green"><b>B2. Transaction Example</b>:</span>
<span style="color:red">Provide one example of transactions in the dataset.</span>


#### <span style="color:green"><b>B3. Market Basket Assumption</b>:</span>
<span style="color:red">Summarize one assumption of market basket analysis.</span>


#### <span style="color:green"><b>C1. Transforming the Dataset</b>:</span>
<span style="color:red">Transform the dataset to make it suitable for market basket analysis. Include a copy of the cleaned dataset.</span>

In [None]:
# Extract Clean dataset
churn_df.to_csv('data/churn_clean_mba.csv')

#### <span style="color:green"><b>C2. Code Execution</b>:</span>
<span style="color:red">Execute the code used to generate association rules with the Apriori algorithm. Provide screenshots that demonstrate the error-free functionality of the code.</span>


#### <span style="color:green"><b>C3. Association Rules Table</b>:</span>
<span style="color:red">Provide values for the support, lift, and confidence of the association rules table.</span>


In [None]:
# Standard data science imports
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

# Visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Scikit-learn
import sklearn
from sklearn import datasets
from sklearn import preprocessing
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import classification_report

In [None]:
# Change color of Matplotlib font
import matplotlib as mpl

COLOR = 'white'
mpl.rcParams['text.color'] = COLOR
mpl.rcParams['axes.labelcolor'] = COLOR
mpl.rcParams['xtick.color'] = COLOR
mpl.rcParams['ytick.color'] = COLOR

In [None]:
# Increase Jupyter display cell-width
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:75% !important; }</style>"))

In [None]:
# Ignore Warning Code
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Load data set into Pandas dataframe
churn_df = pd.read_csv('data/churn_clean.csv', index_col=0)

In [None]:
# Examine the features of the dataset
churn_df.columns

In [None]:
# Get an idea of dataset size
churn_df.shape

In [None]:
# Examine first few records of dataset
churn_df.head()

In [None]:
# View DataFrame info
churn_df.info

In [None]:
# Provide an initial look at extant dataset
churn_df.head()

In [None]:
# Get an overview of descriptive statistics
churn_df.describe()

In [None]:
# Get data types of features
churn_df.dtypes

In [None]:
# Rename last 8 survey columns for better description of variables
churn_df.rename(columns = {'Item1':'TimelyResponse', 
                    'Item2':'Fixes', 
                     'Item3':'Replacements', 
                     'Item4':'Reliability', 
                     'Item5':'Options', 
                     'Item6':'Respectfulness', 
                     'Item7':'Courteous', 
                     'Item8':'Listening'}, 
          inplace=True)

In [None]:
# Create histograms of contiuous variables & categorical variables
churn_df[['Children', 'Age', 'Income', 'Outage_sec_perweek', 'Email', 
          'Contacts', 'Yearly_equip_failure', 'Tenure', 'MonthlyCharge', 
          'Bandwidth_GB_Year', 'TimelyResponse', 'Courteous']].hist()
plt.savefig('churn_pyplot.jpg')
plt.tight_layout()

In [None]:
# Create a scatterplot to get an idea of correlations between potentially related variables
sns.scatterplot(x=churn_df['Outage_sec_perweek'], y=churn_df['Churn'], color='blue')
plt.show();

In [None]:
# Create a scatterplot to get an idea of correlations between potentially related variables
sns.scatterplot(x=churn_df['Tenure'], y=churn_df['Churn'], color='blue')
plt.show();

In [None]:
# Create a scatterplot to get an idea of correlations between potentially related variables
sns.scatterplot(x=churn_df['MonthlyCharge'], y=churn_df['Outage_sec_perweek'], color='blue')
plt.show();

In [None]:
# Provide a scatter matrix of numeric variables for high level overview of potential relationships & distributions
churn_numeric = churn_df[['Children', 'Age', 'Income', 'Outage_sec_perweek', 
                          'Email', 'Contacts','Yearly_equip_failure', 'Tenure', 
                          'MonthlyCharge', 'Bandwidth_GB_Year', 'Replacements', 
                          'Reliability', 'Options', 'Respectfulness', 'Courteous', 
                          'Listening']]

pd.plotting.scatter_matrix(churn_numeric, figsize = [15, 15]);

In [None]:
# Create individual scatterplot for viewing relationship of key financial featurte against target variable
sns.scatterplot(x = churn_df['MonthlyCharge'], y = churn_df['Churn'], color='red')
plt.show();

In [None]:
# Set plot style to ggplot for aesthetics & R style
plt.style.use('ggplot')

# Countplot more useful than scatter_matrix when features of dataset are binary
plt.figure()
sns.countplot(x='Techie', hue='Churn', data=churn_df, palette='RdBu')
plt.xticks([0,1], ['No', 'Yes'])
plt.show()

In [None]:
# Countplot more useful than scatter_matrix when features of dataset are binary
plt.figure()
sns.countplot(x='PaperlessBilling', hue='Churn', data=churn_df, palette='RdBu')
plt.xticks([0,1], ['No', 'Yes'])
plt.show()

In [None]:
# Countplot more useful than scatter_matrix when features of dataset are binary
plt.figure()
sns.countplot(x='InternetService', hue='Churn', data=churn_df, palette='RdBu')
plt.xticks([0,1], ['No', 'Yes'])
plt.show()

In [None]:
# Create multiple boxplots for continuous & categorical variables
churn_df.boxplot(column=['MonthlyCharge','Bandwidth_GB_Year'])

In [None]:
# Create Seaborn boxplots for continuous & categorical variables
sns.boxplot('MonthlyCharge', data = churn_df)
plt.show()

In [None]:
# Create Seaborn boxplots for continuous & categorical variables
sns.boxplot('Bandwidth_GB_Year', data = churn_df)
plt.show()

In [None]:
# Create Seaborn boxplots for continuous variables
sns.boxplot('Tenure', data = churn_df)
plt.show()

#### Anomalies
It appears that anomolies have been removed from the supplied dataset, churn_clean.csv. &nbsp; There are no remaining outliers.

In [None]:
# Discover missing data points within dataset
data_nulls = churn_df.isnull().sum()
print(data_nulls)

In [None]:
# Check for missing data & visualize missing values in dataset 

# Install appropriate library
!pip install missingno

# Importing the libraries
import missingno as msno

# Visualize missing values as a matrix
msno.matrix(churn_df);
"""(GeeksForGeeks, p. 1)"""

In [None]:
# Encode binary categorical variables with dummies
churn_df['DummyGender'] = [1 if v == 'Male' else 0 for v in churn_df['Gender']]
churn_df['DummyChurn'] = [1 if v == 'Yes' else 0 for v in churn_df['Churn']] ### If the customer left (churned) they get a '1'
churn_df['DummyTechie'] = [1 if v == 'Yes' else 0 for v in churn_df['Techie']]
churn_df['DummyContract'] = [1 if v == 'Two Year' else 0 for v in churn_df['Contract']]
churn_df['DummyPort_modem'] = [1 if v == 'Yes' else 0 for v in churn_df['Port_modem']]
churn_df['DummyTablet'] = [1 if v == 'Yes' else 0 for v in churn_df['Tablet']]
churn_df['DummyInternetService'] = [1 if v == 'Fiber Optic' else 0 for v in churn_df['InternetService']]
churn_df['DummyPhone'] = [1 if v == 'Yes' else 0 for v in churn_df['Phone']]
churn_df['DummyMultiple'] = [1 if v == 'Yes' else 0 for v in churn_df['Multiple']]
churn_df['DummyOnlineSecurity'] = [1 if v == 'Yes' else 0 for v in churn_df['OnlineSecurity']]
churn_df['DummyOnlineBackup'] = [1 if v == 'Yes' else 0 for v in churn_df['OnlineBackup']]
churn_df['DummyDeviceProtection'] = [1 if v == 'Yes' else 0 for v in churn_df['DeviceProtection']]
churn_df['DummyTechSupport'] = [1 if v == 'Yes' else 0 for v in churn_df['TechSupport']]
churn_df['DummyStreamingTV'] = [1 if v == 'Yes' else 0 for v in churn_df['StreamingTV']]
churn_df['StreamingMovies'] = [1 if v == 'Yes' else 0 for v in churn_df['StreamingMovies']]
churn_df['DummyPaperlessBilling'] = [1 if v == 'Yes' else 0 for v in churn_df['PaperlessBilling']]

In [None]:
# Drop original categorical features from dataframe
churn_df = churn_df.drop(columns=['Gender', 'Churn', 'Techie', 'Contract', 'Port_modem', 'Tablet', 
                                  'InternetService', 'Phone', 'Multiple', 'OnlineSecurity', 
                                  'OnlineBackup', 'DeviceProtection', 'TechSupport', 
                                  'StreamingTV', 'StreamingMovies', 'PaperlessBilling'])

In [None]:
churn_df.head()

In [None]:
# Remove less meaningful categorical variables from dataset to provide fully numerical dataframe for further analysis
churn_df = churn_df.drop(columns=['Customer_id', 'Interaction', 'UID', 
                            'City', 'State', 'County', 'Zip', 'Lat', 'Lng', 
                            'Area', 'TimeZone', 'Job', 'Marital', 'PaymentMethod'])
churn_df.head()

In [None]:
# Move DummyChurn to end of dataset to set as target
churn_df = churn_df[['Children', 'Age', 'Income', 'Outage_sec_perweek', 'Email', 'Contacts',
       'Yearly_equip_failure', 'Tenure', 'MonthlyCharge', 'Bandwidth_GB_Year',
        'TimelyResponse', 'Fixes', 'Replacements',
       'Reliability', 'Options', 'Respectfulness', 'Courteous', 'Listening',
       'DummyGender', 'DummyTechie', 'DummyContract',
       'DummyPort_modem', 'DummyTablet', 'DummyInternetService', 'DummyPhone',
       'DummyMultiple', 'DummyOnlineSecurity', 'DummyOnlineBackup',
       'DummyDeviceProtection', 'DummyTechSupport', 'DummyStreamingTV',
       'DummyPaperlessBilling', 'DummyChurn',]]

churn_df.head()

In [None]:
# List features for analysis
features = (list(churn_df.columns[:-1]))
print('Features for analysis include: \n', features)

#### <span style="color:green"><b>C4. Top Three Rules</b>:</span>
<span style="color:red">Identify the top three rules generated by the Apriori algorithm. Include a screenshot of the top rules along with their summaries.</span>


#### Part IV: Analysis
D. Perform the data analysis and report on the results by doing the following:

In [None]:
# Re-read fully numerical prepared dataset
churn_df = pd.read_csv('data/churn_prepared.csv')

# Set predictor features & target variable
X = churn_df.drop('DummyChurn', axis=1).values
y = churn_df['DummyChurn'].values

In [None]:
# Import model, splitting method & metrics from sklearn
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score, train_test_split

#### <span style="color:green"><b>D1. Significance of Support, Lift, and Confidence Summary</b></span>
<span style="color:red">Summarize the significance of support, lift, and confidence from the results of the analysis.</span>

In [None]:
# Set seed for reproducibility
SEED = 1

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = SEED)

In [None]:
# Instantiate KNN model 
knn = KNeighborsClassifier(n_neighbors = 7)

# Fit data to KNN model
knn.fit(X_train, y_train)

# Predict outcomes from test set
y_pred = knn.predict(X_test)

#### <span style="color:green"><b>D2. Practical Significance of Findings</b></span>
<span style="color:red">Discuss the practical significance of the findings from the analysis.</span>

#### <span style="color:green"><b>D3. Course of Action</b></span>
<span style="color:red">Recommend a course of action for the real-world organizational situation from part A1 based on your results from part D1.</span>

 It is critical that decision-makers & marketers understand that our predictor variables create a relatively low accuracy score with the results of an 0.84 after scaling.   We should analyse the features that are in common among those leaving the company & attempt to reduce their likelihood of occuring with any given customer in the future.   This suggests that as a customer subscribes to more services that the company provided, an additional port modem or online backup for example, they are less likely to leave the company.   Clearly, it is the best interest of retaining customers to provide them with more services & improve their experience with the company by helping customers understand all the services that are available to them as a subscriber, not simple mobile phone service.

In [None]:
# Print initial accuracy score of KNN model
print('Initial accuracy score KNN model: ', accuracy_score(y_test, y_pred))

In [None]:
# Compute classification metrics
print(classification_report(y_test, y_pred))

In [None]:
# Create pipeline object & scale dataframe
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# Set steps for pipeline object
steps = [('scaler', StandardScaler()),
        ('knn', KNeighborsClassifier())]

# Instantiate pipeline
pipeline = Pipeline(steps)

# Split dataframe
X_train_scaled, X_test_scaled, y_train_scaled, y_test_scaled = train_test_split(X, y, test_size = 0.2, random_state = SEED)

# Scale dateframe with pipeline object
knn_scaled = pipeline.fit(X_train_scaled, y_train_scaled)

# Predict from scaled dataframe
y_pred_scaled = pipeline.predict(X_test_scaled)

In [None]:
# Print new accuracy score of scaled KNN model
print('New accuracy score of scaled KNN model: {:0.3f}'.format(accuracy_score(y_test_scaled, y_pred_scaled)))

In [None]:
# Compute classification metrics after scaling
print(classification_report(y_test_scaled, y_pred_scaled))

In [None]:
# Import sklearn confusion_matrix & generate results
from sklearn.metrics import confusion_matrix
cf_matrix = confusion_matrix(y_test, y_pred)
print(cf_matrix)

In [None]:
# Create a visually more intuitive confusion matrix
"""(Dennis, pg. 1)"""
group_names = ['True Neg', 'False Pos', 'False Neg', 'True Pos']
group_counts = ["{0:0.0f}".format(value) for value in
                cf_matrix.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
                     cf_matrix.flatten()/np.sum(cf_matrix)]
labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in
          zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(cf_matrix, annot=labels, fmt='', cmap='Blues')

#### Model Comparison
It appears that scaling improved model performance from an <b>Accuracy</b> of 0.71 to 0.79 & <b>Precision</b> of 0.78 to 0.84. The area under the curve is a decent score at 0.7959.

In [None]:
# Import GridSearchCV for cross validation of model
from sklearn.model_selection import GridSearchCV

# Set up parameters grid
param_grid = {'n_neighbors': np.arange(1, 50)}

# Re-intantiate KNN for cross validation
knn = KNeighborsClassifier()

# Instantiate GridSearch cross validation
knn_cv = GridSearchCV(knn , param_grid, cv=5)

# Fit model to 
knn_cv.fit(X_train, y_train)

# Print best parameters
print('Best parameters for this KNN model: {}'.format(knn_cv.best_params_))

In [None]:
# Generate model best score
print('Best score for this KNN model: {:.3f}'.format(knn_cv.best_score_))

In [None]:
# Import ROC AUC metrics for explaining the area under the curve
from sklearn.metrics import roc_auc_score

# Fit it to the data
knn_cv.fit(X, y)

# Compute predicted probabilities: y_pred_prob
y_pred_prob = knn_cv.predict_proba(X_test)[:,1]

# Compute and print AUC score
print("The Area under curve (AUC) on validation dataset is: {:.4f}".format(roc_auc_score(y_test, y_pred_prob)))

In [None]:
# Compute cross-validated AUC scores: cv_auc
cv_auc = cross_val_score(knn_cv, X, y, cv=5, scoring='roc_auc')

# Print list of AUC scores
print("AUC scores computed using 5-fold cross-validation: {}".format(cv_auc))

#### <span style="color:green"><b>E2. Results and Implications</b></span>
<span style="color:red">Discuss the results and implications of your clustering analysis.</span>

#### <span style="color:green"><b>E. Panopto Recording</b></span>
 <span style="color:red">link</span>

#### <span style="color:green"><b>F. Video</b></span>
<span style="color:red">link</span>

#### <span style="color:green"><b>F. Web Sources</b></span>
* GeeksForGeeks. &ensp; (2019, July 4). &ensp; <i>Python | Visualize missing values (NaN) values using Missingno Library</i>. &ensp; GeeksForGeeks. &ensp; https://www.geeksforgeeks.org/python-visualize-missing-values-nan-values-using-missingno-library/
<br>
* Dennis, T. &ensp; (2019, July 25). &ensp; <i>Confusion Matrix Visualization</i>. &ensp; Medium. &ensp; https://medium.com/@dtuk81/confusion-matrix-visualization-fc31e3f30fea

#### <span style="color:green"><b>G. Sources</b></span>
* CBTNuggets. &ensp; (2018, September 20). &ensp; <i>Why Data Scientists Love Python</i>. &ensp; CBTNuggets. &ensp; https://www.cbtnuggets.com/blog/technology/data/why-data-scientists-love-python
<br> 
* Grant, P. &ensp; (2019, July 21). &ensp; <i>Introducing k-Nearest Neighbors</i>. &ensp; TowardDataScience. &ensp; https://towardsdatascience.com/introducing-k-nearest-neighbors-7bcd10f938c5
<br> 
* Massaron, L. & Boschetti, A. &ensp; (2016). &ensp; <i>Regression Analysis with Python</i>. &ensp; Packt Publishing.

In [None]:
!wget -nc https://raw.githubusercontent.com/brpy/colab-pdf/master/colab_pdf.py
from colab_pdf import colab_pdf
colab_pdf('D209 Data Mining 1 - NVM2 - Classification Analysis.ipynb')