#### OFM3 — OFM3 TASK 3: ASSOCIATION RULES AND LIFT ANALYSIS

<ul>
<li>Ryan L. Buchanan</li>
<li>Student ID:  001826691</li>
<li>Masters Data Analytics (12/01/2020)</li>
<li>Program Mentor:  Dan Estes</li>
<li>385-432-9281 (MST)</li>
<li>rbuch49@wgu.edu</li>
</ul>

#### Scenario 1
One of the most critical factors in customer relationship management that directly affects a company’s long-term profitability is understanding its customers. When a company can better understand its customer characteristics, it is better able to target products and marketing campaigns for customers, resulting in better profits for the company in the long term.

You are an analyst for a telecommunications company that wants to better understand the characteristics of its customers. You have been asked to perform a market basket analysis to analyze customer data to identify key associations of your customer purchases, ultimately allowing better business and strategic decision-making.

#### Part I: Research Question

#### <span style="color:green"><b>A1. Proposal of Question</b>:</span>
Which are the items of interest in combination with discounts that might reduce customer churn?  That is, by analyzing a list of transactions, may we be able to better understand which items will endear us to customers more if offered as discounted with our services?
This question will be answered using <b>market basket analysis</b>.

#### <span style="color:green"><b>A2. Defined Goal</b>:</span>
Stakeholders in the company will benefit by knowing, with some measure of confidence, which customers are at highest risk of churn because this will provide weight for decisions in marketing improved services to customers with these characteristics and past user experiences.
The goal of this data analysis is to present items for discount purchase to company stakeholders to consider when creating customer enticements and marketing promotions.  We will endeavor to help decision makers better understand which combinations of features (items in concert with telecom services) put their customers at lower risk of churning.

#### Part II: Market Basket Justification

#### <span style="color:green"><b>B1. Explanation of Market Basket</b>:</span>
As pointed out by Li, "\[m\]arket basket analysis is one of the key techniques used . . . to uncover associations between items. It works by looking for combinations of items that occur together frequently in transactions" <span style="color:orange">(Li, p. 1)</span>.  

This analysis proposes to identify which combinations of telecom peripherals and ICT tools customers prefer and purchase together most often.  We will try to identify those items purchased most often together and demonstrate the relationships between these different items.

We expect that we will discover an optimal combination of items to offer at discounts in coordination with our services.

Our plan for analysis includes: 
* Prepare the dataset
* Discover missing values
* Run the Apriori method to identify association rules
* Check the rules with highest values for confidence, support and lift
* Recommend a course of action following the results of our analysis

#### <span style="color:green"><b>B2. Transaction Example</b>:</span>
On quick inspection of the given dataset, transactions are easily distinguishable.  The very first transactions includes a larger list of twenty items including:
* Logitech M510 Wireless mouse	
* HP 63 Ink	
* HP 65 ink	
* nonda USB C to USB Adapter	
* 10ft iPHone Charger Cable	
* HP 902XL ink	
* Creative Pebble 2.0 Speakers	
* Cleaning Gel Universal Dust Cleaner	
* Micro Center 32GB Memory card	
* YUNSONG 3pack 6ft Nylon Lightning Cable	
* TopMate C5 Laptop Cooler pad	
* Apple USB-C Charger cable	
* HyperX Cloud Stinger Headset	
* TONOR USB Gaming Microphone	
* Dust-Off Compressed Gas 2 pack	
* 3A USB Type C Cable 3 pack 6FT	
* HOVAMP iPhone charger	
* SanDisk Ultra 128GB card	
* FEEL2NICE 5 pack 10ft Lighning cable	
* FEIYOLD Blue light Blocking Glasses

These twenty items were purchased by one customer, synchronously.

#### <span style="color:green"><b>B3. Market Basket Assumption</b>:</span>
One assumption of MBA is to make determinations by building association rules.  These rules, suggests Dr. Susan Sivek, "are just statements that connect an 'antecedent' item to a 'consequent' item. Association rules also do not imply causal relationships, only co-occurrence" <span style="color:orange">(Sivek, p. 1)</span>.

So, for instance in our research proposal, we would like to identify items that would purchased before subscribing to a telecom service, or, perhaps, items that would be used in coordination with telecom services.

#### <span style="color:green"><b>C1. Transforming the Dataset</b>:</span>

In [None]:
# Standard data science imports
import numpy as np
import pandas as pd

# Visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# Change color of Matplotlib font
import matplotlib as mpl

COLOR = 'white'
mpl.rcParams['text.color'] = COLOR
mpl.rcParams['axes.labelcolor'] = COLOR
mpl.rcParams['xtick.color'] = COLOR
mpl.rcParams['ytick.color'] = COLOR

In [None]:
# Increase Jupyter display cell-width
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:75% !important; }</style>"))

In [None]:
# Ignore Warning Code
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Load data set into Pandas dataframe
teleco = pd.read_csv('data/teleco_market_basket.csv')

In [None]:
# Examine the features of the dataset
teleco.columns

In [None]:
# Get an idea of dataset size
teleco.shape

In [None]:
# Examine first few records of dataset
teleco.head()

In [None]:
# View DataFrame info
teleco.info

In [None]:
# Get an overview of descriptive statistics
teleco.describe()

In [None]:
# Get data types of features
teleco.dtypes

In [None]:
# Discover missing data points within dataset
data_nulls = teleco.isnull().sum()
print(data_nulls)

In [None]:
# Check for missing data & visualize missing values in dataset 

# Install appropriate library
!pip install missingno

# Importing the libraries
import missingno as msno

# Visualize missing values as a matrix
msno.matrix(teleco);
"""(GeeksForGeeks, p. 1)"""

In [None]:
# Drop records with missing values
teleco.dropna(how='all', inplace=True)

# Review changes
teleco.head()

In [None]:
# Replace empty values with 0
teleco.fillna(0, inplace=True)

In [None]:
# Get an idea of dataset size after changes
teleco.shape

In [None]:
# Review changes to DataFrame
teleco.head()

In [None]:
# Confirm no null values
teleco.info()

In [None]:
# Convert dataset into list format for use with Apriori algorithm
teleco_list = []
for i in range(0, 7501):
    teleco_list.append([str(teleco.values[i, j]) for j in range(0, 20)])
teleco_cleaned = pd.DataFrame(teleco_list)

In [None]:
# Review DataFrame
teleco_cleaned.head()

In [None]:
# Extract prepared dataset
teleco_cleaned.to_csv('data/teleco_market_basket_prepared.csv')

In [None]:
teleco_list[:1]

#### <span style="color:green"><b>C2. Code Execution</b>:</span>

In [None]:
# Generate association rules from Apriori algorithm
from apyori import apriori

# Train Apriori algorithm on the dataset
rule_list = apriori(teleco_list, min_support = 0.003, min_confidence = 0.3, min_lift = 3, min_length = 2)

In [None]:
# Review generate rules
rule_list = list(rule_list)
print(rule_list[0])

In [None]:
# Print number of rules
print(len(rule_list))

In [None]:
# Transform results into DataFrame structure
results = pd.DataFrame(rule_list)

In [None]:
# View results list
results

In [None]:
# Separate support to indiviual DataFrame
support = results.support

In [None]:
# Instantiate four empty lists to contain lhs, rhs, confidence and lift
first_values = []
second_values = []
third_values = []
fourth_values = []

In [None]:
# Create for loop to iterate over list 
for i in range(results.shape[0]):
    single_list = results['ordered_statistics'][i][0]
    first_values.append(list(single_list[0]))
    second_values.append(list(single_list[1]))
    third_values.append(single_list[2])
    fourth_values.append(single_list[3])

In [None]:
# Convert lists into DataFrame
lhs = pd.DataFrame(first_values)
rhs = pd.DataFrame(second_values)
confidence = pd.DataFrame(third_values, columns=['confidence'])
lift = pd.DataFrame(fourth_values, columns=['lift'])

In [None]:
# Concatenate lists into single DataFrame
results_final = pd.concat([lhs, rhs, support, confidence, lift], axis=1)
results_final.fillna(value=' ', inplace=True)

In [None]:
# View final results
results_final

#### <span style="color:green"><b>C3. Association Rules Table</b>:</span>

In [None]:
# Set column names
results_final.columns = ['lhs', 1, 2, 'rhs', 1, 2, 'support', 'confidence', 'lift']
results_final_1 = results_final[['lhs', 'rhs', 'support', 'confidence', 'lift']]
results_final_1

#### Highest combination of Support, Confidence and LIft
After running the final results to create the association rules table, we can demonstate mathematically that "5pack Nylon Braided USB C cables" and "HP 63XL Ink" have the highest combination of values for our three metrics:

For "5pack Nylon Braided USB C cables" &#10230; "HP 63XL Ink"
* Support = 0.0057
* Confidence = 0.3007
* Lift = 3.7908


In [None]:
# Visualize the list of rules
results = list(rule_list)
for i in results:
    print('\n')
    print(i)
    print('**********')

#### <span style="color:green"><b>C4. Top Three Rules</b>:</span>
The top three rules are as follow:

#1. If "5pack Nylon Braided USB C cables" then "HP 63XL Ink" with:
* Support = 0.0057
* Confidence = 0.3007 = 30%
* Lift = 3.7908
<br>Our confidence in this rule demonstrates that out of all customers who purchased the "5pack Nylon Braided USB C cables", 30% also purchased the "HP 63XL Ink".
The simplest metric of support, with a value of 0.0057, demonstrates that a little more than half a percentage of all transactions contain both items.
A lift value of 3.7908 demonstrates that once a customer has purchased the "5pack Nylon Braided USB C cables", they are 3.8 times more likely to also purchase the "HP 63XL Ink".
<br><br>
#2. If "AutoFocus 1080p Webcam" then "SanDisk Ultra 64GB card" with:
* Support = 0.0053
* Confidence = 0.3774 = 38% of customers also purchased consequent
* Lift = 3.8407 = 3.8 times more likely to purchase consequent following purchase of antecedent
<br><br>
#3. If "iPhone 11 case" then "HP 63XL Ink" with:
* Support = 0.0051
* Confidence = 0.3729 = 37% of customers also purchased consequent
* Lift = 4.7008 = 4.7 times more likely to purchase consequent following purchase of antecedent

#### Part IV: Analysis

#### <span style="color:green"><b>D1. Significance of Support, Lift, and Confidence Summary</b></span>
Our top three rules compare the metrics:
* $ Support = \frac {frequency (X, Y)}{N} $
&ensp; &#10233; &ensp; Giving us the number of total transactions containing this particular itemset.
* $ Confidence = \frac {frequency (X, Y)}{frequency(X)} $
&ensp; &#10233; &ensp; Giving us a probability of the consequent given the antecedent.
* $ Lift = \frac {Support}{Support(X) * Support(Y)} $
&ensp; &#10233; &ensp; Giving us the coefficient of likelihood given the antecedent; that is, how many more times likely is the consequent to be purchased once the antecedent has been purchased.
<br><br>
The results of this analysis are not particularly compelling.  None of the rules have a confidence level of greater than 40% and certainly not the greater than 80% which would be an optimal value for signifcances.
<br>
Our highest <b>confidence</b> is in rule #2 at 38%, while the #1 rule (given its analysis in combination with our other metrics of interest) is only 30%.
<br>
The <b>support</b> for the pairing of any of the given top three rules' itemsets does not occur in more than half a percentage point of all transactions, and, again, is not compelling.
<br>
Finally, the <b>lift</b> ratio gives us some hope that once a customer has purchased the antecedent item they will also purchase the consequent item.  Our highest lift metric at "4.7 times more likely" is demonstrated by the relationship between purchasing an "iPhone 11 case" and then purchasing some "HP 63XL Ink".  

#### <span style="color:green"><b>D2. Practical Significance of Findings</b></span>
We do not find that these results contain very much practical significance as we really cannot be confident that any itemset will be purchased even half of the time.  We have a greater chance of predicting the outcome of a coin flip, now, do we not?  We can see that if one of the antecedents is selected for purchase, say a webcam, it is about 4 times more likely that the customer will also purchase the consequent, say a memory card.  
<br>So, for example, if that half a percentage point of people pick up a 5 pack of USB cables, they are nearly 4 times as likely to pickup some HP ink for the printer.  
<br>These results really do not give us much to go on.
Perhaps we need together more data before suggesting any significance.  And, of course, further analysis is recommended.

#### <span style="color:green"><b>D3. Course of Action</b></span>
Therefore, based on the previous analysis and commentary of significance, we do not recommend company decision makers move forward with the original plan of promoting our service by discounted or, even, giving away free items for subscribing to our telecom service.  Not only did we not find any significance in our market basket analysis of this transaction dataset, none of the pairings suggested customers who used telecom services would like or need some consequent item. 
<br><br>
That is, if we had found a significant relationship with say, many transactions where customers purchased two related telecommunications peripherals, we might suggest one of those items for potential customer discount and a marketing promotion.  We did not find that.  We found ink being purchased when, perhaps we might be looking for a relationship where both a webcam and ethernet cable were purchased at the same time.  
<br>
No action is warranted at this time.  More data needs to be gathered and analyzed before confident action can be recommended by our data science team.

#### <span style="color:green"><b>E. Panopto Recording</b></span>
 <span style="color:red">link</span>

#### <span style="color:green"><b>F. Web Sources</b></span>
* GeeksForGeeks. &ensp; (2019, July 4). &ensp; <i>Python | Visualize missing values (NaN) values using Missingno Library</i>. &ensp; GeeksForGeeks. &ensp; https://www.geeksforgeeks.org/python-visualize-missing-values-nan-values-using-missingno-library/
<br>
* Gupta, A. &ensp; (2021). &ensp; <i>Implementing Apriori algorithm in Python</i>. &ensp; GeeksForGeeks. &ensp; https://www.geeksforgeeks.org/implementing-apriori-algorithm-in-python/
<br>
* Kumar, V. &ensp; (2020, May 11). &ensp; <i>Hands-On Guide To Market Basket Analysis With Python Codes</i>. &ensp; AnalyticsIndiaMag.com. &ensp; https://analyticsindiamag.com/hands-on-guide-to-market-basket-analysis-with-python-codes/
Intellipaat. (2021). Introduction to Apriori Algorithm in Python. https://intellipaat.com/blog/data-science-apriori-algorithm/
<br>
* Umredkar, R. &ensp; (2020, November 30). &ensp; <i>Guide To Association Rule Mining From Scratch</i>. &ensp; AnalyticsIndiaMag.com. &ensp;  https://analyticsindiamag.com/guide-to-association-rule-mining-from-scratch/
<br>
* Yogesh. &ensp; (2018). &ensp; <i>Market Basket Analysis (Apriori) in Python</i>. &ensp; Kaggle. &ensp; https://www.kaggle.com/yugagrawal95/market-basket-analysis-apriori-in-python


#### <span style="color:green"><b>G. Sources</b></span>
* Li, S. &ensp; (2017, September 24). &ensp; <i>A Gentle Introduction on Market Basket Analysis — Association Rules</i>. &ensp; TowardsDataScience. &ensp; https://towardsdatascience.com/a-gentle-introduction-on-market-basket-analysis-association-rules-fa4b986a40ce
<br>
* Sivek, S. &ensp; (2020, November 16). &ensp; <i>Market Basket Analysis 101: Key Concepts</i>. &ensp; TowardsDataScience. &ensp; https://towardsdatascience.com/market-basket-analysis-101-key-concepts-1ddc6876cd00