# A5 - Python

This assignment will cover topics of association analysis.

Make sure that you keep this notebook named as "a5.ipynb" 

Any other packages or tools, outside those listed in the assignments or Canvas, should be cleared
by Dr. Brown before use in your submission.

## Q0 - Setup

The following code looks to see whether your notebook is run on Gradescope (GS), Colab (COLAB), or the linux Python environment you were asked to setup.

In [1]:
import re 
import os
import platform 
import sys 

# flag if notebook is running on Gradescope 
if re.search(r'amzn', platform.uname().release): 
    GS = True
else: 
    GS = False

# flag if notebook is running on Colaboratory 
try:
  import google.colab
  COLAB = True
except:
  COLAB = False

# flag if running on Linux lab machines. 
cname = platform.uname().node
if re.search(r'(guardian|colossus|c28|lebrown|rovernet)', cname):
    LLM = True 
else: 
    LLM = False

print("System: GS - %s, COLAB - %s, LLM - %s" % (GS, COLAB, LLM))

System: GS - False, COLAB - False, LLM - True


## Notebook Setup

It is good practice to list all imports needed at the top of the notebook. You can import modules in later cells as needed, but listing them at the top clearly shows all which are needed to be available / installed.

If you are doing development on Colab, the otter-grader package is not available, so you will need to install it with pip (uncomment the cell directly below).

In [2]:
# Only uncomment if you developing on Colab 
# if COLAB == True: 
#     print("Installing otter:")
#     !pip install otter-grader==4.2.0 

In [3]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


from mlxtend.preprocessing import TransactionEncoder 
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
from mlxtend.frequent_patterns import fpgrowth

import warnings
warnings.filterwarnings('ignore')

# Package for Autograder 
import otter 
grader = otter.Notebook()

In [4]:
grader.check("q0")

# Q1 - Association Analysis

For this problem, you will analyze a portion of the Instacart Online Grocery Shopping Dataset from 2017.  The full data set is available if you are interested.  
https://www.instacart.com/datasets/grocery-shopping-2017

The original dataset has 3 million orders.  We will work with a smaller data set.  

You will use the following files for this analysis: 

* `orders_products.csv`  
* `products.csv` 



Structure of the files, `products_orders`: 
* `order_id`: foreign key
* `product_id`: foreign key
* `add_to_cart_order`: order in which each product was added to cart
* `reordered`: 1 if this product has been ordered by this user in the past, 0 otherwise

File structure for `products`: 
* `product_id`: product identifier
* `product_name`: name of the product
* `aisle_id`: foreign key
* `department_id`: foreign key

You can connect the `product_id` with the name of the product, `product_name` in the `products.csv` file. 

## Q1(a) - Load the Data

Load the 2 data files mentioned above.  

You will need to transform this data into a boolean transaction DataFrame `orders`.  

These boolean transaction DataFrame will have rows corresponding to orders / transaction with the `order_id` as the row index.  The DataFrame will have columns corresponding to products with the `product_name` as the column names.  The DataFrame is boolean and entry [i, j] is : "False" meaning this product, j, was not purchased in order, i, and "True" means product, j, was purchased in order, i. 

The rows in the DataFrame should be in order of `order_id`.  The columns should be ordered in alphanumeric ordering of the `product_name`.

Note, you can not use the `mlxtend.TransactionEncoder` function for this because it expects data as lists of lists.  

*Hint:* several `pandas` functions such as `join`, `merge`, or `pivot` may be useful to construct the `orders` DataFrame.

Also, calculate the mean/max number of products per order for the `orders` dataset: `mean_num_prods` and `max_num_prods` 

<br>
After creating the `orders` DataFrame, capture aspects of the data: 

* `orders_num_rows` 
* `orders_num_cols` 
* `orders_col_names` 

Then, save off a slice of the data frame, `orders_small`, the first 50 rows the first 100 items. 

In [9]:
# Load and prepare the data 

prods_orders = pd.read_csv('products_ordres.csv')
prods = pd.read_csv('products.csv')

orders = ... 


orders_num_rows = ...
orders_num_cols = ...
orders_coln_names = ...


mean_num_prods = ...
max_num_prods = ...


orders_small = ...

orders.head()

print("Mean number of products per order: ", mean_num_prods)
print("Max number of projucts per order:  ", max_num_prods)

# clean up unneeded raw data
del prods_orders, prods

product_name  order_id  #2 Coffee Filters  0% Fat Free Organic Milk  \
0                    1                NaN                       NaN   
1                   36                NaN                       NaN   
2                   38                NaN                       NaN   
3                   96                NaN                       NaN   
4                   98                NaN                       NaN   

product_name  0% Fat Organic Greek Vanilla Yogurt  \
0                                             NaN   
1                                             NaN   
2                                             NaN   
3                                             NaN   
4                                             NaN   

product_name  0% Fat Superfruits Greek Yogurt  0% Greek Strained Yogurt  \
0                                         NaN                       NaN   
1                                         NaN                       NaN   
2                            

In [10]:
grader.check("q1a")

<!-- BEGIN QUESTION -->

## Q1(b) - Explore the Data 

Create a density plot showing the number of products per order using the `orders` data set.



In [None]:
# Plot the number of products per order 




<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

## Q1(c) - Explore the Data, part 2 

For the `orders` dataset, create an top 15 item frequency plot, that is plot the top 15 most frequently purchased items. This should be a bar plot with items vs. frequency (relative support).

In [None]:
# Plot top 15 most frequently purchased product (by relative support) 





<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

## Q1(d) - Explore the Data, part 3 

For the `orders` dataset, create an histogram of the number of times an item is purchased. You may want to consider using log scaling to view the data distribution more easily. 

In [None]:
# Plot histogram of number of purchases per item. 





<!-- END QUESTION -->

## Q1(e) - Apriori 

For the `orders` dataset, use Apriori to find association rules, `rules` with a minimum relative support of 0.0035 and confidence of 0.5.  

In `q1e_df` sort the rules by leverage (descending order), then by confidence (in descending order)  and return the top 20 rules.

Note, the minimum support level is rather high given the information plotted in Q1(c) and Q1(d). However, this was done to avoid using too much memory (lower support values will require 15-20 GB memory). 

In [None]:
# Run Apriori as instructed

rules = ...
 
q1e_df = ...

q1e_df.iloc[0:10, [0, 1, 4, 5, 7]]


In [None]:
grader.check("q1e")

<!-- BEGIN QUESTION -->

## Q1(f) - Apriori, part 2

Create a scatterplot of the rules, plotting support vs. confidence colored by lift value. 


In [None]:
# Plot the results of Apriori




<!-- END QUESTION -->

## Q1(g) - FPGrowth 

For the `orders` dataset, use FPGrowth to find association rules, `rules2` with a minimum support of 0.0035 and confidence of 0.5.

Sort the rules by conviction (descending order), then by support (descending order).  Store the top 20 rules in `q1g_df`.

Note, the relative speed for FPGrowth over Apriori.

In [None]:
# Run FPGrowth as instructed

rules2 = ...
 
q1g_df = ...

q1g_df.iloc[0:10, [0, 1, 4, 5, 8]]

In [None]:
grader.check("q1g")

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

**NOTE** the submission must be run on the campus linux machines.  See the instruction in the Canvas assignment.

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export()