In [None]:
# exercise 01

"""
Visualizing itemset support

A content-streaming start-up has approached you for consulting services. To keep licensing fees low, they want to assemble a narrow library of movies that all appeal to the same audience. While they'll provide a smaller selection of content than the big players in the industry, they'll also be able to offer a low subscription fee.

You decide to use the MovieLens data and a heatmap for this project. Using a simple support-based heatmap will allow you to identify individual titles that have high support with other titles. The one-hot encoded data is available as the DataFrame onehot. Additionally, pandas is available as pd, seaborn is available as sns, and apriori() and association_rules() have both been imported.
"""

# Instructions

"""

    Compute the frequent itemsets using a minimum support of 0.07.

    Compute the association rules and apply no pruning to the rules.
---

    Replace the frozen sets in the antecedents columns with strings using a lambda function.

    Replace the frozen sets in the consequents columns with strings using a lambda function.

    Generate a heatmap using support values.

"""

# solution

# Compute frequent itemsets using a minimum support of 0.07
frequent_itemsets = apriori(onehot, min_support = 0.07, 
                            use_colnames = True, max_len = 2)

# Compute the association rules
rules = association_rules(frequent_itemsets, metric = 'support', 
                          min_threshold = 0.0)

#----------------------------------#

# Replace frozen sets with strings
rules['antecedents'] = rules['antecedents'].apply(lambda a: ','.join(list(a)))
rules['consequents'] = rules['consequents'].apply(lambda a: ','.join(list(a)))

# Transform data to matrix format and generate heatmap
pivot = rules.pivot(index='consequents', columns='antecedents', values='support')
sns.heatmap(pivot)

# Format and display plot
plt.yticks(rotation=0)
plt.show()

#----------------------------------#

# Conclusion

"""
Excellent work! Based on the heatmap, can you identify a narrow set of movies that might provide a good starting point for the streaming service?
"""

'/home/nero/Documents/Estudos/DataCamp'

In [1]:
# exercise 02

"""
Heatmaps with lift

The founder likes the heatmap you've produced for her streaming service. After discussing the project further, however, you decide that that it is important to examine other metrics before making a final decision on which movies to license. In particular, the founder suggests that you select a metric that tells you whether the support values are higher than we would expect given the films' individual support values.

You recall that lift does this well and decide to use it as a metric. You also remember that lift has an important threshold at 1.0 and decide that it is important to replace the colorbar with annotations, so you can determine whether a value is greater than 1.0. Note that the rules from the previous exercise are available to you as rules.
"""

# Instructions

"""

    Import seaborn under its standard alias.

    Transform the DataFrame containing rules into a matrix using the lift metric.

    Generate a heatmap with annotations on and the colorbar off.

"""

# solution

# Import seaborn under its standard alias
import seaborn as sns

# Transform the DataFrame of rules into a matrix using the lift metric
pivot = rules.pivot(index = 'consequents', 
                   columns = 'antecedents', values= 'lift')

# Generate a heatmap with annotations on and the colorbar off
sns.heatmap(pivot, annot = True, cbar = False)
plt.yticks(rotation=0)
plt.xticks(rotation=90)
plt.show()

#----------------------------------#

# Conclusion

"""
Excellent work! In the next question, we'll interpret what the heatmap is telling us.
"""

'\n\n'

# Interpreting heatmaps

In the previous exercise, you generated the heatmap shown below. Each cell of the heatmap shows the lift value for an association rule. Recall that your goal was to identify a narrow set of films that were all strongly associated according to the lift metric. These films would form the initial content library for a streaming service. Which of the following statements is true?

![heatmap](images/ch04-heatmap.png)

### Possible Answers


    Fight Club, Braveheart, and Batman Begins would be the best initial content library.
    
    
    Batman Begins, The Dark Knight, and The Dark Knight Rises would make a good initial library. {Answer}
    
    
    Braveheart and The Dark Knight Rises would make the best initial content library.
    
    

In [2]:
# exercise 03

"""
Pruning with scatterplots

After viewing your Batman-based streaming service proposal from the previous exercise, the founder realizes that her initial plan may have been too narrow. Rather than focusing on initial titles, she asks you to focus on general patterns in the association rules and then perform pruning accordingly. Your goal should be to identify a large set of strong associations.

Fortunately, you've just learned how to generate scatterplots. You decide to start by plotting support and confidence, since all optimal rules according to many common metrics are located on the confidence-support border. The one-hot encoded data has been imported for you and is available as onehot. Additionally, apriori() and association_rules() have been imported and pandas is available as pd.
"""

# Instructions

"""

    Generate a large number of itemsets with 2 items by setting the minimum support to 0.0075 and setting the maximum length to 2.

    Complete the statement for association_rules() in a way that avoids additional filtering.

    Complete the statement to generate the scatterplot, setting the y variable to use confidence.

"""

# solution

# Import seaborn under its standard alias
import seaborn as sns

# Apply the Apriori algorithm with a support value of 0.0075
frequent_itemsets = apriori(onehot, min_support = 0.0075, 
                            use_colnames = True, max_len = 2)

# Generate association rules without performing additional pruning
rules = association_rules(frequent_itemsets, metric = 'support', 
                          min_threshold = 0.0)

# Generate scatterplot using support and confidence
sns.scatterplot(x = "support", y = "confidence", data = rules)
plt.show()

#----------------------------------#

# Conclusion

"""
Great work! Notice that the confidence-support border roughly forms a triangle. This suggests that throwing out some low support rules would also mean that we would discard rules that are strong according to many common metrics.
"""

'\n\n'

In [3]:
# exercise 04

"""
Optimality of the support-confidence border

You return to the founder with the scatterplot produced in the previous exercise and ask whether she would like you to use pruning to recover the support-confidence border. You tell her about the Bayardo-Agrawal result, but she seems skeptical and asks whether you can demonstrate this in an example.

Recalling that scatterplots can scale the size of dots according to a third metric, you decide to use that to demonstrate optimality of the support-confidence border. You will show this by scaling the dot size using the lift metric, which was one of the metrics to which Bayardo-Agrawal applies. The one-hot encoded data has been imported for you and is available as onehot. Additionally, apriori() and association_rules() have been imported and pandas is available as pd.
"""

# Instructions

"""

    Apply the Apriori algorithm to the DataFrame onehot.

    Compute the association rules using the support metric and a minimum threshold of 0.0.

    Complete the expression for the scatterplot such that the dot size is scaled by lift.

"""

# solution

# Import seaborn under its standard alias
import seaborn as sns

# Apply the Apriori algorithm with a support value of 0.0075
frequent_itemsets = apriori(onehot, min_support = 0.0075, 
                         use_colnames = True, max_len = 2)

# Generate association rules without performing additional pruning
rules = association_rules(frequent_itemsets, metric = "support", 
                          min_threshold = 0.0)

# Generate scatterplot using support and confidence
sns.scatterplot(x = "support", y = "confidence", 
                size = "lift", data = rules)
plt.show()

#----------------------------------#

# Conclusion

"""
Great work! If you look at the plot carefully, you'll notice that the highest values of lift are always at the support-confidence border for any given value of supply or confidence.
"""

'\n\n'

In [None]:
def rules_to_coordinates(rules):
	rules['antecedent'] = rules['antecedents'].apply(lambda antecedent: list(antecedent)[0])
	rules['consequent'] = rules['consequents'].apply(lambda consequent: list(consequent)[0])
	rules['rule'] = rules.index
	return rules[['antecedent','consequent','rule']]

In [None]:
# exercise 05

"""
Using parallel coordinates to visualize rules

Your visual demonstration in the previous exercise convinced the founder that the supply-confidence border is worthy of further exploration. She now suggests that you extract part of the border and visualize it. Since the rules that fall on the border are strong with respect to most common metrics, she argues that you should simply visualize whether a rule exists, rather than the intensity of the rule according to some metric.

You realize that a parallel coordinates plot is ideal for such cases. The data has been imported for you as onehot. Additionally, apriori(), association_rules(), and parallel_coordinates() have been imported, and pandas is available as pd. The function rules_to_coordinates() has been defined and is available.
"""

# Instructions

"""

    Complete the Apriori algorithm statement using a minimum support of 0.05.
    
    Compute association rules using a minimum confidence threshold of 0.50. This is sufficiently high to exclusively capture points near the upper part of the supply-confidence border.
    
    Convert the rules into coordinates.
    
    Plot the coordinates using parallel_coordinates().

"""

# solution

# Compute the frequent itemsets
frequent_itemsets = apriori(onehot, min_support = 0.05, 
                         use_colnames = True, max_len = 2)

# Compute rules from the frequent itemsets with the confidence metric
rules = association_rules(frequent_itemsets, metric = 'confidence', 
                          min_threshold = 0.50)

# Convert rules into coordinates suitable for use in a parallel coordinates plot
coords = rules_to_coordinates(rules)

# Generate parallel coordinates plot
parallel_coordinates(coords, 'rule')
plt.legend([])
plt.show()

#----------------------------------#

# Conclusion

"""
Good job! Do any patterns in the associations between antecedents and consequents stand out to you?
"""

'\n\n'

In [5]:
# exercise 06

"""
Refining a parallel coordinates plot

After viewing your parallel coordinates plot, the founder concludes that her decision to step away from a Batman-centered streaming platform may have been premature. Indeed, the parallel coordinates plot seems to suggest that many popular movies are strongly associated with The Dark Knight. She decides instead to pitch the idea to her staff in a meeting, but has asked you to make some refinements to the plot to make it more visually appealing for the presentation.

Note that the rules have been generated and are available for you as rules. Additionally, pandas is available as pd and the function rules_to_coordinates() has been defined and is available.
"""

# Instructions

"""

    Import the parallel coordinates function from the plotting submodule of pandas.
    
    Convert the DataFrame of rules into coordinates.
    
    Complete the statement to generate a parallel coordinates plot, using ocean as the colormap.

"""

# solution

# Import the parallel coordinates plot submodule
from pandas.plotting import parallel_coordinates

# Convert rules into coordinates suitable for use in a parallel coordinates plot
coords = rules_to_coordinates(rules)

# Generate parallel coordinates plot
parallel_coordinates(coords, 'rule', colormap = 'ocean')
plt.legend([])
plt.show()

#----------------------------------#

# Conclusion

"""
Great work! Notice that the color scheme for the parallel coordinates plot has now changed.
"""

'\n\n'