# Machine Learning and Mathmatical methods

<table>
    <thead>
        <tr>
            <th>Type of ML Problem</th>
            <th>Description	Example</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>Classification</td>
            <td>Pick one of the N labels, samples, or nodes</td>
        </tr>
        <tr>
            <td>(Linear) Regression</td>
            <td>
                Predict numerical or time based values<br />
                Click-through rate
            </td>
        </tr>
        <tr>
            <td>Clustering</td><td>Group similar examples (relevance)</td>
        </tr>
        <tr>
            <td>Association rule learning</td><td>Infer likely association patterns in data</td>
        </tr>
        <tr>
            <td>Structured output</td>
            <td>
                Natural language processing<br />
                image recognition
            <td>
        </tr>
    </tbody>
</table>

<p>In this lesson we will focus on an Association rule learning (<b>Affinity Analysis</b>) and Structured output (<b>bag of words</b>)</p>

### Affinity Analysis - to determine hobby recomendations

Affinity Analysis is data wrangling used to determine the similiarities between objects (samples). Its used in:
* Sales, Advertising, or similar Recommendations (you liked "Movie" you might also like "other movie")
* Mapping Familia Human Genes (i.e. the "do you share ancestors?" people)
* Social web-maps that associate friend "likes" and "sharing" (guess which we are doing)

It works by finding associations among samples, that is "finding combinations of items, objects, or anything else which happen frequently together". Then it builds rules which use these to determine the likelyhood (probablity) of items being related...and then we build another graph (no kidding, well kind-of. depends on application).

We are going to use this with our data but typically this would be hundreds to millions of transactions to ensure statistical significance.

<small>If your really interested in how these work (on a macro level): <a href="https://medium.com/@smirnov.am/e-commerce-recommendation-systems-basket-analysis-518009d46b79">Smirnov has a great blog about it</a></small>

### We start by building our data into a set of arrays (*cough* a matrix if you will)

So we will be using numpy (a great library that allows for vector, array, set, and other numeric operations on matrix/arrays with Python) - the arrays it uses are grids of values (all same type), indexed by a tuple of nonnegative integers.

The dataset can be thought of as each hobby (or hobby type) as a column with a -1 (dislike), 0 (neutral),  or 1 (liked) based on friends - we will assume all were strong relationships at this point. Weighting will be added later. So think of it like (except we are dropping the person cause I don't care who the person is in this):

<table>
    <thead>
        <tr>
            <th>Person</th>
            <th>Football</th>
            <th>Reading</th>
            <th>Chess</th>
            <th>Sketching</th>
            <th>video games</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <th>Josiah</th>
            <th>1</th>
            <th>1</th>
            <th>1</th>
            <th>-1</th>
            <th>1</th>
        </tr>
        <tr>
            <th>Jill</th>
            <th>1</th>
            <th>0</th>
            <th>0</th>
            <th>1</th>
            <th>-1</th>
        </tr>
        <tr>
            <tr>
            <th>Mark</th>
            <th>-1</th>
            <th>1</th>
            <th>1</th>
            <th>1</th>
            <th>-1</th>
        </tr>
    </tbody>
</table>
<p> Now lets look for our <b>premise<b> (the thing to find out): <i>A person that likes Football will also like Video Games</i></p>

In [None]:
from numpy import array #faster to load and all we need here

# We are going to see if a person that likes Football also likes Video Games (could do reverse too)
# Start by building our data (fyi, capital X as in quantity, and these will be available in other cells)
X = array([
  [1,1,1,-1,1],
  [1,0,0,1,-1],
  [-1,1,1,1,-1]
])

features = ["football", "reading", "chess", "sketching", "video games"]
n_features = len(features) # for interating over features

In [None]:
football_fans = 0
# Even though it is a numpy array we can still just use it like an interator
for sample in X:
  if sample[0] == 1: #Person really likes football
    football_fans += 1
print("{}: people love Football".format(football_fans))

#So we could already figure out just that it's 50% right now

### Lets build some rule sets
<p>The simplest measurements of rules are <b>support</b> and <b>confidence</b>.<br />
    <br /><b>Support</b> = Number of times rule occurs (frequency count)
    <br /><b>Confidence</b> = Percentage of times rule applies when our premise applies<br /><br />
We will use dictionaries (defaultdicts supply a default value) to compute these. We will count the number of valid results and a simple frequency of our premises. To test multiple premises we will make a large loop over them. By then end they will have:
<ul><li>A Set as the key (0,4) for Football vs. Video Games</li><li>The count of valid/invalid/total occurances (based on dict)</li></ul></p>

#### Why must we test multiple premises? Because this is ML, its analytics - it is not based on a human querying but statistical calc

<sub><i>Those who have done Python may see areas where comprehensions, enumerators, generators, and caches could speed this up - if so great! but let's start simple.</i></sub>


<sub>We call this simple rule sets but they are the same that are used for much more complex data: <a href="https://charleshsliao.wordpress.com/2017/06/10/movie-recommender-affinity-analysis-of-apriori-in-python/">See lines 59, 109, and 110</a></sub>

In [None]:
from collections import defaultdict

valid_rules = defaultdict(int) #count of completed rules
num_occurances = defaultdict(int) #count of any premise

In [None]:
for sample in X:
  for premise in range(n_features):
    if sample[premise] == 1: #We are only looking at likes right now
      num_occurances[premise] += 1 # That's one like people
      for conclusion in range(n_features):
        if premise == conclusion: continue
          #i.e. if we are looking at the same idx move to next
        if sample[conclusion] == 1:
          valid_rules[(premise, conclusion)] +=1
          #conlusion shows "Like" or 1 so valid rule

### Now we determine the confidence of our rules

Make a copy of our collection of valid rules and counts (the valid_rule dict). Then loop over the set and divide the frequency of valid occurances by the total frequency....if this reminds you of one item in your ATM project - well...it should.

In [None]:
support = valid_rules
## two indexes (0,4) compared as the key: count of matching 1s (likes) as value
# The key is actually a set
confidence = defaultdict(float)
for (premise, conclusion) in valid_rules.keys():
  rule = (premise, conclusion)
  confidence[rule] = valid_rules[rule] / num_occurances[premise]
## set of indexes as key: # of valid occurances / total occurances as value

### Then it's just time to print out the results (lets say top 2)

In [None]:
# Let's find the top 7 rules (by occurance not confidence)
sorted_support = sorted(support.items(),
    key=itemgetter(1), # sort in the order of the values of the dictionary
    reverse=True)      # Descending
sorted_confidence = sorted(confidence.items(), 
    key=itemgetter(1), reverse=True) # Now these dicts are in same order

# Now just print out the top 2
for i in range(2):
  print("Associated Rule {}".format(i + 1))
  premise, conclusion = sorted_support[i][0]
  print_rule(premise, conclusion, support, confidence, features)

In [None]:
### Function would usually go at top but for notebook I can just run this before earlier cell and want to show progression
def print_rule(premise, conclusion, support, confidence, features):
  premise_name = features[premise] #so if 0 = football, 1 = ...
  conclusion_name = features[conclusion]
  print("rule: if someone likes {} they will also like {}".format(premise_name, conclusion_name))
  print("confidence: {0:.3f} : idx {1} vs. idx {2}".format(
    confidence[(premise, conclusion)], premise, conclusion))
  print("support:{}".format(support[(premise, conclusion)]))

## Prints

In [None]:
Associated Rule 1
rule: if someone likes reading they will also like chess
confidence: 1.000 : idx 1 vs. idx 2
support:2
Associated Rule 2
rule: if someone likes chess they will also like reading
confidence: 1.000 : idx 2 vs. idx 1
support:2
Associated Rule 3
rule: if someone likes football they will also like reading
confidence: 0.500 : idx 0 vs. idx 1
support:1
Associated Rule 4
rule: if someone likes football they will also like chess
confidence: 0.500 : idx 0 vs. idx 2
support:1
Associated Rule 5
rule: if someone likes football they will also like video games
confidence: 0.500 : idx 0 vs. idx 4
support:1
Associated Rule 6
rule: if someone likes reading they will also like football
confidence: 0.500 : idx 1 vs. idx 0
support:1
Associated Rule 7
rule: if someone likes reading they will also like video games
confidence: 0.500 : idx 1 vs. idx 4
support:1