# OpenPetition.de Recommendation System

The following Jupyter notebook was created during the [DataBBQ "Open Data in Action"](https://www.meetup.com/de-DE/Data-Visualization-RheinMain/events/239007070/), organized by the [DataVizRM Meetup](https://www.meetup.com/de-DE/Data-Visualization-RheinMain/) and hosted by [SAS](https://www.sas.com/de_de/company-information/office-listing.html) in Heidelberg.

This notebook is part of a team effort for the mini-hackathon held during the BBQ. The team members are:
  * Cornelius Schwab
  * Danny Tobisch
  * Dilshod
  * Dirk Toewe

The goal of the hackathon was to improve the [openpetition](https://www.openpetition.de/) platform using the the anonymized user data which was published beforehand. This notebook represents the preprocessing step used to convert the user into a reduced form suited for visualization.

In [None]:
% matplotlib notebook
from pandas import read_csv, DataFrame
from collections import defaultdict, Counter
from itertools import permutations, groupby
import csv, os, numpy as np, pandas as pd
from ipywidgets import FloatProgress
from IPython.display import display

# Reading Raw Data

For our recommendation system we needed to know, for each pair of petitions (p,q), how many signatures they have in common. Our hypothesis was that if two petitions have alot of signatures in common they are closely related and for a new signer of p, q would be a sensible recommendation in that case.

The signature data can be found in *signer.csv*. The file is however to large to read in as [DataFrame](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html). Instead the data is streamed using the [csv](https://docs.python.org/2/library/csv.html) module. Unnecessary data is discarded directly.

In [None]:
def signatures():

  filepath = os.path.expanduser('~/Documents/openpetition_data/signer.csv')

  with open(filepath,'r') as file:

    def lines():
      prgrs= FloatProgress(
        description = '0%',
        min=0, max=os.path.getsize(filepath)
      )
      prgrs_val = 0
      display(prgrs)
      for line in file:
        yield line
        prgrs_val += len( line.encode('utf-8') )
        if prgrs_val*100//prgrs.max != prgrs.value*100//prgrs.max:
          prgrs.description = '%3d%%' % (prgrs_val*100//prgrs.max)
          prgrs.value = prgrs_val

    reader = csv.reader(lines(), delimiter=',', quotechar='"')

    head = { field: i for i,field in enumerate(next(reader)) }
    iEmail= head['email']
    iPid  = head['petition_id']
    del head
    emails = []

    def pids():
        
      for row in reader:
        email = row[iEmail]
        if 0 < len(email):
          pid = row[iPid]
          assert pid.startswith('=')
          emails.append(email)
          yield pid[1:]

    pids = np.fromiter( pids(), dtype= np.int64 )
    return emails,pids

signatures = signatures()
signatures = DataFrame({
  'email'      : signatures[0],
  'petition_id': signatures[1]
})

The parsed results are a list of email-petition_id-pairs, each of which represents a person signing for particular petition. We hereby assumed the E-Mail to be a unique identifier for assigner. Anonymous signatures, i.e. without E-Mail were discarded.

In [None]:
signatures

# Group by E-Mail

The data presentation is not ideal for the purposes of a recommendation system yet. We are not interested in particular signatures but rather in all signatures that two petitions have in common. The first step of processing to group all petition_ids signed by an individual e-mail/user. At the same time we can already filter out all the e-mail only used to sign a single petition as they don't give us any connection between two or more petitions.

In [None]:
multisigner = defaultdict(set)
for row in signatures.itertuples():
  multisigner[row.email].add(row.petition_id)

del signatures # <- save memory

multisigner = {
  k : np.fromiter(v, dtype=np.int64)
  for k,v in multisigner.items()
  if len(v) > 1
}
multisigner

# Filter Outliers

It is interesting to see what is the highest number of petitions signed using the same e-mail. As is turns out there is an e-mail which was used in 2174 petitions which is more than suspicious. Overall there are however only 88 e-mails used in more than 200 petitions. 

In [None]:
bot = max( multisigner.items(), key = lambda kv: len(kv[1]) )
print( len(bot[1]) )
len({
  k : v
  for k,v in multisigner.items()
  if len(v) > 200
})

Let us filter out all the users which have signed a hundred petitions of less considering them as "normal" users. As it turns out the remaining sample size is still over 1 million, which is more than reasonable

In [None]:
nonbots = {
  k: v
  for k,v in multisigner.items()
  if len(v) <= 100
}
len(nonbots)

# Building Graph

In this final step data transformation, we turn the data into a weighted <a href="https://en.wikipedia.org/wiki/Graph_(discrete_mathematics)">undirected graph</a>. For each pair of petitions a user has signed, we count up the weight of the edge between the petitions by 1.

In [None]:
graph = Counter(
  ab
  for v in nonbots.values()
  for ab in permutations(v,2)
)

print( len(graph) )

Now all we have to is to save the data so that we can visualize it. We utilized [Microsoft Power BI](https://powerbi.microsoft.com/de-de/).

In [None]:
with open( os.path.expanduser('~/Documents/openpetition_data/graph.csv'), 'w') as signer:
  writer = csv.writer(signer, delimiter=',', quotechar='"')
  writer.writerow(['petition_from','petition_to','weight'])
  for (a,b),n in graph.items():
    writer.writerow([a,b,n])

# Reducing to Recommendations

Since the graph potentially contains a large amount of edges, we can go on an remove all but the 5 strongest edges for each petition node. Keep in mind that this results in a *directed graph*.

In [None]:
recommendations = defaultdict(list)

for (a,b),n in graph.items():
  recommendations[a] += [b,n]

nVertices = 5

def rows():
  for a,bn in recommendations.items():
    bn = np.array(bn).reshape(-1,2)
    yield a, bn[ np.lexsort(bn.T)[:-nVertices-1:-1] ]

recommendations = list( rows() )

del rows
recommendations.sort(key = lambda kv: kv[0])
for k,v in recommendations[:+2]:
  print('%d ->\n%s' % (k,v) )
print('...')
for k,v in recommendations[-2:]:
  print('%d ->\n%s' % (k,v) )

We can also incorporate the petition name into the graph.

In [None]:
petitions = read_csv(
  os.path.expanduser('~/Documents/openpetition_data/petition.csv'),
  index_col='petition_id', usecols=['petition_id','title']
)
assert petitions.index.str.startswith('=').all()
petitions.index = pd.to_numeric( petitions.index.str[1:] )
petitions = petitions.loc[np.fromiter( (a for (a,_) in graph), dtype=np.int64 )]

petitions = {
  row[0]: row.title
  for row in petitions.itertuples()
}
petitions

In [None]:
with open( os.path.expanduser('~/Documents/openpetition_data/recommendations.csv'), 'w') as signer:
  writer = csv.writer(signer, delimiter=',', quotechar='"')
  writer.writerow(['petition_from','petition_to','weight','title_from','title_to'])
  for a,bn in recommendations:
    for b,n in bn:
      writer.writerow([
        a,b,n,
        petitions[a],
        petitions[b]
      ])