# Chapter 3: Introducing Snorkel

[Snorkel](https://www.snorkel.org/) is a [software project](https://github.com/snorkel-team/snorkel) originally from the Hazy Research group at Stanford University enabling the practice of *weak supervision*, *distant supervision*, *data augmentation* and *data slicing*.  Weak and distant supervision and data augmentation are terms we’ve discussed, but what is data slicing? 

The project has an excellent [Get Started](https://www.snorkel.org/get-started/) page, and I recommend you spend some time working the [tutorials](https://github.com/snorkel-team/snorkel-tutorials) before proceeding beyond this chapter. 

Snorkel implements an unsupervised generative model that accepts a matrix of weak labels for records in your training data and produces strong labels by learning the relationships between these weak labels through matrix factorization.

In [1]:
import sys
sys.path.append("..")

import numpy as np
import pandas as pd
import pyarrow

from lib import utils

[nltk_data] Downloading package punkt to /home/rjurney/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/rjurney/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Part 1: Loading the Labeled Amazon Github Projects

I have previously hand labeled about 2,600 Github projects belonging to Amazon and its subsidiariesinto categories related to their purpose. We're going to use this dataset to introduce Jupyter.

### Hand Labeling this Data

In order to get a ground truth dataset against which to benchmark our Snorkel labeling, I hand labeled all Amazon Github projects in [this sheet](https://docs.google.com/spreadsheets/d/1wiesQSde5LwWV_vpMFQh24Lqx5Mr3VG7fk_e6yht0jU/edit?usp=sharing). The label categories are:

| Number | Code      | Description                          |
|--------|-----------|--------------------------------------|
| 0      | GENERAL   | A FOSS project of general utility    |
| 1      | API       | API library for AWS / Amazon product |
| 2      | RESEARCH  | A research paper and/or dataset      |
| 3      | DEAD      | Project is dead, no longer useful    |
| 3      | OTHER     | Uncertainty... what is this thing?   |

If you want to make corrections, please open the sheet, click on `File --> Make a Copy`, make any edits and then share the sheet with me.

In [2]:
readme_df = pd.read_parquet('../data/aws_github.parquet', engine='pyarrow')
readme_df.head()

Unnamed: 0_level_0,full_name,html_url,description,readme,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
61861755,alexa/alexa-skills-kit-sdk-for-nodejs,https://github.com/alexa/alexa-skills-kit-sdk-...,The Alexa Skills Kit SDK for Node.js helps you...,"<p align=""center"">\n <img src=""https://m.medi...",API
84138837,alexa/alexa-cookbook,https://github.com/alexa/alexa-cookbook,A series of sample code projects to be used fo...,\n# Alexa Skill Building Cookbook\n\n<div styl...,API
63275452,alexa/skill-sample-nodejs-fact,https://github.com/alexa/skill-sample-nodejs-fact,Build An Alexa Fact Skill,"# Build An Alexa Fact Skill\n<img src=""https:/...",API
81483877,alexa/avs-device-sdk,https://github.com/alexa/avs-device-sdk,An SDK for commercial device makers to integra...,### What is the Alexa Voice Service (AVS)?\n\n...,API
38904647,alexa/alexa-skills-kit-sdk-for-java,https://github.com/alexa/alexa-skills-kit-sdk-...,The Alexa Skills Kit SDK for Java helps you ge...,"<p align=""center"">\n <img src=""https://m.medi...",API


## Profile the Data

Let's take a quick look at the labels to see what we'll be classifying.

In [5]:
print(f'Total records: {len(readme_df.index):,}')

readme_df['label'].value_counts()

Total records: 2,568


API         2265
GENERAL      279
DEAD          14
RESEARCH       9
OTHER          1
Name: label, dtype: int64

### How much general utility do Amazon's Github projects have?

One question that occurs to me to ask is - how much general utility do Amazon's Github projects have? Let's look at the number of `GENERAL` purpose compared to the number of `API` projects.

In [26]:
api_count = readme_df[readme_df['label'] == 'API'].count(axis='index')['full_name']
general_count = readme_df[readme_df['label'] == 'GENERAL'].count(axis='index')['full_name']

general_pct = 100 * (general_count / (api_count + general_count))
api_pct     = 100 * (api_count / (api_count + general_count))

print(f'Percentage of projects having general utility:   {general_pct:,.3f}%')
print(f'Percentage of projects for Amazon products/APIs: {api_pct:,.3f}%')

Percentage of projects having general utility:   10.967%
Percentage of projects for Amazon products/APIs: 89.033%


### Simplify to `API` vs `GENERAL`

We throw out `DEAD`, `RESEARCH` and `OTHER` to focus on `API` vs `GENERAL` - is an open source project of general utility or is it a client to a company's commercial products?

In [7]:
df = readme_df[readme_df['label'].isin(['API', 'GENERAL'])]

print(f'Total records with API/GENERAL labels: {len(df.index):,}')

df.head()

Total records with API/GENERAL labels: 2,544


Unnamed: 0_level_0,full_name,html_url,description,readme,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
61861755,alexa/alexa-skills-kit-sdk-for-nodejs,https://github.com/alexa/alexa-skills-kit-sdk-...,The Alexa Skills Kit SDK for Node.js helps you...,"<p align=""center"">\n <img src=""https://m.medi...",API
84138837,alexa/alexa-cookbook,https://github.com/alexa/alexa-cookbook,A series of sample code projects to be used fo...,\n# Alexa Skill Building Cookbook\n\n<div styl...,API
63275452,alexa/skill-sample-nodejs-fact,https://github.com/alexa/skill-sample-nodejs-fact,Build An Alexa Fact Skill,"# Build An Alexa Fact Skill\n<img src=""https:/...",API
81483877,alexa/avs-device-sdk,https://github.com/alexa/avs-device-sdk,An SDK for commercial device makers to integra...,### What is the Alexa Voice Service (AVS)?\n\n...,API
38904647,alexa/alexa-skills-kit-sdk-for-java,https://github.com/alexa/alexa-skills-kit-sdk-...,The Alexa Skills Kit SDK for Java helps you ge...,"<p align=""center"">\n <img src=""https://m.medi...",API


## Introducing 

### Defining Label Schema

The labels for this analysis are:

| Number | Code      | Description                       |
|--------|-----------|-----------------------------------|
| -1     | ABSTAIN   | No vote, for Labeling Functions   |
| 0      | GENERAL   | A FOSS project of general appeal  |
| 1      | API       | An API library for AWS            |

In [None]:
df['readme_text'] = df['readme'].apply(utils.markdown_to_text)
df['readme_code'] = df['readme'].apply(utils.markdown_to_code)

df.head()

In [None]:
utils.markdown_to_text(df['readme'].iloc[0])
utils.markdown_to_code(df['readme'].iloc[0])

In [None]:
import io
import re

from bs4 import BeautifulSoup
from markdown import markdown


def markdown_to_code(markdown_text):
    """Extract source code from Markdown snippets"""
    code_blocks = []
    code_snippets = [] # These get a single block

    f = io.StringIO(markdown_text)
    while True:
        line = f.readline()
        if not line:
            # EOF
            break
        is_block = re.match("[^`]*```(.*)$", line)
        if is_block:
            code_block = [f.readline()]
            while re.search("```", code_block[-1]) is None:
                code_block.append(f.readline())
            code_blocks.append("".join(code_block[:-1]))
        else:
            code = re.match(".*`(.+?)`.*", line)
            if code:
                for group in code.groups():
                    line = line.replace(f'`{group}`', '')
                    code_snippets.append(group)
    
    # Now combine all snippets into one code block
    code_blocks.append(' '.join(code_snippets))
    
    return '\n'.join(code_blocks)


def markdown_to_text(markdown_text):
    """Extract plaintext - minus the code snippets - from Markdown"""
    text_blocks = []
    f = io.StringIO(markdown_text)
    i = 0
    while True:
        line = f.readline()
        if not line:
            # EOF
            break
        is_block = re.match("[^`]*```(.*)$", line)
        if is_block:
            print('is_block')
            first_record = f.readline()
            second_record = f.readline()
            print(f'first_record: {first_record}')
            print(f'second_record: {second_record}')
            code_block = [first_record]
            while re.search("```", code_block[-1]) is None:
                print('inside_block')
                f.readline()
        else:
            code = re.match(".*`(.+?)`.*", line)
            if code:
                for group in code.groups():
                    line = line.replace(f'`{group}`', '')

            text_blocks.append(line)
        i += 1
    
    md = ''.join(text_blocks)
    html = markdown(md)
    soup = BeautifulSoup(html, 'lxml')
    text = soup.find_all(text=True)
    out_text = []
    for text in text:
        if text == '\n':
            pass
        else:
            out_text.append(text)
    return out_text

print(df['readme'].iloc[6][1204:-1])

markdown_to_text(df['readme'].iloc[6])