# Welcome to the Iris Data Cleaning Sandbox!

This file provides you a framework with which you can examine data. Here, I provide you with some starter functions along with a guide for some of the things you should be looking for during your research. Duplicate this notebook wherever you need it

In [1]:
# You probably won't need much more than this
import json
import os
import re
import requests
import time
from bs4 import BeautifulSoup
from datetime import datetime

# Optional statistics
import numpy as np
import seaborn as sns
import statistics
import matplotlib.pyplot as plt
from scipy import stats

#THE BELOW IS ONLY FOR KANYES BECAUSE HIS PYTHONPATHS ARE ALL BROKEN
import sys
sys.path.extend(['', '/Users/kanyes/miniconda3/lib/python37.zip', '/Users/kanyes/miniconda3/lib/python3.7', '/Users/kanyes/miniconda3/lib/python3.7/lib-dynload', '/Users/kanyes/.local/lib/python3.7/site-packages', '/Users/kanyes/miniconda3/lib/python3.7/site-packages', '/Users/kanyes/miniconda3/lib/python3.7/site-packages/planetrl-1.0.0-py3.7.egg', '/Users/kanyes/miniconda3/lib/python3.7/site-packages/tensorflow_probability-0.6.0-py3.7.egg'])

In [4]:
# HERE'S WHERE YOU CAN DO ALL OF YOUR WORK!
# USE THE SIMPLE FUNCTIONS IN THE LATER SECTIONS IF YOU WANT
BCOURSES_TOKEN = "1072~gOcNUfmWGmZSmaI51e3qpPoOvwqjeIK65lk1SdMDznH7tQKHkHLgmGBvEcC9gDko"
data = requests.get('https://canvas.instructure.com/api/v1/courses', 
#                     params = {
# #                         'type':'assignment',
#                         'start_date':'2017-08-01',
#                         'end_date': '2020-08-01'
#                     },
                    headers={
                        'Authorization':f'Bearer {BCOURSES_TOKEN}'
                    }).json()

# print(json.dumps(data, indent=4))
call = requests.get("https://bcourses.berkeley.edu/feeds/calendars/course_gEPZNIYldwd8lINjXt6byN9O1Le5OpBY6psucsid.ics", headers={
                        'Authorization':f'Bearer {BCOURSES_TOKEN}'
                    }, allow_redirects=True)
print(call.text)

BEGIN:VCALENDAR
VERSION:2.0
PRODID:icalendar-ruby
CALSCALE:GREGORIAN
METHOD:PUBLISH
X-WR-CALNAME:Avant-Garde Film (Fall 2019) Calendar (Canvas)
X-WR-CALDESC:Calendar events for the course\, Avant-Garde Film (Fall 2019)
BEGIN:VEVENT
DTSTAMP:20191216T193600Z
UID:event-assignment-8009194
DTSTART;VALUE=DATE:20190910T000000
DTEND;VALUE=DATE:20190910T000000
CLASS:PUBLIC
SEQUENCE:0
SUMMARY:Thought Piece #1 [FILM 129 - LEC 001]
URL:https://bcourses.berkeley.edu/calendar?include_contexts=course_1484507&
 month=09&year=2019#assignment_8009194
END:VEVENT
BEGIN:VEVENT
DTSTAMP:20191216T200800Z
UID:event-assignment-8009199
DTSTART;VALUE=DATE:20190917T000000
DTEND;VALUE=DATE:20190917T000000
CLASS:PUBLIC
SEQUENCE:0
SUMMARY:Thought Piece #2 [FILM 129 - LEC 001]
URL:https://bcourses.berkeley.edu/calendar?include_contexts=course_1484507&
 month=09&year=2019#assignment_8009199
END:VEVENT
BEGIN:VEVENT
DTSTAMP:20191216T234300Z
UID:event-assignment-8009200
DTSTART;VALUE=DATE:2

# Appendix: Data Cleaning Guide and Functions
## Retrieving Data from a Data Source

Most of the data you work with will be in JSON format. This kind of data is usually pretty easy to work with -- it's easy to read, nicely and neatly nested, and you've seen it everywhere. Here's a brief script that gives you a framework for extracting a JSON object as an API response.

In [None]:
def get_api_json(url, params={}):
    """Return the JSON object retrieved from a url with optional parameters"""
    try:
        obj = requests.get(url, params=params)
    except requests.exceptions.RequestException as e:
        return SystemExit(e)

    try:
        data = obj.json()
    except:
        data = json.loads(obj.text())
    
    return data

Sometimes the data you're looking for will be HTML data. This is true when we're dealing with a plain webscrape as opposed to an official or external-facing API. You shouldn't encounter these too often.

In [None]:
def get_api_html(url):
    """Return a BeautifulSoup object representing the HTML of a website"""
    try:
        obj = requests.get(url)
    except requests.exceptions.RequestException as e:
        return SystemExit(e)
    
    soup = BeautifulSoup(obj.content, 'html.parser')
    return soup

For `BeautifulSoup` you only really need to be concerned with one method -- the `find_all` function, which takes in a string or regular expressions and returns a list of all matches. For instance, `soup.find_all('h3')` returns a list of all `h3` tags in the HTML. Additionally, `soup.find_all(class_='desc_wrap_ck3')` returns a list of all items that match that `class`.

## REGular EXpressions

A lot of times you'll need regex to filter out API results and clean entries dynamically. Regex is supplied to python via the `re` package. A guide to regular expressions can be found at https://www.regular-expressions.info/tutorial.html (the Bible for regex). If you need help with a particular regex just message Kanyes :) 

There are two functions in particular that you should know how to use -- `re.sub(pattern, str or func, target)` will search for a pattern within a target string and replace it, either with a static string or with a function that can take in the pattern match as an argument.

For example, `re.sub(r'\(.*\)|\{.*\}', "", "hello, (world)")` will strip out anything in parantheses or brackets from the target string.

Likewise, `re.sub(r'([^0-9])', lambda rgx: f"{rgx.group(1)}", "hello 123 world")` will extract anything that isn't a number (`^[0-9]`) from the target string and substitute the extraction back into the target string, so `"hello 123 world"` becomes `"hello  world"`.

## Filtering and Modifying an API response

Sometimes, the data within an API response isn't good enough itself -- we might need to add a temporary field in real-time as we're processing an API result. For instance, in the Yummly 28k dataset, we didn't have a flag for diet. While we're examining data and checking if it's good enough for a card, we need a way to create this new diet field within the response temporarily. The following functions allow us to do some basic filtering and modifying of a response.

This is most useful when the API returns a collection or list of objects rather than just 1 object -- for instance, a recipes endpoint which returns 100 recipes or a concerts endpoint that returns 20 concerts.

Adding a temp field is simple enough -- just do something like:
```
for result in data['results']:
    result['new_field'] = function_or_value()
```

Removing a field is also easy; use
```
del data['field_to_be_removed']
```

In [None]:
def extract_field_from_list(l, field_name):
    """Given a list of dicts, extract a field from each dict"""
    return [e[field_name] for e in l]

def filter_list(l, boolexpr):
    """Filter a list based on a boolean expression"""
    return [e for e in l if boolexpr(e)]

def clean_list(l, regexpr):
    """Given a regex lambda function, clean a particular field in a list of dicts"""
    return [regexpr(e) for e in l]

# Example usage: 
# extract_field_from_list(data['recipes'], 'title')
# filter_list(data['recipes'], lambda recipe: recipe['rating'] > 3)
# clean_list(data['recipes'], lambda recipe: re.sub(r'([A-Za-z])', lambda rgx: f"{rgx.group(1)}", recipe['title']))

## Analytics and Information
Here are some basic functions that allow you to see the distribution of a certain field in the data. This can be helpful in testing how diverse an API response is, or testing the quality of the response.

In [None]:
def hist(vals):
    """Display a histogram for a list `vals` """
    mean_val = statistics.mean(vals)
    print(f"Mean value: {mean_val}")
    sns.set(color_codes=True)
    sns.distplot(vals)
    plt.show()

def count(vals):
    """Display a count plot for a list `vals` (categorical data)"""
    sns.set(color_codes=True)
    sns.countplot(vals)
    plt.show()

def percentile(l, item):
    """find the percentile of an item based on a list"""
    return stats.percentileofscore(l, item)

## Save and Load
Sometimes you may want to save an API response after you've modified it or load it again later. These functions let you do that.

In [None]:
def save(filename, data):
    """Saves a file in the current directory"""
    if not os.path.exists(filename):
        open(filename, 'w').close()
    
    with open(filename, 'r+') as f:
        try:
            f.seek(0)
            f.write(json.dumps(data, indent=4))
        except:
            f.seek(0)
            if type(data) == str:
                f.write(data)
            else:
                print("Invalid data type")
                return
        f.truncate()

def load(filename):
    with open(filename, 'r+') as f:
        try:
            f.seek(0)
            data = json.load(f)
        except:
            f.seek(0)
            data = f.read()
        return data