# Welcome to the Iris Data Cleaning Sandbox!

This file provides you a framework with which you can examine data. Here, I provide you with some starter functions along with a guide for some of the things you should be looking for during your research. Duplicate this notebook wherever you need it

In [4]:
# You probably won't need much more than this
import json
import os
import re
import requests
import time
from bs4 import BeautifulSoup
from datetime import datetime

# Optional statistics
import numpy as np
import seaborn as sns
import statistics
import matplotlib.pyplot as plt
from scipy import stats

In [19]:
# HERE'S WHERE YOU CAN DO ALL OF YOUR WORK!
# USE THE SIMPLE FUNCTIONS IN THE LATER SECTIONS IF YOU WANT

page_content = requests.get("https://www.wenxuecity.com/", verify=False).content
soup = BeautifulSoup(page_content, 'html.parser')
results = soup.select("div div ul li")
print(results)



[<li class="selected"><a href="//www.wenxuecity.com/">首页</a></li>, <li><a href="//www.wenxuecity.com/news/">新闻</a></li>, <li><a href="//www.wenxuecity.com/news/photo/?utm_source=wxc&amp;utm_medium=navi&amp;utm_campaign=05242016">读图</a></li>, <li><a href="//bbs.wenxuecity.com/financenews/">财经</a></li>, <li><a href="//www.wenxuecity.com/edu/">教育</a></li>, <li><a href="//www.wenxuecity.com/home/">家居</a></li>, <li><a href="//www.wenxuecity.com/health/">健康</a></li>, <li><a href="//www.wenxuecity.com/cooking/">美食</a></li>, <li><a href="//www.wenxuecity.com/style/">时尚</a></li>, <li><a href="//www.wenxuecity.com/travel/">旅游</a></li>, <li><a href="//www.wenxuecity.com/video/">影视</a></li>, <li><a href="//blog.wenxuecity.com/">博客</a></li>, <li><a href="//groups.wenxuecity.com/">群组</a></li>, <li><a href="/yoyo/?act=list">悠游</a></li>, <li class="last"><a href="//bbs.wenxuecity.com/">论坛</a></li>, <li>
<a href="/news/2020/08/27/9808438.html">
                        他当面告诉王毅：中国的胁迫外交对加拿大无效            

In [20]:
headlines = []
for result in results:
    headline_url = result.find('a').attrs['href']
    headline = result.find('a').contents[0]
    headline = headline.replace("\r\n", "").strip()
    headlines.append((headline, headline_url))

In [21]:
headlines

[('首页', '//www.wenxuecity.com/'),
 ('新闻', '//www.wenxuecity.com/news/'),
 ('读图',
  '//www.wenxuecity.com/news/photo/?utm_source=wxc&utm_medium=navi&utm_campaign=05242016'),
 ('财经', '//bbs.wenxuecity.com/financenews/'),
 ('教育', '//www.wenxuecity.com/edu/'),
 ('家居', '//www.wenxuecity.com/home/'),
 ('健康', '//www.wenxuecity.com/health/'),
 ('美食', '//www.wenxuecity.com/cooking/'),
 ('时尚', '//www.wenxuecity.com/style/'),
 ('旅游', '//www.wenxuecity.com/travel/'),
 ('影视', '//www.wenxuecity.com/video/'),
 ('博客', '//blog.wenxuecity.com/'),
 ('群组', '//groups.wenxuecity.com/'),
 ('悠游', '/yoyo/?act=list'),
 ('论坛', '//bbs.wenxuecity.com/'),
 ('他当面告诉王毅：中国的胁迫外交对加拿大无效', '/news/2020/08/27/9808438.html'),
 ('便宜！Ann Taylor全网高达5折, 打折区额外70% off!', 'https://home.dealsaving.com/'),
 ('赶快申请蓝钻卡，开户送$750，旅游餐饮返2%!',
  'https://www.cardbenefit.com/search-credit-cards/?cardChoice=creditRating&ccRating=good-credit&introid=sapphire&show=42&src=wxc1'),
 ('“香港人”若获诺贝尔和平奖？ 王毅：勿干涉中国内政', '/news/2020/08/27/9808435.html'),
 ('

In [24]:
def get_author(url):
    page_content = requests.get(url, verify=False).content # import without checking SSL
    soup = BeautifulSoup(page_content, 'html.parser') # parser that works with Chinese characters bs4
    author = soup.find('span', itemprop="author") # Grabs author
    return author

authors = [] # Tuples in format ("This is a headline", "/this/is/a/URL.html")
for headline in headlines:
    if headline[1].find('.com') == -1: # If not an ad/external link
        url = "https://www.wenxuecity.com/" + headline[1]  
        author = get_author(url)
        if author != None:
            authors.append(author.text)





















# Appendix: Data Cleaning Guide and Functions
## Retrieving Data from a Data Source

Most of the data you work with will be in JSON format. This kind of data is usually pretty easy to work with -- it's easy to read, nicely and neatly nested, and you've seen it everywhere. Here's a brief script that gives you a framework for extracting a JSON object as an API response.

In [5]:
def get_api_json(url, params={}):
    """Return the JSON object retrieved from a url with optional parameters"""
    try:
        obj = requests.get(url, params=params)
    except requests.exceptions.RequestException as e:
        return SystemExit(e)

    try:
        data = obj.json()
    except:
        data = json.loads(obj.text())
    
    return data

Sometimes the data you're looking for will be HTML data. This is true when we're dealing with a plain webscrape as opposed to an official or external-facing API. You shouldn't encounter these too often.

In [6]:
def get_api_html(url):
    """Return a BeautifulSoup object representing the HTML of a website"""
    try:
        obj = requests.get(url)
    except requests.exceptions.RequestException as e:
        return SystemExit(e)
    
    soup = BeautifulSoup(obj.content, 'html.parser')
    return soup

For `BeautifulSoup` you only really need to be concerned with one method -- the `find_all` function, which takes in a string or regular expressions and returns a list of all matches. For instance, `soup.find_all('h3')` returns a list of all `h3` tags in the HTML. Additionally, `soup.find_all(class_='desc_wrap_ck3')` returns a list of all items that match that `class`.

## REGular EXpressions

A lot of times you'll need regex to filter out API results and clean entries dynamically. Regex is supplied to python via the `re` package. A guide to regular expressions can be found at https://www.regular-expressions.info/tutorial.html (the Bible for regex). If you need help with a particular regex just message Kanyes :) 

There are two functions in particular that you should know how to use -- `re.sub(pattern, str or func, target)` will search for a pattern within a target string and replace it, either with a static string or with a function that can take in the pattern match as an argument.

For example, `re.sub(r'\(.*\)|\{.*\}', "", "hello, (world)")` will strip out anything in parantheses or brackets from the target string.

Likewise, `re.sub(r'([^0-9])', lambda rgx: f"{rgx.group(1)}", "hello 123 world")` will extract anything that isn't a number (`^[0-9]`) from the target string and substitute the extraction back into the target string, so `"hello 123 world"` becomes `"hello  world"`.

## Filtering and Modifying an API response

Sometimes, the data within an API response isn't good enough itself -- we might need to add a temporary field in real-time as we're processing an API result. For instance, in the Yummly 28k dataset, we didn't have a flag for diet. While we're examining data and checking if it's good enough for a card, we need a way to create this new diet field within the response temporarily. The following functions allow us to do some basic filtering and modifying of a response.

This is most useful when the API returns a collection or list of objects rather than just 1 object -- for instance, a recipes endpoint which returns 100 recipes or a concerts endpoint that returns 20 concerts.

Adding a temp field is simple enough -- just do something like:
```
for result in data['results']:
    result['new_field'] = function_or_value()
```

Removing a field is also easy; use
```
del data['field_to_be_removed']
```

In [7]:
def extract_field_from_list(l, field_name):
    """Given a list of dicts, extract a field from each dict"""
    return [e[field_name] for e in l]

def filter_list(l, boolexpr):
    """Filter a list based on a boolean expression"""
    return [e for e in l if boolexpr(e)]

def clean_list(l, regexpr):
    """Given a regex lambda function, clean a particular field in a list of dicts"""
    return [regexpr(e) for e in l]

# Example usage: 
# extract_field_from_list(data['recipes'], 'title')
# filter_list(data['recipes'], lambda recipe: recipe['rating'] > 3)
# clean_list(data['recipes'], lambda recipe: re.sub(r'([A-Za-z])', lambda rgx: f"{rgx.group(1)}", recipe['title']))

## Analytics and Information
Here are some basic functions that allow you to see the distribution of a certain field in the data. This can be helpful in testing how diverse an API response is, or testing the quality of the response.

In [8]:
def hist(vals):
    """Display a histogram for a list `vals` """
    mean_val = statistics.mean(vals)
    print(f"Mean value: {mean_val}")
    sns.set(color_codes=True)
    sns.distplot(vals)
    plt.show()

def count(vals):
    """Display a count plot for a list `vals` (categorical data)"""
    sns.set(color_codes=True)
    sns.countplot(vals)
    plt.show()

def percentile(l, item):
    """find the percentile of an item based on a list"""
    return stats.percentileofscore(l, item)

## Save and Load
Sometimes you may want to save an API response after you've modified it or load it again later. These functions let you do that.

In [9]:
def save(filename, data):
    """Saves a file in the current directory"""
    if not os.path.exists(filename):
        open(filename, 'w').close()
    
    with open(filename, 'r+') as f:
        try:
            f.seek(0)
            f.write(json.dumps(data, indent=4))
        except:
            f.seek(0)
            if type(data) == str:
                f.write(data)
            else:
                print("Invalid data type")
                return
        f.truncate()

def load(filename):
    with open(filename, 'r+') as f:
        try:
            f.seek(0)
            data = json.load(f)
        except:
            f.seek(0)
            data = f.read()
        return data