# Section 1: Introduction to Fake News Classification


In [None]:
# Run this to load data 
import os
from bs4 import BeautifulSoup as bs
import pickle
  
import requests
import zipfile
import io

# Download class resources...
r = requests.get("https://www.dropbox.com/s/2pj07qip0ei09xt/inspirit_fake_news_resources.zip?dl=1")
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall()

basepath = '.'

## Anatomy of a (Fake) News Website

Have you ever wondered how websites like *google.com* and *nytimes.com* work under the hood? Using the internet every day, it is easy to forget how magical even the most mundane web browsing experiences are. Consider, for example, this article on the New York Times:

![NYTimes Article](https://www.niemanlab.org/images/ochs-nytimes-article-page.png)


How does the browser know to show the title of the article near the top of the page? How does it know that the word "Art & Design" should be left-centered and gray-colored? How does it know where to find the image to display?

All of these questions can be answered by probing through the HTML of a webpage. HTML is a simple markup language that augments text with the structure you'd expect from a webpage. It's the language that provides the structure for every webpage you see. Here's an example of an HTML document for a simple webpage.

![HTML Example](http://www.goodellgroup.com/tutorial/wpimages/wpccea2291_05_06.jpg)

### HTML in a Nutshell

HTML is the standard markup language for creating Web pages.
* HTML stands for Hyper Text Markup Language
* HTML describes the structure of Web pages using markup
* HTML elements are the building blocks of HTML pages
* HTML elements are represented by tags
* HTML tags label pieces of content such as "heading", "paragraph", "table", and so on
* Browsers do not display the HTML tags, but use them to render the content of the page




## Problem Statement

**Given the URL of a news website and its HTML, can we classify the news website as either fake or real?** 

## Task 1 | Exploring the Data 

### Dataset 

Load the train and val in the below cell:


In [None]:
with open(os.path.join(basepath, 'train_val_data.pkl'), 'rb') as f:
    train_data, val_data = pickle.load(f)

print('Number of train examples:', len(train_data))
print('Number of val examples:', len(val_data))

print('Fraction of train examples that are fake:', len([datapoint for datapoint in train_data if datapoint[2] == 0]) / float(len(train_data)))
print('Fraction of val examples that are fake:', len([datapoint for datapoint in val_data if datapoint[2] == 0]) / float(len(val_data)))

We can see that the number of examples for each portion of the data approximately matches the split above, and each portion has roughly 50% fake news websites. Now to explore what each data point looks like. 

### Changing The Example Index

Spend ~15 minutes browsing through the data by changing example_idx below. You are able to see the URL, label (0 is real, 1 is fake), and part of the HTML for an example.

Observe that each data point has three values: the URL, the HTML, and the binary (0 or 1) label. A label of "1" indicates that the website is a fake news website, and a label of "0" indicates that the website does not have fake news. See if you can spot some differences between examples with label 0 and examples with label 1, especially in their URLs! The HTML may be a bit difficult to read, since it is much longer, so don't worry about this.

In [None]:
### YOUR CODE HERE ###
example_idx = int(input("example_idx: "))
### END CODE HERE ###

print('Number of values per data point: %d\n' % len(train_data[0]))

print('URL for chosen example:', train_data[example_idx][0])
print('Label for chosen example:', train_data[example_idx][2])
print('HTML for chosen example (first 5000 chars):\n\n', bs(train_data[example_idx][1]).prettify()[:5000])

## Task 3 | Fake vs Real Fraction


### Probing Hypotheses

Browsing through the examples above, you might have gotten a few ideas for differences between real and fake news websites. For instance, you might have noticed that many fake news websites use domain name extensions other than ".com", whereas this is less common for real news websites. So a possible hypothesis could be: 
 
####Websites with .com extensions are more likely to be real news. 




### Real Fraction 

One simple way to quantify our observation would be to see what percentage of websites using a certain extension (.com, .org, etc.) are real. We can call this number the Real Fraction. 


###Fake Fraction

Likewise, we can find what percentage of websites using a certain extension (.com, .org, etc) are fake. We can call this number the Fake Fraction. 

### Fake/ Real Ratio

How do we use the Fake Fraction and Real Fraction to test our hypothesis ? We could divide them to form a ratio, which we can call the Fake vs Real Ratio. 
 
For the .com extension, the Fake vs Real Ratio would be as follows. 
 
#### (.com) Fake vs  Real Ratio = Fraction of Fake Sites w/ (.com) / Fraction of Real Sites w/ (.com) 



###Interpreting Ratios 

* If the ratio is less than 1, then we have reason to believe that real news websites disproportionately use ".com" extensions, 
* If the ratio is greater than 1, then we have reason to believe that fake news websites disproportionately use ".com" extensions, 
* If the ratio is 1, then both fake and real news websites use the .com extension about the same. This means that our hypothesis isn't very useful for separating out real and fake news websites, at least not by itself.


### Test in Code

We define a function below that returns the real and fake fractions of the training data that satisfy a hypothesis. In our code, our hypotheses will just be simple functions that take in a single data point and return "True" or "False". 


Finish the below function that computes the real and fake fractions, as described above. For each datapoint, you want to compute whether the hypothesis is true, and use this along with label to update *real_true*, *real_total*, *fake_true*, *fake_total*.

In [None]:
def get_real_and_fake_fractions(train_data, hypothesis):
    # Label 0, hypothesis true
    real_true = 0.0
    # Label 0 total
    real_total = 0.0
    # Label 1, hypothesis true
    fake_true = 0.0
    # Label 1 total
    fake_total = 0.0
    
    for datapoint in train_data:
        # Each datapoint has URL, HTML, label in that order.
        label = datapoint[2]
        hypothesis_truth = hypothesis(datapoint)
        
        if (label == 1):
            fake_total += 1
        if(hypothesis_truth):
            fake_true += 1
        else:
            real_total += 1
        if(hypothesis_truth):
            real_true += 1
            
    return real_true / real_total, fake_true / fake_total

Now, play around with this demonstration that asks you for a domain name extension, and prints out the real fraction, the fake fraction, and the ratio of fake fraction to real fraction. Make sure you understand what the code is doing! After running initially, try other values, like ".org", ".co.uk", and ".edu"! The printed values will update automatically. Note that in some cases, the ratio may be "Infinity", if no real websites in the training data have that domain name.

In [None]:
#@title Run this cell with your hypothesis domain name extension { run: "auto" }

def domain_extension_hypothesis(datapoint):
    extension = input("Enter domain extension: ")
    url = datapoint[0]
    return url.endswith(extension)
  
real_fraction, fake_fraction = get_real_and_fake_fractions(train_data, 
                                                           domain_extension_hypothesis)

print('Real fraction:', real_fraction)
print('Fake fraction:', fake_fraction)

# Simple logic for making the printed ratio more interpretable.
def pretty_ratio(fake_fraction, real_fraction):
    ratio = (fake_fraction / real_fraction) if real_fraction > 0 else 'Infinity'
    if fake_fraction == real_fraction:
        ratio = 1
    return ratio
  
print('Ratio fraction:', pretty_ratio(fake_fraction, real_fraction))

## Task 5: Determine Word Frequency Method 


One natural idea is counting whether the frequency of words in the HTML of a webpage is above a certain threshold. For example, given the word "Clinton" and a threshold of 3, does nytimes.com mention "Clinton" 3 times? Does infowars.com? This may tell us something about how useful the word "Clinton" is for telling us whether a website is fake or not.


### Test in Code

Now, code up the below hypothesis function that tests whether the count of a provided word is above a threshold and play with the resulting demo (~15 minutes). We have provided some starter code for you.

In [None]:
#@title Run this cell with a word and a threshold { run: "auto" }

def get_count_from_html(html, hypothesis_word):
    # Transform word to lowercase for consistent results.
    return html.count(hypothesis_word.lower())

def word_threshold_hypothesis(datapoint):
    hypothesis_word = "stocks" #@param {type:"string"}
    threshold =  2#@param {type:"integer"}
    # Transform HTML to lowercase for consistent results.
    html = datapoint[1].lower() 
    ### YOUR CODE HERE ### (Use get_count_from_html!)
    word_count = get_count_from_html(html,hypothesis_word)
    return word_count >= threshold
    ### END CODE HERE ###
    
real_fraction, fake_fraction = get_real_and_fake_fractions(train_data, 
                                                           word_threshold_hypothesis)

print('Real fraction:', real_fraction)
print('Fake fraction:', fake_fraction)
  
print('Ratio fraction:', pretty_ratio(fake_fraction, real_fraction))

## Task 6:  Hypothesize



Once you have "Clinton" working with a threshold of 3, try other words, like "Trump", "Obama", "Sports", "Finance", and "Opinion". 

Discuss three interesting hypothesis word and threshold combination, with an explanation for why you think it is happening.

Be prepared to share with the class! 

## Task 7 | Custom Hypothesis


Now, create your own custom hypotheses! All you should change is the hypothesis function (~20 minutes). 

Some ideas: 
* check whether websites contain certain HTML tags (e.g. "\<table>, \<section>"), 
* check whether websites contain certain words or phrases in the URL, 
* check whether websites are Wordpress blogs (hint: check whether they contain "wp-content" frequently).

In [None]:
### YOUR CODE HERE ###
# hypothesis ideas
# check word count of heading
# article has four or more images
# if it is close to a credible source
# large number of exlamation marks
# use of pronouns

def get_count_from_url(url, hypothesis_word):
    url_list = url.split(".")
  #print(url_list)
  return url_list.count(hypothesis_word.lower())

def word_threshold_hypothesis(datapoint):
    hypothesis_word = "com" #@param {type:"string"}
    threshold =  1#@param {type:"integer"}
    # Transform url to lowercase for consistent results.
    url = datapoint[0].lower() 
    ### YOUR CODE HERE ### (Use get_count_from_html!)
    word_count = get_count_from_url(url,hypothesis_word)
    return word_count >= threshold
    ### END CODE HERE ###

    real_fraction, fake_fraction = get_real_and_fake_fractions(train_data, 
                                                           word_threshold_hypothesis)

print('Real fraction:', real_fraction)
print('Fake fraction:', fake_fraction)
  
print('Ratio fraction:', pretty_ratio(fake_fraction, real_fraction))

### END CODE HERE ###

Once you are done, list your most interesting hypotheses below and prepare to discuss with the class!

Congratulations on completing this notebook! Tomorrow, we'll use the insights you just built up to build our baseline model.