### Introduction to Data Science - Homework 5
*CS 5963 / MATH 3900, University of Utah, http://datasciencecourse.net/*

Due: Friday, October 28, 11:59pm.

In this homework, you will (i) scrape happy data about happy hours in various restaurants and bars from a website and (ii) use classification tools to predict the popularity of online news.

## Your Data
Fill out the following information: 

*First Name:* Martin  
*Last Name:*   Raming
*E-mail:*   martin.raming@utah.edu
*UID:*  u0228111


## Part 1: Scrape SLC happy hour data

In this part, you'll explore happy hours close to Salt Lake City. Unfortunately, you'll probably have to drive a bit, since Utah doesn't do happy hours. Nevertheless, hopefully you'll get an idea for a great location for your next party!

You're going to scrape [The Happy Hour Finder](http://thehappyhourfinder.com/us_ut/salt-lake-city/).

### Task 1.1 Check whether you are permitted to scrape the data. 

Investigate the terms and services of the website and see whether there is a `robots.txt` file, and if so, if it permits you to scrape the website. Make sure you are allowed to scrape this website. Are you?

**Your determination:** TODO

### Task 1.2 Download the website

To avoid sending too many requests to the server download the html file using python and save it locally on your machine while you are developing. You should then only access the downloaded html.

The website allows us to specify a search. Ignore this and just scrape the default happy hours shown (today's happy hours). 

In [2]:
# imports and setup 

from bs4 import BeautifulSoup
import urllib.request
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = (10, 6)
plt.style.use('ggplot')

In [3]:
# Your code here
address = 'http://thehappyhourfinder.com/us_ut/salt-lake-city/'
with urllib.request.urlopen(address) as response:
     html = response.read()
soup = BeautifulSoup(html, 'html.parser')
soup.select(".content")

[<div class="row-fluid content">
 <div class="span8 left-column">
 <div class="well">
 <h4>All Happy Hours in Salt Lake City, UT </h4>
 <div class="social-buttons">
 <!-- AddThis Button BEGIN -->
 <div class="addthis_toolbox addthis_default_style ">
 <a class="addthis_button_facebook_like" fb:like:layout="button_count"></a>
 <a class="addthis_button_tweet"></a>
 <a class="addthis_button_pinterest_pinit"></a>
 <a class="addthis_counter addthis_pill_style"></a>
 </div>
 <!--script type="text/javascript">var addthis_config = {"data_track_addressbar":true};</script-->
 <script src="//s7.addthis.com/js/300/addthis_widget.js#pubid=ra-520201cd67fcd783" type="text/javascript"></script>
 <!-- AddThis Button END -->
 </div>
 <br/>
 <div class="hh-list">
 <div class="simple-business">
 <h4>520 miles</h4>
 <ul>
 <li class="row">
 <div class="span2 image-block">
 <a href="/us_mn/orem/pf-changs-china-bistro-45/">
 <div>
 <img alt="P.F. Chang's China Bistro" src="/static/img/default_business.png">
 <

### Task 1.3 Create a dataframe
We want to know what day of the week and what time of the day is the most popular day for happy hours.
Create a pandas dataframe that includes the name of the bar, a binary entry for each day of the week indicating whether it has a happy hour that day or not.

Also add a link to the website in each row.

Hint: use css selectors to find the information you are looking for. You will also need to retrieve information from nested sites such as [this one](http://thehappyhourfinder.com/us_wy/jackson-hole/the-rose/). To finally get the data, you will need to work with strings. str.replace() and str.split() will help you here.

Your dataset in the end should look something like this: 

![data frame](data_frame.png)

In [None]:
# Your code here


### Task 1.4 Find popular days
What day of the week is the most popular for a happy hours? Create a bar chart showing how many establishments have happy hours each day of the week. Try to explain any pattern in the chart. 

In [None]:
# Your code here


### Task 1.5 Find popular times

Plot a histogram of happy hour times during the week, and a second histogram for happy hour times on the weekend. What time of the day is the most popular for a happy hour? Is there a difference between weekdays and weekends?

In [None]:
# Your code here


## Part 2: Classification

For this problem, you will use classification tools to predict the popularity of online news based on attributes such as the length of the article, the number of images, the day of the week that the article was published, and some variables related to the content of the article. The dataset is described on and can be downloaded from the 
[UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity). 
This dataset was first used in the following conference paper: 

K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. *Proceedings of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence* (2015).

The dataset contains variables describing 39,644 articles published between January 7, 2013 and Januyary 7, 2015 on the news website, [Mashable](http://mashable.com/). 
There are 61 variables associated with each article. Of these, 58 are *predictor* variables, 2 are variables that we will not use (url and timedelta), and finally the number of shares of each article. The number of shares is what we will use to define whether or not the article was *popular*, which is what we will try to predict. You should read about the predictor variables in the file *OnlineNewsPopularity.names*. Further details about the collection and processing of the articles can be found in the conference paper. 


### Task 2.1 Import the data 
Use the pandas.read_csv() function to import the dataset.

To use the Python library [scikit-learn](http://scikit-learn.org), we'll need to save the data as a numpy array. Use the *DataFrame.as_matrix()* command to export the predictor variables as a numpy array called *X* this array should not include our target variable (the number of shares). We don't need the url and timedelta, so let's drop these columns. 

Export the number of shares as a separate numpy array, called *shares*. We'll define an article to be popular if it received more shares than the median number of shares. Create a binary numpy array, *y*, which indicates whether or not each article is popular.

In [None]:
# imports and setup 

import pandas as pd
import numpy as np

from sklearn import tree, svm, metrics
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import train_test_split, cross_val_predict, cross_val_score, KFold

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = (10, 6)
plt.style.use('ggplot')

In [None]:
# Your code here


### Task 2.2 Exploratory data analysis 

First check to see if the values are reasonable. What are the min, median, and maximum number of shares? 

In [None]:
# Your code here


### Task 2.3 Classification using k-NN

Develop a k-NN classification model the data. Use cross validation to choose the best value of k. What is the best accuracy you can obtain on the test data? 

In [None]:
# Your code here


### Task 2.4 Classification using SVM

Develop a support vector machine classification model for the data.

*Hint:* SVM is more computationally expensive, so you might want to start by using only a fraction of the data, say 5,000 articles. 

In [None]:
# Your code here


### Task 2.5 Classification using decision trees

Develop a decision tree machine classification model for the data. Use cross validation to choose good values of the max tree depth (*max_depth*) and minimum samples split (*min_samples_split*). 

In [None]:
# Your code here


### Task 2.6 Describe your findings
1. Which method (k-NN, SVM, Decision Tree) worked best?
+ How did different parameters influence the accuracy?
+  Which model is easiest do interpret?
+ How would you interpret your results?


**Your Solution:** TODO
1. 
+ 
+ 
+ 