# HackOnData.com

## Exercise #6 - Linear Regression

### Week 6 Lab 1 (Public link to the lab solution):
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6762635509390866/1784028329753015/7348375132556489/latest.html

In [3]:
%%bash

S3PATH="https://s3-us-west-2.amazonaws.com/hackondata.evaluation"
FILES=("OnlineNewsTesting.csv" "OnlineNewsTrainingAndValidation.csv")

for file in "${FILES[@]}"
do
    wget $S3PATH/$file
done

In [4]:
%python

proj_folder = "/mnt/E6"
filenames = "OnlineNewsTesting.csv", "OnlineNewsTrainingAndValidation.csv"

dbutils.fs.mkdirs(proj_folder)
[dbutils.fs.cp("file:/databricks/driver/%s" % f, proj_folder) for f in filenames]

In [5]:
import os
from pyspark.sql import Row

l = lambda f: (sqlContext
    .read.format("com.databricks.spark.csv")
    .option("header", "true")
    .option("inferschema", "true")
    .option("mode", "DROPMALFORMED")
    .load(os.path.join(proj_folder, f))
    .map(lambda r: Row(**{k.strip(): v 
                          for k, v in r.asDict().iteritems()})))

raw_travRDD, raw_tstRDD = map(l, filenames)

In [6]:
display(sqlContext.createDataFrame(raw_travRDD))

In [7]:
url_sharesRDD = raw_travRDD.map(lambda r: Row(url=r.url, shares=r.shares))
display(sqlContext.createDataFrame(url_sharesRDD).orderBy('shares', ascending=False))

In [8]:
0. url: URL of the article (non-predictive) 
1. timedelta: Days between the article publication and the dataset acquisition (non-predictive) 
2. n_tokens_title: Number of words in the title 
3. n_tokens_content: Number of words in the content 
4. n_unique_tokens: Rate of unique words in the content 
5. n_non_stop_words: Rate of non-stop words in the content 
6. n_non_stop_unique_tokens: Rate of unique non-stop words in the content 
7. num_hrefs: Number of links 
8. num_self_hrefs: Number of links to other articles published by Mashable 
9. num_imgs: Number of images 
10. num_videos: Number of videos 
11. average_token_length: Average length of the words in the content 
12. num_keywords: Number of keywords in the metadata 
13. data_channel_is_lifestyle: Is data channel 'Lifestyle'? 
14. data_channel_is_entertainment: Is data channel 'Entertainment'? 
15. data_channel_is_bus: Is data channel 'Business'? 
16. data_channel_is_socmed: Is data channel 'Social Media'? 
17. data_channel_is_tech: Is data channel 'Tech'? 
18. data_channel_is_world: Is data channel 'World'? 
19. kw_min_min: Worst keyword (min. shares) 
20. kw_max_min: Worst keyword (max. shares) 
21. kw_avg_min: Worst keyword (avg. shares) 
22. kw_min_max: Best keyword (min. shares) 
23. kw_max_max: Best keyword (max. shares) 
24. kw_avg_max: Best keyword (avg. shares) 
25. kw_min_avg: Avg. keyword (min. shares) 
26. kw_max_avg: Avg. keyword (max. shares) 
27. kw_avg_avg: Avg. keyword (avg. shares) 
28. self_reference_min_shares: Min. shares of referenced articles in Mashable 
29. self_reference_max_shares: Max. shares of referenced articles in Mashable 
30. self_reference_avg_sharess: Avg. shares of referenced articles in Mashable 
31. weekday_is_monday: Was the article published on a Monday? 
32. weekday_is_tuesday: Was the article published on a Tuesday? 
33. weekday_is_wednesday: Was the article published on a Wednesday? 
34. weekday_is_thursday: Was the article published on a Thursday? 
35. weekday_is_friday: Was the article published on a Friday? 
36. weekday_is_saturday: Was the article published on a Saturday? 
37. weekday_is_sunday: Was the article published on a Sunday? 
38. is_weekend: Was the article published on the weekend? 
39. LDA_00: Closeness to LDA topic 0 
40. LDA_01: Closeness to LDA topic 1 
41. LDA_02: Closeness to LDA topic 2 
42. LDA_03: Closeness to LDA topic 3 
43. LDA_04: Closeness to LDA topic 4 
44. global_subjectivity: Text subjectivity 
45. global_sentiment_polarity: Text sentiment polarity 
46. global_rate_positive_words: Rate of positive words in the content 
47. global_rate_negative_words: Rate of negative words in the content 
48. rate_positive_words: Rate of positive words among non-neutral tokens 
49. rate_negative_words: Rate of negative words among non-neutral tokens 
50. avg_positive_polarity: Avg. polarity of positive words 
51. min_positive_polarity: Min. polarity of positive words 
52. max_positive_polarity: Max. polarity of positive words 
53. avg_negative_polarity: Avg. polarity of negative words 
54. min_negative_polarity: Min. polarity of negative words 
55. max_negative_polarity: Max. polarity of negative words 
56. title_subjectivity: Title subjectivity 
57. title_sentiment_polarity: Title polarity 
58. abs_title_subjectivity: Absolute subjectivity level 
59. abs_title_sentiment_polarity: Absolute polarity level 
60. shares: Number of shares (target)

## Linear Regression:

This week we are working with linear regression:

- We have split the dataset in two. Please use the first part for training, testing, and validating your model. Once you have your best model evaluate it using the second dataset for scoring only. The purpose of the scoring set for all of us to use the same datapoints for scoring.
  - Download dataset from: http://tranquant.com/search?search=HackOn(Data)%20Exercise%2006
  - First dataset: OnlineNewsTrainingAndValidation.csv
  - Scoring dataset: OnlineNewsTesting.csv
  - Note that the data set is from http://archive.ics.uci.edu/ml/datasets/Online+News+Popularity where you cand find information about each field. 
    <pre>K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal.</pre>
    
- The goal of the regression task is to predict the number of shares an article will generate (last column in the dataset). The features are already generated for you.
- Use the spark ml implementation for the linear regression:

  example: https://spark.apache.org/docs/1.6.2/ml-classification-regression.html#linear-regression
  
  documentation: https://spark.apache.org/docs/1.6.1/api/python/pyspark.ml.html?highlight=linearregression#pyspark.ml.regression.LinearRegression
  
  to tune the model parameters, you can use the ParamGridBuilder() of spark.ml.tuning, see more:
   - https://spark.apache.org/docs/latest/ml-tuning.html
  
- To analyze the data, follow a similar strategy as with the lab
  - Start by doing some data exploration 
  - Would you use all the features? which ones would you remove?
  - What parameter would you tune?
  - What metrics would you include?
  - How can you improve the model?