## CS696 Final Project - Predicting the Popularity of Online News
Xinyu Zhang 820935369  
Xufei Zhao

## Description
There are two main popularity prediction appproaches: those that use features only known after publication and those that do not use such features. The latter approach is more scarce and, while a lower prediction performance might be expected, the predictions are more useful, allowing (as performed in this work) to improve content prior to publication.

### Data extraction and processing

In [1]:
import org.apache.spark.sql.functions._

val rawData = spark.read.format("csv").option("inferSchema",true).option("header",true).load("news.csv")
rawData.printSchema
rawData.show(1)

root
 |-- url: string (nullable = true)
 |-- timedelta: integer (nullable = true)
 |-- n_tokens_title: integer (nullable = true)
 |-- n_tokens_content: integer (nullable = true)
 |-- n_unique_tokens: double (nullable = true)
 |-- n_non_stop_words: double (nullable = true)
 |-- n_non_stop_unique_tokens: double (nullable = true)
 |-- num_hrefs: integer (nullable = true)
 |-- num_self_hrefs: integer (nullable = true)
 |-- num_imgs: integer (nullable = true)
 |-- num_videos: integer (nullable = true)
 |-- average_token_length: double (nullable = true)
 |-- num_keywords: integer (nullable = true)
 |-- data_channel_is_lifestyle: integer (nullable = true)
 |-- data_channel_is_entertainment: integer (nullable = true)
 |-- data_channel_is_bus: integer (nullable = true)
 |-- data_channel_is_socmed: integer (nullable = true)
 |-- data_channel_is_tech: integer (nullable = true)
 |-- data_channel_is_world: integer (nullable = true)
 |-- kw_min_min: integer (nullable = true)
 |-- kw_max_min: double 

rawData = [url: string, timedelta: int ... 59 more fields]


lastException: Throwable = null


[url: string, timedelta: int ... 59 more fields]

### Binary Classification task
We assume a binary classification task, where an article is considered "popular" if the number of shares is higher than a fixed decision threshold(1400), else it is considered "unpopular"

In [2]:
val popularityDf = rawData.withColumn("popularity", when(col("shares") >= 1400,1).otherwise(0))
popularityDf.select("shares","popularity").show(5)

+------+----------+
|shares|popularity|
+------+----------+
|   593|         0|
|   711|         0|
|  1500|         1|
|  1200|         0|
|   505|         0|
+------+----------+
only showing top 5 rows



popularityDf = [url: string, timedelta: int ... 60 more fields]


[url: string, timedelta: int ... 60 more fields]

### Feature Extraction
We extracted 47 features for learning models
List of attributes by category:
- Words
- Links, 
- Digital Media
- Time
- Keywords
- Natural Language Processing
- Target

In [3]:
val newsDf = popularityDf.select("popularity",
                          "n_tokens_title","n_tokens_content","average_token_length","n_non_stop_words","n_unique_tokens","n_non_stop_unique_tokens",
                          "num_hrefs","num_self_hrefs","self_reference_min_shares","self_reference_max_shares","self_reference_avg_sharess",
                          "num_imgs","num_videos",
                          "is_weekend",
                          "num_keywords","kw_min_min","kw_max_min","kw_avg_min","kw_min_max","kw_max_max","kw_avg_max","kw_min_avg","kw_max_avg","kw_avg_avg",
                          "LDA_00","LDA_01","LDA_02","LDA_03","LDA_04",
                          "global_subjectivity","title_subjectivity","abs_title_subjectivity","global_sentiment_polarity","title_sentiment_polarity","abs_title_sentiment_polarity",
                          "rate_positive_words","rate_positive_words","global_rate_positive_words","global_rate_negative_words",
                          "avg_positive_polarity","min_positive_polarity","max_positive_polarity",
                          "avg_negative_polarity","min_negative_polarity","max_negative_polarity")
newsDf.select("popularity").show(3)

+----------+
|popularity|
+----------+
|         0|
|         0|
|         1|
+----------+
only showing top 3 rows



newsDf = [popularity: int, n_tokens_title: int ... 44 more fields]


[popularity: int, n_tokens_title: int ... 44 more fields]

### Random Forest Classification Method
rolling window？？？？

In [4]:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.classification.{RandomForestClassificationModel, RandomForestClassifier}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator

val assembler = new VectorAssembler().setInputCols(Array("n_tokens_title","n_tokens_content","average_token_length","n_non_stop_words","n_unique_tokens","n_non_stop_unique_tokens",
                          "num_hrefs","num_self_hrefs","self_reference_min_shares","self_reference_max_shares","self_reference_avg_sharess",
                          "num_imgs","num_videos",
                          "is_weekend",
                          "num_keywords","kw_min_min","kw_max_min","kw_avg_min","kw_min_max","kw_max_max","kw_avg_max","kw_min_avg","kw_max_avg","kw_avg_avg",
                          "LDA_00","LDA_01","LDA_02","LDA_03","LDA_04",
                          "global_subjectivity","title_subjectivity","abs_title_subjectivity","global_sentiment_polarity","title_sentiment_polarity","abs_title_sentiment_polarity",
                          "rate_positive_words","rate_positive_words","global_rate_positive_words","global_rate_negative_words",
                          "avg_positive_polarity","min_positive_polarity","max_positive_polarity",
                          "avg_negative_polarity","min_negative_polarity","max_negative_polarity")).setOutputCol("features")
val assembledData = assembler.transform(newsDf)
assembledData.select("popularity","features").show(3)

val Array(trainingData, testData) = assembledData.randomSplit(Array(0.7, 0.3))
val rf = new RandomForestClassifier().setLabelCol("popularity").setFeaturesCol("features").setNumTrees(10)
val rfModel = rf.fit(trainingData)
val predictions = rfModel.transform(testData)

predictions.select("popularity","features","prediction").show

val evaluator = new MulticlassClassificationEvaluator().setLabelCol("popularity").setPredictionCol("prediction")
val predictionsAndLabels = predictions.select("prediction", "popularity").as[(Double, Double)].rdd
val myMatrix = new MulticlassMetrics(predictionsAndLabels)

val accuracy = evaluator.evaluate(predictions)
println("Test Accuracy = " + accuracy)
println("Test Error = " + (1.0 - accuracy))


+----------+--------------------+
|popularity|            features|
+----------+--------------------+
|         0|[12.0,219.0,4.680...|
|         0|(45,[0,1,2,3,4,5,...|
|         1|[9.0,211.0,4.3933...|
+----------+--------------------+
only showing top 3 rows

+----------+--------------------+----------+
|popularity|            features|prediction|
+----------+--------------------+----------+
|         0|[4.0,348.0,4.7442...|       1.0|
|         0|[4.0,359.0,4.6100...|       1.0|
|         0|[5.0,119.0,4.8235...|       0.0|
|         0|[5.0,140.0,5.1785...|       1.0|
|         0|[5.0,213.0,4.7136...|       0.0|
|         0|[5.0,271.0,4.9077...|       1.0|
|         0|[5.0,277.0,4.7978...|       0.0|
|         0|[5.0,308.0,4.8149...|       1.0|
|         0|[5.0,348.0,4.7155...|       1.0|
|         0|[5.0,490.0,4.7142...|       0.0|
|         0|[5.0,543.0,4.7053...|       1.0|
|         0|[5.0,1302.0,4.721...|       1.0|
|         0|[6.0,73.0,4.30136...|       0.0|
|         0|[6.0,

assembler = vecAssembler_1080c1d1301e
assembledData = [popularity: int, n_tokens_title: int ... 45 more fields]
trainingData = [popularity: int, n_tokens_title: int ... 45 more fields]
testData = [popularity: int, n_tokens_title: int ... 45 more fields]
rf = rfc_5d4ef57fddf7


rfModel: org.apache.spark.ml.classification.Random...


rfc_5d4ef57fddf7

### Further to do

- AddBoost Method?
- KNN Method?
- Compare the best accuracy then save the model?
- Accuracy, Precision, Recall, F1, AUC?