# Overview
I'd will use unsupervised learning methods and NLP techniques, leverage Spark to perform some analysis to understand why some [Mediun Blog Posts](https://medium.com/) are more popular than others. In fact, there are probably many factors that can contribute to why a blog post is more popular than others. For example, a blog post with a trendy topic, digestible and meaningful contents, a thesis that echos with majority of readers, ..., is probably a popular blog post. However, I will not explore every possible factor that moves the lever. I'd like to narrow down my research scope and explore a specific area, which is the "topic". I'd like to find out answers to two research questions:
1. What "latent topics" do blog writers like to write about? In other words, what is the current trend of 'latent topics' in Midium blog posts' contents
2. What "latent topics" are well-accepted and echo with majority of Medium readers?
3. (Optional) Should blog post writers produce more blogs that cater readers' appetite? 

First of all, we need to define a blog post's "popularity" metric. I will use number of claps as the metric to indicate the level of "popularity" of a blog post.

To understand what are the 'latent topics', we need to use [latent dirichlet allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)


## Data Sources
There is not a single data science project without data. So I use [scrapy](https://scrapy.org/) to crawl pages on [Mediun Blog Posts](https://medium.com/), and scrap blog title, author, content, claps, and other information, then save them as raw data for research. Source code can be found on [Github](https://github.com/KevinLiao159/MediumBlog/tree/master/src). Due to my limited resources, I only sub-sample a few highlevel topics (data science, blockchain, artificial intelligence, startup, web development, software development) and only query blogs that were posted between 2018/01/01 and 2018/05/01


## Contents
1. Load Data & Basic Data Cleaning 
2. Basic Exploratory Data Analysis<br/>
    i. trends of different high-level topics<br/>
    ii. high-level topic distribution<br/>
    iii. high-level topics vs. claps<br/>
3. Natural Language Processing<br/>
    i. preprocess / clean text, tokenization, lemmetization/stemming<br/>
    ii. TF-IDF vectorization<br/>
    iii. K-means clustering<br/>
    iv. LDA (Latent Dirichlet allocation)<br/>
    v. validate models by visual displays with dimension reduction (PCA/T-SNE)<br/>
4. Analysis on Blog Post Popularity vs. Latent Topics
5. Analysis on Trends in Different Latent Topics
6. Conclusion

In [140]:
# spark entry point
from pyspark import SparkContext
# dataframe / SQL entry point# datafr 
from pyspark.sql import SQLContext, SparkSession, DataFrame
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import StringType, IntegerType, TimestampType, ArrayType

# other import
import os
from unicodedata import normalize
from functools import reduce


from __future__ import division # for Python 2.x

import warnings
warnings.filterwarnings('ignore')

data_path = '../data/'

In [6]:
# spark config
sc = SparkContext(master="local[4]")
sql_sc = SQLContext(sc)
spark = SparkSession(sc)

## Load Data & Basic Data Cleaning 
I use spark sql session to read json lines directly.
1. read data in Spark engine
2. normalize text
3. convert data type to proper type
4. filter date range

In [141]:
def preprocess(data):
    """
    preprcess raw data
    
    input: spark dataframe
    output: spark dataframe with proper dtypes and clean/normalized unicode string
    """
    # create a user defined functon for apply method in spark dataframe
    normalizer = UserDefinedFunction(lambda x: normalize('NFKD', x).replace(';', ' '), StringType())
    # normalization and date filter
    data = data.select(
        data.publish_time.cast(TimestampType()),
        normalizer("title").alias("title"),
        normalizer("contents").alias("contents"),
        data.claps.cast(IntegerType()),
        data.tags.cast(ArrayType(StringType())),
    )
    return data.filter(data.publish_time > '2018-01-01').filter(data.publish_time < '2018-05-01')


def unionAll(*dfs):
    """
    union all tables into one table vertically
    """
    return reduce(DataFrame.unionAll, dfs)

In [143]:
# read json lines file
path_files = [os.path.join(data_path, f_name) for f_name in os.listdir(data_path) if f_name.endswith('.jl')]

# init list of spark dataframes
dfs = []
for path in path_files:
    # read json data into spark driver
    data = spark.read.json(path)
    # preprocess
    data = preprocess(data)
    # append to list
    dfs.append(data)
    
# union all
df = unionAll(*dfs).drop_duplicates(subset=['title'])

In [145]:
df.show(5)

+-------------------+--------------------+--------------------+-----+--------------------+
|       publish_time|               title|            contents|claps|                tags|
+-------------------+--------------------+--------------------+-----+--------------------+
|2018-01-08 00:00:00|12 Question Co-Fo...|When building a b...|   17|[Startup, Retrosp...|
|2018-03-08 00:00:00|12 Startups in 12...|Few days ago duri...|    1|[Startup, Product...|
|2018-03-08 00:00:00|17 Tips & Tricks ...|This is the third...|  202|[Freelancing, Sof...|
|2018-03-27 00:00:00|3 Great Ways in W...|“If you run a bus...|    0|[Web Development,...|
|2018-03-17 00:00:00|4 Things To Keep ...|As you know web d...|    0|[Web Development,...|
+-------------------+--------------------+--------------------+-----+--------------------+
only showing top 5 rows



## Basic Exploratory Data Analysis
1. trends of different high-level topics
2. high-level topic distribution
3. high-level topics vs. claps