## https://nbviewer.jupyter.org/github/WillKoehrsen/Data-Analysis/blob/master/medium/Medium%20Stats%20Analysis.ipynb


<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction:-Analysis-of-Medium-Stats" data-toc-modified-id="Introduction:-Analysis-of-Medium-Stats-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction: Analysis of Medium Stats</a></span><ul class="toc-item"><li><span><a href="#Instructions" data-toc-modified-id="Instructions-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Instructions</a></span></li></ul></li><li><span><a href="#Retrieve-Statistics" data-toc-modified-id="Retrieve-Statistics-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Retrieve Statistics</a></span></li><li><span><a href="#Analysis" data-toc-modified-id="Analysis-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Analysis</a></span><ul class="toc-item"><li><span><a href="#Correlations" data-toc-modified-id="Correlations-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Correlations</a></span></li><li><span><a href="#Correlation-Heatmap" data-toc-modified-id="Correlation-Heatmap-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Correlation Heatmap</a></span></li><li><span><a href="#Scatterplot-Matrix" data-toc-modified-id="Scatterplot-Matrix-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Scatterplot Matrix</a></span></li></ul></li><li><span><a href="#Histograms" data-toc-modified-id="Histograms-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Histograms</a></span></li><li><span><a href="#Cumulative-Plot" data-toc-modified-id="Cumulative-Plot-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Cumulative Plot</a></span></li><li><span><a href="#With-Range-Slider" data-toc-modified-id="With-Range-Slider-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>With Range Slider</a></span></li><li><span><a href="#Scatter-Plots" data-toc-modified-id="Scatter-Plots-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Scatter Plots</a></span></li><li><span><a href="#Univariate-Linear-Regressions" data-toc-modified-id="Univariate-Linear-Regressions-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Univariate Linear Regressions</a></span><ul class="toc-item"><li><span><a href="#Views-Regressed-by-Word-Count" data-toc-modified-id="Views-Regressed-by-Word-Count-8.1"><span class="toc-item-num">8.1&nbsp;&nbsp;</span>Views Regressed by Word Count</a></span></li><li><span><a href="#Read-Ratio-Regressed-by-Reading-Time" data-toc-modified-id="Read-Ratio-Regressed-by-Reading-Time-8.2"><span class="toc-item-num">8.2&nbsp;&nbsp;</span>Read Ratio Regressed by Reading Time</a></span></li></ul></li><li><span><a href="#Univariate-Polynomial-Regressions" data-toc-modified-id="Univariate-Polynomial-Regressions-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Univariate Polynomial Regressions</a></span></li><li><span><a href="#Multivariate-Regressions" data-toc-modified-id="Multivariate-Regressions-10"><span class="toc-item-num">10&nbsp;&nbsp;</span>Multivariate Regressions</a></span></li><li><span><a href="#Extrapolations" data-toc-modified-id="Extrapolations-11"><span class="toc-item-num">11&nbsp;&nbsp;</span>Extrapolations</a></span></li><li><span><a href="#Conclusions" data-toc-modified-id="Conclusions-12"><span class="toc-item-num">12&nbsp;&nbsp;</span>Conclusions</a></span></li></ul></div>

# Introduction: Analysis of Medium Stats

In this notebook, we will analyze my Medium article stats. The functions for scraping and formatting the data were developed in the `Development` notebook, and here we will focus on looking at the data quantitatively and visually.

## Instructions

To apply to your own medium data

1. Go to the stats page https://medium.com/me/stats
2. Make sure to scroll all the way down to the bottom so all the articles are loaded
3. Right click, and hit 'save as'. 
4. Save the file as `stats.html` in the `data/` directory. You can also save the responses to do a similar analysis.

![](images/stats-saving-medium.gif)

    # Might need to run this on MAC for multiprocessing to work properly
    # see https://stackoverflow.com/questions/50168647/multiprocessing-causes-python-to-crash-and-gives-an-error-may-have-been-in-progr
    export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES

For any of the figures, I recommend opening them in plotly and touching them up. `plotly` is an incredible library and I highly it as a replacement for whatever plotting library you are using.

# Retrieve Statistics

Thanks to a few functions already developed, you can get all of the statistics for your articles in under 10 seconds.

In [1]:
from retrieval import process_in_parallel, get_table_rows

table_rows = get_table_rows(fname='stats.html')

ModuleNotFoundError: No module named 'retrieval'

Each of these entries is a separate article. To get the information about each article, we use the next function. This scrapes both the article metadata and the article itself (using `requests` and `BeautifulSoup`).

In [None]:
df = process_in_parallel(table_rows=table_rows, processes=25)
df.head()

# Analysis

With the comprehensive data, we can do any sort of analysis we want. There's a lot of data here and I'm sure you'll be able to find other interesting things to do with the data.

In [None]:
# Data science imports
import pandas as pd
import numpy as np

%load_ext autoreload
%autoreload 2

# Options for pandas
pd.options.display.max_columns = 25

# Display all cell outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

import plotly.plotly as py
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly.offline import iplot

import cufflinks
cufflinks.go_offline()

## Correlations

We can start off by looking at correlations. We'll limit this to the `published` articles for now.

In [None]:
corrs = df[df['type'] == 'published'].corr()
corrs.round(2)

If we are looking at maximizing claps, what do we want to focus on?

In [None]:
corrs['claps'].sort_values(ascending=False)

Okay, so most of these occur after the article is released. However, the tag `Towards Data Science` seems to help quite a bit! It also looks like the read time is negatively correlated with the number of claps. 

## Correlation Heatmap

Using the `plotly` python library, we can very rapidly create interactive great looking charts.

Here are the avaiable colorscales if you want to try others:

    colorscales = ['Greys', 'YlGnBu', 'Greens', 'YlOrRd', 'Bluered', 'RdBu',
            'Reds', 'Blues', 'Picnic', 'Rainbow', 'Portland', 'Jet',
            'Hot', 'Blackbody', 'Earth', 'Electric', 'Viridis', 'Cividis']

In [None]:
colorscales = ['Greys', 'YlGnBu', 'Greens', 'YlOrRd', 'Bluered', 'RdBu',
        'Reds', 'Blues', 'Picnic', 'Rainbow', 'Portland', 'Jet',
        'Hot', 'Blackbody', 'Earth', 'Electric', 'Viridis', 'Cividis']

In [None]:
figure = ff.create_annotated_heatmap(z = corrs.round(2).values, 
                                     x =list(corrs.columns), 
                                     y=list(corrs.index), 
                                     colorscale='Portland',
                                     annotation_text=corrs.round(2).values)
iplot(figure)

Correlations by themselves don't tell us that much. It does not help that most of these are pretty obvious, such as the `claps` and `fans` will be highly correlated. Sometimes correlations by themselves are useful, but not really in this case.

## Scatterplot Matrix

In [None]:
figure = ff.create_scatterplotmatrix(df[['read_time', 'claps', 'type']],
                                     index = 'type', colormap='Jet', title='Scatterplot Matrix by Type',
                                     diag='histogram', width=800, height=800)
iplot(figure)

In [None]:
figure = ff.create_scatterplotmatrix(df[['read_time', 'claps', 'publication']],
                                     index = 'publication', title='Scatterplot Matrix by Publication',
                                     diag='histogram', width=800, height=800)
iplot(figure)

In [None]:
figure = ff.create_scatterplotmatrix(df[['read_time', 'claps', 'views',
                                         'num_responses', 'publication']],
                                     index = 'publication', 
                                     diag='histogram', 
                                     size=8, width=1000, height=1000,
                                     title='Scatterplot Matrix by Publication')

iplot(figure)

In [None]:
figure = ff.create_scatterplotmatrix(df[['read_time', 'views', 'read_ratio', 'publication']],
                                     index = 'publication', 
                                     diag='histogram', 
                                     size=8, width=1000, height=1000,
                                     title='Scatterplot Matrix by Publication')

iplot(figure)

# Histograms

In [None]:
from visuals import make_hist

In [None]:
figure = make_hist(df, x='views', category='publication')
iplot(figure)

In [None]:
figure = make_hist(df, x='word_count', category='type')
iplot(figure)

In [None]:
figure=make_hist(df, x='claps')
iplot(figure)

# Cumulative Plot

In [None]:
from visuals import make_cum_plot

In [None]:
figure = make_cum_plot(df, y='views')
iplot(figure)

In [None]:
figure = make_cum_plot(df, y='word_count')
iplot(figure)

In [None]:
figure = make_cum_plot(df, y='views', category='publication')
iplot(figure)

In [None]:
figure = make_cum_plot(df, y=['word_count', 'views'])
iplot(figure)

In [None]:
figure = make_cum_plot(df, y=['views', 'reads'])
iplot(figure)

# With Range Slider

The neat part about plotly is we can easily add more elements to our plots. For example, to make a range selector and a range slider, let's just pass in an extra parameter to the function.

In [None]:
figure = make_cum_plot(df, 'word_count', ranges=True)
iplot(figure)

In [None]:
figure = make_cum_plot(df, 'read_time', ranges=True)
iplot(figure)

# Scatter Plots

In [None]:
from visuals import make_scatter_plot

In [None]:
figure = make_scatter_plot(df, x='read_time', y='read_ratio')
iplot(figure)

In [None]:
figure = make_scatter_plot(df, x='read_time', y='read_ratio', category='type')
iplot(figure)

In [None]:
figure = make_scatter_plot(df, x='read_time', y='views', ylog=True,
                           category='type')
iplot(figure)

In [None]:
figure = make_scatter_plot(df, x='read_time', y='views', ylog=True,
                           scale='read_ratio', sizeref=0.2)
iplot(figure)

In [None]:
df['binned_ratio'] = pd.cut(df['read_ratio'], list(range(0, 100, 10))).astype('str')
df['binned_claps'] = pd.cut(df['claps'], list(np.insert(np.logspace(start=0, stop=5, num=6),0,-1).astype(int))).astype(str)

In [None]:
figure = make_scatter_plot(df, x='word_count', y='fans',
                           scale='claps', sizeref=5)
iplot(figure)

In [None]:
figure = make_scatter_plot(df, x='word_count', y='reads', xlog=True,
                           scale='claps', sizeref=3)
iplot(figure)

# Univariate Linear Regressions

For the linear regressions, we'll focus on articles that were published in Towards Data Science. This makes the relationships clearer because the other articles are a mixed bag. We'll start off using a single variable - univariate - and focusing on linear relationships.

In [None]:
tds = df[df['publication'] == 'Towards Data Science'].copy()
figure = make_scatter_plot(tds, 'word_count', 'views')
iplot(figure)

## Views Regressed by Word Count

Let's do a regression of the number of words versus the views for articles published in towards data science. We are using `statsmodels.api.OLS` which sets the intercept to be 0. I made this choice because the number of views can never be negative (sometimes we do need an intercept so I left this as a parameter).

In [None]:
import statsmodels.api as sm

lin_reg=sm.OLS(tds['views'], tds['word_count']).fit()
lin_reg.summary()

This tells us that for every extra word, I get 13 more views! If we look at the plot, there is one outlying data point beyond 5000 words. What happens if I stick to articles under 5000 words published on Towards Data Science?

In [None]:
tds_clean = tds[tds['word_count'] < 5000].copy()
lin_reg = sm.OLS(tds_clean['views'], tds_clean['word_count']).fit()
lin_reg.summary()

Now we see that for every extra word, I get 14 more views! However, it looks like I want to keep my articles under 5000 words (about a 25 minute reading time). 

## Read Ratio Regressed by Reading Time

If we want to fit a model with an intercept, we can use `scipy.stats.linregress`

In [None]:
figure = make_scatter_plot(tds_clean, 'read_time', 'read_ratio')
iplot(figure)

In [None]:
from scipy import stats
stats.linregress(tds_clean['read_time'], tds_clean['read_ratio'])

This time, we see that for every additional minute of reading time, the percentage of people who read the article declines by 2.3%. For an article with a 0 minute reading time, 53% of people will read it! 

Let's take a look at a few different fits.

In [None]:
from visuals import make_linear_regression

figure, summary = make_linear_regression(tds_clean, x='word_count', y='views', intercept_0=True)
iplot(figure)

In [None]:
summary

In [None]:
tds_clean['read_pct'] = list(tds_clean['read_ratio'])
figure, summary = make_linear_regression(tds_clean, x='read_time', y='read_pct', intercept_0=False)
iplot(figure)

In [None]:
summary

In [None]:
figure, summary = make_linear_regression(tds_clean, x='title_word_count', y='fans', intercept_0=True)
iplot(figure)

In [None]:
summary

This clearly is not the best fit! 

# Univariate Polynomial Regressions

Next, we'll let the degree of the fit increase above 1. Overfitting (especially with limited data) is definitely going to be the outcome, but we'll let this serve as a lesson about having too many parameters in your model! 

In [None]:
from visuals import make_poly_fits

In [None]:
figure, fit_stats = make_poly_fits(tds_clean, x='word_count', y='reads', degree=6)
fit_stats

In [None]:
iplot(figure)

In [None]:
tds_clean['log_views'] = np.log10(tds_clean['views'])
figure, fig_stats = make_poly_fits(tds_clean, x='word_count', y='log_views', degree=15)
fit_stats

In [None]:
iplot(figure)

In [None]:
figure, fig_stats = make_poly_fits(tds_clean, x='title_word_count', y='fans', degree=10)
iplot(figure)

# Multivariate Regressions

Next, we'll consider more independent variables in our model. For this, we need to break out the exceptional Scikit-Learn library. We'll use `liner_model.LinearRegression` which supports multiple independent variables.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

x = ['read_time', 'editing_days', 'title_word_count']
x.extend(c for c in df.columns if '<tag>' in c)
x

In [None]:
lin_model = LinearRegression()
lin_model.fit(tds[x],  tds['reads'])

In [None]:
lin_model = LinearRegression()
lin_model.fit(tds[x],  tds['reads'])

slopes, intercept, = lin_model.coef_, lin_model.intercept_
fit = lin_model.predict(tds[x])
r2 = lin_model.score(tds[x], tds['reads'])
rmse = np.sqrt(mean_squared_error(y_true=tds['reads'], y_pred=fit))

In [None]:
for p, s in zip(x, slopes):
    print(f'Independent Variable: {p.replace("_", " ").title():25} Slope: {s:.2f}')

print(f'Intercept: {intercept:.2f}')
print(f'\nCoefficient of Determination: {r2:.2f}')
print(f'RMSE: {rmse:.2f}')

We can see that some variables contribute positively to the number of reads, while others decrease the number of reads! Evidently, I should decrease the reading time, not use the tag education, and use the tags Towards Data Science and Python.

In [None]:
figure, summary = make_linear_regression(tds, x=x, y='reads', intercept_0=False)
iplot(figure)

In [None]:
summary

In [None]:
figure, summary = make_linear_regression(tds, x=x, y='fans', intercept_0=False)
iplot(figure)

In [None]:
summary

# Extrapolations

The most fun part of this is extrapolating wildly into the future! Using the past stats, we can make estimates for the future using the numbers of days since publishing.

In [None]:
from visuals import make_extrapolation

In [None]:
figure, future_df = make_extrapolation(tds, y='reads', years=1.5, degree=3)
iplot(figure)

In [None]:
figure, future_df = make_extrapolation(df, y='word_count', years=2.5, degree=3)
iplot(figure)

In [None]:
figure, future_df = make_extrapolation(df, 'read_time', years=1, degree=3)
iplot(figure)

# Conclusions

Well, that's about all I have! There is a lot of additional analysis that could be done here, and going forward, I'll be further developing these functions and trying to extract more information. Feel free to use these functions on your own articles, and of course, contribute as needed! Developing this library has been enjoyable, and I look forward to expanding it so any suggestions are welcome and appreciated.