# Introduction: Linear Regression with Medium Articles

In this notebook, we'll look at performing some basic linear regression with the medium articles. This is a continuation of the data analysis performed on my Medium articles. 

In [None]:
# Data science imports
import pandas as pd
import numpy as np

from scipy import stats

# Options for pandas
pd.options.display.max_columns = 20

# Display all cell outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

# Interactive plotting
import plotly.plotly as py
import plotly.graph_objs as go
from plotly.offline import iplot
import cufflinks
cufflinks.go_offline()

%load_ext autoreload
%autoreload 2

from timeit import default_timer as timer

from collections import Counter, defaultdict
from itertools import chain

from bs4 import BeautifulSoup
import re

import requests
from multiprocessing import Pool

In [None]:
from utils import process_in_parallel, get_links, make_iplot

In [None]:
soup = BeautifulSoup(open('data/published.html', 'r'))
soup.text[:100]

In [None]:
links = get_links(soup)

In [None]:
data = process_in_parallel(links)
responses = data[data['response'] == 'response'].copy()
articles = data[data['response'] == 'article'].copy()
responses.head()

# Linear Regression

In [None]:
regression = stats.linregress(x=articles['word_count'], y=articles['claps'])
slope = regression.slope
intercept = regression.intercept
rvalue = regression.rvalue

regression

In [None]:
figure = make_iplot(articles, x = 'word_count', y = 'claps', base_title='Claps vs Word Count')
iplot(figure)

In [None]:
figure = make_iplot(data, x = 'word_count', y = 'claps', base_title='Claps vs Word Count')
iplot(figure)

In [None]:
figure = make_iplot(articles, x = 'read_time', y = 'word_count', base_title='Word Count vs Read Time')
iplot(figure)

# Time Since Start Comparisons

In [None]:
articles['time_since_start'] = (articles['time_published'] - articles['time_published'].min()).dt.total_seconds() / (60 * 60 * 24)
figure = make_iplot(articles, x = 'time_since_start', y = 'word_count', 
                    base_title='Word Count vs Time Since Start', eq_pos=(0.5, 0.75))
iplot(figure)

In [None]:
figure = make_iplot(articles, x = 'time_since_start', y = 'claps', 
                    base_title='Claps vs Time Since Start', eq_pos=(0.5, 0.75))
iplot(figure)

In [None]:
figure = make_iplot(articles, x = '<tag>Towards Data Science', y = 'claps', 
                    base_title='Claps vs Tag Towards Data Science', eq_pos=(0.5, 0.75))
iplot(figure)

# Conclusions


In this notebook, we looked at performing linear regressions on my medium article data. We saw there are not many linear relationships within the dataset except that of time to read versus the number of words. Claps does not appear to be linearly related to any other variable, although using the tag towards data science seems to increase the number of claps on an article.