### Webscraping Art Articles from Medium.com
Going into the branches of the DOM and getting relevant tags

### Part 0: Check the robots.txt
to see what you can scrape at: https://medium.com/robots.txt

## Part 1: Sample - looking at a single page
basic webpage accessing and scraping the top of the art topics page (without scrolling). Only about 10 articles. Was able to retrieve links


In [18]:
#imports
import re
import requests
from bs4 import BeautifulSoup
import pandas
import re

#the url topic (root url)
url = "https://medium.com/topic/art"

#requesting the url to get access to the page
r = requests.get(url)

#parsing in the information
soup = BeautifulSoup(r.content, "html5lib")

Looking at the article through the developer tool on a Medium article, there are certain tags which are present within the article which are uniform. 

- < div class= "n p"> : all (title and paragraph)
- < div class= "o n"> : title specifically

In [19]:
#look at the html in nice format
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <script>
   !function(c,f){var t,o,i,e=[],r={passive:!0,capture:!0},n=new Date,a="pointerup",u="pointercancel";function p(n,e){t||(t=e,o=n,i=new Date,w(f),s())}function s(){0<=o&&o<i-n&&(e.forEach(function(n){n(o,t)}),e=[])}function l(n){if(n.cancelable){var e=(1e12<n.timeStamp?new Date:performance.now())-n.timeStamp;"pointerdown"==n.type?function(n,e){function t(){p(n,e),i()}function o(){i()}function i(){f(a,t,r),f(u,o,r)}c(a,t,r),c(u,o,r)}(e,n):p(e,n)}}function w(e){["click","mousedown","keydown","touchstart","pointerdown"].forEach(function(n){e(n,l,r)})}w(c),self.perfMetrics=self.perfMetrics||{},self.perfMetrics.onFirstInputDelay=function(n){e.push(n),s()}}(addEventListener,removeEventListener)
  </script>
  <title data-rh="true">
   Art - Medium
  </title>
  <meta charset="utf-8" data-rh="true"/>
  <meta content="width=device-width,minimum-scale=1,initial-scale=1" data-rh="true" name="viewport"/>
  <meta content="#000000" data-rh="true" na

In [20]:
#example of link I want to get
'<a href="https://psiloveyou.xyz/remembering-the-terrible-cb7ebf24a6da?source=topic_page---------6------------------1"'

#Regex pattern to get html tags (want entire html)
pattern = 'href="(.{5,100}source=topic_page\-+\d\-+\d)'

In [21]:
#find all the htmls in the string using REGEX
result = re.findall(pattern, str(soup))
result[:10]

['/dave-mann/review-david-koloane-retrospective-at-standard-bank-195ab7bbd514?source=topic_page---------0------------------1',
 '/dave-mann/review-david-koloane-retrospective-at-standard-bank-195ab7bbd514?source=topic_page---------0------------------1',
 '/dave-mann/review-david-koloane-retrospective-at-standard-bank-195ab7bbd514?source=topic_page---------0------------------1',
 '/@david_mann92?source=topic_page---------0------------------1',
 '/dave-mann?source=topic_page---------0------------------1',
 '/dave-mann/review-david-koloane-retrospective-at-standard-bank-195ab7bbd514?source=topic_page---------0------------------1',
 '/traveling-through-history/the-abstract-art-of-josef-albers-d4840e06a6ad?source=topic_page---------1------------------1',
 '/traveling-through-history/the-abstract-art-of-josef-albers-d4840e06a6ad?source=topic_page---------1------------------1',
 '/traveling-through-history/the-abstract-art-of-josef-albers-d4840e06a6ad?source=topic_page---------1--------------

###### references:
1. https://stackoverflow.com/questions/21006940/how-to-load-all-entries-in-an-infinite-scroll-at-once-to-parse-the-html-in-pytho
2. https://stackoverflow.com/questions/37207959/how-to-scrape-all-contents-from-infinite-scroll-website-scrapy
3. https://stackoverflow.com/questions/42478591/python-selenium-chrome-webdriver
4. https://stackoverflow.com/questions/21006940/how-to-load-all-entries-in-an-infinite-scroll-at-once-to-parse-the-html-in-pytho (used for page scrolling)
5. https://www.geeksforgeeks.org/implementing-web-scraping-python-beautiful-soup/

## Part 2: Using Selenium to Scroll and Get all the URLs at once 
creating a new browser that will automatically scroll down the page for you. Using the developer tools it was noticed that "POST" calls were being made when the scroll down occurred, but no "GET" calls were made. Thus, since we could not use the GET calls, we decided to use the scroller instead.

In [1]:
import selenium
import pandas as pd

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

import time

In [16]:
#opens a new browser to scroll down automatically
browser = webdriver.Chrome(executable_path = r"C:\Users\jesse\Downloads\chromedriver_win32\chromedriver.exe")
browser.get("https://medium.com/topic/politics")

In [None]:
#code based off of stackoverflow code #4
time.sleep(1)
elem = browser.find_element_by_tag_name("body")
pagedowns = 100000
while pagedowns:
    elem.send_keys(Keys.PAGE_DOWN)
    time.sleep(1)
    pagedowns-=1
    
#pages = browser.find_elements_by_class_nam

In [27]:
#get the html under the page
pages = browser.page_source

## Part 3: Processing
using the regex tested above, able to get html links from the browser object. Then dropping duplicates and clean the links.

In [29]:
#Regex pattern to get html tags
pattern = 'href="(.{5,100}source=topic_page\-+\d{1,5}\-+\d)'

#get all the html links based on the pattern
result = re.findall(pattern, pages)
result[-5:]

['https://gen.medium.com/does-cutting-u-s-aid-help-or-hurt-central-america-55db640f2add?source=topic_page---------5109------------------1',
 'https://gen.medium.com/does-cutting-u-s-aid-help-or-hurt-central-america-55db640f2add?source=topic_page---------5109------------------1',
 '/@johnbwashington?source=topic_page---------5109------------------1',
 'https://gen.medium.com/?source=topic_page---------5109------------------1',
 'https://gen.medium.com/does-cutting-u-s-aid-help-or-hurt-central-america-55db640f2add?source=topic_page---------5109------------------1']

In [None]:
#put into a series to process
html_links = pd.Series(result)

#export to csv for safekeeping
html_links.to_csv("htmls_politics.csv")

#drop all the duplicates
htmls = html_links.drop_duplicates().reset_index(drop=True)

In [32]:
#a look at some of the htmls... seems as though some are missing the "https:"
htmls.head()

0    https://arcdigital.media/trumps-trade-war-is-k...
1    /@maxburnswrites?source=topic_page---------0--...
2    https://arcdigital.media/?source=topic_page---...
3    /@fnfwriter?source=topic_page---------1-------...
4    /politically-speaking?source=topic_page-------...
dtype: object

In [36]:
topic = pd.read_csv("htmls_politics.csv", index_col = 0, header = None)

In [70]:
#drop all the duplicates
htmls = topic.drop_duplicates().reset_index(drop=True)
#just get the url column
htmls = htmls[1]

In [72]:
#there are some repeat html links, so getting rid of the repeats
clean_htmls = htmls[~htmls.str.contains(r"^.{2,30}\?source=topic")]

In [73]:
### ADD HTTP to some versus not others ###

#links with https already included (no user in url)
with_http = clean_htmls[clean_htmls.str.contains("https://")].reset_index(drop=True)

#links without https:
without_http = clean_htmls[~clean_htmls.str.contains("https://")].reset_index(drop=True)

In [75]:
#adding medium before in order to get the full url
urls = "https://medium.com" + without_http

In [77]:
#see how many links
len(with_http), len(without_http)

(218, 1466)

checking to see if the urls are valid

In [78]:
urls.head()

0    https://medium.com/dave-mann/review-david-kolo...
1    https://medium.com/traveling-through-history/t...
2    https://medium.com/@elizabethswebster/whitewas...
3    https://medium.com/@lizadonnelly/drawing-and-r...
4    https://medium.com/@lizadonnelly/trumps-patrio...
Name: 1, dtype: object

In [79]:
with_http.head()

0    https://arcdigital.media/wall-e-with-a-paintbr...
1    https://eidolon.pub/the-green-fiasco-in-contex...
2    https://psiloveyou.xyz/remembering-the-terribl...
3    https://curiosityneverkilledthewriter.com/a-la...
4    https://curiosityneverkilledthewriter.com/?sou...
Name: 1, dtype: object

In [82]:
#combine all the urls together
pd.concat([urls, with_http], ignore_index = True).tail()

1679    https://magenta.as/legendary-cartoonist-ben-ka...
1680    https://timeline.com/hannah-wilke-labial-art-9...
1681    https://artplusmarketing.com/kathy-griffins-ar...
1682    https://medium.muz.li/why-gradients-are-the-ne...
1683    https://brightthemag.com/a-tale-of-two-artists...
Name: 1, dtype: object

In [83]:
#cleaned htmls (entire html should work)
urls.to_csv("cleaned_htmls.csv")

  
