### Webscraping Art Articles from Medium.com
Going into the branches of the DOM and getting relevant tags

#### references:
https://www.geeksforgeeks.org/implementing-web-scraping-python-beautiful-soup/

### Check the robots.txt
to see what you can scrape at: https://medium.com/robots.txt

## Part 1: Sample - looking at a single page

In [1]:
#imports
import requests
from bs4 import BeautifulSoup
import pandas
import re

#the url topic (root url)
url = "https://medium.com/topic/art"

#requesting the url to get access to the page
r = requests.get(url)

#parsing in the information
soup = BeautifulSoup(r.content, "html5lib")

Looking at the article through the developer tool on a Medium article, there are certain tags which are present within the article which are uniform. 

- < div class= "n p"> : all (title and paragraph)
- < div class= "o n"> : title specifically

In [2]:
#look at the html in nice format
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <script>
   !function(c,f){var t,o,i,e=[],r={passive:!0,capture:!0},n=new Date,a="pointerup",u="pointercancel";function p(n,e){t||(t=e,o=n,i=new Date,w(f),s())}function s(){0<=o&&o<i-n&&(e.forEach(function(n){n(o,t)}),e=[])}function l(n){if(n.cancelable){var e=(1e12<n.timeStamp?new Date:performance.now())-n.timeStamp;"pointerdown"==n.type?function(n,e){function t(){p(n,e),i()}function o(){i()}function i(){f(a,t,r),f(u,o,r)}c(a,t,r),c(u,o,r)}(e,n):p(e,n)}}function w(e){["click","mousedown","keydown","touchstart","pointerdown"].forEach(function(n){e(n,l,r)})}w(c),self.perfMetrics=self.perfMetrics||{},self.perfMetrics.onFirstInputDelay=function(n){e.push(n),s()}}(addEventListener,removeEventListener)
  </script>
  <title data-rh="true">
   Art - Medium
  </title>
  <meta charset="utf-8" data-rh="true"/>
  <meta content="width=device-width,minimum-scale=1,initial-scale=1" data-rh="true" name="viewport"/>
  <meta content="#000000" data-rh="true" na

In [3]:
#example of link I want to get
'<a href="https://psiloveyou.xyz/remembering-the-terrible-cb7ebf24a6da?source=topic_page---------6------------------1"'

#Regex pattern to get html tags (want entire html)
pattern = 'href="(.{5,100}source=topic_page\-+\d\-+\d)'

In [209]:
#find all the htmls in the string using REGEX
result = re.findall(pattern, str(soup))
result[:10]

['/awake-alive-mind/the-tension-of-expectation-57a0696956c9?source=topic_page---------0------------------1',
 '/awake-alive-mind/the-tension-of-expectation-57a0696956c9?source=topic_page---------0------------------1',
 '/awake-alive-mind/the-tension-of-expectation-57a0696956c9?source=topic_page---------0------------------1',
 '/@georgeannsack?source=topic_page---------0------------------1',
 '/awake-alive-mind?source=topic_page---------0------------------1',
 '/awake-alive-mind/the-tension-of-expectation-57a0696956c9?source=topic_page---------0------------------1',
 '/swlh/what-its-like-to-hear-the-world-in-color-and-in-touch-a5f0db8640aa?source=topic_page---------1------------------1',
 '/swlh/what-its-like-to-hear-the-world-in-color-and-in-touch-a5f0db8640aa?source=topic_page---------1------------------1',
 '/swlh/what-its-like-to-hear-the-world-in-color-and-in-touch-a5f0db8640aa?source=topic_page---------1------------------1',
 '/@juliettefgreene?source=topic_page---------1---------


for link in soup.find_all("a", href = True):
    print(link.get("href"))

##### reference: 
1. https://stackoverflow.com/questions/21006940/how-to-load-all-entries-in-an-infinite-scroll-at-once-to-parse-the-html-in-pytho
2. https://stackoverflow.com/questions/37207959/how-to-scrape-all-contents-from-infinite-scroll-website-scrapy
3. https://stackoverflow.com/questions/42478591/python-selenium-chrome-webdriver
4. https://stackoverflow.com/questions/21006940/how-to-load-all-entries-in-an-infinite-scroll-at-once-to-parse-the-html-in-pytho (used for page scrolling)

## Part 2: Using Selenium to Scroll and Get all the URLs at once 

In [4]:
import selenium
import pandas as pd

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

import time

In [136]:
browser = webdriver.Chrome(executable_path = r"C:\Users\jesse\Downloads\chromedriver_win32\chromedriver.exe")
browser.get("https://medium.com/topic/art")

In [None]:
#code based off of stackoverflow code #4
time.sleep(1)
elem = browser.find_element_by_tag_name("body")
pagedowns = 500
while pagedowns:
    elem.send_keys(Keys.PAGE_DOWN)
    time.sleep(0.5)
    pagedowns-=1
    
#pages = browser.find_elements_by_class_nam

In [167]:
#get the html under the page
pages = browser.page_source

In [168]:
#Regex pattern to get html tags
pattern = 'href="(.{5,100}source=topic_page\-+\d{1,5}\-+\d)'

#get all the html links
result = re.findall(pattern, pages)
result

In [176]:
#put into a series to process
html_links = pd.Series(result)

#export to csv for safekeeping
html_links.to_csv("htmls_art.csv")

In [181]:
#drop all the duplicates
htmls = html_links.drop_duplicates().reset_index(drop=True)

In [183]:
htmls.head()

0    /awake-alive-mind/the-tension-of-expectation-5...
1    /@georgeannsack?source=topic_page---------0---...
2    /awake-alive-mind?source=topic_page---------0-...
3    /swlh/what-its-like-to-hear-the-world-in-color...
4    /@juliettefgreene?source=topic_page---------1-...
dtype: object

In [196]:
#there are some repeat html links, so getting rid of the repeats
clean_htmls = htmls[~htmls.str.contains(r"^.{2,30}\?source=topic")]

In [211]:
#getting the topic number in the string (since each one should be 
#different, to indicate another article)
clean_htmls.str.extract(r"--+(\d{1,4})-+").head()

Unnamed: 0,0
0,0
3,1
6,2
8,0
9,1


 ### Webscraping Practice on the Infinite Scrolling Pages:
 refernce: https://blog.scrapinghub.com/2016/06/22/scrapy-tips-from-the-pros-june-2016 (also says it's ok to webscrape from this site)
 
 Scraping from http://spidyquotes.herokuapp.com/ which is a website to scrape quotes from

In [125]:
#imports
import requests
from bs4 import BeautifulSoup

#the url topic (root url)
url = "http://spidyquotes.herokuapp.com/scroll"

#requesting the url to get access to the page
r = requests.get(url)
print(r)

#parsing in the information
soup = BeautifulSoup(r.content, "lxml")

<Response [200]>


## Part 3: Looking at the Webscraped htmls
importing back in and working on cleaning/getting ready for looping

In [5]:
import pandas as pd


In [21]:
art = pd.read_csv("htmls_art.csv", index_col = 0, header = None)

In [22]:
art.head()

Unnamed: 0_level_0,1
0,Unnamed: 1_level_1
0,/awake-alive-mind/the-tension-of-expectation-5...
1,/awake-alive-mind/the-tension-of-expectation-5...
2,/awake-alive-mind/the-tension-of-expectation-5...
3,/@georgeannsack?source=topic_page---------0---...
4,/awake-alive-mind?source=topic_page---------0-...


In [23]:
#drop all the duplicates
htmls = art.drop_duplicates().reset_index(drop=True)

In [27]:
htmls = htmls[1]

In [28]:
#there are some repeat html links, so getting rid of the repeats
clean_htmls = htmls[~htmls.str.contains(r"^.{2,30}\?source=topic")]

In [29]:
#getting the topic number in the string (since each one should be 
#different, to indicate another article)
clean_htmls.str.extract(r"--+(\d{1,4})-+").head()

Unnamed: 0,0
0,0
3,1
6,2
8,0
9,1


## EXTRA:  Regex pattern to get html tags (want entire html)


In [1]:
url = "https://medium.com/@insensified/a-handprint-against-the-sunset-e0fb404d690a?source=topic_page---------0------------------1"

In [2]:
#imports
import requests
from bs4 import BeautifulSoup

#requesting the url to get access to the page
r = requests.get(url)
print(r)

#parsing in the information
soup = BeautifulSoup(r.content, "html5lib")

<Response [200]>


In [3]:
#find the paragraphs in an article
paragraphs = soup.find_all('p')

In [4]:
one_big_paragraph = str(paragraphs)

In [5]:
one_big_paragraph[:2000]

'[<p class="gx gy ar bz gz b ha hb hc hd he hf hg hh hi hj hk" id="7b05">In the last room of a retrospective, you always think of death. These two exhibitions — Claude Monet and David Park, in adjacent museums in Fort Worth — were no exception, but it was as a duet that the two shows sung me the saddest requiem.</p>, <p class="gx gy ar bz gz b ha hb hc hd he hf hg hh hi hj hk" id="e1c5">Both artists completed their most ambitious and monumental work near the end of their lives. The shows’ penultimate rooms showcase this ambition, plunging me into oceanic visions: Monet’s lake-sized processions of lilies and Park’s brushstroke-sculpted, nearly life-size bathers.</p>, <p class="gx gy ar bz gz b ha hb hc hd he hf hg hh hi hj hk" id="43d6">Then, in the last rooms, the works shrink to painfully modest proportions — as if to bow before the exit, to stoop before the final gate.</p>, <p class="gx gy ar bz gz b ha hb hc hd he hf hg hh hi hj hk" id="f920">Park, diagnosed with terminal cancer in 

Harsha figure it out! Look at his folder.

***

## Part 1: Web Scraping Users:
can automate getting the users followers

In [27]:
#practice url:

url = 'https://medium.com/_/api/users/ec92ba75ef27/profile/stream?limit=8&to=1135f5736c0d&source=followers&page=4'

In [28]:
#imports
import requests
from bs4 import BeautifulSoup

#requesting the url to get access to the page
r = requests.get(url)
print(r)

#parsing in the information
soup = BeautifulSoup(r.content, "html5lib")

<Response [200]>


In [39]:
url = 'https://medium.com/@lemonsand/followers'
#requesting the url to get access to the page
r = requests.get(url)
print(r)

#parsing in the information
soup = BeautifulSoup(r.content, "html5lib")

<Response [200]>


In [None]:
soup.prettify()

In [112]:
#look for javascript (this is where the links to the followers is)
paging = str(soup.findAll('script'))

In [119]:
#find the page information but in one big string, need to separate up
link_info = re.findall(r'"paging":{(.+)}},"streamItems*', paging)[0]

In [156]:
#getting the individual parts of the link

#get the overall body link
link =  re.findall(r'(https.+)","next', link_info)[0]
#get the limit
limit = re.findall(r'limit":(\d+)', link_info)[0]
#Not sure what to is but needed in link
to = re.findall(r'to":"([\w\d]+)"', link_info)[0]
#source: should just be followers for all
source = "followers"
#will start at 2 since can webscrape from page 1 
#also, there is no page 1. Will have to scrape then use this link to move to the next page.
page_no = "2"


In [158]:
#get the individual pieces
link, limit, to, source, page_no

('https://medium.com/_/api/users/ec92ba75ef27/profile/stream',
 '10',
 '10a75bdee834',
 'followers',
 '2')

In [160]:
#link the pieces together
page_request_link = link + "?limit=" + limit + "&to=" + to + "&source=followers&page" + page_no

now can loop through and increase page_no to increase and scrape the users from that page to get the followers!

In [44]:
#what the overall json data of the url to get the next page of followers is:
next_url = '"paging":{"path":"https://medium.com/_/api/users/ec92ba75ef27/profile/stream","next":{"limit":10,"to":"10a75bdee834","source":"followers","page":2}},'

In [None]:
#what the url looks like in practice
#https://medium.com/_/api/users/ec92ba75ef27/profile/stream?limit=5&to=10a75bdee834&source=followers&page=1