### Webscraping Art Articles from Medium.com
Going into the branches of the DOM and getting relevant tags

#### references:
https://www.geeksforgeeks.org/implementing-web-scraping-python-beautiful-soup/

### Check the robots.txt
to see what you can scrape at: https://medium.com/robots.txt

## Part 1: Sample - looking at a single page

In [205]:
#imports
import requests
from bs4 import BeautifulSoup
import pandas

#the url topic (root url)
url = "https://medium.com/topic/art"

#requesting the url to get access to the page
r = requests.get(url)

#parsing in the information
soup = BeautifulSoup(r.content, "html5lib")

Looking at the article through the developer tool on a Medium article, there are certain tags which are present within the article which are uniform. 

- < div class= "n p"> : all (title and paragraph)
- < div class= "o n"> : title specifically

In [None]:
#look at the html in nice format
print(soup.prettify())

In [None]:
#example of link I want to get
'<a href="https://psiloveyou.xyz/remembering-the-terrible-cb7ebf24a6da?source=topic_page---------6------------------1"'

#Regex pattern to get html tags (want entire html)
pattern = 'href="(.{5,100}source=topic_page\-+\d\-+\d)'

In [209]:
#find all the htmls in the string using REGEX
result = re.findall(pattern, str(soup))
result[:10]

['/awake-alive-mind/the-tension-of-expectation-57a0696956c9?source=topic_page---------0------------------1',
 '/awake-alive-mind/the-tension-of-expectation-57a0696956c9?source=topic_page---------0------------------1',
 '/awake-alive-mind/the-tension-of-expectation-57a0696956c9?source=topic_page---------0------------------1',
 '/@georgeannsack?source=topic_page---------0------------------1',
 '/awake-alive-mind?source=topic_page---------0------------------1',
 '/awake-alive-mind/the-tension-of-expectation-57a0696956c9?source=topic_page---------0------------------1',
 '/swlh/what-its-like-to-hear-the-world-in-color-and-in-touch-a5f0db8640aa?source=topic_page---------1------------------1',
 '/swlh/what-its-like-to-hear-the-world-in-color-and-in-touch-a5f0db8640aa?source=topic_page---------1------------------1',
 '/swlh/what-its-like-to-hear-the-world-in-color-and-in-touch-a5f0db8640aa?source=topic_page---------1------------------1',
 '/@juliettefgreene?source=topic_page---------1---------


for link in soup.find_all("a", href = True):
    print(link.get("href"))

##### reference: 
1. https://stackoverflow.com/questions/21006940/how-to-load-all-entries-in-an-infinite-scroll-at-once-to-parse-the-html-in-pytho
2. https://stackoverflow.com/questions/37207959/how-to-scrape-all-contents-from-infinite-scroll-website-scrapy
3. https://stackoverflow.com/questions/42478591/python-selenium-chrome-webdriver
4. https://stackoverflow.com/questions/21006940/how-to-load-all-entries-in-an-infinite-scroll-at-once-to-parse-the-html-in-pytho (used for page scrolling)

## Part 2: Using Selenium to Scroll and Get all the URLs at once 

In [123]:
import selenium
import pandas as pd

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

import time

In [136]:
browser = webdriver.Chrome(executable_path = r"C:\Users\jesse\Downloads\chromedriver_win32\chromedriver.exe")
browser.get("https://medium.com/topic/art")

In [None]:
#code based off of stackoverflow code #4
time.sleep(1)
elem = browser.find_element_by_tag_name("body")
pagedowns = 500
while pagedowns:
    elem.send_keys(Keys.PAGE_DOWN)
    time.sleep(0.5)
    pagedowns-=1
    
#pages = browser.find_elements_by_class_nam

In [167]:
#get the html under the page
pages = browser.page_source

In [168]:
#Regex pattern to get html tags
pattern = 'href="(.{5,100}source=topic_page\-+\d{1,5}\-+\d)'

#get all the html links
result = re.findall(pattern, pages)
result

In [176]:
#put into a series to process
html_links = pd.Series(result)

#export to csv for safekeeping
html_links.to_csv("htmls_art.csv")

In [181]:
#drop all the duplicates
htmls = html_links.drop_duplicates().reset_index(drop=True)

In [183]:
htmls.head()

0    /awake-alive-mind/the-tension-of-expectation-5...
1    /@georgeannsack?source=topic_page---------0---...
2    /awake-alive-mind?source=topic_page---------0-...
3    /swlh/what-its-like-to-hear-the-world-in-color...
4    /@juliettefgreene?source=topic_page---------1-...
dtype: object

In [196]:
#there are some repeat html links, so getting rid of the repeats
clean_htmls = htmls[~htmls.str.contains(r"^.{2,30}\?source=topic")]

In [211]:
#getting the topic number in the string (since each one should be 
#different, to indicate another article)
clean_htmls.str.extract(r"--+(\d{1,4})-+").head()

Unnamed: 0,0
0,0
3,1
6,2
8,0
9,1


 ### Webscraping Practice on the Infinite Scrolling Pages:
 refernce: https://blog.scrapinghub.com/2016/06/22/scrapy-tips-from-the-pros-june-2016 (also says it's ok to webscrape from this site)
 
 Scraping from http://spidyquotes.herokuapp.com/ which is a website to scrape quotes from

In [125]:
#imports
import requests
from bs4 import BeautifulSoup

#the url topic (root url)
url = "http://spidyquotes.herokuapp.com/scroll"

#requesting the url to get access to the page
r = requests.get(url)
print(r)

#parsing in the information
soup = BeautifulSoup(r.content, "lxml")

<Response [200]>
