# Instagram Scraper

This notebook is an exploration of a few difficulties in web scraping Instagram.

If you like this example, check out my [`insta_builder.py`](https://github.com/SkylerBurger/apis_and_scrapers/blob/master/instagram/insta_builder.py) module where I implement the Builder design pattern to create representations of an Instagram profile. This example could be extended to handle profile from other social media platforms by creating a new builder and profile class for each platform.

## Challenges:

1. GET requests to Instagram return executable JavaScript rather than HTML content.
  - To get past this issue I learned how to use Selenium which executes the JavaScript included in the response from Instagram just as a browser would. This gathers and renders the actual HTML content so I can scrape it.
2. Instagram has a soft login-wall to keep clients that are not logged in from viewing the full extent of a profile's public content.
  - When researching how to scroll through a page of unknown length in Selenium, I came across an approach that manipulates the `window` object directly to continue scrolling. Luckily, this direct approach works even after Instagram renders their soft login-wall to the page.
3. Instagram dynamically populates and depopulates images from the browser as you scroll through a profile.
  - To make sure that I capture all of the images as I scroll, I set up an algorithm that takes a 'snapshot' of the current HTML content of the page and then scrolls to the bottom of the page. Scrolling to the bottom of the page causes new content to render and old content to depopulate from the page. It then repeats this process of taking a snapshot and scrolling until it notices that the bottom of the page is no longer extending. I then extract all the image tags from the HTML snapshots and remove duplicates by running them through a set.
4. After initial exploration, Instagram changed their preview of public profiles when a user is not logged in. Instead of automatically loading images upon scrolling to the bottom of the page, they currently show a few images and a 'show more' button.
  - To get the page back into a state where images load automatically upon scrolling, I had to use Chrome Dev Tools to identify potential class names to use in targeting the 'show more' button element. With class names in hand I was able to write a few lines that tell the Selenium WebDriver how to locate the button and then to execute a click action on it. Once the button was clicked and scrolling was restored, the remainder of my previous code continued to work as expected.
  - After writing the button clicking code, I noticed that only certain profiles have the 'show more' button enabled so I had to alter my approach to search for the button and click if it is present or proceed with scrolling as usual.



In [1]:
import time
from random import randint

from selenium import webdriver
from bs4 import BeautifulSoup

In [2]:
def snapshot(browser):
    """Appends the current HTML of the page to the 
    snapshots list.
    """
    global snapshots
    snapshots.append(browser.page_source)


def scroll_and_snapshot(browser, max_scroll_secs):
    """Scrolls through the full height of a profile page while 
    frequently taking snapshots of the page's HTML.
    """
    SCROLL_PAUSE_TIME = 1
    last_height = browser.execute_script("return document.body.scrollHeight")
    end_time = time.time() + max_scroll_secs
    
    # Attribution for scrolling mechanism:
    # https://stackoverflow.com/questions/20986631/how-can-i-scroll-a-web-page-using-selenium-webdriver-in-python#27760083
    while time.time() < end_time:
        # Scroll down to bottom
        browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        # Wait to load page
        time.sleep(SCROLL_PAUSE_TIME)
        # Calculate new scroll height and compare with last scroll height
        new_height = browser.execute_script("return document.body.scrollHeight")
        snapshot(browser)
        
        if new_height <= last_height:
            break
        else: 
            last_height = new_height


def collect_image_tags(snapshots):
    """Goes through the HTML snapshots and returns a set of unique 
    img elements.
    """
    image_tags = []
    
    for html in snapshots:
        soup = BeautifulSoup(html, 'html.parser')
        image_tags += soup.find_all('img', class_='FFVAD')
        
    return set(image_tags)


def capture(snapshots, browser):
    """Downloads all images from the provided HTML snapshots."""  
    image_tags = collect_image_tags(snapshots)

    for index, image in enumerate(image_tags):
        browser.get(image['src'])
        images = browser.find_elements_by_tag_name('img')
        # Edit file name f-string below  as needed
        images[0].screenshot(f'./screenshot_{index}.png')
        wait_time = randint(1, 2)
        time.sleep(wait_time)


def create_browser():
    """Returns a Selenium Chrome WebDriver instance."""
    browser = webdriver.Chrome('../utilities/chromedriver')
    return browser


def click_show_more_button(browser):
    """Clicks the troublesome 'show more' button, if present."""
    browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    try:
        button = browser.find_element_by_class_name('z4xUb')
        button.click()
    except:
        return

## Result of Exploration

Working with Selenium and the Instagram website led to the above collection of functions that work together to report on and optionally download images from a public profile.

In [3]:
# Report back on a public profile with a link

profile_name = 'third_impact_01'
instagram_url = f'https://www.instagram.com/{profile_name}'
max_scroll_secs = 300
snapshots = []

browser = create_browser()
browser.get(instagram_url)
click_show_more_button(browser)
scroll_and_snapshot(browser, max_scroll_secs)
images = collect_image_tags(snapshots)
print(f'Image Links Captured: {len(images)}')

# ONLY call line below if you want to risk downloading a lot of images
# capture(snapshots, browser)

browser.quit()

Image Links Captured: 114


## Synthesizing Into Classes

Below is an example of combining the above exploratory code into classes using the builder design pattern to make modelling public Instagram profiles easy and repeatable. Check out [`insta_builder.py`](https://github.com/SkylerBurger/apis_and_scrapers/blob/master/instagram/insta_builder.py) for a look under the hood at the code behind the classes.

In [4]:
from insta_builder import ProfileDirector

In [5]:
# Create a Director instance to build Profiles

profile_director = ProfileDirector()

In [6]:
# Tell the Director to build a Profile with the given link

profile_name = 'engineering.stations'
insta_profile = profile_director.build_insta_profile(profile_name)

In [7]:
print(insta_profile.image_count)

529


In [8]:
# Tell the profile to download its images

# Change value below to int, or leave as None to download all
# max_images_to_download = None
# instagram_profile.download_images(max_images_to_download)