Web Crawling & Scraping (Instagram)

This script scrapes Instagram posts using Selenium & BeautifulSoup packages.

Ref (2019): https://medium.com/@srujana.rao2/scraping-instagram-with-python-using-selenium-and-beautiful-soup-8b72c186a058

Ref (2021): https://medium.com/analytics-vidhya/web-scraping-instagram-with-selenium-python-b8e77af32ad4

Ref (2021): http://www.easy2digital.com/automation/python-tutorial-for-digital-marketer-12-using-hashtags-to-scrape-top-instagram-posts-and-instagram-users/

In [None]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup as bs
import time
import re
from urllib.request import urlopen
import json
from pandas.io.json import json_normalize
import pandas as pd, numpy as np

First, we will specify Instagram profile's username
And scrape the profile's posts.

On your computer, you need to:

1. Install Python selenium package
> conda install -c conda-forge selenium

OR

> pip install selenium

Ref: https://medium.com/@praneeth.jm/running-chromedriver-and-selenium-in-python-on-an-aws-ec2-instance-2fb4ad633bb5

2. Install chromedriver

3. Install Google Chrome

In [None]:
# Open the web browser
# Selenium uses Chrome Drive to open the profile given a username (public user).
# For example -

username='davidbeckham'
browser = webdriver.Chrome()
browser.get('https://www.instagram.com/'+username+'/?hl=en')
Pagelength = browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")

# In the new Chrome browser window (used by Selenium)
# Log into Instagram (using your own account)
# Otherwise, crawler will not be able to retrieve posts from multiple pages

In [None]:
# If you want to open a hashtag page
# hashtag='food'
# browser = webdriver.Chrome()
# browser.get('https://www.instagram.com/explore/tags/'+hashtag)
# Pagelength = browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")

In [None]:
'''
Parse HTML source page

Open the source page and use beautiful soup to parse it.

Go through the body of HTML script and extract link for each image in that page
and pass it to an empty list 'links[]'.
'''

links = []
source = browser.page_source
data = bs(source, 'html.parser')

body = data.find('body')
script = body.find_all("div",class_="v1Nh3")


for div in script:
     link = div.find('a')
     if re.match("/p", link.get('href')):
        links.append('https://www.instagram.com' + link.get('href'))
        
print("Number of Instagram images: ", len(links))

In [None]:
links

In [None]:
'''
Remember by default selenium opens only first page.

If you want to scroll through further pages and get more images,
divide the scroll Height by a number and run the parse code multiple times.

This adds new links from each page to the list. For example -
'''


Pagelength = browser.execute_script("window.scrollTo(0, document.body.scrollHeight/1.5);")
source = browser.page_source
data = bs(source, 'html.parser')
body = data.find('body')
script = body.find('span')
for link in script.findAll('a'):
     if re.match("/p", link.get('href')):
         links.append('https://www.instagram.com' + link.get('href'))


# === IMPORTANT ===
# sleep time is required.
# If you don't use this, Instagram may interrupt the script and doesn't scroll through pages.

time.sleep(5)



Pagelength = browser.execute_script("window.scrollTo(document.body.scrollHeight/1.5, document.body.scrollHeight/3.0);")
source = browser.page_source
data = bs(source, 'html.parser')

body = data.find('body')
script = body.find_all("div",class_="v1Nh3")

for div in script:
     link = div.find('a')
     if re.match("/p", link.get('href')):
        links.append('https://www.instagram.com' + link.get('href'))
        
print("Number of Instagram images: ", len(links))

In [None]:
links

In [None]:
'''
Get information for each image in the page

To get more details of each image such as

who posted it
post type
image url
image catpion
number of likes
comments


open the source page of each image (from 'links' list in previous code)
and extract the JSON script to pandas dataframe.
'''

import os
import requests

result = pd.DataFrame()
#for i in range(len(links)):
for link in links:
    try:
        page = urlopen(link).read()
        data = bs(page, 'html.parser')
        body = data.find('body')
        script = body.find('script')
        raw = script.text.strip().replace('window._sharedData =', '').replace(';', '')
        
        json_data = json.loads(raw)
        
        posts = json_data['entry_data']['PostPage'][0]['graphql']
        posts = json.dumps(posts)
        posts = json.loads(posts)
        
        print(posts)
        
        x = pd.DataFrame.from_dict(json_normalize(posts), orient='columns') 
        x.columns = x.columns.str.replace("shortcode_media.", "")
        result = result.append(x)
    except:
        np.nan
        
# Just check for the duplicates
result = result.drop_duplicates(subset = 'shortcode')
result.index = range(len(result.index))

In [None]:
'''
Can you go to each post (via its URL) and retrieve metadata such as:
- Number of likes
- Comments (text)
'''

In [None]:
'''
Download images from pandas data frame

Use requests library to download images from the ‘display_url’ in pandas ‘result’ data frame 
and store them with respective shortcode as file name.

(Important Note: Remember that you should respect author’s rights when you 
download copyrighted content. Do not use images/videos from Instagram for commercial intent).
'''
import os
import requests
result.index = range(len(result.index))

path_prefix=""
directory = "Instagram_Photos_" + username

if not os.path.exists(directory):
    os.makedirs(path_prefix + directory)

for i in range(len(result)):
    r = requests.get(result['display_url'][i])
    with open(directory + "/" + result['shortcode'][i]+".jpg", 'wb') as f:
                    f.write(r.content)

Now go check out the folder Instagram_Photos_davidbeckham/

You should see JPG files there.