## Scraping

### Using Selenium, this notebook initializes a zombie web browser to scrape images from Instagram's latest posts. First it navigates to a hashtag's page and grabs links to a certain number of posts, then it visits each post and grabs its image, its hashtags, and a few other pieces of information such as a direct link to the post. All of this data is then saved directly into the `data` and `metadata` folders. 

In [1]:
# NumPy versions below 1.17 may be incompatible with some other 
# packages, so you may need to replace your current version with 
# an earlier one in order to run this notebook as-is. 
# !pip uninstall numpy --yes
# !pip install "numpy<1.17"

In [2]:
import numpy as np
import pandas as pd
import json
from selenium.webdriver import Chrome, Firefox
from functions import scrape_data
import os

### **If you want to scrape your own hashtags,** simply add them to the list, choose how many you want to be scraped, and run the remaining cells. 

In [3]:
# EXAMPLE:
# hashtags = ["travel", "food", "animals", "selfie", "cars", "fitness", "babies", "wedding", "nature", "architecture"]

# Your own hashtags here:
hashtags = []

# How many hashtags to scrape:
num_to_scrape = 0

In [4]:
# Make sure our data and metadata folders exist before we start scraping
folder_names = ["data", "metadata"]
for folder_name in folder_names:
    try:
        os.mkdir(folder_name)
    except OSError:
        print(f"Folder '{folder_name}' already exists.")

Folder 'data' already exists.
Folder 'metadata' already exists.


In [5]:
for hashtag in hashtags:
    # "delay" is how long to wait between grabbing each image, to avoid being 
    # blocked by Instagram. If delay=5 for example, then the browser will 
    # randomly wait between 0 to 5 seconds before grabbing each new image.
    new_hashtag_metadata = scrape_data(hashtag, num_to_scrape, delay=5)
    
    if os.path.exists(f"metadata/{hashtag}.json"):
        # We already have metadata for this hashtag, add to it
        with open(f"metadata/{hashtag}.json", "r") as f:
            hashtag_metadata = json.load(f)
            hashtag_metadata += new_hashtag_metadata
    else:
        # We don't have metadata for this hashtag yet, initialize it
        hashtag_metadata = new_hashtag_metadata

    with open(f"metadata/{hashtag}.json", "w") as f:
        json.dump(hashtag_metadata, f)

### You can use `pd.read_json` to import hashtag data again. 

In [9]:
# travel_df = pd.read_json("metadata/travel.json")
# travel_df.head()

### Optionally you can also use this scaffolding for uploading scraped images to an S3 bucket, although you will of course need to set up your own S3 bucket.

In [10]:
# import boto3

# s3 = boto3.resource("s3")

# hashtags_to_upload = ["foo", "bar"]
# for hashtag in hashtags_to_upload:
#     for img in hashtag: 
#         source = f"data/{img["image_local_name"]}"
#         bucket = f"instagram-images-mod4"
#         destination = f"{img["search_hashtag"]}/{img["image_local_name"]}"
#         s3.meta.client.upload_file(source, bucket, destination)