## UPDATED! 
# Web Scraping Instagram with Selenium

### AUTOMATING IMAGE EXTRACTION

This Notebook is an <b>upgraded</b> version of the original Web Scraping Instagram with Selenium notebook.
<br>
This code is asjusted to fit the new Selenium Syntax (relevant to December 2022), and contains solutions for the issues that arrised in the comment section on YouTube/issues section on Github.
<br>
Also, this code is extracting the <b>full size images</b> and not the <b>thumbnails</b>, and it's a <b>100% automated</b>!

Please let me know if you have any other problems that you haven't found a solution for in the comment section of the Youtube tutorial:
<br>
https://youtu.be/iJGvYBH9mcY

### JUPYTER NOTEBOOK INSTRUCTIONS
please click on "Cell" in the menu above and then "Restart & Run All" for 100% automation

### INSTALL REQUIREMENTS

In [None]:
!pip install selenium
!pip install wget
# the library below automatically installs a Web Driver for any browser
# please see OPTION2:INSTALL WEB DRIVER below
!pip install webdriver_manager

### IMPORT MODULES

In [None]:
#imports here
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.firefox.service import Service as FirefoxService
from webdriver_manager.firefox import GeckoDriverManager
import time

### OPTION 1: DOWNLOAD WEB DRIVER
You can download the latest stable release of <b>GekoDriver</b> from:
<br>
https://github.com/mozilla/geckodriver/releases

### OPTION 2: INSTALL WEB DRIVER
You can use the webdriver_manager you have installed 2 cell above
<br>
and pass it to `webdriver.Firefox()` which is much quicker (and cooler ;)) than option 1.

## Log In to  Instagram

### DUMMY ACCOUNT
please use a dummy account to automate Instagram! <b>DO NOT USE</b> your personal account!
<br>
if you're not sure what a dummy account is, check out my <a href="https://youtu.be/aSeqMYNhEHo" target="_blank">Twitter Bot YouTube tutorial</a> at minute 02:25.

### DOWNLOADED WEB DRIVER
if you've downloaded your web driver, as in <b>OPTION1</b>
<br>
please replace the first 2 lines of code with the following:

```
# OPTION 1: DOWNLOADED WEB DRIVER
driver = webdriver.Firefox('C:/path/to/your/chromedriver.exe')
```


In [None]:
# OPTION 2: USE INSTALLED WEB DRIVER
driver = webdriver.Firefox(service=FirefoxService(GeckoDriverManager().install()))

# open the webpage
driver.get("http://www.instagram.com")

# target username
username = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[name='username']")))
password = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[name='password']")))

# enter username and password
username.clear()
username.send_keys("my_username")
password.clear()
password.send_keys("my_password")

# target the login button and click it
button = WebDriverWait(driver, 2).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[type='submit']"))).click()

time.sleep(3)
# We are logged in!

## Handle Alerts

you might only get a single alert, or you might get 2 of them
<br>
please adjust the cell below accordingly

In [None]:
alert = WebDriverWait(driver, 15).until(EC.element_to_be_clickable((By.XPATH, '//button[contains(text(), "Not Now")]'))).click()
#alert2 = WebDriverWait(driver, 15).until(EC.element_to_be_clickable((By.XPATH, '//button[contains(text(), "Not Now")]'))).click()

## Search for a certain hashtag

In [None]:
import time

keyword = "cat"
driver.get("https://www.instagram.com/explore/tags/" + keyword + "/")

time.sleep(3)

## Scroll Down
<br>
Increase n_scrolls to select more photos (depending on screen resolution)
<br>
<b>Example:</b>
<br>
<ul>
    <li>2 scrolls cover approx. 35 photos</li>
    <li>3 scrolls cover approx. 45 photos</li>
</ul>

In [None]:
n_scrolls = 2
img_links = []

for i in range(0, n_scrolls):  
    # select all the anchor elements on the page
    anchors = driver.find_elements(By.TAG_NAME, 'a')
    # only keep their href attributes
    anchors = [a.get_attribute('href') for a in anchors]
    # filter links that do not start with instagram's prefix
    anchors = [a for a in anchors if str(a).startswith("https://www.instagram.com/p/")]
    # store outside the for loop
    img_links += anchors
    print("added " + str(len(anchors)) + " links")
    
    # scroll to the bottom of the current image batch
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(3)

print('Found ' + str(len(img_links)) + ' links to images')
img_links[:5]

In [None]:
images = []

# iterate over extracted image links
for a in img_links:
    # open URL and wait
    driver.get(a)
    time.sleep(3)
    
    #find all image elements on the page
    all_images = driver.find_elements(By.TAG_NAME, 'img')
    
    for i in all_images:
        # filter images that do not start with an instagram prefix
        src = i.get_attribute('src')
        if str(src).startswith("https://instagram.fcxh2-1.fna.fbcdn.net/"):
            # store the image of interest and skip the rest of the images
            images.append(src)
            break
            
images[:5]

## Save images to computer

First we'll create a new folder for our images somewhere on our computer.
<br>
Then, we'll save all the images there.

In [None]:
import os
import wget

path = os.getcwd()
path = os.path.join(path, keyword)

#create the directory
os.mkdir(path)

path

In [None]:
#download images

for idx, image in enumerate(images):
    save_as = os.path.join(path, keyword + str(idx) + '.jpg')
    wget.download(image, save_as)