# Scraping Instagram with Selenium

## Importing Libraries 

Selenium is a very powerful web scraping tool, it can target specific content elements on a webpage and extract them mercilessly!

But great power also leaves room for great errors, and in this short tutorial, I will show handy ways to bypass them and automate the entire process of image extraction.

In [1]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait

## Login on Instagram 

This way, we allow enough time for the elements to load and this step is absolutely crucial if you’re looking for a %100 automated process.


Now we are safe to enter our own personal user name and password and click on the login button.

In [2]:
driver = webdriver.Chrome()

#open the webpage
driver.get("http://www.instagram.com")

#target username
username = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[name='username']")))
password = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[name='password']")))

#enter username and password
username.clear()
username.send_keys("ghznsami@gmail.com")
password.clear()
password.send_keys("azerty23943608")

#target the login button and click it
button = WebDriverWait(driver, 2).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[type='submit']"))).click()

Here we will select the button, with the text “Plus tard”, and we’ll perform this action twice because there will be another pop-up message following the first:

In [3]:
#nadle NOT NOW
Plus_tard= WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, '//button[contains(text(), "Plus tard")]'))).click()
Plus_tard1 = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, '//button[contains(text(), "Plus tard")]'))).click()

Next, we are finally presented with our Instagram feed and we can move on with searching keywords.

## Search keywords 

Next, we will enter inside the search field, in our case  cristiano.
Then, we will press the ENTER key, because Instagram didn’t include a button for us to click on so we need to improvise.
For this, we will also delay the execution of our Python code and adjust it to the loading speed of our webpage. However, instead of using WebDriverWait, we will use time.sleep(seconds).

In [4]:
import time

#target the search input field
searchbox = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//input[@placeholder='Rechercher']")))
searchbox.clear()
#<input class="XTCLo x3qfX" type="text" autocapitalize="none" placeholder="Rechercher" value="">
#search for the hashtag cat
keyword = "celine_dr19"
searchbox.send_keys(keyword)
 
# Wait for 5 seconds
time.sleep(5)
searchbox.send_keys(Keys.ENTER)
time.sleep(5)
searchbox.send_keys(Keys.ENTER)
time.sleep(5)

In [5]:
#scroll down to scrape more images
driver.execute_script("window.scrollTo(0, 4000);")

#target all images on the page
images = driver.find_elements_by_tag_name('img') 
images = [image.get_attribute('src') for image in images]
images = images[:-2]

print('Number of scraped images: ', len(images))

Number of scraped images:  27


## Save images to computer 

Once we have the URLs of our images in a neat list, we’ll create a brand new folder named as our keyword.

The first command selects the current directory, the second command specifies the folder name we would like to add to this directory (based on our keyword) and the last command creates the directory on our computer
(mkdir = make directory).

In [6]:
import os
import wget

path = os.getcwd()
path = os.path.join(path, keyword[0:] )
os.mkdir(path) #create the directory

path

'/home/ghazouani/Bureau/notebooks/celine_dr19'

Lastly, we will use wget to help us with downloading the images.

We’ll create a variable “counter” which will represent the index of each image so that the file names we set are formatted as “cristiano1.jpg”, then “cristiano2.jpg” and so on all the way until the last image.

In [7]:
#download images
counter = 0
for image in images:
    save_as = os.path.join(path, keyword[1:] + str(counter) + '.jpg')
    wget.download(image, save_as)
    counter += 1

And now, we can sit back, relax, press on the “Run All Cells” button and be impressed with our superior coding skills.
Now each time you’ll revise the keyword — you’ll get a brand new image database within seconds!

## What can we do with a database of images?


From my point of view, the answer is simple — Machine Learning and image classification! We can train a neural network to learn the recognition person face  between cristiano or messi and predict whether a specific, never seen before, image is of a messi  or cristiano.
We can do the same with flowers, buildings, landmarks  and when we have free access to such powerful tools as Selenium — the sky is the limit. This technology is very handy when it comes to data science, artificial intelligence and even graphic design, so if you’d like to learn more about web scraping, and Selenium in particular, please have a look at the documentation:


https://selenium-python.readthedocs.io/