# Instagram Scraper API
This notebook shows how to use Instagram Scaraper API implementated by [realsirjoe](https://github.com/realsirjoe/instagram-scraper)

In [None]:
# don't worry about these
%load_ext autoreload
%autoreload 2

Let's import some libraries!

In [None]:
# numpy is for numerical calculation
import numpy as np

# matplotlib is for plotting
import matplotlib.pyplot as plt
# we need this line to embed the plot in the notebook
%matplotlib inline  

In [None]:
# instagram API that we are using
from igramscraper.instagram import Instagram

# we use "sleep" to control the speed we access instagram website.
# If this is too low, instagram will block us
from time import sleep  

let's login

In [None]:
# create instagram object, its important to set the sleeping time sufficiently high
instagram = Instagram(sleep_between_requests=15)

# login
instagram.with_credentials('username', 'password', 'cache')
instagram.login()

You can get posts related to a hashtag as follows:

In [None]:
# medias = instagram.get_current_top_medias_by_tag_name('london')

# returns a list of media
medias = instagram.get_medias_by_tag('london', count=3)

In [None]:
# lets check the first item in the list
media = medias[0]
print(media)

Let's see the image!

In [None]:
# requests library allows us to load content from URL 
import requests
# BytesIO is a nice wrapper to contain raw byte data read from the URL
from io import BytesIO
# we use PIL to "load" images from a URL
import PIL
from PIL import Image

def get_thumbnail(url, imsize=None):
    """ download the image and return a numpy array representing the image
    
    Args:
      url (str): URL to download the image from.
      (optional) imsize (tuple): Resize image, specify in (nx, ny) format.
    Returns:
       (np.ndarray): the downloaded image which is resized 
    """
    # fetch the rawdata from URL
    response = requests.get(url)
    # convert rawdata into an image
    img = Image.open(BytesIO(response.content))
    # resize the image using bicubic interpolation
    if imsize:
        img = img.resize(imsize, PIL.Image.BICUBIC)
    # convert to numpy array so we can process 
    img_array = np.array(img)
    # return the numpy array
    return img_array

def show_thumbnail(img, media_caption):
    """ Show thumbnail using matplotlib """
    plt.figure()
    plt.imshow(img)
    plt.axis('off')
    plt.title(f'{media_caption[:20]}...')
    plt.show()

In [None]:
thumbnail = get_thumbnail(medias[0].image_high_resolution_url)

show_thumbnail(thumbnail, media.caption)

Cool! let's iterate through all items in the list

In [None]:
for media in medias:
    thumbnail = get_thumbnail(media.image_high_resolution_url)
    show_thumbnail(thumbnail, media.caption)

# Findings:

- So it seems like the best way for us to scrape data is by `instagram.get_medias_by_tag('london', count=100)`
- But it's also the case that the returned items do not actually contain the corresponding location information!
- I found that the way we can do this is to:

  (1) get medias using above function, where we can extract post id.
  
  (2) get media page from that id using `instagram.get_media_by_id`, which DOES contain location ID. 


- That is a bit sucky as it means for each image, we need to access the actual page to get the location ID! It seems like this is the best we can do... It's ok, let's slowly scrape the website to builde the databset

In [None]:
media = instagram.get_media_by_id('2183856998075057001')

In [None]:
media.location_id, media.location_name, media.location_slug

- One more thing we wanted to do is to get the geographical coordinate from the media.
- It seems like we cannot do this within Instagram, so we would need to think how we can map [`location`] -> [`(latitude, longitude)`]

# Misc

If you go to instagram post and click location, each location has `location_id`. For London, this is 213385402
Maybe this is useful for us later...?

In [None]:
location = instagram.get_location_by_id(213385402)
print(location)

# Advanced: Hacking Tutorial

- I wanted to know how the API works. For this, I really need to jump into the [code](https://github.com/realsirjoe/instagram-scraper/blob/2d5fb53f1a92add34a8dbcf129708ed15d478190/igramscraper/instagram.py) to know wtf is happening under the hood.

- I first check what happens when I run `instagram.get_account('tserence')`. If I go to the source code, I can check what exactly `get_account` does, [Line1070](https://github.com/realsirjoe/instagram-scraper/blob/2d5fb53f1a92add34a8dbcf129708ed15d478190/igramscraper/instagram.py#L1070), in which the code looks like this:
```
class Instagram:
    # ...
    
    def get_account(self, username):
        """
        :param username: username
        :return: Account
        """
        time.sleep(self.sleep_between_requests)
        response = self.__req.get(endpoints.get_account_page_link(
            username), headers=self.generate_headers(self.user_session))

        if Instagram.HTTP_NOT_FOUND == response.status_code:
            raise InstagramNotFoundException(
                'Account with given username does not exist.')

        if Instagram.HTTP_OK != response.status_code:
            raise InstagramException.default(response.text,
                                             response.status_code)

        user_array = Instagram.extract_shared_data_from_body(response.text)

        if user_array['entry_data']['ProfilePage'][0]['graphql']['user'] is None:
            raise InstagramNotFoundException(
                'Account with this username does not exist')

        return Account(
            user_array['entry_data']['ProfilePage'][0]['graphql']['user'])
```

- From this code, I can deduce that the code has 3 parts:

  - the part which gets the page content using `self.__req.get(...)`,
  
  - the part which converts the raw `response` to JSON object which only contains the juicy part using `Instagram.extract_shared_data_from_body(response.text)`
  
  - the part which converts the result to an `Account` Object
  
- The rest is all handling exceptions, such as when the page doesn't exist, etc.

- Now, let's say I want to know what the actual JSON object looks like.... Then, I can replicate the first few lines to get the results

In [None]:
from igramscraper.instagram import endpoints

In [None]:
# create a request session so we can access the URL's
import requests  # <-- already imported above so you actually don't need this. I put it here for clarity.
__req = requests.session()

# replicate the call to `instagram.get_account`
username = 'tserence'
response = __req.get(
    endpoints.get_account_page_link(username),
    headers=instagram.generate_headers(instagram.user_session),
)

In [None]:
# end the request session to avoid memory leak... Instagram API doesn't do this! Potential problem...
__req.close()

In [None]:
# convert to JSON
user_array = Instagram.extract_shared_data_from_body(response.text)
# I know the code accesses `user_array['entry_data']['ProfilePage'][0]['graphql']['user']` to get the user information, so lets see what is happening here
user_json =  user_array['entry_data']['ProfilePage'][0]['graphql']['user']

In [None]:
# let's see what attributes are in this
user_json.keys()

Voila! Now we know what kind of internal information Instagram is using to render their content. Let's see some of the interesting contents...

In [None]:
terence_profile_pic = get_thumbnail(user_json['profile_pic_url'], imsize=(200, 200))
show_thumbnail(terence_profile_pic, 'Terence Profile Pic!')

In [None]:
for key in [
    'full_name',
    'is_private',
    'biography',
    'has_blocked_viewer',
]:
    print(f'[{key}]: {user_json[key]}')
    print()

Great! Terence has no `has_blocked_viewer`'s, what a great guy! (WOW Instagram, what kind of information are you returning!?)