# Instagram Scraper API
This notebook shows how to use Instagram Scraper API implementated by [realsirjoe](https://github.com/realsirjoe/instagram-scraper)

In [None]:
# don't worry about these
%load_ext autoreload
%autoreload 2

Let's import some libraries that are useful for us to process and visualise data

In [None]:
# numpy is for numerical calculation
import numpy as np

# matplotlib is for plotting
import matplotlib.pyplot as plt
# we need this line to embed the plot in the notebook
%matplotlib inline  

In [None]:
# instagram API that we are using
from igramscraper.instagram import Instagram

# we use "sleep" to control the speed we access instagram website.
# If this is too low, instagram will block us
from time import sleep  

Let's create instagram object for scraping!

In [None]:
# create instagram object, its important to set the sleeping time sufficiently high
instagram = Instagram(sleep_between_requests=15)

You can get posts related to a hashtag as follows:

In [None]:
# medias = instagram.get_current_top_medias_by_tag_name('london')

# returns a list of media
medias = instagram.get_medias_by_tag('london', count=3)

In [None]:
# lets check the first item in the list
media = medias[0]
print(media)

Let's see the image!

In [None]:
# requests library allows us to load content from URL 
import requests
# BytesIO is a nice wrapper to contain raw byte data read from the URL
from io import BytesIO
# we use PIL to "load" images from a URL
import PIL
from PIL import Image

def get_thumbnail(url, imsize=None):
    """ download the image and return a numpy array representing the image
    
    Args:
      url (str): URL to download the image from.
      (optional) imsize (tuple): Resize image, specify in (nx, ny) format.
    Returns:
       (np.ndarray): the downloaded image which is resized 
    """
    # fetch the rawdata from URL
    response = requests.get(url)
    # convert rawdata into an image
    img = Image.open(BytesIO(response.content))
    # resize the image using bicubic interpolation
    if imsize:
        img = img.resize(imsize, PIL.Image.BICUBIC)
    # convert to numpy array so we can process 
    img_array = np.array(img)
    # return the numpy array
    return img_array

def show_thumbnail(img, media_caption):
    """ Show thumbnail using matplotlib """
    plt.figure()
    plt.imshow(img)
    plt.axis('off')
    plt.title(f'{media_caption[:20]}...')
    plt.show()

In [None]:
thumbnail = get_thumbnail(medias[0].image_high_resolution_url)

show_thumbnail(thumbnail, media.caption)

Cool! let's iterate through the list:

In [None]:
for media in medias:
    thumbnail = get_thumbnail(media.image_high_resolution_url)
    show_thumbnail(thumbnail, media.caption)

# Findings:

- The best way for us to get the data we want is to scrape the website using `instagram.get_medias_by_tag('london', count=100)`
- The above function only gives you basic media information (caption, comments, #likes, etc.) and does not include location information
- In order to get the location information, we can do:

  (1) extract media id using the above function,
  
  (2) extract the media content using the media id and `instagram.get_media_by_id` function, 
  
  (3) extract location information that is contained in the media content


- Not the best workaround we expected. At least we can get some data we wanted!

In [None]:
media = instagram.get_media_by_id('2183856998075057001')

In [None]:
media.location_id, media.location_name, media.location_slug

[Note] 
- We wanted to extract the geographical coordinate from the media. This seems nontrivial and cannot be done within Instagram. We would need to think an alternatives to map [`location name`] -> [`(latitude, longitude)`]
- Each location has corresponding `location_id`. For example, `location_id` of "london" is 213385402.

In [None]:
location = instagram.get_location_by_id(213385402)
print(location)

Some major locations do seem to contain (latitude, longitude) information though

# Advanced: Hacking Tutorial

- What is actually happening under the hood of this instagram API?
- We can find out by walking through the [code](https://github.com/realsirjoe/instagram-scraper/blob/2d5fb53f1a92add34a8dbcf129708ed15d478190/igramscraper/instagram.py).

For example, let's check what happens when I run `instagram.get_account('tserence')`.
- If I go to the source code, I can check what `get_account` function does: [Line1070](https://github.com/realsirjoe/instagram-scraper/blob/2d5fb53f1a92add34a8dbcf129708ed15d478190/igramscraper/instagram.py#L1070). In there, I find that the code looks like this:
```python
class Instagram:
    # ...
    
    def get_account(self, username):
        """
        :param username: username
        :return: Account
        """
        time.sleep(self.sleep_between_requests)
        response = self.__req.get(endpoints.get_account_page_link(
            username), headers=self.generate_headers(self.user_session))

        if Instagram.HTTP_NOT_FOUND == response.status_code:
            raise InstagramNotFoundException(
                'Account with given username does not exist.')

        if Instagram.HTTP_OK != response.status_code:
            raise InstagramException.default(response.text,
                                             response.status_code)

        user_array = Instagram.extract_shared_data_from_body(response.text)

        if user_array['entry_data']['ProfilePage'][0]['graphql']['user'] is None:
            raise InstagramNotFoundException(
                'Account with this username does not exist')

        return Account(
            user_array['entry_data']['ProfilePage'][0]['graphql']['user'])
```

- From this code, I can deduce that the code has 3 parts:

  - the part which gets the page content using `self.__req.get(...)`,
  
  - the part which converts the raw `response` to JSON object which only contains the juicy part using `Instagram.extract_shared_data_from_body(response.text)`
  
  - the part which converts the result to an `Account` object
  
- The remainder of the code is checking if the HTTP request method was successful, and raises `Exception`'s when this fails.

If you press `Ctrl+Click` and press `inspect`, then you can find the source html of the website. All website is simply rendering this html. Scraper works the same way -- it literally loads this html and obtain the html components that you want and download them.

Since the obtained HTML body is converted into `json` format, let's actually see how this looks like. In order to step through the instagram API, I will replicate the behaviour of `instagram.get_account` in this notebook below:

In [None]:
from igramscraper.instagram import endpoints

In [None]:
# create a request session so we can access the URL's
import requests  # <-- already imported above so you actually don't need this. I put it here for clarity.
__req = requests.session()

# replicate the call to `instagram.get_account`
username = 'tserence'
response = __req.get(
    endpoints.get_account_page_link(username),
    headers=instagram.generate_headers(instagram.user_session),
)

In [None]:
# end the request session to avoid memory leak... 
# Instagram API doesn't do this! Potential problem...
__req.close()

In [None]:
# convert to JSON
user_array = Instagram.extract_shared_data_from_body(response.text)
# I know the extracts the information from 
# `user_array['entry_data']['ProfilePage'][0]['graphql']['user']`
# to get the user information, so lets see what is happening here.
user_json =  user_array['entry_data']['ProfilePage'][0]['graphql']['user']

In [None]:
# let's see what attributes are in this
user_json.keys()

Voila! Now we know what kind of internal information Instagram is using to render their content. Let's see some of the interesting contents...

In [None]:
terence_profile_pic = get_thumbnail(user_json['profile_pic_url'], imsize=(200, 200))
show_thumbnail(terence_profile_pic, 'Terence Profile Pic!')

In [None]:
for key in [
    'full_name',
    'is_private',
    'biography',
    'has_blocked_viewer',
]:
    print(f'[{key}]: {user_json[key]}')
    print()

Great! Terence has no `has_blocked_viewer`'s, what a great guy! (WOW Instagram, what kind of information are you returning!?)