# Project TT Instagram Data Attributes
- This notebook shows how we extract data attribute from the media information obtained through Instagram API

Let's get started by importing modules & login instagram API

In [None]:
# don't worry about these
%load_ext autoreload
%autoreload 2

In [None]:
from time import sleep

from igramscraper.instagram import Instagram
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline  

##### Digression: Reusing code & modules
Do you remember `get_thumbnail` and `show_thumbnail` from "[1] Instagram Scraper API Tutorial.ipynb"? I want to reuse them and not implement it again here. In order to do so, I put the function definitions in `core.utils`. Therefore by *importing* our own module, we can use the functions here.

When you call `import module_xyz`, python tries to find `module_xyz` from a few "locations":
- (1) internal system directories,
- (2) current working directory and
- (3) directories under the *environment variable* called `PYTHONPATH`.

If the module is located somewhere other than these paths, then it cannot be discovered! See it for yourself:

In [None]:
import sys  # a library that helps us find out the system information of our machines
sys.path  # this is PYTHONPATH stored within python

In [None]:
import os  # a library that helps us find out the directories of our machines
os.getcwd()  # get the (c)urrent (w)orking (d)irectory

In [None]:
import core  # FAILS, as core is not in PYTHONPATH, neither in the current directory

As you can see, our `core` module which lives under `project_TT` is not visible in PYTHONPATH. So let's add it.

In [None]:
# add project root to python path
PROJECT_TT_ROOT = os.path.abspath('..')  # this might change for you depending on where you are running
sys.path.append(PROJECT_TT_ROOT)

In [None]:
# now lets import the core
import core  # works!
from core.utils import get_thumbnail, show_thumbnail, imresize

##### Now lets get back to business!

login using Instagram API

In [None]:
instagram = Instagram(sleep_between_requests=10)

##### Data Attributes
We will extract the following attributes from the media that we are scraping:
- User
    - User ID
    - Full name
- Location
    - Location ID
    - Location name
- Media type: one of {Image, Video, Carousel}
- Caption
- Comments
    - A list of comments
    - Number of comments
- Time created
- Number of likes
- URLs
    - Image/video/carousel URL
    - Thumbnail URL

Cool! Let's remind ourselves what attributes are in each media

In [None]:
# choose which one to test...
image_urls = {
    'kjeragbolten': 'https://www.instagram.com/p/B2Mwbzkg-m8/',
    'songkran': 'https://www.instagram.com/p/Bw96sSrgoGa/',
    'kohphiphi': 'https://www.instagram.com/p/BwmhYcFAzUi/'
}

image_url = image_urls['kjeragbolten']

In [None]:
media = instagram.get_media_by_url(image_url)
print(media)

In [None]:
thumbnail = get_thumbnail(media.image_thumbnail_url)
show_thumbnail(thumbnail, media.caption)

In [None]:
# Media ID
print('[Media ID]')
print(media.identifier)  # Internal ID
print(media.short_code)
print(media.link)  # Media URL
print(media.type)  # type: image, video or carousel
# Username 
print('[Username]')
print(media.owner.identifier)
print(media.owner.username)
print(media.owner.full_name)
# Location
print('[Location]')
print(media.location_name)
# Post descriptions
print('[Caption]')
print(media.caption)
# Time posted
print('[Time]')
print(media.created_time)  # <- can you guess wtf is this number mean?
# Number of likes
print('[#likes]')
print(media.likes_count)
# Number of comments
print('[Comments]')
print(media.comments_count)  # <- this is wrong
print(media.comments)  # <- this is empty...
# Image url
print('[Image URLs]')
print(media.image_thumbnail_url)
print(media.image_high_resolution_url)  # original resolution

The returned object is missing some information we are looking for:
  - the comments.
  - I know this post has multiple images (carousel), but I'm only getting the url for the first one. We might use this later

This meant that we need to hack the code again to get the information that we want!

##### ENTER HACK-MODE
- Again, I copied the piece of source code from `get_media_json_by_url`
- I'm going to modify it to get all the information that we are looking for

In [None]:
import requests
import re
from igramscraper.exception.instagram_exception import InstagramException
from igramscraper.exception.instagram_not_found_exception import InstagramNotFoundException


def get_media_json_by_url(insta_obj, media_url):
    """
    
    Original code found at: https://github.com/realsirjoe/instagram-scraper/blob/2d5fb53f1a92add34a8dbcf129708ed15d478190/igramscraper/instagram.py#L371

    :param media_url: media url
    :return: json 
    """
    # create a one-off session
    __req = requests.session()
    
    # === Don't worry from here ===
    url_regex = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    if len(re.findall(url_regex, media_url)) <= 0:
        raise ValueError('Malformed media url')
    url = media_url.rstrip('/') + '/?__a=1'
    # === to here ===
    
    # Make request to get the information
    response = __req.get(url, headers=insta_obj.generate_headers(insta_obj.user_session))

    # === Don't worry from here ===
    if Instagram.HTTP_NOT_FOUND == response.status_code:
        raise InstagramNotFoundException('Media with given code does not exist or account is private.')

    if Instagram.HTTP_OK != response.status_code:
        raise InstagramException.default(response.text, response.status_code)
    # === to here ===

    media_array = response.json()
    try:
        media_in_json = media_array['graphql']['shortcode_media']
    except KeyError:
        raise InstagramException('Media with this code does not exist')

    return media_in_json

In [None]:
media_json = get_media_json_by_url(instagram, image_url)

Alright! what keys are actually returned? SHOW ME EVERYTHING:

In [None]:
for k, v in media_json.items():
    print(f'[{k}]: {v}\n\n')

Okay this is quite long ass, but it is quite informative. In fact, we got:
- All comments
- All carousel images
- `is_video` attribute which will let us find out if it is (an) image(s) or a video.

Cool! Let's write our own wrapper which not only gets the default media but other information we want as well

In [None]:
from igramscraper.model import Media

def get_type(typename):
    if typename == 'GraphImage':
        return 'image'
    elif typename == 'GraphVideo':
        return 'video'

In [None]:
# the default things scraped by them
m = Media(media_json)

In [None]:
# add comments to media
comments_attr = media_json['edge_media_to_parent_comment']
comments = [ comment['node']['text'] for comment in  media_json['edge_media_to_parent_comment']['edges'] ]
comments_count = len(comments)  # note that this is untested.. there might be paginated comments, in which probably this solution is not enough
comments = '_[COMMENT_SEPERSTOR]_'.join(comments) # into a string
m.comments = comments
m.comments_count = comments_count

In [None]:
# add carousel urls to media
media_ids = []
media_types = []
thumbnails = []
image_highres_urls = []
if 'edge_sidecar_to_children' in media_json:
    sidecars = media_json['edge_sidecar_to_children']['edges']
    for sidecar in sidecars:
        node = sidecar['node']
        media_ids.append(node['id'])
        media_types.append(get_type(node['__typename']))
        thumbnails.append(node['display_resources'][0]['src'])
        image_highres_urls.append(node['display_resources'][-1]['src'])
media_ids = ','.join(media_ids)
media_types = ','.join(media_types)
thumbnails = ' , '.join(thumbnails)
image_highres_urls = ' , '.join(image_highres_urls)

# create new attributes
m.carousel_ids = media_ids
m.carousel_types = media_types
m.carousel_thumbnail_urls = thumbnails
m.carousel_image_highres_urls = image_highres_urls

In [None]:
columns = [
    'media_id', 
    'media_code', 
    'media_link', 
    'user_id',
    'username',
    'user_full_name',
    'type', 
    'created_time',
    'likes_count', 
    'img_thumbnail_url', 
    'img_highres_url', 
    'carousel_ids',
    'carousel_types',
    'carousel_thumbnail_urls', 
    'carousel_highres_urls', 
    'caption',
    'comments_count',
    'comments',
    'location_id',
    'location_name',
    'location_slug',
]

sample_row = [
    m.identifier,
    m.short_code,
    m.link,
    m.owner.identifier,
    m.owner.username,
    m.owner.full_name,
    m.type,
    m.created_time,
    m.likes_count,
    m.image_thumbnail_url,
    m.image_high_resolution_url,
    m.carousel_ids,
    m.carousel_types,
    m.carousel_thumbnail_urls,
    m.carousel_image_highres_urls,
    m.caption,
    m.comments_count,
    m.comments,
    m.location_id,  
    m.location_name,
    m.location_slug,
]
for k, v in zip(columns, sample_row):
    print(f'[{k}]: {v}')

So there we have it!

##### Saving Images
Now lets look into downloading the thumbnails for our display.
- We already have `get_thumbnail` to download the image
- We just need to specify where to download on our machines
- I want to download to 
    - `project_tt/data/images` for the original small images
    - `project_tt/data/thumbnails` for the miniature images

In [None]:
thumbnail = get_thumbnail(m.image_high_resolution_url)

In [None]:
show_thumbnail(thumbnail, 'Orig')
# resize the image
thumbnail_resized = imresize(thumbnail, (64, 64))
show_thumbnail(thumbnail_resized, 'Downsampled')

lets save them!

In [None]:
import pathlib  # library that helps us manage directories and paths conveniently

# build temporary paths to save
image_path = pathlib.Path('tmp/data/images')
thumbnail_path = pathlib.Path('tmp/data/thumbnails')

# create the directories if they don't exist
image_path.mkdir(parents=True, exist_ok=True)
thumbnail_path.mkdir(parents=True, exist_ok=True)

In [None]:
from PIL import Image
imname = f'{image_path}/{m.identifier}.jpeg'
Image.fromarray(thumbnail).save(imname)

thumbnail_name = f'{thumbnail_path}/{m.identifier}.jpeg'
Image.fromarray(thumbnail_resized).save(thumbnail_name)

check if they are saved

In [None]:
# this command is not python! its terminal command running in jupyter-notebook via jupyter magic!
!ls $image_path

In [None]:
!ls $thumbnail_path

You can load them like this

In [None]:
loaded = Image.open(thumbnail_name)
loaded