# Instagram Scrapper with GraphQL

Greetings, in this notebook i'm going to implement an IG Scrapper in order to extract information needed from a specific page in Instagram Social Media. The reason I decided to implement it myself is tackling the problem first hands so I would have a clear vision and understanding in explanation of my thesis, also in order to make progress in my thesis and data-wise, I need to save my data and features in a specific way so my machine learning algorithm would work flawlessly.

This Scrapper was implemented with the help of Igscrapper source code from realsirjoe Github account you can check it here: [realsirjoe Github Account](https://github.com/realsirjoe)

### this section is for necessary imports:

In [65]:
import pandas as pd
import numpy as np
import time
import requests
import json
pd.set_option('display.max_columns', None)

The main reason for implementing this scrapper is to retrieve comments from posts which I have their address in my dataset. Instagram use this format as its posts: 


https://instagram.com/p/some_chars_as_post_link/


I had stored my posts link in a field in my dataset which you can see below some examples of it:

In [6]:
df_posts = pd.read_excel('Data/MSc_Thesis_Dataset.xlsx')
df_posts.head()

Unnamed: 0,index,post_link,caption,like,comment,share,save,reach,pf_visit,follows,impression,type,image-alt
0,1,BhIjjhfFy7k,.\n@BeKhatereMan\n.\nاگر شما هم نگران عزیزانتا...,58,1,0,0,7,0,0,7,logo,
1,2,BhIjhiuF0Ip,.\n@BeKhatereMan\n.\nاگر شما هم نگران عزیزانتا...,108,1,0,0,8,0,0,8,other_ads,
2,3,BhIjehEFXZ0,.\n@BeKhatereMan\n.\nاگر شما هم نگران عزیزانتا...,56,0,0,1,2,0,0,2,logo,
3,4,BgGYNWABqeh,.\n@BeKhatereMan\n.\nاگر شما هم نگران عزیزانتا...,30,0,0,0,2,0,0,2,other_ads,
4,5,BgGYMmoBFjd,.\n@BeKhatereMan\n.\nاگر شما هم نگران عزیزانتا...,40,0,0,0,2,1,0,2,other_ads,


As you can see above my 'post_link' feature contains each post link which we just have to add it to the url we mentioned above to monitor it.

There are two ways to tackle the problem of scraping comments of each post in instagram:
1. using selenium library, opening Instagram Web, navigate to our post, retrieve comments for that post via web page source code.
2. just use GraphQL and Json :)

Obviously, proper way to doing this is using GraphQL since it's much more faster and actually we need our data in structured format so we can sort it and extract the part we want much easier. the objective is here to make a dataframe which contains useful information about the comments.

To benefit from this method we need to pass a variable dictionary to this hash url:

https://www.instagram.com/graphql/query/?query_hash=97b41c52301f77ce508f55e66d17620e

and if you do it correctly, a json will be passed for you which contains comment information of requested post.
Variables Dictionary contains:
1. short code or url of in-mind post.
2. quantity of comments you want to retrieve for each request.
3. index of comment you want to recieve after that (when you want to get a post comments in multiple requests.)

Before determining which features we want to have in our dataset regarding comments, we need to see which information are available for us via this way.

Lets just test this method with one of our entries:

In [42]:
number_of_comments_to_recieve = 50
max_id = ''
get_comment_url = 'https://www.instagram.com/graphql/query/?query_hash=97b41c52301f77ce508f55e66d17620e'

# we can actually define a python dictionary for it.

__variables = {
    "shortcode": str(df_posts['post_link'][0]),
    "first": str(number_of_comments_to_recieve),
    "after": "" if not max_id else max_id
}


# but since we have to pass it as a string via GET method in url, 
# I find it easier to make variables dictonary a string in python before passing it.

variables = '&variables={"shortcode":"' + str(df_posts['post_link'][16]) +'","first":"' + str(number_of_comments_to_recieve) + '","after":"' + max_id + '"}'

json_response = requests.get(get_comment_url + variables).json()
json_response

{'data': {'shortcode_media': {'edge_media_to_parent_comment': {'count': 3,
    'page_info': {'has_next_page': False, 'end_cursor': None},
    'edges': [{'node': {'id': '17904030238121618',
       'text': '@Bekhatereman Kash Ye nega ham be direct haton mindKhtid',
       'created_at': 1516067351,
       'did_report_as_spam': False,
       'owner': {'id': '5986010015',
        'is_verified': False,
        'profile_pic_url': 'https://scontent-bru2-1.cdninstagram.com/v/t51.2885-19/s150x150/97543380_248788369665493_8552836172030672896_n.jpg?_nc_ht=scontent-bru2-1.cdninstagram.com&_nc_ohc=LshR8AjdG-IAX_iI5F7&tp=1&oh=551cee0319dc76d5d1af64250bfbe94c&oe=5FE34193',
        'username': 'dokhtaram_adrina'},
       'viewer_has_liked': False,
       'edge_liked_by': {'count': 0},
       'edge_threaded_comments': {'count': 0,
        'page_info': {'has_next_page': False, 'end_cursor': None},
        'edges': []}}},
     {'node': {'id': '17919087844025241',
       'text': 'دوستان گرامی و عزیز ... خس

If you are familiar with json you can understand the reply easily, this post had 1 comment we can see its quantity as 'count' field.

each comment will be a 'node' in this json response which everyone of them have an id, comment text, create date, owner and much more useful information.

In [64]:
comment_number = 0
comments_retrieved = json_response['data']['shortcode_media']['edge_media_to_parent_comment']['count']
json_response['data']['shortcode_media']['edge_media_to_parent_comment']['edges'][0]['node']

{'id': '17904030238121618',
 'text': '@Bekhatereman Kash Ye nega ham be direct haton mindKhtid',
 'created_at': 1516067351,
 'did_report_as_spam': False,
 'owner': {'id': '5986010015',
  'is_verified': False,
  'profile_pic_url': 'https://scontent-bru2-1.cdninstagram.com/v/t51.2885-19/s150x150/97543380_248788369665493_8552836172030672896_n.jpg?_nc_ht=scontent-bru2-1.cdninstagram.com&_nc_ohc=LshR8AjdG-IAX_iI5F7&tp=1&oh=551cee0319dc76d5d1af64250bfbe94c&oe=5FE34193',
  'username': 'dokhtaram_adrina'},
 'viewer_has_liked': False,
 'edge_liked_by': {'count': 0},
 'edge_threaded_comments': {'count': 0,
  'page_info': {'has_next_page': False, 'end_cursor': None},
  'edges': []}}

In [61]:
comment_number = 0
for i in range(comments_retrieved):
    print(json_response['data']['shortcode_media']['edge_media_to_parent_comment']['edges'][i]['node']['text'])

@Bekhatereman Kash Ye nega ham be direct haton mindKhtid
دوستان گرامی و عزیز ... خسته نباشید و خدا قوت ... منتظر کارها و عکسهای جدید هستیم ...سپاس از شما 🌷🌷🌷🌎🙆😍👍👌
#لعنت_به_ارين_موتور_دزد


In [74]:
temp_data = {'id' : [1,2],
             'owner_username' : ['a','aa'],
             'text' : ['b','bb']}
df_test = pd.DataFrame(data = temp_data)
df_test

Unnamed: 0,id,owner_username,text
0,1,a,b
1,2,aa,bb
