# Instagram Scrapper with GraphQL

Greetings, in this notebook i'm going to implement an IG Scrapper in order to extract information needed from a specific page in Instagram Social Media. The reason I decided to implement it myself is tackling the problem first hands so I would have a clear vision and understanding in explanation of my thesis, also in order to make progress in my thesis and data-wise, I need to save my data and features in a specific way so my machine learning algorithm would work flawlessly.

This Scrapper was implemented with the help of Igscrapper source code from realsirjoe Github account you can check it here: [realsirjoe Github Account](https://github.com/realsirjoe)

### this section is for necessary imports:

In [1]:
import pandas as pd
import numpy as np
import time
import requests
import json
from openpyxl import load_workbook
import http.cookiejar
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException 
import re
pd.set_option('display.max_columns', None)

The main reason for implementing this scrapper is to retrieve comments from posts which I have their address in my dataset. Instagram use this format as its posts: 


https://instagram.com/p/some_chars_as_post_link/


I had stored my posts link in a field in my dataset which you can see below some examples of it:

In [2]:
df_posts = pd.read_excel('Data/MSc_Thesis_Dataset.xlsx')
df_posts.head()

Unnamed: 0,index,post_link,caption,like,comment,share,save,reach,pf_visit,follows,impression,type,image-alt
0,1,BhIjjhfFy7k,.\n@BeKhatereMan\n.\nاگر شما هم نگران عزیزانتا...,58,1,0,0,7,0,0,7,logo,
1,2,BhIjhiuF0Ip,.\n@BeKhatereMan\n.\nاگر شما هم نگران عزیزانتا...,108,1,0,0,8,0,0,8,other_ads,
2,3,BhIjehEFXZ0,.\n@BeKhatereMan\n.\nاگر شما هم نگران عزیزانتا...,56,0,0,1,2,0,0,2,logo,
3,4,BgGYNWABqeh,.\n@BeKhatereMan\n.\nاگر شما هم نگران عزیزانتا...,30,0,0,0,2,0,0,2,other_ads,
4,5,BgGYMmoBFjd,.\n@BeKhatereMan\n.\nاگر شما هم نگران عزیزانتا...,40,0,0,0,2,1,0,2,other_ads,


As you can see above my 'post_link' feature contains each post link which we just have to add it to the url we mentioned above to monitor it.

There are two ways to tackle the problem of scraping comments of each post in instagram:
1. using selenium library, opening Instagram Web, navigate to our post, retrieve comments for that post via web page source code.
2. just use GraphQL and Json :)

Obviously, proper way to doing this is using GraphQL since it's much more faster and actually we need our data in structured format so we can sort it and extract the part we want much easier. the objective is here to make a dataframe which contains useful information about the comments.

To benefit from this method we need to pass a variable dictionary to this hash url:

https://www.instagram.com/graphql/query/?query_hash=97b41c52301f77ce508f55e66d17620e

and if you do it correctly, a json will be passed for you which contains comment information of requested post.
Variables Dictionary contains:
1. short code or url of in-mind post.
2. quantity of comments you want to retrieve for each request.
3. index of comment you want to recieve after that (when you want to get a post comments in multiple requests.)

Before determining which features we want to have in our dataset regarding comments, we need to see which information are available for us via this way.

Lets just test this method with one of our entries:

In [13]:
number_of_comments_to_recieve = 50
max_id = ''
get_comment_url = 'https://www.instagram.com/graphql/query/?query_hash=97b41c52301f77ce508f55e66d17620e'

# we can actually define a python dictionary for it.

__variables = {
    "shortcode": str(df_posts['post_link'][0]),
    "first": str(number_of_comments_to_recieve),
    "after": "" if not max_id else max_id
}


# but since we have to pass it as a string via GET method in url, 
# I find it easier to make variables dictonary a string in python before passing it.

variables = '&variables={"shortcode":"' + str(df_posts['post_link'][16]) +'","first":"' + str(number_of_comments_to_recieve) + '","after":"' + max_id + '"}'

json_response = requests.get(get_comment_url + variables).json()
json_response

{'data': {'shortcode_media': {'edge_media_to_parent_comment': {'count': 3,
    'page_info': {'has_next_page': False, 'end_cursor': None},
    'edges': [{'node': {'id': '17904030238121618',
       'text': '@Bekhatereman Kash Ye nega ham be direct haton mindKhtid',
       'created_at': 1516067351,
       'did_report_as_spam': False,
       'owner': {'id': '5986010015',
        'is_verified': False,
        'profile_pic_url': 'https://scontent-amt2-1.cdninstagram.com/v/t51.2885-19/s150x150/97543380_248788369665493_8552836172030672896_n.jpg?_nc_ht=scontent-amt2-1.cdninstagram.com&_nc_ohc=gMjxcSK4h7YAX8xlt50&tp=1&oh=9259ba85d7119309b603a358244b8d8d&oe=5FE34193',
        'username': 'dokhtaram_adrina'},
       'viewer_has_liked': False,
       'edge_liked_by': {'count': 0},
       'edge_threaded_comments': {'count': 0,
        'page_info': {'has_next_page': False, 'end_cursor': None},
        'edges': []}}},
     {'node': {'id': '17919087844025241',
       'text': 'دوستان گرامی و عزیز ... خس

If you are familiar with json you can understand the reply easily, this post had 1 comment we can see its quantity as 'count' field.

each comment will be a 'node' in this json response which everyone of them have an id, comment text, create date, owner and much more useful information.

In [14]:
id_list = []
owner_username_list = []
text_list = []
comments_retrieved = json_response['data']['shortcode_media']['edge_media_to_parent_comment']['count']
for comment_number in range(comments_retrieved):
    id_list.append(json_response['data']['shortcode_media']['edge_media_to_parent_comment']['edges'][comment_number]['node']['id'])
    owner_username_list.append(json_response['data']['shortcode_media']['edge_media_to_parent_comment']['edges'][comment_number]['node']['owner']['username'])
    text_list.append(json_response['data']['shortcode_media']['edge_media_to_parent_comment']['edges'][comment_number]['node']['text'])
temp_data = {'id' : id_list,
             'owner_username' : owner_username_list,
             'text' : text_list,
             'post_link' : str(df_posts['post_link'][16])
            }
df_test = pd.DataFrame(data = temp_data)
df_test

Unnamed: 0,id,owner_username,text,post_link
0,17904030238121618,dokhtaram_adrina,@Bekhatereman Kash Ye nega ham be direct haton...,Bd-qCY4lATh
1,17919087844025241,lady._.designer,دوستان گرامی و عزیز ... خسته نباشید و خدا قوت ...,Bd-qCY4lATh
2,17905619464085983,mania_mind,#لعنت_به_ارين_موتور_دزد,Bd-qCY4lATh


now we can make our dataframe, in order to make this code cleaner we will write a function to first iterate for each post in our dataset and then iterate all comments in that post and save it to the new dataset we will make.

here I'm creating my empty lists for next function.

In [52]:
id_list = []
owner_username_list = []
text_list = []
post_link_list = []

In [53]:
def comment_retriever(post_link = df_posts['post_link'], start_index = 0):
    '''
    this function is designed to retrieve all comments for an Instagram post which you will pass short code to this function.
    args:
        post_link -> short link of the post you want to retrieve its comments.
        start_index -> since i have connection issue i retrieve start point from user
    return:
        this function actually returns nothing but update lists which we make our dataframe from them gradually.
    '''
    url = 'https://www.instagram.com/graphql/query/?query_hash=97b41c52301f77ce508f55e66d17620e&'
    code = 'ramin :)'
    for i in range(start_index, len(post_link)):
        code = post_link[i]
        variables = 'variables={"shortcode":"' + code + '","first":"50","after":""}'
        req_url = url + variables
        json_response = requests.get(req_url).json()
        cm_retrieved = json_response['data']['shortcode_media']['edge_media_to_parent_comment']['count']
        print(f"{i}, Response Status: {json_response['status']}, post shortcode: {code}")
        print(f"comments retrieved for this post: {cm_retrieved}")
        if cm_retrieved == 0:
            continue
        else:
            for cm in json_response['data']['shortcode_media']['edge_media_to_parent_comment']['edges']:
                id_list.append(cm['node']['id'])
                owner_username_list.append(cm['node']['owner']['username'])
                text_list.append(cm['node']['text'])
                post_link_list.append(code)
        print(f"items post {i} added to the related lists.")
    return None

In [72]:
comment_retriever(start_index = 204)

204, Response Status: ok, post shortcode: BYgQJ3YF-VB
comments retrieved for this post: 0
205, Response Status: ok, post shortcode: BYgQITdlLGm
comments retrieved for this post: 0
206, Response Status: ok, post shortcode: BYgQFmqloOv
comments retrieved for this post: 0
207, Response Status: ok, post shortcode: BYfgOT0FSA2
comments retrieved for this post: 0
208, Response Status: ok, post shortcode: BYfgKstlYil
comments retrieved for this post: 1
items post 208 added to the related lists.
209, Response Status: ok, post shortcode: BYfgEzWlAZD
comments retrieved for this post: 0
210, Response Status: ok, post shortcode: BYdth45FLRh
comments retrieved for this post: 0
211, Response Status: ok, post shortcode: BYdtd57F3Rf
comments retrieved for this post: 2
items post 211 added to the related lists.
212, Response Status: ok, post shortcode: BYdtbKXFCOr
comments retrieved for this post: 0
213, Response Status: ok, post shortcode: BYdtUBhlSnS
comments retrieved for this post: 4
items post 213

now we save our dataframe to an excel file.

In [81]:
df_comments = pd.DataFrame(data = {'id' : id_list,
                                 'owner_username' : owner_username_list,
                                 'text' : text_list,
                                 'post_link' : post_link_list
                                })
df_comments.to_excel(excel_writer = "data\comments.xlsx")

in order to make our dataset more comprehensive, we try to save the newly created comments dataset as a new sheet with coresponding name in our main dataset.

In [4]:
df_comment = pd.read_excel('data/comments.xlsx')
df_comment

Unnamed: 0.1,Unnamed: 0,id,owner_username,text,post_link
0,0,17925563803134203,antalya_amlak_best,برای دریافت بهترین قیمتهای املاک در آنتالیا تر...,BhIjjhfFy7k
1,1,17857688434300045,pariyanikookar,عالی بود لایک داره😍👍,BhIjhiuF0Ip
2,2,17912877673509433,lady._.designer,👍👌SUPER😍LIKE👌👍,BfqQTe9FHvR
3,3,17939183302119902,originalshow,پیج تون فوق العاده س😍,BfqQSf3lsml
4,4,17921061856066364,qn_mkc,سلام مسابقہ جدیدہ؟,BfbUhNxlzaJ
...,...,...,...,...,...
402,402,17880543517122587,artin_momy_441392,خي من قربونت برم زيبا😍😍😍😍,BXiJjg8Fpzp
403,403,17886281512077000,artin_momy_441392,💖💖💖💖,BXiJjg8Fpzp
404,404,17894445739041719,artin_momy_441392,🌹🌹🌹🌹,BXiJjg8Fpzp
405,405,17880883009126319,hedieh_hedayatzadeh,قربووووووونش بشم بانمک دوست داشتنی👌👌👌😄😄😄😉😉❤💙💚💛💜,BXiJgZuFGQf


In [9]:
book = load_workbook('data/MSc_Thesis_Dataset.xlsx')
writer = pd.ExcelWriter('data/MSc_Thesis_Dataset.xlsx', engine='openpyxl') 
writer.book = book

writer.sheets = dict((ws.title, ws) for ws in book.worksheets)
df_comment.to_excel(writer, "Comments")
writer.save()


now our we have comment sheet in our main dataset which contains our comments.

the last part of data that we need scraper to complete our dataset, is alts for posts.
Instagram use image recognition in every post in which tells you what objects are in the picture, this feature is available to access from Instagram web api, in alt tag of image. in this section we are going to retrieve this content from mentioned source and save it in our dataset.

since viewing a post requires login, I found it easier to use selenium and login via it.

In [5]:
df_posts = pd.read_excel('data/MSc_Thesis_Dataset.xlsx')
df_posts

Unnamed: 0,index,post_link,caption,like,comment,share,save,reach,pf_visit,follows,impression,type,image-alt
0,1,BhIjjhfFy7k,.\n@BeKhatereMan\n.\nاگر شما هم نگران عزیزانتا...,58,1,0,0,7,0,0,7,logo,
1,2,BhIjhiuF0Ip,.\n@BeKhatereMan\n.\nاگر شما هم نگران عزیزانتا...,108,1,0,0,8,0,0,8,other_ads,
2,3,BhIjehEFXZ0,.\n@BeKhatereMan\n.\nاگر شما هم نگران عزیزانتا...,56,0,0,1,2,0,0,2,logo,
3,4,BgGYNWABqeh,.\n@BeKhatereMan\n.\nاگر شما هم نگران عزیزانتا...,30,0,0,0,2,0,0,2,other_ads,
4,5,BgGYMmoBFjd,.\n@BeKhatereMan\n.\nاگر شما هم نگران عزیزانتا...,40,0,0,0,2,1,0,2,other_ads,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
307,308,BXiJgZuFGQf,.@BeKhatereMan\n.\nمنم یک روزی یک مرد بزرگ و م...,109,2,0,1,0,0,0,0,situational,
308,309,BXiJd9YF6bY,.@BeKhatereMan\n.\nنه زمین خوردن، نه شکستن اسب...,104,0,0,0,0,0,0,0,situational,
309,310,BXfQ6Y9Fv7k,.@BeKhatereMan\n.\nاگر شما هم نگران عزیزانتان ...,61,0,0,0,0,0,0,0,logo,
310,311,BXfQ4qKFXw_,.@BeKhatereMan\n.\nاگر شما هم نگران عزیزانتان ...,57,0,0,0,0,0,0,0,logo,


In [9]:
class InstagramBot:
    def __init__(self, username: str, password: str) -> None:
        self.username = username
        self.password = password
        self.driver = webdriver.Chrome('./chromedriver.exe')
        self.base_url = 'https://www.instagram.com'
        pass
    
    def delay(self, t = int) -> None:
        time.sleep(t)
        pass
        
    def login(self) -> None:
        self.driver.get(f'{self.base_url}/accounts/login/')
        self.delay(1)
        # accepting cookie
#         self.driver.find_element_by_xpath('/html/body/div[2]/div/div/div/div[2]/button[1]').click()
        self.driver.find_element_by_name('username').send_keys(self.username)
        self.driver.find_element_by_name('password').send_keys(self.password)
        self.driver.find_element_by_xpath('//*[@id="loginForm"]/div/div[3]/button').click()
        self.delay(1.5)
        pass
        
    def nav_post(self, post_link: str) -> None:
        self.delay(1)
        url = f'{self.base_url}/p/{post_link}/'
        self.driver.get(url)
        self.delay(1)
        pass
    
    def retrieve_alt(self) -> str:
        self.delay(1)
        try:
            return self.driver.find_element_by_xpath('//*[@id="react-root"]/section/main/div/div[1]/article/div[2]/div/div/div[1]/img').get_attribute('alt')
        except NoSuchElementException:
            return self.driver.find_element_by_xpath('//*[@id="react-root"]/section/main/div/div[1]/article/div[2]/div/div[1]/div[1]/div[1]/img').get_attribute('alt')
            
        


before inserting into the dataset, it's better to create a data structure containing propriate alt text + post link then add it to the dataframe.
(list containing tuple which is double pair of link and its alt) - (for first step we are going to add just alts of users and other ads type posts.)

In [21]:
alt_list = []

In [29]:
ig_bot = InstagramBot(username = '', password = '')
ig_bot.login()
for post in df_posts['post_link'][(df_posts['type'] == 'user') | (df_posts['type'] == 'other_ads')]:
    ig_bot.nav_post(post)
    alt_text = ig_bot.retrieve_alt()
    if alt_text.find('Image may contain:') != -1:
        alt_text = alt_text.split('.')[1]
        temp = (post, alt_text[alt_text.find(':') + 1 :].lstrip())
        alt_list.append(temp)
        print(f'alt text: {alt_text[alt_text.find(":") + 1 :].lstrip()} added to the key: {post}')
    else:
        print(f'{post} image does not have a alt tag.')
print(alt_list)

BhIjhiuF0Ip image does not have a alt tag.
BgGYNWABqeh image does not have a alt tag.
BgGYMmoBFjd image does not have a alt tag.
BgGYLjjB2_Q image does not have a alt tag.
BfqQTe9FHvR image does not have a alt tag.
BfbUf-_lhXu image does not have a alt tag.
alt text: phone added to the key: BeVQnOEFA-O
alt text: indoor added to the key: Bd-qCY4lATh
BdqDAuCl3ZP image does not have a alt tag.
BdqC_7alK7b image does not have a alt tag.
BdqC-_iFRXp image does not have a alt tag.
Bc7tHnRFiN9 image does not have a alt tag.
Bc7tETEFn9G image does not have a alt tag.
alt text: shoes added to the key: Bc7s_UTlHbV
Bcu4S-mFqp5 image does not have a alt tag.
BbZs7GvF3J9 image does not have a alt tag.
BbKeKPGlFS2 image does not have a alt tag.
BbHzJXTFZKf image does not have a alt tag.
alt text: 1 person added to the key: BZbltprFKEU
alt text: 1 person added to the key: BZblsArllWq
alt text: 2 people added to the key: BZblqcsFySj
alt text: 1 person, closeup added to the key: BZbEyi-F9yA
alt text: 2

in the next cell we quickly save our progress to a dataframe.

In [39]:
temp_post_list = []
temp_alt_list = []
for item in alt_list:
    temp_post_list.append(item[0])
    temp_alt_list.append(item[1])

df_alt = pd.DataFrame(data = {'post_link' : temp_post_list,
                              'alt_text' : temp_alt_list})
df_alt.to_excel(excel_writer = "data/alt_user_otherads.xlsx")

with a quick check of the dictionary, we found out that there are some errors in that:

- our algorithm couldn't detect alt attribute of images with tagged persons in it.
- our algorithm didn't detect alt attribute of images which are multiple.