`To Do:`
- [ ] check if post type can be added as feature

---

# Instagram Like Prediction @310ai Competition

This notebook is for the competition posted by the @310ai on 15th of April. I will approach the competition as a project following the CRISP-DM methodology and try to explain the approach in every steps of the way.

The main and short summary of this competition is **"given an Instagram post predict the number of likes"**.

## Business Understanding
First thing first, there are some important points that we have to consider which are forced by the Instagram. This points will result in some features that are effective in percision of the model. In the following section we will discuss them further.

***Are we try to predict the number of likes for an Instagram post of our own or not?***

This question might seem a little odd, but let me explain it. Each Instagram post consists of some metrics that show the performance of the post among the users. We will call these **"Performance Metrics"**. Some of these performance metrics such as amount of like, amount of columns, caption and etc, are publicly availble, in other words, any user on the Instagram can see them.

But some of the performance metrics, are not publicly available, in order to see them, we need to authenticate as the owner of the page (will discuss about this part further in this section.), some of these private performance metrics are, amount of share, amount of save, amount of reach, amount of profile visits, amount of follows, amount of impression and etc.

Obviously, if we try to predict the amount of like for a page that we don't own, we can not access these features, we will go for a page that we don't have access to it for this competition.

Another to have in mind is that, since the post we are going to predict the amount of like for it, is not actually existing, the amount of performance metrics can't be predicted preciesly. In other words, how we can estimate the amount of comments a hypothetical post might recieve if we don't post it actually. Due to this abstraction, the performance metrics for each post is not a good feature for this deed.

In the further section I will try to address the questions of the competitions in combination of code and text. Please have in mind to follow the chosen methodology I might change the order of questions.

## Data Requirements and Data Collection

In this section I will tackle the questions mainly related to these parts of the challenge. As we discussed above some useful features introduced that might have effect on the precision of the prediction. But there are some other features, further I will point to some features that are related to the page of the published post.

### What Features you used?

Each and every page on the Instagram has some features that will distinguish it from other pages, some of these features are like the features discussed above, performance metrics, and some of them are identifiers. Some of the identifiers features are:
- `id`: a unique id that is allocated by the Instagram.
- `username`: a unique username that each user when created the page chose.

Also there are some other features that we will investigate, these features are:
- `category_name`: each page based on the published content and some other traits, are categorized into different categories, for instance, Blogger, Personal Blog, Design & Fashion, chef and etc.
- `follower`: amount of followers the page has.
- `following`: amount of pages that the target page is following.
- `ar_effect`: whether the page has published ar effects in the Instagram or not.
- `type_business`: whether the page identified itself as business account or not.
- `type_professional`: whether the page identified itself as professional account or not.
- `verified`: whether the page is verified or not.
- `reel_count`: amount of igtvs posted by the page.
- `media_count`: amount of posts, posted by the page.

There are some features that are collected organically but can be calculated in the process of feature engineering. Some of them are:
- `reel_view`: The average view of igtvs posted by the page.
- `reel_comment`: The average of comments igtvs acquired.
- `reel_like`: The average of likes igtvs got.
- `reel_duration`: The average of igtv's duration posted by the page.
- `reel_frequency`: How often the page have posted the reels.
- `media_avg_view`: The average view of media posted by the page.
- `media_avg_comment`: The average of comments media acquired.
- `media_avg_like`: The average of likes media got.
- `media_avg_duration`: The average of media's duration posted by the page.
- `media_frequency`: How often the page have posted the media.

Last but not least, is the content of the image itself. There are multiple ways to have the content of the image as feature. For instance we can have a classifier network to detect what objects are present in the image and pass them to the like predictor model. Other heuristic approaches might result in a good model, such as passing the image vector generated by the last hidden layer of a classification network as a standalone feature.

As you are aware, choosing the best strategy requires some tests, such as A/B tests and trial and error ones, for now I will chose the strategy which will be discussed further that is fastest and heuristic.

It's been some time that the Meta, is using an object detection model for generating the Alt Text attribute for the posts. Due to the resources the Meta have in its disposal, this model is extremly face and reliable since it is ran on the server side. Thus for this approach we will use the result of the what objects are present in the image as feature.

## How do we collect the data?

As we discussed above, there are different kind of features, and each group can be collected via different methods.

The Instagram provides an API for developers, but due some restrictions and limitations, this API can not provide us the data that we seek. Based on this facts, we will use a heuristic way to collect the data. There will be 2 approaches regarding the matter. one approach which is not very tech-friendly (:D) is to create a scrapper with Selenium page in python to scrap the information we need. Selenium is a website testing library in python that can also be utilized into a webscrapper. This approach has another limitation excluse for users like me, since I'm in Iran right now, access to the Instagram is restricted and we have to use VPNs and geo-restriction bypasses, these tools add another layer of challenge and additional bottleneck. Another approach that I try to utilize, is to use the graphql endpoints to recieve the information needed in JSON format. Eventhough still use of VPNs and similar tools is needed in this approach, but unlike the Selenium this approach doesn't require to load the GUI of Instagram, its much more faster and eligble in a pipeline.

- end point for user information:
`https://www.instagram.com/{username}/?__a=1&__d=dis
`

- end point for post information:
`https://www.instagram.com/p/{post_ID}/?__a=1&__d=dis
`

getting training data for the model:
- each json response of an account gives 12 latest post
information:

  - Alt text information is here: `data['graphql']['user']['edge_owner_to_timeline_media']['edges'][0]['node']['accessibility_caption']`
    - each node has type, `GraphImage` is posts which have alt text.
    - `GraphVideo` doesn't have alt text.
    - `GraphSideCar` is carousel and have alt text.
  - number of comments is here: `data['graphql']['user']['edge_owner_to_timeline_media']['edges'][0]['node']['edge_media_to_comment']['count']`
  - number of likes is here: `data['graphql']['user']['edge_owner_to_timeline_media']['edges'][0]['node']['edge_liked_by']['count']`


In [5]:
import requests
from datetime import datetime
import json
import re
import numpy as np
import pandas as pd
from tqdm import tqdm
import time

# reading accounts lists for gathering training data.
with open('Data/top_100_follower.txt') as f:
    lines = f.readlines()
top_100_followers = lines[0].split(',')

with open('Data/top_100_posts.txt') as f:
    lines = f.readlines()
top_100_posts = lines[0].split(',')

# since added try exception in the main body of collecting data, this section is probably unnecessary, double check it.
main_accounts_df = pd.DataFrame(columns=['id', 'username', 'category_name', 'follower', 'following', 'ar_effect', 'type_business', 'type_professional', 'verified', 'reel_count', 'reel_avg_view', 'reel_avg_comment', 'reel_avg_like', 'reel_avg_duration', 'reel_frequency', 'media_count', 'media_avg_comment', 'media_avg_like', 'media_frequency'])
main_posts_df = pd.DataFrame(columns=['shortcode', 'post_type', 'username', 'like', 'comment', 'object_1', 'object_2', 'object_3', 'object_4', 'object_5','object_6'])

def flatten(lst):
    """A helper function to flatten any dimensional python list to 1D one.

    Args:
        lst (list): multi dimension python list

    Returns:
        list: flattened list
    """
    rt = []
    for i in lst:
        if isinstance(i,list): rt.extend(flatten(i))
        else: rt.append(i)
    return rt

#### Logging into the Instagram account
This step is necesary for getting information of the images, since majority of information in Instagram are locked behind the authentication wall.

In [3]:
link = 'https://www.instagram.com/accounts/login/'
login_url = 'https://www.instagram.com/accounts/login/ajax/'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36',
            'referer':'https://www.instagram.com/',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
            'Accept-Language': 'en-US,en;q=0.5',
            'Accept-Encoding': 'gzip, deflate, br',
            'Connection': 'keep-alive',
            'Upgrade-Insecure-Requests': '1',
            'Sec-Fetch-Dest': 'document',
            'Sec-Fetch-Mode': 'navigate',
            'Sec-Fetch-Site': 'none',
            'Sec-Fetch-User': '?1',
            'TE': 'trailers'
}


current_time = int(datetime.now().timestamp())
response = requests.Session().get(link, headers=headers)
if response.ok:
    csrf = re.findall(r'csrf_token\\":\\"(.*?)\\"',response.text)[0]
    username = 'rfdeveloping'
    password = 'ramin1234'

    payload = {
        'username': username,
        'enc_password': f'#PWD_INSTAGRAM_BROWSER:0:{current_time}:{password}',
        'queryParams': {},
        'optIntoOneTap': 'false',
        'stopDeletionNonce': '',
        'trustedDeviceRecords': '{}'
    }

    login_header = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36',
        "X-Requested-With": "XMLHttpRequest",
        "Referer": "https://www.instagram.com/accounts/login/",
        "X-CSRFToken": csrf,
        'Accept': '*/*',
        'Accept-Language': 'en-US,en;q=0.5',
        'X-Instagram-AJAX': 'c6412f1b1b7b',
        'X-IG-App-ID': '936619743392459',
        'X-ASBD-ID': '198387',
        'X-IG-WWW-Claim': '0',
        'X-Requested-With': 'XMLHttpRequest',
        'Origin': 'https://www.instagram.com',
        'DNT': '1',
        'Connection': 'keep-alive',
        'Referer': 'https://www.instagram.com/accounts/login/?',
        'Sec-Fetch-Dest': 'empty',
        'Sec-Fetch-Mode': 'cors',
        'Sec-Fetch-Site': 'same-origin',
    }

    login_response = requests.post(login_url, data=payload, headers=login_header)
    json_data = json.loads(login_response.text)


    if json_data['status'] == 'fail':
        print(json_data['message'])

    elif json_data["authenticated"]:
        print("login successful")
        cookies = login_response.cookies
        cookie_jar = cookies.get_dict()
        csrf_token = cookie_jar['csrftoken']
        print("csrf_token: ", csrf_token)
        session_id = cookie_jar['sessionid']
        print("session_id: ", session_id)

    else:
        print("login failed ", login_response.text)
else:
    print('error')
    print(response)

login successful
csrf_token:  HDaodXrMyStbMY87xZ5vMW5bE5ladNdl
session_id:  1691538713%3AkAlqueeCHIFh7V%3A25%3AAYcm5QFb7UuZfvtNxXKZtH9Sy0WXrDIr92QFJ-VlUQ


#### Collecting Data

In [29]:
# add read main accounts and main posts csv files here as dataframe
try:
    main_accounts_df = pd.read_csv('Data/accounts.csv')
    main_posts_df = pd.read_csv('Data/posts.csv')
except:
    main_accounts_df = pd.DataFrame(columns=['id', 'username', 'category_name', 'follower', 'following', 'ar_effect', 'type_business', 'type_professional', 'verified', 'reel_count', 'reel_avg_view', 'reel_avg_comment', 'reel_avg_like', 'reel_avg_duration', 'reel_frequency', 'media_count', 'media_avg_comment', 'media_avg_like', 'media_frequency'])
    main_posts_df = pd.DataFrame(columns=['shortcode', 'post_type', 'username', 'like', 'comment', 'object_1', 'object_2', 'object_3', 'object_4', 'object_5','object_6'])


for username in tqdm(top_100_followers):
    print(f'Getting Account Information: {username}')
    # loading account information
    session = {
            "csrf_token": csrf_token,
            "session_id": session_id
        }

    headers = {
            "x-csrftoken": session['csrf_token'],
            'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36',
            "X-Requested-With": "XMLHttpRequest",
            "Referer": "https://www.instagram.com/accounts/login/",
            'Accept': '*/*',
            'Accept-Language': 'en-US,en;q=0.5',
            'X-Instagram-AJAX': 'c6412f1b1b7b',
            'X-IG-App-ID': '936619743392459',
            'X-ASBD-ID': '198387',
            'X-IG-WWW-Claim': '0',
            'X-Requested-With': 'XMLHttpRequest',
            'Origin': 'https://www.instagram.com',
            'DNT': '1',
            'Connection': 'keep-alive',
            'Referer': 'https://www.instagram.com/accounts/login/?',
            'Sec-Fetch-Dest': 'empty',
            'Sec-Fetch-Mode': 'cors',
            'Sec-Fetch-Site': 'same-origin',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
            'Accept-Language': 'en-US,en;q=0.5',
            'Accept-Encoding': 'gzip, deflate, br',
            'Connection': 'keep-alive',
            'Upgrade-Insecure-Requests': '1',
            'Sec-Fetch-Dest': 'document',
            'Sec-Fetch-Mode': 'navigate',
            'Sec-Fetch-Site': 'none',
            'Sec-Fetch-User': '?1',
            'TE': 'trailers'
        }

    cookies = {
            "sessionid": session['session_id'],
            "csrftoken": session['csrf_token']
        }
    url = f'https://www.instagram.com/{username}/?__a=1&__d=dis'
    res = requests.get(url, headers=headers, cookies=cookies)
    # add error handling here based on response codes, reference -> InstagramBot.py

    data = res.json()
    followers = data['graphql']['user']['edge_followed_by']['count']
    following = data['graphql']['user']['edge_follow']['count']
    ar_effect = data['graphql']['user']['has_ar_effects']
    id = data['graphql']['user']['id']
    type_business = data['graphql']['user']['is_business_account']
    type_professional = data['graphql']['user']['is_professional_account']
    category = data['graphql']['user']['category_name']
    verified = data['graphql']['user']['is_verified']
    reel_count = data['graphql']['user']['edge_felix_video_timeline']['count']
    media_count = data['graphql']['user']['edge_owner_to_timeline_media']['count']
    username = data['graphql']['user']['username']

    reel_view_list = []
    reel_like_list = []
    reel_comment_list = []
    reel_duration_list = []
    reel_timestamp_list = []

    media_like_list = []
    media_comment_list = []
    media_timestamp_list = []

    for video in data['graphql']['user']['edge_felix_video_timeline']['edges']:
        reel_view_list.append(video['node']['video_view_count'])
        reel_comment_list.append(video['node']['edge_media_to_comment']['count'])
        reel_timestamp_list.append(video['node']['taken_at_timestamp'])
        reel_like_list.append(video['node']['edge_liked_by']['count'])
        reel_duration_list.append(video['node']['video_duration'])

    for medium in data['graphql']['user']['edge_owner_to_timeline_media']['edges']:
        media_like_list.append(medium['node']['edge_liked_by']['count'])
        media_comment_list.append(medium['node']['edge_media_to_comment']['count'])
        media_timestamp_list.append(medium['node']['taken_at_timestamp'])

    
    reel_utc_list = [datetime.utcfromtimestamp(ts) for ts in reel_timestamp_list]
    media_utc_list = [datetime.utcfromtimestamp(ts) for ts in media_timestamp_list]

    reel_utc_difference_list = [reel_utc_list[i] - reel_utc_list[i+1] for i in range(len(reel_utc_list) - 1)]
    media_utc_difference_list = [media_utc_list[i] - media_utc_list[i+1] for i in range(len(media_utc_list) - 1)]

    if reel_count > 1:
        reel_frequency = np.mean(reel_utc_difference_list).days + (np.mean(reel_utc_difference_list).seconds / 86_400) + (np.mean(reel_utc_difference_list).microseconds / 1_000_000 / 84_600)
    else:
        reel_frequency = 0
    media_frequency = np.mean(media_utc_difference_list).days + (np.mean(media_utc_difference_list).seconds / 86_400) + (np.mean(media_utc_difference_list).microseconds / 1_000_000 / 84_600)

    reel_view_mean = np.mean(reel_view_list)
    reel_like_mean = np.mean(reel_like_list)
    reel_comment_mean = np.mean(reel_comment_list)
    reel_duration_mean = np.mean(reel_duration_list)

    media_like_mean = np.mean(media_like_list)
    media_comment_mean = np.mean(media_comment_list)

    entry_lst = [id, username, category, followers, following, ar_effect, type_business, type_professional, verified, reel_count, reel_view_mean, reel_comment_mean, reel_like_mean, reel_duration_mean, reel_frequency, media_count, media_comment_mean, media_like_mean, media_frequency]

    account_df = pd.DataFrame() #reset variable
    account_df = pd.DataFrame([entry_lst], columns=['id', 'username', 'category_name', 'follower', 'following', 'ar_effect', 'type_business', 'type_professional', 'verified', 'reel_count', 'reel_avg_view', 'reel_avg_comment', 'reel_avg_like', 'reel_avg_duration', 'reel_frequency', 'media_count', 'media_avg_comment', 'media_avg_like', 'media_frequency'])

    if account_df.username.isin(main_accounts_df.username).bool():
        print('User information already exist, skipping...')
        continue
    else:
        print(f'Adding {username} information...')
        account_df = account_df.astype({
            'ar_effect': bool,
            'type_business': bool,
            'type_professional': bool,
            'verified': bool,
        })
        main_accounts_df = pd.concat([main_accounts_df, account_df], axis=0, join='outer')
    
    # adding user's posts information
    print(f'Getting Posts Information: {username}')
    # main lists structure is:
    # shortcode, post_type, username, objects
    posts_lst = []
    for post in data['graphql']['user']['edge_owner_to_timeline_media']['edges']:
        temp_lst = []
        objects = []
        temp_lst.append(post['node']['shortcode'])
        temp_lst.append(post['node']['__typename'])
        temp_lst.append(data['graphql']['user']['username'])
        temp_lst.append(post['node']['edge_liked_by']['count'])
        temp_lst.append(post['node']['edge_media_to_comment']['count'])
        if post['node']['__typename'] == 'GraphImage' or post['node']['__typename'] == 'GraphSidecar':
            if post['node']['accessibility_caption'] == None:
                objects = []
                continue
            # split object-detection output
            objects = post['node']['accessibility_caption'].split('.')[1]
            # terminating empty lists
            if objects:
                try:
                    # cutting objects
                    objects = objects.split('of')[1]
                    objects = objects.split('and', 1)
                    objects[0] = objects[0].split(',')
                    if 'text' in objects[1]:
                        objects[1] = 'text'
                except:
                    continue
                # flattening the objects list to make the dimension 1D
                objects = flatten(objects)
                # terminating leading and trailing spaces from list items
                objects = [item.strip() for item in objects]
            else:
                objects = []
        # padding the objects list, we set the limit to 6 objects
        objects += ['No Object'] * (6 - len(objects))
        if len(objects) > 6:
            objects = objects[:6]
        temp_lst.append(objects)
        posts_lst.append(flatten(temp_lst))

    # creating temporary dataframe for posts of this account
    temp_df = pd.DataFrame()
    temp_df = pd.DataFrame(posts_lst, columns=[
        'shortcode',
        'post_type',
        'username',
        'like',
        'comment',
        'object_1',
        'object_2',
        'object_3',
        'object_4',
        'object_5',
        'object_6'
    ])

    if temp_df.username.isin(main_posts_df.username)[0]:
        print('User post information already exist, skiping...')
        continue
    else:
        print(f'Adding {username} posts information...')
        main_posts_df = pd.concat([main_posts_df, temp_df], axis=0, join='outer')
    
    # saving the data each time
    main_accounts_df.to_csv('Data/accounts.csv')
    main_posts_df.to_csv('Data/posts.csv')

    # waiting 5 sec for each  user, instagram rate limit
    print('Waiting 5 seconds...')
    time.sleep(5)

  0%|          | 0/100 [00:00<?, ?it/s]

Getting Account Information: instagram
Adding instagram information...
Getting Posts Information: instagram
Adding instagram posts information...
Waiting 5 seconds...


  1%|          | 1/100 [00:09<16:19,  9.90s/it]

Getting Account Information: cristiano
Adding cristiano information...
Getting Posts Information: cristiano
Adding cristiano posts information...
Waiting 5 seconds...


  main_accounts_df = pd.concat([main_accounts_df, account_df], axis=0, join='outer')
  2%|▏         | 2/100 [00:17<14:24,  8.82s/it]

Getting Account Information: leomessi
Adding leomessi information...
Getting Posts Information: leomessi
Adding leomessi posts information...
Waiting 5 seconds...


  main_accounts_df = pd.concat([main_accounts_df, account_df], axis=0, join='outer')
  3%|▎         | 3/100 [00:30<17:15, 10.67s/it]

Getting Account Information: selenagomez
Adding selenagomez information...
Getting Posts Information: selenagomez
Adding selenagomez posts information...
Waiting 5 seconds...


  main_accounts_df = pd.concat([main_accounts_df, account_df], axis=0, join='outer')
  4%|▍         | 4/100 [00:40<16:12, 10.13s/it]

Getting Account Information: kyliejenner
Adding kyliejenner information...
Getting Posts Information: kyliejenner
Adding kyliejenner posts information...
Waiting 5 seconds...


  main_accounts_df = pd.concat([main_accounts_df, account_df], axis=0, join='outer')
  5%|▌         | 5/100 [00:49<15:37,  9.86s/it]

Getting Account Information: therock
Adding therock information...
Getting Posts Information: therock
Adding therock posts information...
Waiting 5 seconds...


  main_accounts_df = pd.concat([main_accounts_df, account_df], axis=0, join='outer')
  6%|▌         | 6/100 [01:01<16:46, 10.71s/it]

Getting Account Information: arianagrande
Adding arianagrande information...
Getting Posts Information: arianagrande
Adding arianagrande posts information...
Waiting 5 seconds...


  main_accounts_df = pd.concat([main_accounts_df, account_df], axis=0, join='outer')
  7%|▋         | 7/100 [01:11<15:58, 10.30s/it]

Getting Account Information: kimkardashian
Adding kimkardashian information...
Getting Posts Information: kimkardashian
Adding kimkardashian posts information...
Waiting 5 seconds...


  main_accounts_df = pd.concat([main_accounts_df, account_df], axis=0, join='outer')
  8%|▊         | 8/100 [01:21<15:55, 10.38s/it]

Getting Account Information: beyonce
Adding beyonce information...
Getting Posts Information: beyonce
Adding beyonce posts information...
Waiting 5 seconds...


  main_accounts_df = pd.concat([main_accounts_df, account_df], axis=0, join='outer')
  8%|▊         | 8/100 [01:31<17:36, 11.48s/it]


KeyboardInterrupt: 

In [8]:
session = {
            "csrf_token": csrf_token,
            "session_id": session_id
        }

headers = {
            "x-csrftoken": session['csrf_token'],
            'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36',
            "X-Requested-With": "XMLHttpRequest",
            "Referer": "https://www.instagram.com/accounts/login/",
            'Accept': '*/*',
            'Accept-Language': 'en-US,en;q=0.5',
            'X-Instagram-AJAX': 'c6412f1b1b7b',
            'X-IG-App-ID': '936619743392459',
            'X-ASBD-ID': '198387',
            'X-IG-WWW-Claim': '0',
            'X-Requested-With': 'XMLHttpRequest',
            'Origin': 'https://www.instagram.com',
            'DNT': '1',
            'Connection': 'keep-alive',
            'Referer': 'https://www.instagram.com/accounts/login/?',
            'Sec-Fetch-Dest': 'empty',
            'Sec-Fetch-Mode': 'cors',
            'Sec-Fetch-Site': 'same-origin',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
            'Accept-Language': 'en-US,en;q=0.5',
            'Accept-Encoding': 'gzip, deflate, br',
            'Connection': 'keep-alive',
            'Upgrade-Insecure-Requests': '1',
            'Sec-Fetch-Dest': 'document',
            'Sec-Fetch-Mode': 'navigate',
            'Sec-Fetch-Site': 'none',
            'Sec-Fetch-User': '?1',
            'TE': 'trailers'
        }

cookies = {
            "sessionid": session['session_id'],
            "csrftoken": session['csrf_token']
        }
url = f'https://www.instagram.com/arianagrande/?__a=1&__d=dis'
res = requests.get(url, headers=headers, cookies=cookies)
# add error handling here based on response codes, reference -> InstagramBot.py
res

<Response [200]>

In [9]:
data = res.json()

In [8]:
# this section need to be changed to sent requests and process the json which is in the response
f = open('Data/account_response.json', 'r')
data = json.load(f)
f.close()

In [10]:
followers = data['graphql']['user']['edge_followed_by']['count']
following = data['graphql']['user']['edge_follow']['count']
ar_effect = data['graphql']['user']['has_ar_effects']
id = data['graphql']['user']['id']
type_business = data['graphql']['user']['is_business_account']
type_professional = data['graphql']['user']['is_professional_account']
category = data['graphql']['user']['category_name']
verified = data['graphql']['user']['is_verified']
reel_count = data['graphql']['user']['edge_felix_video_timeline']['count']
media_count = data['graphql']['user']['edge_owner_to_timeline_media']['count']
username = data['graphql']['user']['username']

In [11]:
reel_view_list = []
reel_like_list = []
reel_comment_list = []
reel_duration_list = []
reel_timestamp_list = []

media_like_list = []
media_comment_list = []
media_timestamp_list = []

for video in data['graphql']['user']['edge_felix_video_timeline']['edges']:
    reel_view_list.append(video['node']['video_view_count'])
    reel_comment_list.append(video['node']['edge_media_to_comment']['count'])
    reel_timestamp_list.append(video['node']['taken_at_timestamp'])
    reel_like_list.append(video['node']['edge_liked_by']['count'])
    reel_duration_list.append(video['node']['video_duration'])

for medium in data['graphql']['user']['edge_owner_to_timeline_media']['edges']:
    media_comment_list.append(medium['node']['edge_media_to_comment']['count'])
    media_timestamp_list.append(medium['node']['taken_at_timestamp'])
    media_like_list.append(medium['node']['edge_liked_by']['count'])



In [12]:
reel_utc_list = [datetime.utcfromtimestamp(ts) for ts in reel_timestamp_list]
media_utc_list = [datetime.utcfromtimestamp(ts) for ts in media_timestamp_list]

reel_utc_difference_list = [reel_utc_list[i] - reel_utc_list[i+1] for i in range(len(reel_utc_list) - 1)]
media_utc_difference_list = [media_utc_list[i] - media_utc_list[i+1] for i in range(len(media_utc_list) - 1)]

if reel_count > 1:
    reel_frequency = np.mean(reel_utc_difference_list).days + (np.mean(reel_utc_difference_list).seconds / 86_400) + (np.mean(reel_utc_difference_list).microseconds / 1_000_000 / 84_600)
else:
    reel_frequency = 0
media_frequency = np.mean(media_utc_difference_list).days + (np.mean(media_utc_difference_list).seconds / 86_400) + (np.mean(media_utc_difference_list).microseconds / 1_000_000 / 84_600)

reel_view_mean = np.mean(reel_view_list)
reel_like_mean = np.mean(reel_like_list)
reel_comment_mean = np.mean(reel_comment_list)
reel_duration_mean = np.mean(reel_duration_list)

media_like_mean = np.mean(media_like_list)
media_comment_mean = np.mean(media_comment_list)

entry_lst = [id, username, category, followers, following, ar_effect, type_business, type_professional, verified, reel_count, reel_view_mean, reel_comment_mean, reel_like_mean, reel_duration_mean, reel_frequency, media_count, media_comment_mean, media_like_mean, media_frequency]

accounts_df = pd.DataFrame([entry_lst] ,columns=['id', 'username', 'category_name', 'follower', 'following', 'ar_effect', 'type_business', 'type_professional', 'verified', 'reel_count', 'reel_avg_view', 'reel_avg_comment', 'reel_avg_like', 'reel_avg_duration', 'reel_frequency', 'media_count', 'media_avg_comment', 'media_avg_like', 'media_frequency'])

In [13]:
accounts_df

Unnamed: 0,id,username,category_name,follower,following,ar_effect,type_business,type_professional,verified,reel_count,reel_avg_view,reel_avg_comment,reel_avg_like,reel_avg_duration,reel_frequency,media_count,media_avg_comment,media_avg_like,media_frequency
0,7719696,arianagrande,Musician,367888326,600,False,False,True,True,1309,9167998.0,40.5,2158336.25,49.31075,18.352435,4987,2367.666667,3577108.0,13.360044


In [32]:
def flatten(lst):
    """A helper function to flatten any dimensional python list to 1D one.

    Args:
        lst (list): multi dimension python list

    Returns:
        list: flattened list
    """
    rt = []
    for i in lst:
        if isinstance(i,list): rt.extend(flatten(i))
        else: rt.append(i)
    return rt

# main lists structure is:
#   shortcode, post_type, username, objects
posts_lst = []
for post in data['graphql']['user']['edge_owner_to_timeline_media']['edges']:
    temp_lst = []
    objects = []
    temp_lst.append(post['node']['shortcode'])
    temp_lst.append(post['node']['__typename'])
    temp_lst.append(data['graphql']['user']['username'])
    temp_lst.append(post['node']['edge_liked_by']['count'])
    temp_lst.append(post['node']['edge_media_to_comment']['count'])
    if post['node']['__typename'] == 'GraphImage' or post['node']['__typename'] == 'GraphSidecar':
        # split object-detection output
        if post['node']['accessibility_caption'] == None:
            objects = []
            continue
        objects = post['node']['accessibility_caption'].split('.')[1]
        # terminating empty lists
        if objects:
            try:
            # cutting objects
                objects = objects.split('of')[1]
                objects = objects.split('and', 1)
                objects[0] = objects[0].split(',')
                if 'text' in objects[1]:
                    objects[1] = 'text'
            except:
                continue
            # flattening the objects list to make the dimension 1D
            objects = flatten(objects)
            # terminating leading and trailing spaces from list items
            objects = [item.strip() for item in objects]
        else:
            objects = []
    # padding the objects list, we set the limit to 6 objects
    objects += ['No Object'] * (6 - len(objects))
    if len(objects) > 6:
        objects = objects[:6]
    temp_lst.append(objects)
    posts_lst.append(flatten(temp_lst))

# creating temporary dataframe for posts of this account
temp_df = pd.DataFrame(posts_lst, columns=[
    'shortcode',
    'post_type',
    'username',
    'like',
    'comment',
    'object_1',
    'object_2',
    'object_3',
    'object_4',
    'object_5',
    'object_6'
])

In [33]:
temp_df

Unnamed: 0,shortcode,post_type,username,like,comment,object_1,object_2,object_3,object_4,object_5,object_6
0,CqLUeUppDYs,GraphSidecar,beyonce,3180843,25102,2 people,makeup,people kissing,suit,overcoat,dinner jacket
1,CqLUQ_upwT0,GraphImage,beyonce,1539359,18315,1 person,magazine,text,No Object,No Object,No Object
2,Cp3vzkhp1Ug,GraphSidecar,beyonce,4515558,52969,1 person,makeup,dress,No Object,No Object,No Object
3,CoZ98CJDHRZ,GraphVideo,beyonce,4753985,88044,No Object,No Object,No Object,No Object,No Object,No Object
4,CoZJBmpuIlD,GraphSidecar,beyonce,3586900,27435,miniskirt,drawstring,top,No Object,No Object,No Object
5,CoTojVarwg_,GraphSidecar,beyonce,2860019,31583,one or more people,makeup,No Object,No Object,No Object,No Object
6,CoRHCpauzUG,GraphImage,beyonce,2144468,22590,one or more people,makeup,dress,miniskirt,No Object,No Object
7,CoHxOQhrTHX,GraphImage,beyonce,8670783,225761,costume,tinfoil,fishnet stockings,headdress,No Object,No Object
8,CkhUy2bL0_9,GraphImage,beyonce,5640978,76182,No Object,No Object,No Object,No Object,No Object,No Object
