Useful link for progressing:

- [pre-trained models for beginners](https://learnopencv.com/pytorch-for-beginners-image-classification-using-pre-trained-models/)
- <strong> [PyTorch Pretrained EfficientNet Model Image Classification](https://debuggercafe.com/pytorch-pretrained-efficientnet-model-image-classification/)</strong>
- [PyTorch image classification with pre-trained networks](https://pyimagesearch.com/2021/07/26/pytorch-image-classification-with-pre-trained-networks/) 
- [How to modify a pretrained model](https://discuss.pytorch.org/t/how-to-modify-a-pretrained-model/60509)

---


# Instagram Like Prediction @310ai Competition - Data Understanding, Collection & Preparation

This notebook is for the competition posted by the @310ai on 15th of April. I will approach the competition as a project following the CRISP-DM methodology and try to explain the approach in every steps of the way.

The main and short summary of this competition is **"given an Instagram post predict the number of likes"**.

## Business Understanding
First thing first, there are some important points that we have to consider which are forced by the Instagram. This points will result in some features that are effective in percision of the model. In the following section we will discuss them further.

***Are we try to predict the number of likes for an Instagram post of our own or not?***

This question might seem a little odd, but let me explain it. Each Instagram post consists of some metrics that show the performance of the post among the users. We will call these **"Performance Metrics"**. Some of these performance metrics such as amount of like, amount of columns, caption and etc, are publicly availble, in other words, any user on the Instagram can see them.

But some of the performance metrics, are not publicly available, in order to see them, we need to authenticate as the owner of the page (will discuss about this part further in this section.), some of these private performance metrics are, amount of share, amount of save, amount of reach, amount of profile visits, amount of follows, amount of impression and etc.

Obviously, if we try to predict the amount of like for a page that we don't own, we can not access these features, we will go for a page that we don't have access to it for this competition.

Another to have in mind is that, since the post we are going to predict the amount of like for it, is not actually existing, the amount of performance metrics can't be predicted preciesly. In other words, how we can estimate the amount of comments a hypothetical post might recieve if we don't post it actually. Due to this abstraction, the performance metrics for each post is not a good feature for this deed.

In the further section I will try to address the questions of the competitions in combination of code and text. Please have in mind to follow the chosen methodology I might change the order of questions.

## Data Requirements and Data Collection

In this section I will tackle the questions mainly related to these parts of the challenge. As we discussed above some useful features introduced that might have effect on the precision of the prediction. But there are some other features, further I will point to some features that are related to the page of the published post.

### What Features you used?

Each and every page on the Instagram has some features that will distinguish it from other pages, some of these features are like the features discussed above, performance metrics, and some of them are identifiers. Some of the identifiers features are:
- `id`: a unique id that is allocated by the Instagram.
- `username`: a unique username that each user when created the page chose.

Also there are some other features that we will investigate, these features are:
- `category_name`: each page based on the published content and some other traits, are categorized into different categories, for instance, Blogger, Personal Blog, Design & Fashion, chef and etc.
- `follower`: amount of followers the page has.
- `following`: amount of pages that the target page is following.
- `ar_effect`: whether the page has published ar effects in the Instagram or not.
- `type_business`: whether the page identified itself as business account or not.
- `type_professional`: whether the page identified itself as professional account or not.
- `verified`: whether the page is verified or not.
- `reel_count`: amount of igtvs posted by the page.
- `media_count`: amount of posts, posted by the page.

There are some features that are collected organically but can be calculated in the process of feature engineering. Some of them are:
- `reel_view`: The average view of igtvs posted by the page.
- `reel_comment`: The average of comments igtvs acquired.
- `reel_like`: The average of likes igtvs got.
- `reel_duration`: The average of igtv's duration posted by the page.
- `reel_frequency`: How often the page have posted the reels.
- `media_avg_view`: The average view of media posted by the page.
- `media_avg_comment`: The average of comments media acquired.
- `media_avg_like`: The average of likes media got.
- `media_avg_duration`: The average of media's duration posted by the page.
- `media_frequency`: How often the page have posted the media.

Last but not least, is the content of the image itself. There are multiple ways to have the content of the image as feature. For instance we can have a classifier network to detect what objects are present in the image and pass them to the like predictor model.

As you are aware, choosing the best strategy requires some tests, such as A/B tests and trial and error ones, for now I will chose the strategy which will be discussed further that is fastest and heuristic.

I will use a heuristic approach regarding the image content, I will use a pre-trained image classifier, `[NAME OF ARCHECTURE USED]`, but remove the last layer and pass an image vector created by network as a feature to a classifier.

### How do we collect the data?

As we discussed above, there are different kind of features, and each group can be collected via different methods.

The Instagram provides an API for developers, but due some restrictions and limitations, this API can not provide us the data that we seek. Based on this facts, we will use a heuristic way to collect the data. There will be 2 approaches regarding the matter. one approach which is not very tech-friendly (:D) is to create a scrapper with Selenium page in python to scrap the information we need. Selenium is a website testing library in python that can also be utilized into a webscrapper. This approach has another limitation excluse for users like me, since I'm in Iran right now, access to the Instagram is restricted and we have to use VPNs and geo-restriction bypasses, these tools add another layer of challenge and additional bottleneck. Another approach that I try to utilize, is to use the graphql endpoints to recieve the information needed in JSON format. Eventhough still use of VPNs and similar tools is needed in this approach, but unlike the Selenium this approach doesn't require to load the GUI of Instagram, its much more faster and eligble in a pipeline.

- end point for user information:
`https://www.instagram.com/{username}/?__a=1&__d=dis
`

- end point for post information:
`https://www.instagram.com/p/{post_ID}/?__a=1&__d=dis
`

getting training data for the model:
- each json response of an account gives 12 latest post
information:

  - Alt text information is here: `data['graphql']['user']['edge_owner_to_timeline_media']['edges'][0]['node']['accessibility_caption']`
    - each node has type, `GraphImage` is posts which have alt text.
    - `GraphVideo` doesn't have alt text.
    - `GraphSideCar` is carousel and have alt text.
  - number of comments is here: `data['graphql']['user']['edge_owner_to_timeline_media']['edges'][0]['node']['edge_media_to_comment']['count']`
  - number of likes is here: `data['graphql']['user']['edge_owner_to_timeline_media']['edges'][0]['node']['edge_liked_by']['count']`


In [3]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import requests
from datetime import datetime
import json
import re
import numpy as np
import pandas as pd
from tqdm import tqdm
import time
from os import path, listdir
import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim import lr_scheduler
import torchvision
from torchvision import datasets, models, transforms
import cv2
from rich.console import Console
from rich.theme import Theme

ramin_theme = Theme({
    'success': 'italic bright_green',
    'error': 'bold red',
    'progress': 'italic yellow',
    'header': 'bold cyan',
})
console = Console(theme=ramin_theme)


# reading credentials for loging into the instagram account
with open('credentials.json') as f:
    creds = json.load(f)
    login_username = creds['username']
    login_password = creds['password']

# reading accounts lists for gathering training data.
with open('Data/top_100_follower.txt') as f:
    lines = f.readlines()
top_100_followers = lines[0].split(',')

with open('Data/top_100_posts.txt') as f:
    lines = f.readlines()
top_100_posts = lines[0].split(',')

# Loading the pretrained model for object classification
efficient_net = models.efficientnet_b7(pretrained=True)
efficient_net.eval()

# Preparing a standard transformation and categories of ImageNet
transform = transforms.Compose([
    transforms.ToPILImage(),
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[.485, .456, .406],
                         std=[.229, .224, .225])
])
image_directory = 'Data/Images'

# reading ImageNet Classes
with open('Data/ilsvrc2012_wordnet_lemmas.txt', 'r') as f:
    categories = [s.strip() for s in f.readlines()]

def flatten(lst):
    """A helper function to flatten any dimensional python list to 1D one.

    Args:
        lst (list): multi dimension python list

    Returns:
        list: flattened list
    """
    rt = []
    for i in lst:
        if isinstance(i,list): rt.extend(flatten(i))
        else: rt.append(i)
    return rt



#### Logging into the Instagram account
This step is necesary for getting information of the images, since majority of information in Instagram are locked behind the authentication wall.

In [6]:
link = 'https://www.instagram.com/accounts/login/'
login_url = 'https://www.instagram.com/accounts/login/ajax/'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36',
            'referer':'https://www.instagram.com/',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
            'Accept-Language': 'en-US,en;q=0.5',
            'Accept-Encoding': 'gzip, deflate, br',
            'Connection': 'keep-alive',
            'Upgrade-Insecure-Requests': '1',
            'Sec-Fetch-Dest': 'document',
            'Sec-Fetch-Mode': 'navigate',
            'Sec-Fetch-Site': 'none',
            'Sec-Fetch-User': '?1',
            'TE': 'trailers'
}


current_time = int(datetime.now().timestamp())
response = requests.Session().get(link, headers=headers)
if response.ok:
    csrf = re.findall(r'csrf_token\\":\\"(.*?)\\"',response.text)[0]
    username = login_username
    password = login_password

    payload = {
        'username': username,
        'enc_password': f'#PWD_INSTAGRAM_BROWSER:0:{current_time}:{password}',
        'queryParams': {},
        'optIntoOneTap': 'false',
        'stopDeletionNonce': '',
        'trustedDeviceRecords': '{}'
    }

    login_header = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36',
        "X-Requested-With": "XMLHttpRequest",
        "Referer": "https://www.instagram.com/accounts/login/",
        "X-CSRFToken": csrf,
        'Accept': '*/*',
        'Accept-Language': 'en-US,en;q=0.5',
        'X-Instagram-AJAX': 'c6412f1b1b7b',
        'X-IG-App-ID': '936619743392459',
        'X-ASBD-ID': '198387',
        'X-IG-WWW-Claim': '0',
        'X-Requested-With': 'XMLHttpRequest',
        'Origin': 'https://www.instagram.com',
        'DNT': '1',
        'Connection': 'keep-alive',
        'Referer': 'https://www.instagram.com/accounts/login/?',
        'Sec-Fetch-Dest': 'empty',
        'Sec-Fetch-Mode': 'cors',
        'Sec-Fetch-Site': 'same-origin',
    }

    login_response = requests.post(login_url, data=payload, headers=login_header)
    json_data = json.loads(login_response.text)


    if json_data['status'] == 'fail':
        print(json_data['message'])

    elif json_data["authenticated"]:
        print("login successful")
        cookies = login_response.cookies
        cookie_jar = cookies.get_dict()
        csrf_token = cookie_jar['csrftoken']
        print("csrf_token: ", csrf_token)
        session_id = cookie_jar['sessionid']
        print("session_id: ", session_id)

    else:
        print("login failed ", login_response.text)
else:
    print('error')
    print(response)

login successful
csrf_token:  salaaWSXe0gWdNp2LG6qpqPshVKyQK4D
session_id:  1691538713%3AplARLZpi4yxjyu%3A28%3AAYdMiR7cw2ptRbaz5YwahmwnqrCCqFFTNbttzzs1MA


#### Collecting Data

The below cell is the main cell for collecting the data from the Instagram, since this code block is the longest block in the workspace, it's worth to discuss the parts of its structure. Please have in mind the best design pattern for this kind of task, is to create pipeline, but since this is a competition and understanding a pipeline might be difficult for reviewrs, I stick with this approach regarding the matter.

First thing, I have to check whether the data is present or not, if the **accounts** and **posts** dataset are present I'm reading them, otherwise I'm creating empty dataframes for each one of them with their corresponding features. I have to read the names of the accounts I want to get their information, for the training of this model, I have selected the top 100 pages with the most followers and top 100 pages with the most published posts. I call these **accounts dataset**.

For each username in the accounts dataset, I do these procedures:
1. I check whether I had acquired that account information or no, if I had, skip that account and go to the next account.
2. Then I send a request containing appropriate headers and previously acquired cookies from logining into the Instagram, to recieve account information. I sanity check the response to validate whether we have got the correct response or it's faulty (i.e. empty response, page got private, etc.).
3. Previously Discussed features then are extracted from the response json and saved into their coresponding variables or lists, some of these features have to be calculated, for instance, media & reel frequency, view, like, comment, duration average and etc. These features are calculated and saved into their correspoding variables.
4. I create a temporary dataframe for each account and add it to the main accounts dataframe.
5. Almost the same procedure is done for the posts information.
6. In the end, we will have 5 seconds delay between each username process, to honoring the rate limit of the Instagram.

In [10]:
try:
    main_df = pd.read_csv('Data/main v2.0.csv')
    main_df.drop(columns=['Unnamed: 0'], inplace=True)
except:
    main_df = pd.DataFrame(columns=['id', 'username', 'shortcode', 'post_type', 'like', 'comment', 'object', 'category_name', 'follower', 'following', 'ar_effect', 'type_business', 'type_professional', 'verified', 'reel_count', 'reel_avg_view', 'reel_avg_comment', 'reel_avg_like', 'reel_avg_duration', 'reel_frequency', 'media_count', 'media_avg_comment', 'media_avg_like', 'media_frequency'])


for username in tqdm(top_100_followers + top_100_posts):
    console.print(f'Getting Account Information: [cyan]{username}[/]',)
    if main_df['username'].str.contains(f'{username}').any():
        console.print('\tUser information already exist, skipping...', style='error')
        continue

    # loading account information
    session = {
            "csrf_token": csrf_token,
            "session_id": session_id
        }

    headers = {
            "x-csrftoken": session['csrf_token'],
            'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36',
            "X-Requested-With": "XMLHttpRequest",
            "Referer": "https://www.instagram.com/accounts/login/",
            'Accept': '*/*',
            'Accept-Language': 'en-US,en;q=0.5',
            'X-Instagram-AJAX': 'c6412f1b1b7b',
            'X-IG-App-ID': '936619743392459',
            'X-ASBD-ID': '198387',
            'X-IG-WWW-Claim': '0',
            'X-Requested-With': 'XMLHttpRequest',
            'Origin': 'https://www.instagram.com',
            'DNT': '1',
            'Connection': 'keep-alive',
            'Referer': 'https://www.instagram.com/accounts/login/?',
            'Sec-Fetch-Dest': 'empty',
            'Sec-Fetch-Mode': 'cors',
            'Sec-Fetch-Site': 'same-origin',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
            'Accept-Language': 'en-US,en;q=0.5',
            'Accept-Encoding': 'gzip, deflate, br',
            'Connection': 'keep-alive',
            'Upgrade-Insecure-Requests': '1',
            'Sec-Fetch-Dest': 'document',
            'Sec-Fetch-Mode': 'navigate',
            'Sec-Fetch-Site': 'none',
            'Sec-Fetch-User': '?1',
            'TE': 'trailers'
        }

    cookies = {
            "sessionid": session['session_id'],
            "csrftoken": session['csrf_token']
        }
    url = f'https://www.instagram.com/{username}/?__a=1&__d=dis'
    res = requests.get(url, headers=headers, cookies=cookies)
    # add error handling here based on response codes, reference -> InstagramBot.py

    try:
        data = res.json()
    except:
        console.print('Something went wrong. Skipping...', style='error')
    if not data:
        console.print(f'\tResponse is empty for {username} skipping...', style='error')
        continue
    followers = data['graphql']['user']['edge_followed_by']['count']
    following = data['graphql']['user']['edge_follow']['count']
    ar_effect = data['graphql']['user']['has_ar_effects']
    id = data['graphql']['user']['id']
    type_business = data['graphql']['user']['is_business_account']
    type_professional = data['graphql']['user']['is_professional_account']
    category = data['graphql']['user']['category_name']
    verified = data['graphql']['user']['is_verified']
    reel_count = data['graphql']['user']['edge_felix_video_timeline']['count']
    media_count = data['graphql']['user']['edge_owner_to_timeline_media']['count']
    username = data['graphql']['user']['username']
    media = data['graphql']['user']['edge_owner_to_timeline_media']['edges']

    reel_view_list = []
    reel_like_list = []
    reel_comment_list = []
    reel_duration_list = []
    reel_timestamp_list = []

    media_like_list = []
    media_comment_list = []
    media_timestamp_list = []

    for video in data['graphql']['user']['edge_felix_video_timeline']['edges']:
        reel_view_list.append(video['node']['video_view_count'])
        reel_comment_list.append(video['node']['edge_media_to_comment']['count'])
        reel_timestamp_list.append(video['node']['taken_at_timestamp'])
        reel_like_list.append(video['node']['edge_liked_by']['count'])
        reel_duration_list.append(video['node']['video_duration'])
    
    # sometimes instagram result for video duration is None, this is sanity check
    reel_duration_list = [0 if duration is None else duration for duration in reel_duration_list]

    for medium in media:
        media_like_list.append(medium['node']['edge_liked_by']['count'])
        media_comment_list.append(medium['node']['edge_media_to_comment']['count'])
        media_timestamp_list.append(medium['node']['taken_at_timestamp'])
    
    reel_utc_list = [datetime.utcfromtimestamp(ts) for ts in reel_timestamp_list]
    media_utc_list = [datetime.utcfromtimestamp(ts) for ts in media_timestamp_list]

    reel_utc_difference_list = [reel_utc_list[i] - reel_utc_list[i+1] for i in range(len(reel_utc_list) - 1)]
    media_utc_difference_list = [media_utc_list[i] - media_utc_list[i+1] for i in range(len(media_utc_list) - 1)]

    if reel_count > 1:
        reel_frequency = np.mean(reel_utc_difference_list).days + (np.mean(reel_utc_difference_list).seconds / 86_400) + (np.mean(reel_utc_difference_list).microseconds / 1_000_000 / 84_600)
    else:
        reel_frequency = 0
    media_frequency = np.mean(media_utc_difference_list).days + (np.mean(media_utc_difference_list).seconds / 86_400) + (np.mean(media_utc_difference_list).microseconds / 1_000_000 / 84_600)

    reel_view_mean = np.mean(reel_view_list)
    reel_like_mean = np.mean(reel_like_list)
    reel_comment_mean = np.mean(reel_comment_list)
    reel_duration_mean = np.mean(reel_duration_list)

    media_like_mean = np.mean(media_like_list)
    media_comment_mean = np.mean(media_comment_list)

    for medium in media:
        shortcode = medium['node']['shortcode']
        media_type = medium['node']['__typename']
        media_display_url = medium['node']['display_url']
        media_like = medium['node']['edge_liked_by']['count']
        media_comment = medium['node']['edge_media_to_comment']['count']
        
        entry_lst = [id, username, shortcode, media_type, media_like, media_comment, None, category, followers, following, ar_effect, type_business, type_professional, verified, reel_count, reel_view_mean, reel_comment_mean, reel_like_mean, reel_duration_mean, reel_frequency, media_count, media_comment_mean, media_like_mean, media_frequency]
        main_df.loc[len(main_df)] = entry_lst
        main_df = main_df.astype({
            'ar_effect': bool,
            'type_business': bool,
            'type_professional': bool,
            'verified': bool,
        })
        if media_type == 'GraphImage' or media_type == 'GraphSidecar':
            if path.isfile(f'Data/Images/{shortcode}.jpg'):
                console.print('\tImage already exists, Skipping...', style='error')
                continue
            console.print(f'\tDownloading: {shortcode}', style='progress')
            res = requests.get(media_display_url)
            with open(f'Data/Images/{shortcode}.jpg', 'wb') as f:
                f.write(res.content)
            console.print('\tSaved!', style='success')
        main_df.to_csv('Data/main v2.0.csv')

  0%|          | 0/200 [00:00<?, ?it/s]

  5%|▌         | 10/200 [00:00<00:01, 95.93it/s]

 10%|█         | 20/200 [00:00<00:02, 82.68it/s]

 16%|█▌        | 31/200 [00:00<00:01, 90.64it/s]

 20%|██        | 41/200 [00:00<00:01, 88.68it/s]

 25%|██▌       | 50/200 [00:00<00:01, 88.75it/s]

 30%|███       | 60/200 [00:00<00:01, 90.82it/s]

 36%|███▌      | 71/200 [00:00<00:01, 96.67it/s]

 40%|████      | 81/200 [00:00<00:01, 89.56it/s]

 46%|████▌     | 91/200 [00:01<00:01, 82.49it/s]

 50%|█████     | 101/200 [00:01<00:01, 85.41it/s]

 56%|█████▌    | 112/200 [00:01<00:00, 91.17it/s]

 62%|██████▏   | 124/200 [00:01<00:00, 98.44it/s]

 68%|██████▊   | 135/200 [00:10<00:16,  3.93it/s]

 72%|███████▏  | 143/200 [00:18<00:24,  2.29it/s]

 78%|███████▊  | 156/200 [00:18<00:12,  3.54it/s]

 84%|████████▍ | 169/200 [00:18<00:05,  5.27it/s]

 89%|████████▉ | 178/200 [00:18<00:03,  6.89it/s]

 89%|████████▉ | 178/200 [00:30<00:03,  6.89it/s]

 90%|█████████ | 181/200 [00:34<00:13,  1.44it/s]

 91%|█████████ | 182/200 [00:56<00:31,  1.74s/it]

 92%|█████████▏| 183/200 [01:08<00:40,  2.39s/it]

 92%|█████████▏| 184/200 [01:22<00:54,  3.39s/it]

 92%|█████████▎| 185/200 [01:40<01:13,  4.92s/it]

 93%|█████████▎| 186/200 [02:00<01:36,  6.89s/it]

 94%|█████████▎| 187/200 [02:12<01:41,  7.80s/it]

 94%|█████████▍| 188/200 [02:17<01:26,  7.19s/it]

 94%|█████████▍| 188/200 [02:30<01:26,  7.19s/it]

 94%|█████████▍| 189/200 [02:35<01:44,  9.48s/it]

 95%|█████████▌| 190/200 [02:44<01:33,  9.38s/it]

 95%|█████████▌| 190/200 [03:00<01:33,  9.38s/it]

 96%|█████████▌| 191/200 [03:04<01:49, 12.11s/it]

 96%|█████████▌| 192/200 [03:22<01:48, 13.53s/it]

 96%|█████████▋| 193/200 [03:28<01:21, 11.57s/it]

 96%|█████████▋| 193/200 [03:40<01:21, 11.57s/it]

 97%|█████████▋| 194/200 [03:44<01:17, 12.84s/it]

 98%|█████████▊| 196/200 [03:54<00:37,  9.32s/it]

 98%|█████████▊| 196/200 [04:10<00:37,  9.32s/it]

 98%|█████████▊| 197/200 [05:06<01:12, 24.32s/it]

 99%|█████████▉| 198/200 [05:36<00:51, 25.82s/it]

100%|█████████▉| 199/200 [06:02<00:25, 25.99s/it]

100%|██████████| 200/200 [06:05<00:00,  1.83s/it]


In [11]:
main_df

Unnamed: 0,id,username,shortcode,post_type,like,comment,object,category_name,follower,following,...,reel_count,reel_avg_view,reel_avg_comment,reel_avg_like,reel_avg_duration,reel_frequency,media_count,media_avg_comment,media_avg_like,media_frequency
0,25025320,instagram,Cr_OviBJPrw,GraphSidecar,282344,6691,,Digital creator,632718714,59,...,1256,9409092.0,13528.916667,570225.416667,92.466333,18.554887,7404,8653.666667,449491.75,1.454414
1,25025320,instagram,Cr3hAoSNLVt,GraphVideo,1303777,17830,,Digital creator,632718714,59,...,1256,9409092.0,13528.916667,570225.416667,92.466333,18.554887,7404,8653.666667,449491.75,1.454414
2,25025320,instagram,Cr08AgFLRPW,GraphSidecar,348204,6513,,Digital creator,632718714,59,...,1256,9409092.0,13528.916667,570225.416667,92.466333,18.554887,7404,8653.666667,449491.75,1.454414
3,25025320,instagram,CryV45urd0B,GraphImage,267553,6693,,Digital creator,632718714,59,...,1256,9409092.0,13528.916667,570225.416667,92.466333,18.554887,7404,8653.666667,449491.75,1.454414
4,25025320,instagram,CrvsWzwsC-D,GraphVideo,297803,6631,,Digital creator,632718714,59,...,1256,9409092.0,13528.916667,570225.416667,92.466333,18.554887,7404,8653.666667,449491.75,1.454414
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2308,1818816886,indonesiabertauhidofficial,CsAql1ltWaG,GraphVideo,1015,0,,,1552241,76,...,5757,11642.0,2.166667,895.583333,194.896583,19.361267,27861,2.416667,1279.00,-2.237711
2309,1818816886,indonesiabertauhidofficial,Cr97NRqvv-P,GraphImage,659,0,,,1552241,76,...,5757,11642.0,2.166667,895.583333,194.896583,19.361267,27861,2.416667,1279.00,-2.237711
2310,1818816886,indonesiabertauhidofficial,Cr53VzVuJmT,GraphVideo,542,1,,,1552241,76,...,5757,11642.0,2.166667,895.583333,194.896583,19.361267,27861,2.416667,1279.00,-2.237711
2311,1818816886,indonesiabertauhidofficial,Cr5oaqAveJZ,GraphImage,1989,4,,,1552241,76,...,5757,11642.0,2.166667,895.583333,194.896583,19.361267,27861,2.416667,1279.00,-2.237711


Now with the image files, I need to prepare a neural network capable of outputting the vector of classified images. Since the data is not rich enough to train one from scratch, I have to use transfer learning and finetunning.

Saving image addresses in list, and reading ImageNet Classes

In [55]:
transform = transforms.Compose([
    transforms.ToPILImage(),
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[.485, .456, .406],
                         std=[.229, .224, .225])
])
image_directory = 'Data/Images'

# reading ImageNet Classes
with open('Data/ilsvrc2012_wordnet_lemmas.txt', 'r') as f:
    categories = [s.strip() for s in f.readlines()]

Iterating all the images and classifying them.

In [18]:
image_directory = 'Data/Images'

try:
    image_object_df = pd.read_csv('Data/images_object.csv')
    image_object_df.drop(columns=['Unnamed: 0'], inplace=True)
except:
    image_object_df = pd.DataFrame(columns=['shortcode','object'])

for image_filename in tqdm(listdir(image_directory)):
    if image_object_df['shortcode'].str.contains(f'{image_filename.split(".")[0]}').any():
        # picture already classified, skipping
        continue
    # loading image
    image_address = f'{image_directory}/{image_filename}'
    image = cv2.imread(image_address)
    # preprocessing image to be suitable to feed to the network
    rgb_image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    input_tensor = transform(image)
    input_batch = input_tensor.unsqueeze(0)
    with torch.no_grad():
        output = efficient_net(input_batch)
    detected_object = torch.nn.functional.softmax(output[0], dim=0)
    prob, cat = torch.topk(detected_object, 1)
    image_object_df = image_object_df.append({
        'shortcode': image_filename.split('.')[0],
        'object': categories[cat[0]]
    }, ignore_index=True)
image_object_df.to_csv('Data/images_object.csv')

100%|██████████| 2191/2191 [14:45<00:00,  2.47it/s]


Now that we have populated images object, dataframe, we can add the records to the main dataset.

In [24]:
main_df = pd.merge(main_df, image_object_df, on='shortcode', how='outer')
main_df.to_csv('Data/main v2.0.csv')

TODO:

- some pictures in the main dataset dont have their other information present, check the issue and fix that. it might be from saving the datasets when we are getting account information.

## Data Preparation

In the next stage of the CRISP-DM methodology, we have to clean our data for the training phase. Please have in mind that since the insight generation is not part of the competition, we will not undergo an EDA analysis, but an EDA analysis is highly suggested at this stage for any kind of endeavor.

In [25]:
main_df = pd.read_csv('Data/main v2.0.csv')
main_df.drop(columns=['Unnamed: 0'], inplace=True)

In [26]:
main_df

Unnamed: 0,id,username,shortcode,post_type,like,comment,category_name,follower,following,ar_effect,...,reel_avg_view,reel_avg_comment,reel_avg_like,reel_avg_duration,reel_frequency,media_count,media_avg_comment,media_avg_like,media_frequency,object
0,25025320.0,instagram,Cr_OviBJPrw,GraphSidecar,282344.0,6691.0,Digital creator,632718714.0,59.0,True,...,9409092.0,13528.91667,570225.4167,92.466333,18.554887,7404.0,8653.666667,449491.75,1.454414,"feather_boa, boa"
1,25025320.0,instagram,Cr08AgFLRPW,GraphSidecar,348204.0,6513.0,Digital creator,632718714.0,59.0,True,...,9409092.0,13528.91667,570225.4167,92.466333,18.554887,7404.0,8653.666667,449491.75,1.454414,"teddy, teddy_bear"
2,25025320.0,instagram,CryV45urd0B,GraphImage,267553.0,6693.0,Digital creator,632718714.0,59.0,True,...,9409092.0,13528.91667,570225.4167,92.466333,18.554887,7404.0,8653.666667,449491.75,1.454414,wig
3,25025320.0,instagram,CruRdGGsMq6,GraphSidecar,590609.0,8893.0,Digital creator,632718714.0,59.0,True,...,9409092.0,13528.91667,570225.4167,92.466333,18.554887,7404.0,8653.666667,449491.75,1.454414,mask
4,25025320.0,instagram,CrgVBtHr3DP,GraphSidecar,395844.0,8636.0,Digital creator,632718714.0,59.0,True,...,9409092.0,13528.91667,570225.4167,92.466333,18.554887,7404.0,8653.666667,449491.75,1.454414,neck_brace
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2429,,,CrZsxd-Prrb,,,,,,,,...,,,,,,,,,,"motor_scooter, scooter"
2430,,,CrZVF0kvixB,,,,,,,,...,,,,,,,,,,sarong
2431,,,CrZwRBorSpZ,,,,,,,,...,,,,,,,,,,comic_book
2432,,,CrZ_G9NSuBv,,,,,,,,...,,,,,,,,,,"web_site, website, internet_site, site"


First let's process the accounts dataframe:

In [20]:
print('Number of missing values for each feature:')
print(f'{main_df.isna().sum()}')

Number of missing values for each feature:
id                      0
username                0
shortcode               0
post_type               0
like                    0
comment                 0
object               2313
category_name         550
follower                0
following               0
ar_effect               0
type_business           0
type_professional       0
verified                0
reel_count              0
reel_avg_view           0
reel_avg_comment        0
reel_avg_like           0
reel_avg_duration       0
reel_frequency          0
media_count             0
media_avg_comment       0
media_avg_like          0
media_frequency         0
dtype: int64


As you can see in the cell below, the only feature that has missing value is **Category**. We will replace those missing values with "Unknown".

In [61]:
main_accounts_df['category_name'].fillna('Unknown', inplace=True)

Another data cleaning task that we must do to increase the accuracy and generalizability, is to process the categorical variables. Since we have a good chunk of categorical features in this dataset, we must do this task with careful consideration. There is always a debate regarding the type of encoding the categorical variables, should we use One Hot Encoding (OHE) or Label Encoding (LE). The rule of thumb for this debate rests in cardinality. If the cardinality of the feature is high, we must use label encoding, but if the cardinality is low, we should use label encoding. Let's Explore the cardinality of categorical features in the dataset.

In [62]:
print(f'Cardinality of category_name:\t\t {len(main_accounts_df["category_name"].unique())}')
print(f'Cardinality of ar_effect:\t\t {len(main_accounts_df["ar_effect"].unique())}')
print(f'Cardinality of type_business:\t\t {len(main_accounts_df["type_business"].unique())}')
print(f'Cardinality of type_professional:\t {len(main_accounts_df["type_professional"].unique())}')
print(f'Cardinality of verified:\t\t {len(main_accounts_df["verified"].unique())}')

Cardinality of category_name:		 42
Cardinality of ar_effect:		 2
Cardinality of type_business:		 2
Cardinality of type_professional:	 2
Cardinality of verified:		 2


As you can see in the cell above, the only feature with high cardinality is **category_name** and other features are binary categorical features, thus have the low cardinality.

***But***, at the time of writing this code, **XGBoost 1.7** had been published, since this version of XGBoost, it can works with categorical variables without the need of manual encoding, thus we won't encode the categorical variables.

Since we will use the XGBoost and tree-based models for this competetition, feature normalization won't improve the model, thus we will skip the normalization.

Now we can process the posts dataframe:

First thing we can remove `GraphVideo` type of posts from the dataset since reels on the Instagram don't have detected objects since they are videos.

In [63]:
main_posts_df = main_posts_df.drop(main_posts_df[main_posts_df['post_type'] == 'GraphVideo'].index)
main_posts_df = main_posts_df.reset_index(drop=True)

Now after all the cleaning, I can make the main dataset for training the model. To create the main dataset, we must add the account information which is present in the **main_accounts_df** to each corresponding record in **main_posts_df**.

In [64]:
df = main_posts_df.merge(main_accounts_df, on='username')
df.to_csv('Data/main_dataset.csv',)

After all of these endeavors, we can train the model. To make the workspace more clear, we will train the model explore it, and visualize it in another notebook.

----
`@Ramin F.` | [Email](ferdos.ramin@gmail.com) | [LinkedIn](https://www.linkedin.com/in/raminferdos/) | [GitHub](https://github.com/SimplyRamin) | [Personal Portfolio](https://simplyramin.github.io/)