# Instagram Post Dataset
This notebook retrieves a subset of images from the Instagram dataset (https://www.kaggle.com/datasets/shmalex/instagram-dataset).

In [1]:
# Path to the Instagram dataset
DS_PATH = "/kaggle/input/instagram-dataset"

# Dimensions to rescale images (IMAGE_SIZE x IMAGE_SIZE)
IMAGE_SIZE = 256

### Import libraries

- Pandas: Read the csv and store it into a Dataframe
- PIL: Image processing
- Requests: Get images from URLs
- NumPy: Arrays

In [2]:
import pandas as pd
from PIL import Image
import requests
import numpy as np

### Read data
Read `N_ROWS` rows of data starting at row `START_INDEX`. 

In [3]:
# Number of rows to read
N_ROW = 1000000

# Which row of the dataset to start reading at (not including header)
START_INDEX = 0 * 1000000 + 1

In [4]:
df = pd.read_csv(DS_PATH + '/instagram_posts.csv', 
                 delimiter='\t', 
                 nrows=N_ROW, 
                 skiprows=range(1, START_INDEX)
                )

## Data Cleaning
Add an index column that corresponds to the index of the post in the dataset

In [5]:
df.insert(loc=0, 
          column='index', 
          value=np.arange(START_INDEX, START_INDEX + df.shape[0])
         )

Get rows that contain a single image as the media.

We perform three operations:
- Get only the rows where the media type is a single image. This corresponds to `1` in the `post_id` column.
- Drop the `post_id` column as it is no longer needed
- Reset the indices after dropping the non-image rows in order to iterate through the rows easier

In [6]:
df = df[df.post_type == 1].drop(columns='post_type').reset_index()

Preview the first few entries of the dataframe. The sequences of characters in the `post_id` column are "shortcodes" which can be put into `https://www.instagram.com/p/<shortcode>/media/?size=l` to get the link to an image.

In [7]:
df.head()

Unnamed: 0,level_0,index,sid,sid_profile,post_id,profile_id,location_id,cts,description,numbr_likes,number_comments
0,1,2,28370932,-1,BVg0pbolYBC,5579335000.0,457426800000000.0,2017-06-19 09:31:16.000,🙌🏼 believe in ya dreams 🙌🏼 just like I believe...,25,1
1,2,3,28370933,-1,BRgkjcXFp3Q,313429600.0,457426800000000.0,2017-03-11 20:05:03.000,#meraviglia #incensi #the #candele #profumo #a...,9,0
2,3,4,28370934,-1,BKTKeNhjEA7,1837593000.0,457426800000000.0,2016-09-13 16:27:16.000,#teatime #scorpion #friends #love #mountains #...,4,0
3,4,5,28370935,-1,8-NQrvoYLX,1131527000.0,457426800000000.0,2015-10-18 10:19:27.000,thE sky gavE mE a #constEllation,8,0
4,5,6,28370964,-1,BrYDPJeABJQ,16262390.0,282618700.0,2018-12-14 18:16:15.000,#beautiful #Christmas #lights,138,15


### Get images
Create a directory where we will store the images.

In [8]:
!mkdir images

Iterate through the csv to get the images. Note that some URLs are invalid. When this happens, we skip over the image.

In [9]:
for i in range(1, len(df)):
    if i % 1000 == 0:
        print(f"Getting image {START_INDEX + i}...")
    
    # Get shortcodes from dataset
    shortcode = df.post_id[i]

    # Get images from url
    response = requests.get(f'https://www.instagram.com/p/{shortcode}/media/?size=l', stream=True)

    # Save images if they exist
    if response.status_code == 200:
        try:
            img = Image.open(response.raw).resize((IMAGE_SIZE, IMAGE_SIZE))
            img.save(f'images/{START_INDEX + i}.png')
        except:
            print(f"Error in index {START_INDEX + i}")
            print("Link:", f'https://www.instagram.com/p/{shortcode}/media/?size=l')
            pass
    
    

KeyboardInterrupt: 