<a href="https://colab.research.google.com/github/S-Arnone/RaqqaOnlife/blob/main/Instagram.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# @author:          Samuel Arnone-Roller
# @email:           RollerSa@uw.edu
# @website:         https://hgis.uw.edu
# @organization:    University of Washington - Graduate Student
# @description:     An Instagram crawler developed for GEOG 595.

In [None]:
# Installing Kora to the remote google colab server. Kora is a collection of tools to make programming on Google Colab easier.
!pip install kora -q

In [None]:
from bs4 import BeautifulSoup
import time, datetime
import pandas as pd
import re
from kora.selenium import wd as bot

In [None]:
# URL Selection and Manipulation
## URL Variable can be manually altered by the user to collect data on different subject matters. In order to obtain a valid url for this bot, you must
## first manually download the HTML of the entries you are looking to crawl, this creates some limitations specifically in limiting the way that your bot
## can interact with React JS integrated HTML, but allows for ease of access without an API. After you download the HTML, upload it to your google drive
## and copy the link which is presented when you click 'download'. This may be somewhat tricky to access, but is possible. This download link will be accessed
## by the bot as a means to download the HTML directly to the Colab files. If this does not work, as it is sometimes error prone, you may wish to manually
## upload the HTML file in the files section of Colab located on the left hand side of the screen - Note: should you do this you need to comment out
## bot.get(url)
url = "https://drive.google.com/u/0/uc?id=1zXefaq9291x_nGBu7v-s0OLIhuGFz2IX&export=download"

##Input the targeting url to the bot, and the bot will load data from the url.
##bot.get(url)

# Global Values
## An array to store all post urls.
media_urls = []
## An array to store the retrieved results.
results = []


# Reading in the HTML
## The following code will interact with the HTML you have downloaded, either through the bot or manually, so as to make it accessable to BeautifulSoup
HtmlFile = open('/content/RaqqaAugust2021.html', 'r', encoding='utf-8')
source_code = HtmlFile.read() 

# BeautifulSoup
## Beautiful soup will now parse the HTML DOM (Document Object Model) so that it is readable and workable for the following code - simple but essential.
soup = BeautifulSoup(source_code, 'html.parser')

# Finding Posts
## Every post on instagram is contained under the div class listed below. This means that the bot must isolate those posts in advance of crawling
## the information you are interested in. The following code does exactly that, before initiating information crawling.

posts = soup.find_all('div', class_="v1Nh3 kIKUG _bz0w")

# Crawling Posts
## Using a for loop for all posts contained in your HTML, the following will run through a series of extractions by selectively
## looking for standardized information patterns.
for content in posts:

    # Video Filtering
    ## The following allows for you to track the progress of processing and monitor for errors while isolating photos from videos.
    ## Videos, having an svg, can be isolated by this child and be filtered from the final data.
    if content.findChild("svg"):
      print("it had an svg for a child")
    else:
      print("it did not have svg for a child")
      try:
        
        # Post URL
        ## This pulls the post url contained within the image information of each instagram post. This is necessary for work verification by
        ## yourself or others. Such information is contained in the anchor element of each post as an href.
        post_url =  content.find("a").attrs["href"]

        # Username Location
        ## The following picks through the image alt attached to each post, extracting the username of each poster by looking between the
        ## common terms "by " and " on" or " in". The origin of this information is in the alt text of post images.
        username = content.find("img").attrs["alt"]
        start_U = username.find("by ") + len("by ")
        if " on" in username:
          end_U = username.find(" on ")
        else:
          end_U = username.find(" in ")
        substring_U = username[start_U:end_U]

        # Creation Time
        ## Drawing on the method used to extract usernames, the following extracts the time of posting by extracting between "on " and ". "
        created_on = content.find("img").attrs["alt"]
        if "on " in created_on:
          start_C = created_on.find("on ") + len(". ")
        else:
          start_C = created_on.find("in ") + len(". ")
        end_C = created_on.find(". ")
        substring_C = created_on[start_C:end_C]
        
        # Alt Text Description
        ## The same method can also be used to extract the image alt in lieu of caption availability. This describes what is present
        ## in the image associated with each post. It is not perfect, at worst it can provide simple observations,
        ## but at best it can replicate text and indentify important visual data like flag types.
        Alt_Text = content.find("img").attrs["alt"]
        if "May be" in Alt_Text:
          start_A = Alt_Text.find("May be") + len("May be")
          end_A = len(Alt_Text)
        substring_A = Alt_Text[start_A:end_A]

        # DEPRICATED: Post Likes
        ## As this bot works with manually downloaded HTML, this feature has been depricated but remains accessible for those interested
        ## in retooling the bot for HTML extraction *by* the bot directly from Instagram. Likes are inaccessible in this model
        ## due to the fact that the span containing like information is hidden behind a 'when hover' feature of React JS.
        ###LikesDiv = content.find('div', class_="_7UhW9 vy6Bb qyrsm h_zdq uL8Hv T0kll")
        ###Likes = LikesDiv.find('span')

        # Time of Crawling
        ## Collecting the date and time of my own capture allows for later validation of work,
        ## additionally, it can allow validation of the time taken to collect data.
        collected_at = datetime.datetime.now()

        # Data Organization
        ## The following organizes our data for transformation into a CSV, assigning row names to data.
        row = {'post_url': post_url,
                      'username': substring_U,
                      'created_on': substring_C,
                      'Alt_Text': substring_A,
                      'collected_at': collected_at}

        # Data Integrity
        ## This simply ensures that the same post will not be crawled twice.
        if post_url in media_urls:
                  print("this post has already been added.")
        else:
                  results.append(row)
                  media_urls.append(post_url)

      except:
        pass
    

# DEPRICATED: Bot Work Mediation
## The following are unnecessary when we are working with manually downloaded HTML, but in the event that you wish to retool this
## code for direct crawling, you will need to use and improve the following code to evade detection as a crawler.
## The following makes the bot process posts at a slower rate, like a human being might.
# time.sleep(7)
## The following will need to be used to scroll down an instagram page so that posts beyond what initially loads can be crawled.
# bot.execute_script("window.scrollTo(0, document.body.scrollHeight);")
## The following will finally be used to tell the bot that it can be finished crawling the Instagram page.
#bot.close()

# Data storage
## Store the results as a pandas dataframe
df = pd.DataFrame(results)

# Notification of Work Completion
## notify the completion of the crawling in the console.
print("the crawling task is finished.")

it did not have svg for a child
it did not have svg for a child
it did not have svg for a child
it did not have svg for a child
it did not have svg for a child
it did not have svg for a child
it did not have svg for a child
it did not have svg for a child
it did not have svg for a child
it did not have svg for a child
it had an svg for a child
it had an svg for a child
it had an svg for a child
it had an svg for a child
it had an svg for a child
it had an svg for a child
it had an svg for a child
it did not have svg for a child
it did not have svg for a child
it had an svg for a child
it had an svg for a child
it had an svg for a child
it had an svg for a child
it had an svg for a child
it had an svg for a child
it had an svg for a child
it did not have svg for a child
it did not have svg for a child
it had an svg for a child
it had an svg for a child
it did not have svg for a child
it did not have svg for a child
it had an svg for a child
it did not have svg for a child
it did not hav

In [None]:
# Manual DF Verification
## The following is used to print what has been stored in your dataframe, so that you can check for issues before translating your data into
## a CSV and dowloading it. Warning: if you are working with a large volume of data, it may be better to conduct work verification in
## an excel CSV file due to it's superior readability.
print(df)

                                    post_url  \
0   https://www.instagram.com/p/CXTAvtZNymn/   
1   https://www.instagram.com/p/Cau1wBbtVbV/   
2   https://www.instagram.com/p/CU9zoAgNviY/   
3   https://www.instagram.com/p/CWG-DhXqz5X/   
4   https://www.instagram.com/p/CYJTQKHKnBH/   
5   https://www.instagram.com/p/CY9guIuI-l6/   
6   https://www.instagram.com/p/CV_RreJqHIO/   
7   https://www.instagram.com/p/CV5Tm0RNpil/   
8   https://www.instagram.com/p/CYKGXGjK5aD/   
9   https://www.instagram.com/p/CD64nYUls4y/   
10  https://www.instagram.com/p/CD6phGJgSF5/   
11  https://www.instagram.com/p/CD6efpKobZd/   
12  https://www.instagram.com/p/CD6dT4jlifG/   
13  https://www.instagram.com/p/CD6dN_EFqiZ/   
14  https://www.instagram.com/p/CD6dG1TFTFR/   
15  https://www.instagram.com/p/CD5_9BVF-0R/   
16  https://www.instagram.com/p/CD4wXMnFe2A/   
17  https://www.instagram.com/p/CD4O8AxAA4U/   
18  https://www.instagram.com/p/CD4Fqs0FZh_/   
19  https://www.instagram.com/p/CD4FleTp

In [None]:
# Enable Google Colab to Access Your Drive
from google.colab import drive
# Mount your Drive to the Colab
drive.mount('/gdrive')

# Name the Output and its Location
output_file = '/gdrive/My Drive/August2021.csv'

# Initiate Data Save as CSV
df.to_csv(output_file, index=False)

Drive already mounted at /gdrive; to attempt to forcibly remount, call drive.mount("/gdrive", force_remount=True).


In [None]:
# Downloading! 
## This final step initiates CSV downloading and notifies you on completion.
from google.colab import files
files.download(output_file)
print("the csv has been downloaded to your local computer. The program has been completed successfully.")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

the csv has been downloaded to your local computer. The program has been completed successfully.
