# ML.py, MLpy, MLPy!
<p style="text-align: center;">MLpy came as a simple question <i>and</i> a big project in pespective:
<br><b>"Can I build a discord bot that can tell two pictures apart?"</b></p>
<br>The goal of this notebook is two-fold with one overarching thread:

1. To build a web crawler that can lift a statistically relevant number of images from [derpibooru](https://derpibooru.org), an image database powered by the community that built around the fourth generation of the show 'My Little Pony.'
2. To build a machine learning algorithm capable of telling the difference between 2 types of pictures--to be summarized in a function that I can feed to my existing discord bot [BotJack](https://github.com/LMquentinLR/botjack_discord_bot).

The thread is that I am, at the time of writing, learning how to program. I neither know how to build a web crawler or how a ML algorithm works (is it even called an algorithm?). All in all, this is a small idea that is both a learning experience, a blog--and of course a fun project.

### Why a bot should do that?
There are many reasons why a bot should be able to identify images posted on a server: classification, tagging, games, etc. 
<p style="text-align: center;"><br>This notebook will focus on <b>compliance</b>.</p> 

* Servers may have anti-NSFW (i.e. not safe for watch) rules where explicit, grim, and otherwise unwanted content is banned or curtailed to specific server channels.
* Moderation being volunteer-driven on discord, malicious users may capitalize on idle, asleep, or away-from-keyboard moderators to engage in rule-breaking activities. More commonly, users may simply post a NSFW picture in a SFW-only channel. 
* A bot able to distinguish NSFW content from SFW helps fill in the breaches that may affect any moderation effort. A bot, for instance, could automatically alert moderators when a specific content is posted and start a moderating process prior to any human intervention.

<b>Automatic content moderation and compliance is a current industry effort in social media (e.g. Facebook)</b>, making this notebook a real world application.

In [None]:
import os
import json
import requests
import time
import operator

### Building a web crawler
Derpibooru is a website dedicated to fanart of MLP:FiM. It provides a JSON REST API for major site functionality, which can be freely used by anyone wanting to produce tools for the site or other webapps that use the data provided within Derpibooru.
<br><b>Derpibooru licensing rules</b>
<br>"<i>Anyone can use it, users making abusively high numbers of requests may be asked to stop. Your application MUST properly cache, respect server-side cache expiry times. Your client MUST gracefully back off if requests fail (eg non-200 HTTP code), preferably exponentially or fatally.</i>"

<br>A single image can be accessed through the following links:
1. https://derpibooru.org/2072316 (embedded)
2. https://derpicdn.net/img/view/2019/6/22/2072316.png (default size)
3. https://derpicdn.net/img/view/2019/6/22/2072316_small.png (small size)
4. https://derpicdn.net/img/view/2019/6/22/2072316_medium.png (medium size)
5. https://derpicdn.net/img/view/2019/6/22/2072316_large.png (large size)

The metadata of a single picture can be accessed through the following link:
* https://derpibooru.org/2072316.json
<br> The list of attributes a single image is:
>id, created_at, updated_at, first_seen_at, score, comment_count, width, height, file_name, description, uploader, uploader_id, image, upvotes, downvotes, faves, tags, tag_ids, aspect_ratio, original_format, mime_type, sha512_hash, orig_sha512_hash, source_url, representations, is_rendered, is_optimized, interactions, spoilered

In [None]:
class img_metadata:
    """
    Class object corresponding to the process retrieving picture metadata from the REST API
    of the website derpibooru--data is retrieved as a series of c. 1Mb JSON files.
    """
    
    def __init__(self, tags = "", tags_include = True, instances = 50, crawl_all = True):
        """
        Initialization of the img_metadata class object.
        ---
        :param <self>:         <class>    ; class object reference
        :param <tags>:         <list>     ; list of strings (i.e. picture tags) used for sorting
        :param <tags_include>: <boolean>  ; includes or excludes based on <tags>
        :param <instances>:    <integer>  ; number of instances/loops allowed before program stops
        :param <crawl_all>:    <boolean>  ; True: crawls the whole derpibooru metadata/False: updates
                                            the locally stored metadata
        """
        self.tags = tags
        self.tags_include = tags_include
        self.instances = instances
        self.crawl_all = crawl_all
    
    def convert_bytes(self, bytes_size):
        """
        Converts byte lengths
        ---
        :param <self>:       <class>   ; class object reference
        :param <bytes_size>: <integer> ; size in bytes of a file
        """
        for unit_multiple in ['bytes', 'KB', 'MB', 'GB', 'TB']:
            if bytes_size < 1024.0:
                return "%3.1f %s" % (bytes_size, unit_multiple)
            bytes_size /= 1024.0
    
    def keys_to_keep(self):
        """
        Returns the keys to keep in the JSON extract
        ---
        :param <self>: <class> ; class object reference
        """
        keys = ["id", "created_at", "updated_at", "score", "uploader",
                "uploader_id", "upvotes", "downvotes", "faves", "tags",
                "tags_id", "representations"]
        return keys
    
    def push_one_up(self):
        """
        Updates all the existing JSON records (1Mb splits) by incrementing their name by 1.
        ---
        :param <self>: <class> ; class object reference
        """
        def push(name, depth):
            if os.path.exists(name[:19] + "_" + str(int(name[20:-5])+1) +".json"):
                push(name[:19] + "_" + str(int(name[20:-5])+1) +".json", depth)
                os.rename(name,name[:19] + "_" + str(int(name[20:-5])+1) +".json")
            else: 
                os.rename(name,name[:19] + "_" + str(int(name[20:-5])+1) +".json")

        if os.path.exists("derpibooru_metadata_0.json"): 
            depth = 0
            push("derpibooru_metadata_0.json", depth)
        if os.path.exists("derpibooru_metadata.json"):
            os.rename("derpibooru_metadata.json", "derpibooru_metadata_0.json")
    
    def check_prior_extract(self, print_msg = True):
        """
        Checks for existing metadata extractions in the working directory. 
        The default file name is 'derpibooru_metadata.json'.
        ---
        :param <self>:      <class>   ; class object reference
        :param <print_msg>: <boolean> ; toggle to print message to command line
        """
        
        json_found = f"FOUND: 'derpibooru_metadata.json'"
        json_not_found = f"MISSING FILE: 'derpibooru_metadata.json'; NOT IN: {os.getcwd()}\n" + \
        "FILE TO CREATE: 'derpibooru_metadata.json'"
        json_created = f"FILE CREATED: 'derpibooru_metadata.json'"
        json_not_created = f"ERROR FILE CREATION: 'derpibooru_metadata.json'"
        json_path = os.getcwd() + "\\derpibooru_metadata.json"
        
        #checks if a local file containing potential metadata exists
        find = os.path.exists(json_path)
        
        #if TRUE: opens the file and extracts the contained metadata
        #if FALSE: creates file storing an empty list
        if find and print_msg: print(json_found)
        elif not find and print_msg:
            print(json_not_found)
            try:    
                with open(json_path, "w") as file: file.write("[]")   
                print(json_created) 
            except Exception as e: print(json_not_created, e, sep = "\n")
        elif not find and not print_msg:
            try:    
                with open(json_path, "w") as file: file.write("[]")   
            except Exception as e: print(json_not_created, e, sep = "\n")
        
        return json_path

    def json_collect(self, json_local, json_derpibooru, json_path, iterations):
        """
        Collects picture metadata extracted from derpibooru
        ---
        :param <self>:            <class>       ; class object reference
        :param <json_local>:      <json_object> ; JSON data stored locally
        :param <json_derpibooru>: <json_object> ; JSON data extracted from derpibooru
        :param <json_path>:       <string>      ; path of local file where the data is stored
        :param <instances>:       <integer>     ; number of instances/loops allowed before program stops
        """
        stored_keys = self.keys_to_keep()
        
        for image_data in json_derpibooru:
            
            temp = image_data.copy()
            
            for item in image_data: 
                if item not in stored_keys: del temp[item]

            json_local.append(temp)
            json_local.sort(key=operator.itemgetter("id"), reverse = True)

            #increments the number of performed iterations by one
            #breaks out if the upper limit of instances is reached
            if isinstance(self.instances, str) == False:
                if iterations == self.instances - 1:
                    iterations += 1
                    break
                else: iterations += 1

        with open(json_path,'w') as file: json.dump(json_local, file)

        #splits the json file if it is too large (1Mb)
        length = self.convert_bytes(float(os.stat(json_path).st_size))
        length = length.split(" ")
        if float(length[0]) >= 1.0 and length[1] == "MB":
            nb_file = 0
            while True:
                json_new_path = json_path[:-5] + "_" + str(nb_file) + ".json"
                if os.path.exists(json_new_path) == False:
                    print("SPLIT: JSON file to be split as 1Mb max size reached.")
                    os.rename(json_path, json_new_path) 
                    break
                nb_file += 1
        
        return iterations

    def json_update(self, json_derpibooru, json_path, iterations, last_id):
        """
        Merges the metadata already stored locally with that extracted from derpibooru.
        ---
        :param <self>:            <class>       ; class object reference
        :param <json_derpibooru>: <json_object> ; JSON data extracted from derpibooru
        :param <json_path>:       <string>      ; path of local file where the data is stored
        :param <instances>:       <integer>     ; number of instances/loops allowed before program stops
        :param <last_id>:         <integer>     ; most recent ID stored locally
        """            
        stored_keys = self.keys_to_keep()
        
        try:
            with open(json_path, "r") as file: json_local = json.load(file)
            
            for image_data in json_derpibooru:
                
                temp = image_data.copy()
                
                for item in image_data: 
                    if item not in stored_keys: del temp[item]
                
                if len(json_local) == 0: json_local.append(temp)
                else:
                    for image_json in json_local:
                        if temp["id"] == last_id: raise NewContentCrawled
                        else:
                            json_local.append(temp)
                            json_local.sort(key=operator.itemgetter('id'), reverse = True)
                            break
                
                #increments the number of performed iterations by one
                #breaks out if the upper limit of instances is reached
                if isinstance(self.instances, str) == False:
                    if iterations == self.instances - 1:
                        iterations += 1
                        break
                    else: iterations += 1
            
            with open(json_path,'w') as file: json.dump(json_local, file)
            
            #splits the json file if it is too large (1Mb)
            length = self.convert_bytes(float(os.stat(json_path).st_size))
            length = length.split(" ")
            if float(length[0]) >= 1.0 and length[1] == "MB":
                self.push_one_up()
                json_new_path = "derpibooru_metadata_0.json"
                if os.path.exists(json_new_path) == False:
                    print("JSON file will be split for size management.")
                    os.rename(json_path, json_new_path) 
        
        except NewContentCrawled: iterations = "END"
        
        return iterations

    def crawl_metadata(self):
        """
        Retrieves from the derpibooru REST API a list of picture metadata.
        ---
        :param <self>: <class> ; class object reference
        """
        #initializes local variables
        page = 1
        iterations = 0
        back_off_counter = 1
        max_instances_reached = f"The set maximum number of images to request was reached at {self.instances}."
        exit_condition_1 = "The crawler scraped the whole derpibooru metadata. The program will now close."
        exit_condition_2 = "The crawler scraped the new content on derpibooru. The program will now close."
        
        #removes all existing json files if crawl_all == True
        for fname in os.listdir(os.getcwd()):
            if self.crawl_all and fname.startswith("derpibooru_metadata"): 
                os.remove(os.path.join(os.getcwd(), fname))
        
        #retrieves most recent recorded picture id
        if os.path.exists("derpibooru_metadata_0.json"): json_path = "derpibooru_metadata_0.json"
        else: json_path = self.check_prior_extract()
        
        with open(json_path, "r") as file:
            last_id = json.load(file)
        
        if last_id == []: last_id = -1
        else: 
            last_id = last_id[0]["id"]
            if not self.crawl_all: self.push_one_up()        
        
        while True:
            current_page = f"You are requesting the page {page} of the derpibooru website."
            error_json_extraction = f"The program couldn't extract the page {page} and " + \
                                    "will now proceed to an exponential back off."
            
            json_path = self.check_prior_extract(False)
            
            with open(json_path,'r') as file: json_local = json.load(file)
            
            print(current_page)
            
            path_derpibooru = "https://derpibooru.org/images.json?page=" + str(page)
            
            try:
                json_derpibooru = requests.get(path_derpibooru).json()["images"]
                if json_derpibooru == []: raise DatabaseFullyCrawled
                
                if self.crawl_all:
                    iterations = self.json_collect(json_local, json_derpibooru, json_path, iterations)
                else:
                    iterations = self.json_update(json_derpibooru, json_path, iterations, last_id)
                
                if iterations == "END": raise NewContentCrawled
                
                page += 1
                
                #time delay to respect the API's license
                time.sleep(1)
                
                if not isinstance(self.instances, str) and iterations >= self.instances:
                        print(max_instances_reached)
                        break
            
            except DatabaseFullyCrawled:
                print(exit_condition_1)
                break
            
            except NewContentCrawled:
                print(exit_condition_2)
                break
            
            except Exception as e:
                print(error_json_extraction)
                print(f"The error was the following: {e}.\n The program will back " + \
                      f"off for {2**back_off_counter} seconds.")
                back_off_counter += 1
                time.sleep(2 ** back_off_counter)

In [None]:
class derpibooru_search(img_metadata):
    """
    Class representing a search object that can prompt the derpibooru REST API and
    retrieve both picture metadata and the affiliated pictures.
    """
    
    def change_search(self, tags = "", tags_include = True, instances = 50, crawl_all = True):
        """
        Changes the arguments of the created object derpibooru_search.
        ---
        :param <self>:         <class>    ; class object reference
        :param <tags>:         <list>     ; list of strings (i.e. picture tags) used for sorting
        :param <tags_include>: <boolean>  ; includes or excludes based on <tags>
        :param <instances>:    <integer>  ; number of instances/loops allowed before program stops
        """
        self.tags = tags
        self.tags_include = tags_include
        self.instances = instances
        self.crawl_all = crawl_all

    def crawl(self):
        """
        Changes the arguments of the created object derpibooru_search.
        ---
        :param <self>: <class> ; class object reference
        """
        print("----|Entering Derpibooru Data Crawler code|----")
        self.crawl_metadata()
        print("---------------|Exiting Program|---------------")
        pass

In [None]:
class Error(Exception):
    """Base class for other exceptions"""
    pass

class DatabaseFullyCrawled(Error):
    """Raised when the crawler reached the last pages of derpibooru"""
    pass

class NewContentCrawled(Error):
    """Raised when the input value is too large"""
    pass

------------------

The following is a series of tests you can try out.

In [None]:
obj = derpibooru_search()

In [None]:
obj.__dict__
obj.change_search(instances = 10, crawl_all = True)
obj.crawl()

In [None]:
obj.__dict__
obj.change_search(instances = "", crawl_all = True)
obj.crawl()

In [None]:
obj.__dict__
obj.change_search(instances = 10, crawl_all = False)
obj.crawl()

In [None]:
obj.__dict__
obj.change_search(instances = "", crawl_all = False)
obj.crawl()