# ML.py, MLpy, MLPy!
<p style="text-align: center;">MLpy came as a simple question <i>and</i> a big project in pespective:
<br><b>"Can I build a discord bot that can tell two pictures apart?"</b></p>
<br>The goal of this notebook is two-fold with one overarching thread:

1. To build a web crawler that can lift a statistically relevant number of images from [derpibooru](https://derpibooru.org), an image database powered by the community that built around the fourth generation of the show 'My Little Pony.'
2. To build a machine learning algorithm capable of telling the difference between 2 types of pictures--to be summarized in a function that I can feed to my existing discord bot [BotJack](https://github.com/LMquentinLR/botjack_discord_bot).

The thread is that I am, at the time of writing, learning how to program. I neither know how to build a web crawler or how a ML algorithm works (is it even called an algorithm?). All in all, this is a small idea that is both a learning experience, a blog--and of course a fun project.

### Why a bot should do that?
There are many reasons why a bot should be able to identify images posted on a server: classification, tagging, games, etc. 
<p style="text-align: center;"><br>This notebook will focus on <b>compliance</b>.</p> 

* Servers may have anti-NSFW (i.e. not safe for watch) rules where explicit, grim, and otherwise unwanted content is banned or curtailed to specific server channels.
* Moderation being volunteer-driven on discord, malicious users may capitalize on idle, asleep, or away-from-keyboard moderators to engage in rule-breaking activities. More commonly, users may simply post a NSFW picture in a SFW-only channel. 
* A bot able to distinguish NSFW content from SFW helps fill in the breaches that may affect any moderation effort. A bot, for instance, could automatically alert moderators when a specific content is posted and start a moderating process prior to any human intervention.

<b>Automatic content moderation and compliance is a current industry effort in social media (e.g. Facebook)</b>, making this notebook a real world application.

In [1]:
import os
import json
import requests
import time
import operator

### Building a web crawler
Derpibooru is a website dedicated to fanart of MLP:FiM. It provides a JSON REST API for major site functionality, which can be freely used by anyone wanting to produce tools for the site or other webapps that use the data provided within Derpibooru.
<br><b>Derpibooru licensing rules</b>
<br>"<i>Anyone can use it, users making abusively high numbers of requests may be asked to stop. Your application MUST properly cache, respect server-side cache expiry times. Your client MUST gracefully back off if requests fail (eg non-200 HTTP code), preferably exponentially or fatally.</i>"

<br>A single image can be accessed through the following links:
1. https://derpibooru.org/2072316 (embedded)
2. https://derpicdn.net/img/view/2019/6/22/2072316.png (default size)
3. https://derpicdn.net/img/view/2019/6/22/2072316_small.png (small size)
4. https://derpicdn.net/img/view/2019/6/22/2072316_medium.png (medium size)
5. https://derpicdn.net/img/view/2019/6/22/2072316_large.png (large size)

The metadata of a single picture can be accessed through the following link:
* https://derpibooru.org/2072316.json
<br> The list of attributes a single image is:
>id, created_at, updated_at, first_seen_at, score, comment_count, width, height, file_name, description, uploader, uploader_id, image, upvotes, downvotes, faves, tags, tag_ids, aspect_ratio, original_format, mime_type, sha512_hash, orig_sha512_hash, source_url, representations, is_rendered, is_optimized, interactions, spoilered

In [13]:
class img_metadata:
    """
    Class representing the JSON metadata file that can be retrieved from the REST API
    of the website derpibooru.
    """
    
    def __init__(self, tags = "", tags_include = True, instances = 50, crawl_all = True):
        """
        Initialization of the img_metadata class of objects.
        ---
        :param <self>:         <class>    ; bound object/variable's reference
        :param <tags>:         <list>     ; list of strings representing tags to filter 
                                            the extraction
        :param <tags_include>: <boolean>  ; True includes/False excludes based on tags
        :param <instances>:    <integer>  ; number of instances to extract before a stop
        :param <crawl_all>:    <boolean>  ; True implies the program crawls the whole derpibooru
                                            metadata database/False implies the program updates
                                            the locally stored data with newly added data.
        """
        self.tags = tags
        self.tags_include = tags_include
        self.instances = instances
        self.crawl_all = crawl_all
    
    def check_file(self):
        """
        Checks for an existing extraction of image metadata in the working directory. 
        This default file name is 'derpibooru_metadata.json'.
        ---
        :param <self>: <class> ; bound object/variable's reference
        """
        #messages to be displayed in the commande line
        json_found_msg = f"The file 'derpibooru_metadata.json' was found and read."
        json_not_found_msg = f"The file 'derpibooru_metadata.json' is missing from {os.getcwd()}.\n" + \
        "An empty JSON file named 'derpibooru_metadata.json' will be created."
        json_created_msg = f"An empty JSON file named 'derpibooru_metadata.json' was created."
        json_not_created_msg = f"The file 'derpibooru_metadata.json' could not be created."
        json_metadata_path = os.getcwd() + "\\derpibooru_metadata.json"
        
        #checks if a local file containing potential metadata exists
        find = os.path.exists(json_metadata_path)
        
        #if TRUE: opens the file and extracts the stored metadata
        #if FALSE: creates an empty file and stores an empty list
        if find == True:
            print(json_found_msg)
            with open(json_metadata_path,'r') as file:
                json_metadata = json.load(file)
        else:
            print(json_not_found_msg)
            json_metadata = "[]"
            try:    
                with open(json_metadata_path, "w") as file:
                    file.write(json_metadata)   
                print(json_created_msg) 
            except Exception as e:
                print(json_not_created_msg)
                print(e)
        return json_metadata_path

    def json_merge(self, json_local, json_derpibooru, path_json, iterations, crawl_all):
        """
        Merges the metadata already stored locally with that extracted from derpibooru.
        ---
        :param <self>:            <class>       ; bound object/variable's reference
        :param <json_local>:      <json_object> ; JSON data stored locally
        :param <json_derpibooru>: <json_object> ; JSON data extracted from derpibooru
        :param <path_json>:       <string>      ; path of local file where metadata is stored
        :param <crawl_all>:       <boolean>     ; True implies the program crawls the whole derpibooru
                                                  metadata database/False implies the program updates
                                                  the locally stored data with newly added data.
        """
        #keys not excluded from JSON metadata retrieved through the derpibooru REST API
        retained_keys = ["id", "created_at", "updated_at", "score", "uploader", 
                "uploader_id", "upvotes", "downvotes", "faves", "tags",
                "tags_id", "representations"]
        
        #for each image in the extracted JSON file:
        #records each item and removes any key that's not in retained_keys
        #appends each item to the JSON file (replacing a previous entry if it exists)
        #counts the number of images iterated over to break when the max number of 
        #instances is reached.
        try:
            for image_derpibooru in json_derpibooru:
                #creates a copy of the image metadata
                temp = image_derpibooru.copy()
                #removes the unwanted keys
                for item in image_derpibooru:
                    if item not in retained_keys: del temp[item]
                #appends to the local JSON data, replacing if a previous entry exists
                if len(json_local) == 0: json_local.append(temp)
                else:
                    for index, image_json in enumerate(json_local):
                        #print(temp["id"], image_json["id"], index, sep= "|")
                        if temp["id"] == image_json["id"]:
                            if not crawl_all:
                                json_local.append(temp)
                                json_local.sort(key=operator.itemgetter('id'), reverse = True)
                                raise NewContentCrawled
                            else:
                                del json_local[index]
                                json_local.append(temp)
                                json_local.sort(key=operator.itemgetter('id'), reverse = True)
                                break
                    else:
                        json_local.append(temp)
                        json_local.sort(key=operator.itemgetter('id'), reverse = True)
                #increments the number of performed iterations by one
                #breaks out if the maximum number of instances declared by the class is reached
                if isinstance(self.instances, str) == False:
                    print(iterations, self.instances)
                    if iterations == self.instances - 1:
                        iterations += 1
                        break
                    else: iterations +=1
            #dumps the updated local JSON data in its file
            with open(path_json,'w') as file:
                json.dump(json_local, file)
        except NewContentCrawled:
            iterations = "END"
        return iterations

    def crawl_metadata(self):
        """
        Retrieves from the derpibooru REST API the full list of picture metadata.
        ---
        :param <self>: <class> ; bound object/variable's reference
        """
        #initializes local variables
        page = 1
        iterations = 0
        back_off_counter = 1
        instances_extracted = f"The set maximum number of images to request was reached at {self.instances}."
        exit_condition_1 = "The crawler scraped the whole derpibooru metadata. The program will now close."
        exit_condition_2 = "The crawler scraped the new content on derpibooru. The program will now close."
        exit = "--Exiting program--"
        
        #retrieves existing locally stored metadata
        path_json = self.check_file()
        with open(path_json,'r') as file: json_local = json.load(file)
        
        #iterates over an <x> pages until <self.instances> image metadata has been retrieved
        while True:
            #messages to be displayed in the commande line
            current_page = f"You are requesting the page {page} of the derpibooru website."
            error_json_extraction = f"The program couldn't extract the page {page} and " + \
                                    "will now proceed to an exponential back off."
            #retrieves the data from the {page} page on Derpibooru
            print(current_page)
            path_derpibooru = "https://derpibooru.org/images.json?page=" + str(page)
            #tries to overwrite/update the locally stored metadata
            try:
                json_derpibooru = requests.get(path_derpibooru).json()["images"]
                if json_derpibooru == "{\"images\":[],\"interactions\":[]}":
                    raise DatabaseFullyCrawled
                iterations = self.json_merge(json_local, json_derpibooru, path_json, iterations, self.crawl_all)
                if iterations == "END":
                    raise NewContentCrawled
                page += 1
                #time delay to respect the API's license
                time.sleep(1)
                if isinstance(self.instances, str) == False:
                    if iterations >= self.instances:
                        print(instances_extracted)
                        break
            except DatabaseFullyCrawled:
                print(exit_condition_1)
                break
            except NewContentCrawled:
                print(exit_condition_2)
                break
            except Exception as e:
                print(error_json_extraction)
                print(f"The error was the following: {e}.\n The program will back " + \
                      f"off for {2**back_off_counter} seconds.")
                time.sleep(2 ** back_off_counter)
        pass
    
    def metadata_filter(self, tags, tags_include):
        """
        Provides the list of derpibooru picture IDs and sizes available for extraction.
        ---
        :param <self>:         <class>   ; bound object/variable's reference
        :param <tags>:         <list>    ; list of strings representing tags to filter
                                           the extraction
        :param <tags_include>: <boolean> ; True includes/False excludes based on tags
        """
        #return "{id: size}"
        pass

In [14]:
class derpibooru_search(img_metadata):
    """
    Class representing a search object that can prompt the derpibooru REST API and
    retrieve both picture metadata and the affiliated pictures.
    """
    
    def change_search(self, tags = "", tags_include = True, instances = 50, crawl_all = True):
        """
        Changes the arguments of the created object derpibooru_search.
        ---
        :param <self>:         <class>    ; bound object/variable's reference
        :param <tags>:         <list>     ; list of strings representing tags to filter 
                                            the extraction
        :param <tags_include>: <boolean>  ; True includes/False excludes based on tags
        :param <instances>:    <integer>  ; number of instances to extract before a stop
        """
        self.tags = tags
        self.tags_include = tags_include
        self.instances = instances
    
    def crawl(self):
        """
        Changes the arguments of the created object derpibooru_search.
        ---
        :param <self>: <class> ; bound object/variable's reference
        """
        print("---|Entering Derpibooru Data Crawler Program|---")
        self.crawl_metadata()
        print("---------------|Exiting Program|---------------")
        pass

In [15]:
class Error(Exception):
    """Base class for other exceptions"""
    pass

class DatabaseFullyCrawled(Error):
    """Raised when the crawler reached the last pages of derpibooru"""
    pass

class NewContentCrawled(Error):
    """Raised when the input value is too large"""
    pass

------------------

In [16]:
obj = derpibooru_search()

In [None]:
obj.__dict__
obj.change_search(instances = "", crawl_all = True)
obj.crawl()

In [None]:
print(obj.crawl.__doc__)