Skip to content

From Instagram data to Algolia search ⬅️➖➡️

License

Notifications You must be signed in to change notification settings

PaulRosset/InstAgolia

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

InstAgolia

Lambda like to go from Instagram data account to Algolia search data.

What's the point ?

InstAgolia is a docker HTTP based application that scrap Instagram user data of a profile (Instagram Scraper) and then transform these volume of data to put everything in Algolia index to be able to add useful feature on top of it.

How's made ?

All you have to do is:

  • ENV: $PORT=3000
  • yarn or npm install
  • yarn start

Then a light HTTP server will be launched.

But it works best with the Dockerfile at the root of the project, like this in whatever environnment everything can be launched:

  • Build images: docker build -t instagolia:1.0 ./

  • Launch container docker run -d -p 3000:3000 instagolia:1.0

Finally, to trigger the pipeline, you need to hit the HTTP server at the desired endpoint:

  • curl -X POST --data {} https://yoururl.com/

A little bit of configuration

A bunch of env variables have to be provided in order to work:

  • PORT

    • You can provide a specific port to bind to, this is useful when you want to deploy the container on cloud platform that provide the port.
    • default: 3000
  • ALGOLIA_APPID

    • The appID, this is your unique application identifier.
    • required
    • Init time variable.
  • ALGOLIA_ADMIN_KEY

    • This is the ADMIN API key, used for query index backend.
    • required
    • Init time variable.
  • ALGOLIA_INDEX

    • This is the name of your index where you want to store data.
    • required
    • To give more flexibility, you can also pass a json body with the index property when doing your HTTP request, like so: { index: 'your_index_name' }, the priority will be the json provided. Run time variable.
    • However, for the moment, you have to create the index manually before with the Algolia dashboard.
  • TARGET_ACCOUNT

    • The name of the Instagram account of the target.
    • required
    • Like the variable above, you can specify a targetAccount property. Same priority. Run time variable.
  • USER_ACCOUNT

    • The Instagram scraper need a authentified account, so we need to provide user/pass informations.
    • required
    • Init time variable.
  • PASS_ACCOUNT

    • Like we said above, we need user/pass, so here is the pass
    • required
    • Init time variable.
  • URL_TO_HIT

    • For whatever reason, you may want to have a specific url to HIT to trigger the scraping/storage.
    • default: /
    • Init time variable.
  • METHOD_TO_HIT

    • You might also want to trigger a specific METHOD.
    • default: POST
    • Init time variable.

In the cloud

That pipeline work very well on cloud infrastructure such as Heroku with their container system.

Datasets

The dataset we receive from the Instagram scraper is like the following:

{
  "__typename": "GraphImage",
  "comments_disabled": false,
  "dimensions": {
    "height": 1350,
    "width": 1080
  },
  "display_url": "String",
  "edge_media_preview_like": {
    "count": 0
  },
  "edge_media_to_caption": {
    "edges": [
      {
        "node": {
          "text": "text"
        }
      }
    ]
  },
  "edge_media_to_comment": {
    "count": 0
  },
  "gating_info": null,
  "id": "String",
  "is_video": false,
  "location": {
    "address_json": "String",
    "has_public_page": true,
    "id": "758238628",
    "name": "name",
    "slug": "name-computer-readable"
  },
  "media_preview": "ACEqqGrkNqHH7w7Wbp04+vrn8OlVY13tgfU/QdTV4qSAf7oz19c/40pyaskOMb6sia0AXIYMwBOAODj0Of6VXAq8sZTcwwDyec9TjI9gSKoA804Sbun0CStZofiikzRWhmXJtlsohTl2GWb1I5x9O1SZLqCf4oxVW7UCQY5OeT/MD2GcfWrcZJ4IwO304z+X8sVzT3+R1LayJJvunHrWOZN53HAz1x0zWlOPlJB6dPwyf61kOS5L9yckDpV0+pnPZInxRTc0VsYmlHF5zFz06ClLKmFz8wPGT2H/ANapjxEMVy9wxM5yTwRWUlzG97HRzyLleVBBBIJ/n9QMYrC3dx0zVOVixJJJO48mtG45KnuUWnFWRnJ3I99FQ0VRJ//Z",
  "owner": {
    "id": "121212"
  },
  "shortcode": "1qsndqsd",
  "tags": ["String", "String", "String", "String"],
  "taken_at_timestamp": 1550826160,
  "thumbnail_resources": [
    {
      "config_height": 150,
      "config_width": 150,
      "src": "image-link"
    },
    {
      "config_height": 240,
      "config_width": 240,
      "src": "image-link"
    },
    {
      "config_height": 320,
      "config_width": 320,
      "src": "image-link"
    },
    {
      "config_height": 480,
      "config_width": 480,
      "src": "image-link"
    },
    {
      "config_height": 640,
      "config_width": 640,
      "src": "image-link"
    }
  ],
  "thumbnail_src": "images-link",
  "urls": ["image-link"],
  "username": "username"
}

However, we are only storing these values in Algolia:

objectID
display_url
likes
texts
comments
address
nameLocation
slugLocation
idLocation
tags
taken_at_timestamp
shortcode

License

MIT

Releases

No releases published

Packages

No packages published