Skip to content
This repository has been archived by the owner on Oct 21, 2020. It is now read-only.

Expression domain #16

Closed
alakel opened this issue Dec 2, 2014 · 14 comments
Closed

Expression domain #16

alakel opened this issue Dec 2, 2014 · 14 comments

Comments

@alakel
Copy link
Contributor

alakel commented Dec 2, 2014

Develop a heuristic manager and the first heuristic files for the first 10 content farms.
Facebook
Twitter
Linkedin
Slideshare
Wordpress
Viadeo
youtube
vimeo
dailymotion
blogger
typepad
pinterest
Scribd
google+
Tumblr
picasa
flickr

@DavidBruant
Copy link
Contributor

Dear issue #16,

It's with emotion I cannot hide that I'm writing this to you.

I remember. Oh, I remember :-)
January 5th 2012, was it? I don't remember that accurately. But it was the first day. I remember distinctly the conversation with @alakel when you emerged.

We've been talking about you for so long, but never had the chance.

Your time has come.

@DavidBruant
Copy link
Contributor

Plan is as follow:

  1. Work on issue Annotation on Expression domain #185 to create the database part of the operation. On creating a new expression domain, fetch the domain infos.
  2. Generate the GEXF from the expression domain in the database (with the domain infos as export fields)
  3. Make a directory with heuristics. One per heuristics.

A heuristics is composed of :

  • A set of hostname to match exactly
  • A function that takes a URL with one of the matched domains and returns a promise for the expression domain (string, likely a URL, but not compulsory)
  • A function that takes an expression domain string and returns a promise for meta information about the expression domain (main_url, title, description, keywords)

@DavidBruant
Copy link
Contributor

Angel @oncletom shared something that'll save us time http://oembed.com/

@alakel
Copy link
Contributor Author

alakel commented Sep 27, 2015

So what have we do ? Use it or extend it to a more global concept of heuristic ?

@DavidBruant
Copy link
Contributor

"Use it" (assuming you're talking about oembed), but in a way I don't think you've grasped yet. Oembed isn't a service, it's a standard other websites choose to adhere to.
For instance slideshare adheres to oembed. Example: http://www.slideshare.net/api/oembed/2?url=fr.slideshare.net/alakel/my-webintelligence-projectscope&format=json

So instead of having to reverse-engineer the HTML to find the author (which can be time consuming) and have a result that is brittle by design, we have direct access via oembed to the structured information we need (very specifically author_name for the expression domain and author_url for its main URL).

Unfortunately, not all websites we need expose oembed (Facebook doesn't, Twitter does, but only via API, so it's probably too complicated for now), so these will have to be done one by one (as planned anyway).

However, there is a good overlap between your list and the current list of oembed providers. Once I have written the code to transform one oembed output to what we call an expression domain, I can have all the oembed providers in the same movement (because it's a standard).

@DavidBruant
Copy link
Contributor

@DavidBruant
Copy link
Contributor

For Facebook, it looks mostly impossible to find the expression domain of an event or the number of followers/friends because the HTML is very rough/minimalist (BigPipe, all that).
Alternatives are:

  • Don't do it (information loss)
  • Execute JS (Casper)
  • Use Facebook API (which brings more structured data but is a barrier to entry to installing MyWI)

For now, we're choosing to not do it. If it turns out this is not efficient enough, we'll revisit, probably to use the API.

@DavidBruant
Copy link
Contributor

Facebook heuristic at #211.
The rest is coming.

@DavidBruant
Copy link
Contributor

Twitter and Linkedin done at #215.
oembed (Slideshare, youtube, vimeo, dailymotion, etc.) is next.

@thom4parisot
Copy link

The rest is coming.

@DavidBruant
Copy link
Contributor

Nothing to do for Wordpress. Either people bought their own domain or use a *.wordpress.com. Either way, their hostname is their expression domain.

@DavidBruant
Copy link
Contributor

Blogger is gone and is now G+. Cool.

@thom4parisot
Copy link

If you type in the main issue description you can create tickable checkboxes ;-)

- [ ] blah
- [ ] blah

@DavidBruant
Copy link
Contributor

Let's consider this fixed after #221.

Done:

Facebook
Twitter
Linkedin
Slideshare
youtube
Wordpress (nothing to do)
Viadeo (minimalistic)
vimeo
dailymotion
blogger (nothing to do)
pinterest (minimalistic)
instagram

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants