Skip to content
This repository has been archived by the owner on Apr 15, 2019. It is now read-only.

Classify importance of Data (Characters, Houses,...) by Centrality measure #103

Open
sacdallago opened this issue Mar 20, 2016 · 54 comments

Comments

@sacdallago
Copy link
Contributor

Hi there,

as mentioned around and in other teams, it would be extremely useful to have a measure of importance of a character, house, and everything that can be scraped from the wiki and has a unique link.

This can be done by creating a graph of in- and outgoing links via https://www.npmjs.com/package/ngraph.centrality and then storing in separate values the in-, out-, degree and betweenness centralities. Just be careful: the centralities are graph dependent. Thus you need to think wether you want to have a mega-graph or separate graphs for characters, houses,... Both are correct, it's a matter of choice.
Also: think about the package handles multiple links from A -> B

Last but not least, some other teams have already been working on these things, so @Rostlab/js_cs_sose_2016_students please raise your voice. For now we have been wrongly advising this as page rank but that's just one flavor of centrality measure.

EDIT: Centrality is: https://en.wikipedia.org/wiki/Centrality

@boriside
Copy link
Collaborator

Is this a call for help/collaboration with another team or it is an issue for Project_A?

@kajo404
Copy link
Collaborator

kajo404 commented Mar 20, 2016

I did something . I took the data from Guy's wiki scraper and generated some jsons from it. I then matched this with the character list provided by and ordered by score.

The code can be found here . Its not great but it works. Maybe this helps someone...

@sacdallago
Copy link
Contributor Author

It's a call for collab if someone has done something about it. But I think I described quite extensively how you could implement this, regardless of other's people work. So :D you know... :D

@AlexBeischl
Copy link
Collaborator

Well, we have just gathered some images for the characters, we created paths for and which are most important/popular. Those images can be found here:
https://github.com/Rostlab/JS16_ProjectC_Group10/tree/develop/mockup/img/persons

@sacdallago
Copy link
Contributor Author

@AlexBeischl that's still not really what I had in mind :) but again, a start:
@kordianbruck @Adiolis can you assign someone except you two or @togiberlin to take care of this one? Assign as in, this issue is assigned.

@Legenzoo
Copy link
Collaborator

I asked already on the facebook group, who wants to do this. Till now, there are no volunteers.
I am really not able to also do this. 😠
Should we really just assign some one, @sacdallago 😄

@Legenzoo Legenzoo mentioned this issue Mar 22, 2016
@Legenzoo
Copy link
Collaborator

Is this even a part of Project A?

@kordianbruck
Copy link
Collaborator

Yea, I don't think this is in the project scope of A nor can be done till the 25. I vouch to move this issue to 'someday'

@gyachdav
Copy link
Collaborator

I actually think it is within the scope of A, but since this requirement showed up late in the game we can defer it to the next version.

In the mean time, please integrate the data from here https://rostlab.org/~gyachdav/awoiaf/Data/pageRank/allchars.tgz

it's not perfect but at least gives a measure of which character is more referenced than others. I believe the range is [1-300](unknown to popular) . All you want to do here is, read the first item in the array from which you pick up the page_name to identify the character and assign that character the "score" value.

We really need this popularity measure in, to be able to sort the characters by the most important ones. otherwise we run into a case where we show bunch of negligible characters on the character portal.

@Legenzoo
Copy link
Collaborator

@gyachdav thank you. We will do that.

@Legenzoo
Copy link
Collaborator

I have implemented the updating of the pageRanks.
With every refill, update etc. of the characters the pageRanks are added.

To just add the pageRanks and do nothing more run: npm run updatePageRanks --update=characters
Then the pageRanks.json in dir data/ is added to the db.
With the --file=dir/file.json option, one can change the json file to use.

To create this json from the dir that guy provided run: npm run updatePageRanks --dir=PATHTODIR
With this the many _data files are transformed to just one json, that contains the names and scores of the characters.
With the --to=dir/file.json option, one can define to which json file the result is saved.

So, @kordianbruck, please run npm run updatePageRanks --update=characters on the public server.

Legenzoo added a commit that referenced this issue Mar 23, 2016
@sacdallago
Copy link
Contributor Author

but this is not done with the fancy alg I suggested above, right @Adiolis ? In that case please leave this open with the sometime milestone and no assignation :P

@Legenzoo Legenzoo reopened this Mar 23, 2016
@Legenzoo Legenzoo removed their assignment Mar 23, 2016
@gyachdav
Copy link
Collaborator

@Adiolis can you please run quick stats on all characters and report min,max median and mean and stddev for "pageRank". It will help with interpreting level of importantce for a character. A histogram of pageRank would also be very helpful. Thanks!

@Hack3l
Copy link
Collaborator

Hack3l commented Mar 25, 2016

@gyachdav i used pagerank for PLOD min = 0, max = 300 i normalized the values and only around 300 characters have over 0.1 normalized rank

@gyachdav
Copy link
Collaborator

thanks can you post here your normalized ranking?

@gyachdav
Copy link
Collaborator

i'm basically interested in cases like this https://got-api.bruck.me/api/characters/Tormund where the pageRank is rather low but in the show still plays a prominent role to get tweets.

@Hack3l
Copy link
Collaborator

Hack3l commented Mar 25, 2016

Here my normalized pagerank and 60 is not that low normalized all above 0.1 are pretty much popular
pagerank_normalized_json.txt

@gyachdav
Copy link
Collaborator

Thanks @Hack3l!

@Adiolis see yourself as excused from this task 😄

@kajo404
Copy link
Collaborator

kajo404 commented Mar 30, 2016

Hi @Hack3l . I can't parse you file because some names are formatted like 'Name' and other are just Name. Can you do all "Name" like in the last file?

@Hack3l
Copy link
Collaborator

Hack3l commented Mar 30, 2016

Oh yeah sorry, here is the right one.
pagerank_normalized_json.txt

@kajo404
Copy link
Collaborator

kajo404 commented Mar 31, 2016

thank you very much!

@sacdallago
Copy link
Contributor Author

@Adiolis @boriside @kordianbruck @togiberlin can one of you update this also in your repo? Then I rescrape or whatevvah

@Legenzoo
Copy link
Collaborator

Legenzoo commented Apr 1, 2016

I have already implemented the translation of guys data in his zip to our data/pageRanks.json, which has an other structure than Hack3l´s json. The question is now who is generating the pageRanks in the future and in which structure the information will be provided. I can remove my earlier waste of time and reimplement it to fill hack3l´s json. But later i wont do that anymore, so we need a final decision ;)
Does hack3l´s json contain all the pageRanks our data/pageRank provides? Is there still the need to lower the pageRank if the character has no image?

@sacdallago
Copy link
Contributor Author

I know. That's why I suggested since the beginning the algorithmic, scraped way. Fast fixes are no-good in the long term, and this is again just a fix.
My suggestion is to do it algorithmically as in my first post.

As far as the image story goes: that's definitely an indicator. If it makes sense then to add this, that's up to you. You have complete freedom on this

Legenzoo added a commit that referenced this issue Apr 1, 2016
Legenzoo added a commit that referenced this issue Apr 1, 2016
@Legenzoo
Copy link
Collaborator

Legenzoo commented Apr 1, 2016

Pull and run for the normalized pageRanks.
npm run updatePageRanks --update=characters --file=data/normalizedPageRanks.json

@Legenzoo
Copy link
Collaborator

Legenzoo commented Apr 1, 2016

If one needs a "translation" script for the jsons of hack3l:

var jsonfile = require('jsonfile');
const OLDFILE = './normalized.json';
jsonfile.readFile(OLDFILE, function(err,data){
    var newFile = [];
    for(var name in data) {
        newFile.push({"name": name, "score": data[name]});
    }
    jsonfile.writeFile('./data/normalizedPageRanks.json',newFile,{"spaces": 2},function(err){
       if(err) {
           console.log(err);
       }
        else {
           console.log('Finished');
       }
    });
});

@sacdallago
Copy link
Contributor Author

thx @Adiolis

@sacdallago
Copy link
Contributor Author

Done - for now

@gyachdav
Copy link
Collaborator

gyachdav commented Apr 1, 2016

Some characters are missing a pagerank value. in this case just assign a low default value. if the character has an image stored bump the default pagerank by 30%.

@gyachdav
Copy link
Collaborator

gyachdav commented Apr 2, 2016

okay guys let's solve this mess with the page rank

We currently have characters with page rank in the [0-1] range (normalized)
e.g.
https://api.got.show/api/characters/Gregor "pageRank":0.84

other characters have page rank in the [1-300] range
e.g.
https://api.got.show/api/characters/Qyburn "pageRank":161

This means that not all characters were updated with the normalized page rank.

@sacdallago says that this is because some characters are missing on @Hack3l maybe so.

What we need to do now is:

for each character in the characters collection

  1. assign the page rank from my original list.
  2. if character does not have a stored image -> pagerank=pagerank/2
  3. if character does not have a page rank in my list then assign a low default value. if that character has an image stored increase default page rank by 30%
  4. normalize page rank value using this procedure Classify importance of Data (Characters, Houses,...) by Centrality measure #103 (comment)

Once this procedure is done please discuss with @sacdallago how to migrate the data to the got.show mongo instance.

I need a volunteer to see this through from start to finish and hammer this procedure into the filler code so we wont have any mistakes the next time we refill the db.

Who will take care of this? @kordianbruck @Adiolis @boriside @togiberlin

@sacdallago
Copy link
Contributor Author

Well, I guess returning to your old page rank is not much hassle. I just need to npm run updatePageRanks --update=characters I GUESS?!

But having @Hack3l list extended to include also the characters that are missing now is more interesting to me!

Then someone would just need to replace the old file in the repo with the new one, I pull and run npm run updatePageRanks --update=characters --file=data/normalizedPageRanks.json

P.S.: See @kordianbruck , I am fast enough 💃

@gyachdav
Copy link
Collaborator

gyachdav commented Apr 2, 2016

will i have it done by the time I have my cereal tomorrow morning? :D

On Apr 2, 2016, at 5:20 PM, Christian Dallago notifications@github.com wrote:

Well, I guess returning to your old page rank is not much hassle. I just need to npm run updatePageRanks --update=characters I GUESS?!

But having @Hack3l list extended to include also the characters that are missing now is more interesting to me!

Then someone would just need to replace the old file in the repo with the new one, I pull and run npm run updatePageRanks --update=characters --file=data/normalizedPageRanks.json

P.S.: See @kordianbruck , I am fast enough


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub

@sacdallago
Copy link
Contributor Author

That sounds familiar! @Hack3l ?

@sacdallago
Copy link
Contributor Author

@gyachdav I am reverting the pageranks to the original ones.

@Hack3l please look at https://github.com/Rostlab/JS16_ProjectA/blob/master/data/normalizedPageRanks.json and https://github.com/Rostlab/JS16_ProjectA/blob/master/data/pageRanks.json and assign values to the ones present in the second file but missing in the first.

I counted and we are talking 2234 vs 2020, so a total of 214 missing!

@Hack3l
Copy link
Collaborator

Hack3l commented Apr 3, 2016

Now it contains all 2234 values
If u want to normalize values on updating the database it's just (value - min)/(max - min)
normalizedRanks.json.txt

@Legenzoo
Copy link
Collaborator

Legenzoo commented Apr 3, 2016

I ´ve updated data/normalizedPageRanks.json.
@sacdallago Pull and run npm run updatePageRanks --update=characters --file=data/normalizedPageRanks.json

@sacdallago
Copy link
Contributor Author

Thx guys, I'll do that soon

EDIT: done.

@gyachdav
Copy link
Collaborator

gyachdav commented Apr 3, 2016

hi @adiolis,

@sacdallago just refilled the database with the original pageranks (range [1-300])

What is the procedure he needs to run in order to half the page rank of characters with no image?

I remember you already implemented this just dont remember what is the command.

Thanks!

Legenzoo added a commit that referenced this issue Apr 4, 2016
@Legenzoo
Copy link
Collaborator

Legenzoo commented Apr 4, 2016

@gyachdav There is no special command to half the rank of characters with no images. I have just commented the according 3 lines out, because i thought this is no issue anymore with the normalized page ranks. Now i put the lines back.

Is there the need for such command or is this behavior always the same?

@Legenzoo
Copy link
Collaborator

Legenzoo commented Apr 4, 2016

@sacdallago Have you updated the pageranks? According to gyachdav the normalized ones should also be halved.

@gyachdav
Copy link
Collaborator

gyachdav commented Apr 4, 2016

Please don't use the normalize values list. It is incomplete and using it messed up our data. All that is left to do here is just pull the changes @adiolis made in the code and then run the pageRank filler again.

@sacdallago
Copy link
Contributor Author

Oh.. I forgot? Mmhh well, I'll do it again just to be sure. You commented those three linea out @adiolis right?

EDIT Done ✔️

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Development

No branches or pull requests

8 participants