Classify importance of Data (Characters, Houses,...) by Centrality measure #103

sacdallago · 2016-03-20T14:50:36Z

Hi there,

as mentioned around and in other teams, it would be extremely useful to have a measure of importance of a character, house, and everything that can be scraped from the wiki and has a unique link.

This can be done by creating a graph of in- and outgoing links via https://www.npmjs.com/package/ngraph.centrality and then storing in separate values the in-, out-, degree and betweenness centralities. Just be careful: the centralities are graph dependent. Thus you need to think wether you want to have a mega-graph or separate graphs for characters, houses,... Both are correct, it's a matter of choice.
Also: think about the package handles multiple links from A -> B

Last but not least, some other teams have already been working on these things, so @Rostlab/js_cs_sose_2016_students please raise your voice. For now we have been wrongly advising this as page rank but that's just one flavor of centrality measure.

EDIT: Centrality is: https://en.wikipedia.org/wiki/Centrality

boriside · 2016-03-20T18:30:06Z

Is this a call for help/collaboration with another team or it is an issue for Project_A?

kajo404 · 2016-03-20T19:33:02Z

I did something . I took the data from Guy's wiki scraper and generated some jsons from it. I then matched this with the character list provided by and ordered by score.

The code can be found here . Its not great but it works. Maybe this helps someone...

sacdallago · 2016-03-21T09:13:21Z

It's a call for collab if someone has done something about it. But I think I described quite extensively how you could implement this, regardless of other's people work. So :D you know... :D

AlexBeischl · 2016-03-21T16:40:38Z

Well, we have just gathered some images for the characters, we created paths for and which are most important/popular. Those images can be found here:
https://github.com/Rostlab/JS16_ProjectC_Group10/tree/develop/mockup/img/persons

sacdallago · 2016-03-21T21:44:02Z

@AlexBeischl that's still not really what I had in mind :) but again, a start:
@kordianbruck @Adiolis can you assign someone except you two or @togiberlin to take care of this one? Assign as in, this issue is assigned.

Legenzoo · 2016-03-22T17:26:59Z

I asked already on the facebook group, who wants to do this. Till now, there are no volunteers.
I am really not able to also do this. 😠
Should we really just assign some one, @sacdallago 😄

Legenzoo · 2016-03-22T18:09:53Z

Is this even a part of Project A?

kordianbruck · 2016-03-22T18:30:56Z

Yea, I don't think this is in the project scope of A nor can be done till the 25. I vouch to move this issue to 'someday'

gyachdav · 2016-03-22T18:57:51Z

I actually think it is within the scope of A, but since this requirement showed up late in the game we can defer it to the next version.

In the mean time, please integrate the data from here https://rostlab.org/~gyachdav/awoiaf/Data/pageRank/allchars.tgz

it's not perfect but at least gives a measure of which character is more referenced than others. I believe the range is [1-300](unknown to popular) . All you want to do here is, read the first item in the array from which you pick up the page_name to identify the character and assign that character the "score" value.

We really need this popularity measure in, to be able to sort the characters by the most important ones. otherwise we run into a case where we show bunch of negligible characters on the character portal.

Legenzoo · 2016-03-22T19:00:05Z

@gyachdav thank you. We will do that.

Legenzoo · 2016-03-23T17:35:45Z

I have implemented the updating of the pageRanks.
With every refill, update etc. of the characters the pageRanks are added.

To just add the pageRanks and do nothing more run: npm run updatePageRanks --update=characters
Then the pageRanks.json in dir data/ is added to the db.
With the --file=dir/file.json option, one can change the json file to use.

To create this json from the dir that guy provided run: npm run updatePageRanks --dir=PATHTODIR
With this the many _data files are transformed to just one json, that contains the names and scores of the characters.
With the --to=dir/file.json option, one can define to which json file the result is saved.

So, @kordianbruck, please run npm run updatePageRanks --update=characters on the public server.

sacdallago · 2016-03-23T20:48:28Z

but this is not done with the fancy alg I suggested above, right @Adiolis ? In that case please leave this open with the sometime milestone and no assignation :P

gyachdav · 2016-03-25T13:11:14Z

@Adiolis can you please run quick stats on all characters and report min,max median and mean and stddev for "pageRank". It will help with interpreting level of importantce for a character. A histogram of pageRank would also be very helpful. Thanks!

Hack3l · 2016-03-25T13:20:16Z

@gyachdav i used pagerank for PLOD min = 0, max = 300 i normalized the values and only around 300 characters have over 0.1 normalized rank

gyachdav · 2016-03-25T13:23:21Z

thanks can you post here your normalized ranking?

gyachdav · 2016-03-25T13:29:20Z

i'm basically interested in cases like this https://got-api.bruck.me/api/characters/Tormund where the pageRank is rather low but in the show still plays a prominent role to get tweets.

Hack3l · 2016-03-25T13:43:35Z

Here my normalized pagerank and 60 is not that low normalized all above 0.1 are pretty much popular
pagerank_normalized_json.txt

gyachdav · 2016-03-25T14:04:46Z

Thanks @Hack3l!

@Adiolis see yourself as excused from this task 😄

kajo404 · 2016-03-30T12:38:53Z

Hi @Hack3l . I can't parse you file because some names are formatted like 'Name' and other are just Name. Can you do all "Name" like in the last file?

Hack3l · 2016-03-30T21:48:58Z

Oh yeah sorry, here is the right one.
pagerank_normalized_json.txt

kajo404 · 2016-03-31T20:20:41Z

thank you very much!

sacdallago · 2016-03-31T20:23:41Z

@Adiolis @boriside @kordianbruck @togiberlin can one of you update this also in your repo? Then I rescrape or whatevvah

Legenzoo · 2016-04-01T06:37:46Z

I have already implemented the translation of guys data in his zip to our data/pageRanks.json, which has an other structure than Hack3l´s json. The question is now who is generating the pageRanks in the future and in which structure the information will be provided. I can remove my earlier waste of time and reimplement it to fill hack3l´s json. But later i wont do that anymore, so we need a final decision ;)
Does hack3l´s json contain all the pageRanks our data/pageRank provides? Is there still the need to lower the pageRank if the character has no image?

sacdallago · 2016-04-01T09:41:18Z

I know. That's why I suggested since the beginning the algorithmic, scraped way. Fast fixes are no-good in the long term, and this is again just a fix.
My suggestion is to do it algorithmically as in my first post.

As far as the image story goes: that's definitely an indicator. If it makes sense then to add this, that's up to you. You have complete freedom on this

Legenzoo · 2016-04-01T10:52:44Z

Pull and run for the normalized pageRanks.
npm run updatePageRanks --update=characters --file=data/normalizedPageRanks.json

Legenzoo · 2016-04-01T10:58:55Z

If one needs a "translation" script for the jsons of hack3l:

var jsonfile = require('jsonfile');
const OLDFILE = './normalized.json';
jsonfile.readFile(OLDFILE, function(err,data){
    var newFile = [];
    for(var name in data) {
        newFile.push({"name": name, "score": data[name]});
    }
    jsonfile.writeFile('./data/normalizedPageRanks.json',newFile,{"spaces": 2},function(err){
       if(err) {
           console.log(err);
       }
        else {
           console.log('Finished');
       }
    });
});

sacdallago · 2016-04-01T11:18:55Z

thx @Adiolis

sacdallago · 2016-04-01T11:25:08Z

Done - for now

gyachdav · 2016-04-01T14:57:52Z

Some characters are missing a pagerank value. in this case just assign a low default value. if the character has an image stored bump the default pagerank by 30%.

gyachdav · 2016-04-02T21:16:08Z

okay guys let's solve this mess with the page rank

We currently have characters with page rank in the [0-1] range (normalized)
e.g.
https://api.got.show/api/characters/Gregor "pageRank":0.84

other characters have page rank in the [1-300] range
e.g.
https://api.got.show/api/characters/Qyburn "pageRank":161

This means that not all characters were updated with the normalized page rank.

@sacdallago says that this is because some characters are missing on @Hack3l maybe so.

What we need to do now is:

for each character in the characters collection

assign the page rank from my original list.
if character does not have a stored image -> pagerank=pagerank/2
if character does not have a page rank in my list then assign a low default value. if that character has an image stored increase default page rank by 30%
normalize page rank value using this procedure Classify importance of Data (Characters, Houses,...) by Centrality measure #103 (comment)

Once this procedure is done please discuss with @sacdallago how to migrate the data to the got.show mongo instance.

I need a volunteer to see this through from start to finish and hammer this procedure into the filler code so we wont have any mistakes the next time we refill the db.

Who will take care of this? @kordianbruck @Adiolis @boriside @togiberlin

sacdallago · 2016-04-02T21:20:45Z

Well, I guess returning to your old page rank is not much hassle. I just need to npm run updatePageRanks --update=characters I GUESS?!

But having @Hack3l list extended to include also the characters that are missing now is more interesting to me!

Then someone would just need to replace the old file in the repo with the new one, I pull and run npm run updatePageRanks --update=characters --file=data/normalizedPageRanks.json

P.S.: See @kordianbruck , I am fast enough 💃

gyachdav · 2016-04-02T21:22:35Z

will i have it done by the time I have my cereal tomorrow morning? :D

On Apr 2, 2016, at 5:20 PM, Christian Dallago notifications@github.com wrote:

Well, I guess returning to your old page rank is not much hassle. I just need to npm run updatePageRanks --update=characters I GUESS?!

But having @Hack3l list extended to include also the characters that are missing now is more interesting to me!

Then someone would just need to replace the old file in the repo with the new one, I pull and run npm run updatePageRanks --update=characters --file=data/normalizedPageRanks.json

P.S.: See @kordianbruck , I am fast enough

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub

sacdallago · 2016-04-02T21:29:20Z

That sounds familiar! @Hack3l ?

sacdallago · 2016-04-02T22:02:42Z

@gyachdav I am reverting the pageranks to the original ones.

@Hack3l please look at https://github.com/Rostlab/JS16_ProjectA/blob/master/data/normalizedPageRanks.json and https://github.com/Rostlab/JS16_ProjectA/blob/master/data/pageRanks.json and assign values to the ones present in the second file but missing in the first.

I counted and we are talking 2234 vs 2020, so a total of 214 missing!

Hack3l · 2016-04-03T11:34:38Z

Now it contains all 2234 values
If u want to normalize values on updating the database it's just (value - min)/(max - min)
normalizedRanks.json.txt

Legenzoo · 2016-04-03T11:45:05Z

I ´ve updated data/normalizedPageRanks.json.
@sacdallago Pull and run npm run updatePageRanks --update=characters --file=data/normalizedPageRanks.json

sacdallago · 2016-04-03T12:25:10Z

Thx guys, I'll do that soon

EDIT: done.

gyachdav · 2016-04-03T20:38:14Z

hi @adiolis,

@sacdallago just refilled the database with the original pageranks (range [1-300])

What is the procedure he needs to run in order to half the page rank of characters with no image?

I remember you already implemented this just dont remember what is the command.

Thanks!

Legenzoo · 2016-04-04T06:37:42Z

@gyachdav There is no special command to half the rank of characters with no images. I have just commented the according 3 lines out, because i thought this is no issue anymore with the normalized page ranks. Now i put the lines back.

Is there the need for such command or is this behavior always the same?

Legenzoo · 2016-04-04T17:06:29Z

@sacdallago Have you updated the pageranks? According to gyachdav the normalized ones should also be halved.

gyachdav · 2016-04-04T17:13:17Z

Please don't use the normalize values list. It is incomplete and using it messed up our data. All that is left to do here is just pull the changes @adiolis made in the code and then run the pageRank filler again.

sacdallago · 2016-04-04T19:23:31Z

Oh.. I forgot? Mmhh well, I'll do it again just to be sure. You commented those three linea out @adiolis right?

EDIT Done ✔️

sacdallago added the feature label Mar 21, 2016

sacdallago added this to the v0.1.0 Feature freeze milestone Mar 21, 2016

Legenzoo mentioned this issue Mar 22, 2016

Status update #107

Closed

kordianbruck modified the milestones: Sometime, v0.1.0 Feature freeze Mar 22, 2016

Legenzoo self-assigned this Mar 23, 2016

Legenzoo added a commit that referenced this issue Mar 23, 2016

#103 script to add the page ranks to the characters

d26d73d

Legenzoo added a commit that referenced this issue Mar 23, 2016

#103

e6075f7

Legenzoo closed this as completed Mar 23, 2016

Legenzoo reopened this Mar 23, 2016

Legenzoo removed their assignment Mar 23, 2016

gyachdav assigned Legenzoo Mar 26, 2016

Legenzoo added a commit that referenced this issue Apr 1, 2016

#103

b0b5d13

Legenzoo added a commit that referenced this issue Apr 1, 2016

#103

b21a7ba

sacdallago unassigned Legenzoo Apr 1, 2016

gyachdav mentioned this issue Apr 1, 2016

Sorting characters by popularity is strange and contains houses Rostlab/JS16_ProjectF#224

Closed

gyachdav added the high priority label Apr 2, 2016

Legenzoo added a commit that referenced this issue Apr 4, 2016

#103

123bbf9

Classify importance of Data (Characters, Houses,...) by Centrality measure #103

Classify importance of Data (Characters, Houses,...) by Centrality measure #103

Comments

sacdallago commented Mar 20, 2016

boriside commented Mar 20, 2016

kajo404 commented Mar 20, 2016

sacdallago commented Mar 21, 2016

AlexBeischl commented Mar 21, 2016

sacdallago commented Mar 21, 2016

Legenzoo commented Mar 22, 2016

Legenzoo commented Mar 22, 2016

kordianbruck commented Mar 22, 2016

gyachdav commented Mar 22, 2016

Legenzoo commented Mar 22, 2016

Legenzoo commented Mar 23, 2016

sacdallago commented Mar 23, 2016

gyachdav commented Mar 25, 2016

Hack3l commented Mar 25, 2016

gyachdav commented Mar 25, 2016

gyachdav commented Mar 25, 2016

Hack3l commented Mar 25, 2016

gyachdav commented Mar 25, 2016

kajo404 commented Mar 30, 2016

Hack3l commented Mar 30, 2016

kajo404 commented Mar 31, 2016

sacdallago commented Mar 31, 2016

Legenzoo commented Apr 1, 2016

sacdallago commented Apr 1, 2016

Legenzoo commented Apr 1, 2016

Legenzoo commented Apr 1, 2016

sacdallago commented Apr 1, 2016

sacdallago commented Apr 1, 2016

gyachdav commented Apr 1, 2016

gyachdav commented Apr 2, 2016

sacdallago commented Apr 2, 2016

gyachdav commented Apr 2, 2016

sacdallago commented Apr 2, 2016

sacdallago commented Apr 2, 2016

Hack3l commented Apr 3, 2016

Legenzoo commented Apr 3, 2016

sacdallago commented Apr 3, 2016

gyachdav commented Apr 3, 2016

Legenzoo commented Apr 4, 2016

Legenzoo commented Apr 4, 2016

gyachdav commented Apr 4, 2016

sacdallago commented Apr 4, 2016