Classify importance of Data (Characters, Houses,...) by Centrality measure #103
Comments
Is this a call for help/collaboration with another team or it is an issue for Project_A? |
I did something . I took the data from Guy's wiki scraper and generated some jsons from it. I then matched this with the character list provided by and ordered by score. The code can be found here . Its not great but it works. Maybe this helps someone... |
It's a call for collab if someone has done something about it. But I think I described quite extensively how you could implement this, regardless of other's people work. So :D you know... :D |
Well, we have just gathered some images for the characters, we created paths for and which are most important/popular. Those images can be found here: |
@AlexBeischl that's still not really what I had in mind :) but again, a start: |
I asked already on the facebook group, who wants to do this. Till now, there are no volunteers. |
Is this even a part of Project A? |
Yea, I don't think this is in the project scope of A nor can be done till the 25. I vouch to move this issue to 'someday' |
I actually think it is within the scope of A, but since this requirement showed up late in the game we can defer it to the next version. In the mean time, please integrate the data from here https://rostlab.org/~gyachdav/awoiaf/Data/pageRank/allchars.tgz it's not perfect but at least gives a measure of which character is more referenced than others. I believe the range is [1-300](unknown to popular) . All you want to do here is, read the first item in the array from which you pick up the page_name to identify the character and assign that character the "score" value. We really need this popularity measure in, to be able to sort the characters by the most important ones. otherwise we run into a case where we show bunch of negligible characters on the character portal. |
@gyachdav thank you. We will do that. |
I have implemented the updating of the pageRanks. To just add the pageRanks and do nothing more run: To create this json from the dir that guy provided run: So, @kordianbruck, please run |
but this is not done with the fancy alg I suggested above, right @Adiolis ? In that case please leave this open with the sometime milestone and no assignation :P |
@Adiolis can you please run quick stats on all characters and report min,max median and mean and stddev for "pageRank". It will help with interpreting level of importantce for a character. A histogram of pageRank would also be very helpful. Thanks! |
@gyachdav i used pagerank for PLOD min = 0, max = 300 i normalized the values and only around 300 characters have over 0.1 normalized rank |
thanks can you post here your normalized ranking? |
i'm basically interested in cases like this https://got-api.bruck.me/api/characters/Tormund where the pageRank is rather low but in the show still plays a prominent role to get tweets. |
Here my normalized pagerank and 60 is not that low normalized all above 0.1 are pretty much popular |
Thanks @Hack3l! @Adiolis see yourself as excused from this task 😄 |
Hi @Hack3l . I can't parse you file because some names are formatted like 'Name' and other are just Name. Can you do all "Name" like in the last file? |
Oh yeah sorry, here is the right one. |
thank you very much! |
@Adiolis @boriside @kordianbruck @togiberlin can one of you update this also in your repo? Then I rescrape or whatevvah |
I have already implemented the translation of guys data in his zip to our data/pageRanks.json, which has an other structure than Hack3l´s json. The question is now who is generating the pageRanks in the future and in which structure the information will be provided. I can remove my earlier waste of time and reimplement it to fill hack3l´s json. But later i wont do that anymore, so we need a final decision ;) |
I know. That's why I suggested since the beginning the algorithmic, scraped way. Fast fixes are no-good in the long term, and this is again just a fix. As far as the image story goes: that's definitely an indicator. If it makes sense then to add this, that's up to you. You have complete freedom on this |
Pull and run for the normalized pageRanks. |
If one needs a "translation" script for the jsons of hack3l: var jsonfile = require('jsonfile');
const OLDFILE = './normalized.json';
jsonfile.readFile(OLDFILE, function(err,data){
var newFile = [];
for(var name in data) {
newFile.push({"name": name, "score": data[name]});
}
jsonfile.writeFile('./data/normalizedPageRanks.json',newFile,{"spaces": 2},function(err){
if(err) {
console.log(err);
}
else {
console.log('Finished');
}
});
}); |
thx @Adiolis |
Done - for now |
Some characters are missing a pagerank value. in this case just assign a low default value. if the character has an image stored bump the default pagerank by 30%. |
okay guys let's solve this mess with the page rank We currently have characters with page rank in the [0-1] range (normalized) other characters have page rank in the [1-300] range This means that not all characters were updated with the normalized page rank. @sacdallago says that this is because some characters are missing on @Hack3l maybe so. What we need to do now is: for each character in the characters collection
Once this procedure is done please discuss with @sacdallago how to migrate the data to the got.show mongo instance. I need a volunteer to see this through from start to finish and hammer this procedure into the filler code so we wont have any mistakes the next time we refill the db. Who will take care of this? @kordianbruck @Adiolis @boriside @togiberlin |
Well, I guess returning to your old page rank is not much hassle. I just need to But having @Hack3l list extended to include also the characters that are missing now is more interesting to me! Then someone would just need to replace the old file in the repo with the new one, I pull and run P.S.: See @kordianbruck , I am fast enough 💃 |
will i have it done by the time I have my cereal tomorrow morning? :D On Apr 2, 2016, at 5:20 PM, Christian Dallago notifications@github.com wrote:
|
That sounds familiar! @Hack3l ? |
@gyachdav I am reverting the pageranks to the original ones. @Hack3l please look at https://github.com/Rostlab/JS16_ProjectA/blob/master/data/normalizedPageRanks.json and https://github.com/Rostlab/JS16_ProjectA/blob/master/data/pageRanks.json and assign values to the ones present in the second file but missing in the first. I counted and we are talking 2234 vs 2020, so a total of 214 missing! |
Now it contains all 2234 values |
I ´ve updated data/normalizedPageRanks.json. |
Thx guys, I'll do that soon EDIT: done. |
hi @adiolis, @sacdallago just refilled the database with the original pageranks (range [1-300]) What is the procedure he needs to run in order to half the page rank of characters with no image? I remember you already implemented this just dont remember what is the command. Thanks! |
@gyachdav There is no special command to half the rank of characters with no images. I have just commented the according 3 lines out, because i thought this is no issue anymore with the normalized page ranks. Now i put the lines back. Is there the need for such command or is this behavior always the same? |
@sacdallago Have you updated the pageranks? According to gyachdav the normalized ones should also be halved. |
Please don't use the normalize values list. It is incomplete and using it messed up our data. All that is left to do here is just pull the changes @adiolis made in the code and then run the pageRank filler again. |
Oh.. I forgot? Mmhh well, I'll do it again just to be sure. You commented those three linea out @adiolis right? EDIT Done ✔️ |
Hi there,
as mentioned around and in other teams, it would be extremely useful to have a measure of importance of a character, house, and everything that can be scraped from the wiki and has a unique link.
This can be done by creating a graph of in- and outgoing links via https://www.npmjs.com/package/ngraph.centrality and then storing in separate values the in-, out-, degree and betweenness centralities. Just be careful: the centralities are graph dependent. Thus you need to think wether you want to have a mega-graph or separate graphs for characters, houses,... Both are correct, it's a matter of choice.
Also: think about the package handles multiple links from A -> B
Last but not least, some other teams have already been working on these things, so @Rostlab/js_cs_sose_2016_students please raise your voice. For now we have been wrongly advising this as page rank but that's just one flavor of centrality measure.
EDIT: Centrality is: https://en.wikipedia.org/wiki/Centrality
The text was updated successfully, but these errors were encountered: