Recommender System with StackExchange data

QuietOne edited this page Jun 23, 2015 · 11 revisions

##The problem For any given tag recommend to user questions that he can find amusing to read.

##The ideas for solution Questions can be recommended based on:

  1. their properties (answer_counts, is_answered, score, last_activity_date...)
  2. the person who posted them; in particular, that would be a person who is by some metrics important for the given tag
  3. some centrality measure of question for the given tag.

Implemented recommender system is hybrid of these three options.

##Data used for creating graph ###Nodes ####Tag node ####Question node Question node keeps data relevant for the corresponding question, such as link, score, view_count... Unique property for questions is question_id assigned by the StackExchange API; it is used for getting more data about any particular question. ###Relationships ####models.RelatedTagsEdge ####models.SynonymTagsEdge ####models.BelongsEdge It is used to connect a tag and a question, and to represent that the question belongs to the given tag. ####models.FAQEdge It is used to connect a tag and a question to represent that the question is one of frequently asked questions for the given tag. ####models.RelatedQuestionsEdge It is used to connect two questions that are related. ####models.PMITagsEdge It is used to represent the PMI connection between two tags. This relationship is unlike others because this is not made based on the data obtained via the StackExchange API, but with the PMI metric. The main property is weight and it represents the PMI value between two tags. ##Used metrics ###Pointwise mutual information ##Procedure for recommending question ###1. Getting data from StackExchange API

  1. Add the specified tag (i.e., the tag requested by the user) to the graph
  2. Get the synonym tags for the specified tag
  3. Add the synonym tags to the graph
  4. For each tag in the cluster formed around the specified tag (the cluster comprises the specified tag with and its synonym tags)
    1. Get FAQ for the tag
    2. Add FAQ to the graph; connect them with the tag with appropriate relationships
    3. Get certain number of randomly selected questions related to that tag (It is not really random, but without knowing the order of given questions by StackExchange, we can treat them as random). The number of questions can be arbitrarily set at the beginning of the process to obtain different results. In its current setting, this recommender system retrives 100 questions, as it is the maximum number of questions returned by one StackExchange API call.
    4. Add questions to the graph, and connect them with the tag via the models.BelongsEdge relationship; add additional tags if needed (a question can have more than 1 tag to which it belongs)
    5. For each of those "randomly" selected questions
      1. Get from the graph the tags it belongs to (in the following text they will be named as R-tags)
      2. Calculate the PMI metric between the specified tag and the tags identified in the previous step (a), and set adequate edges between them
      3. Get FAQ for R-tags from StackExchange and add them to the graph
      4. Get "random" for R-tags questions from StackExchange and add it to the graph
  5. Get for all questions in the graph related questions from the StackExchange API (getting additional questions if there aren't enough questions that can be recommended)
  6. Add the retrieved questions to the graph

###2. Calculating questions relevancy

  1. Setting weights for the question's properties. Weights are constants that define relevancy of the question's properties such as: score, answer_count, is_answered... In this recommender system, weights used are:

    • weightScore = 0.8 (based on intuition)
    • weightAnswerCount = 0.3 (based on intuition)
    • weightIsAnswered = 1 (based on intuition)
    • weightViewCount = 0.8 (setting higher value for this weight results in getting older questions, sometimes obsolete questions)
    • weightCreationDate = 0.03 (high value results in disregarding older questions)
    • weightLastActivityDate = 0.03 (high value results in disregarding older questions)
    • weightBelongs = 1 (adding additional value for those questions that belong to specified tag)

    Weight for belonging to the specified tag may seem as an overhead, but without it, the recommender system can recommend questions that are not so related with the specified tag, if that tag is not so popular in the community (score for its questions is low, or there aren't many questions with that tag). Not "so related" questions are not excluded because if there aren't many or even any questions with the given tag, recommender system won't return enough questions; and from the big picture, a recommender system should not return only the questions related to the specified tag, but also those that are less related but still potentially interesting.

  2. Setting penalties for questions that are distant from the specified tag. This has the same function as the weight for the belonging of a question to the specified tag. The penalty is named alpha and value must in interval [0,1]. 0 means that the questions that do not belong to specified tag will not be included (not recommended, as stated before), and 1 means that there will be no penalties for distant questions inside graph (also not recommended). This value is currently set to 0.05, as it gave the best results in the experiments.

  3. Getting all questions starting from the given tag to the specified path length (all kinds of connections are included in this search)

  4. Normalization of the question attribute values. For each attribute, this is done by diving the attribute's value with the maximum value of that attribute observed among the questions that were taken in consideration

  5. Calculating the value of a question. This is done with the following formula:

value = alpha^distance_from_tag * ((answer_count * weightAnswerCount) + (isAnswered * weightIsAnswered) + (view_count * weightViewCount) + (score * weightScore) + (belongs * weightBelongs) + (creation_date * weightCreationDate) + (last_activity_date * weightLastActivityDate))

6. Calculating and adding declustering value. Idea of declustering is to find those questions that are not obvious, questions that are not linked to the specified tag, but are similar and can open new possibilities to the user. For more information see: [Auralist: Introducing Serendipity into Music Recommendation ](
1. Computing tag clusters based on the PMI or some other similarity metric (e.q. jaccard index); the computation procedure is described in the referenced paper.
2. Adding declustering bonus value to each question using declustering value of each tag the question is associated with. Declustering value of a tag, which was computed in the previous step, reflects how well that tag is connected to the other tags in its cluster.
7. Sorting questions by value
8. Returning the specified number of sorted questions

Recommended questions for tag game-ai:

##Further development possibilities:
* improving recommendations by adding more sites from StackExchange (in this project, only StackOverflow was used)
* improving recommendations by adding more complex queries:
  * more tags (the easiest way is to get for each tag relevant data, if graph is not connected, then there is no solution, if there is then calculate most popular questions for each, add values calculated by each and return the questions with the biggest values)
* improving recommendations by filtering out the results (questions) related to some very specific frameworks and/or programming languages that would not be of interest to majority of users; keep these only if they were explicitly specified in the user's query
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.