# Skills evaluator math model

## Problem description

Is required to provide the engineer's skills evaluation for **Cuban Engineer** platform users. The skills evaluation process use the data extracted from **Cuban Engineer** providers to use only verified data sources. Since all the sources provide custom skills measures is required to process all providers data, join repeated skills data, summarize the output in a single score per skill and infer skills nonexplicitly provided.

### Objectives

* Provide a fair engineer skills evaluation 
* Preprocess input data according to each provider API 
* Build a summarized evaluation model with all engineer skills extracted from different providers
* Infer verified but non-explicitly provided skills from the data

### Hypothesis

All **Cuban Engineer** users are mainly programming related professionals, so they must know some programming language, a set of related technologies and they'll have published pieces of information about his skills. 

### Objects description and acquired data

For the evaluation process the **Objects** are the Cuban Engineer user's data from each provider.

The current providers are: 
* GitHub
* GitLab
* StackExchange

#### GitHub
Is a VCS for developers that store code statistics and user's collaborative behaviour.

##### Provided data
* **User followers**: User followers count. An integer value.
* **A set of repositories**: Each repository have a set of code statistics .
* **Repository ID**: An identifyer.
* **Repository Forked**: Define if the repository is forked or not. Boolean value.
* **Repository Contributors**: Amount of project contributors. An integer value.
* **Repository Stars**: Amount of project stars. An integer value.
* **Repository Forks**: Amount of project forks. An integer value.
* **Repository Views**: Amount of project Views. An integer value.
* **Repository total commits**: Amount of project commits. An integer value.
* **Repository user commits**: Amount of user commits. An integer value.
* **Repository total addtitions**: Amount of bytes of code added to the project. An integer value.
* **Repository user additions**: Amount of bytes of code added by the user. An integer value.
* **A set of skills**: Programming languages or scripts used in each reposotiry named Skill in this context.
* **Skill repository ID**: A repository identifyer reference.
* **Skill name**: Programming language or script used. A string value.
* **Skill value**: Amount of bytes of code for that programming language or script used. An integer value.

#### GitLab
Is a VCS for developers that store code statistics and user's collaborative behaviour 

##### Provided data
* **User followers**: User followers count. An integer value.
* **A set of repositories**: Each repository have a set of code statistics .
* **Repository ID**: An identifyer.
* **Repository Forked**: Define if the repository is forked or not. Boolean value.
* **Repository Contributors**: Amount of project contributors. An integer value.
* **Repository Stars**: Amount of project stars. An integer value.
* **Repository Forks**: Amount of project forks. An integer value.
* **Repository Views**: Amount of project Views. An integer value.
* **Repository total commits**: Amount of project commits. An integer value.
* **Repository user commits**: Amount of user commits. An integer value.
* **A set of skills**: Programming languages or scripts used in each reposotiry named Skill in this context.
* **Skill repository ID**: A repository identifyer reference.
* **Skill name**: Programming language or script used. A string value.
* **Skill value**: Programming language or script percent of usage in the project. A real value.

#### StackExchange
Is a Questions-Answers platform to share knowledge between the users

##### Provided data
* **User reputation**: 
* **A set of skills**:
* **Skill name**: 
* **Skill value**: 

### Expected results  and evaluation

The output is a set of evaluated skills for each Cuban Engineer user that follows the next requirements: 

* A evaluated skill is unique for the user. 
* The skill value represent the user final score for all providers that provide information.
* The skills value output must be scaled to user desired range of values 

The evaluator result must be tested with a data set extracted from know users of the platform. The test set will not provide an evaluation value, instead will sort the engineers according to several skills and an overall evaluation. The evaluator result should have a similar output when the values are sorted in the same way that the test set. This process will define the evaluation algorithm quality.

## Problem formalization

### Objectives formalization

The algorithm must use the data from different providers to estimate how good is an engineer in a set of skills explicit or not in provided data. To achieve this goal is required several processing steps: 

* **Input value determination**: The input value definition is required for each provider. Could be a single value (e.g. StackExchange) or a linear combination of values (e.g. GitHub).
* **Vaue Normalization**: Transform the skill value from different providers to the same numerical scale to allow a more precise comparison between input values.
* **Name Homogenization**: Transform the skills name to a single structure due to different providers may name the same skill in different ways (e.g. 'C++' and 'Cpp')
* **Skill Inference**: Using the data provided infer non-explicit skills and their evaluation keeping in mind that new skills must be valid, so concepts like probability or correlation must not be used in the inference process. 
* **Result merging**: Since some providers have a project structured data, is required to process and summarize the data of each project to achieve a final provider evaluation. The same is required to define user final skill evaluation using the result of the different providers. Is required to define the merge process between projects and providers to achieve a fair and final result.

### Data representation

The input data for each user is structured in a JSON format like the following example:

In [2]:
input_json = """
    {
      "name": "Minnie",
      "nick": "Minnie",
      "email": "Minnie@disney.com",
      "profiles": [
        {
          "provider": "STACK_EXCHANGE",
          "stats": {
            "reputation": "17"
          },
          "skills": [
            {
              "repositoryId": 0,
              "name": "angular",
              "value": 2.0
            },
            {
              "repositoryId": 0,
              "name": "unit-testing",
              "value": 2.0
            },
            {
              "repositoryId": 0,
              "name": "chromium",
              "value": 1.0
            },
            {
              "repositoryId": 0,
              "name": "docker",
              "value": 1.0
            },
            {
              "repositoryId": 0,
              "name": "gitlab",
              "value": 1.0
            },
            {
              "repositoryId": 0,
              "name": "highcharts",
              "value": 1.0
            },
            {
              "repositoryId": 0,
              "name": "intellij-idea",
              "value": 1.0
            },
            {
              "repositoryId": 0,
              "name": "java",
              "value": 1.0
            },
            {
              "repositoryId": 0,
              "name": "javascript",
              "value": 1.0
            },
            {
              "repositoryId": 0,
              "name": "matplotlib",
              "value": 1.0
            },
            {
              "repositoryId": 0,
              "name": "python",
              "value": 1.0
            },
            {
              "repositoryId": 0,
              "name": "rest",
              "value": 1.0
            }
          ]
        },
        {
          "provider": "GITHUB",
          "stats": {
            "followers": "2"
          },
          "repositories": [
            {
              "id": 1,
              "isFork": false,
              "contributors": 0,
              "totalCommits": 30,
              "userCommits": 30,
              "forks": 0,
              "stars": 0,
              "views": 0,
              "userAdditions": 2976,
              "totalAdditions": 2976
            }
          ],
          "skills": [
            {
              "repositoryId": 1,
              "name": "Java",
              "value": 2976.0
            }
          ]
        },
        {
          "provider": "GITLAB",
          "stats": {},
          "repositories": [
            {
              "id": 1,
              "isFork": false,
              "contributors": 0,
              "totalCommits": 11,
              "userCommits": 11,
              "forks": 0,
              "stars": 0,
              "views": 0
            },
            {
              "id": 2,
              "isFork": false,
              "contributors": 0,
              "totalCommits": 25,
              "userCommits": 20,
              "forks": 0,
              "stars": 0,
              "views": 0
            },
            {
              "id": 3,
              "isFork": false,
              "contributors": 0,
              "totalCommits": 0,
              "userCommits": 0,
              "forks": 0,
              "stars": 0,
              "views": 0
            }
          ],
          "skills": [
            {
              "repositoryId": 1,
              "name": "Java",
              "value": 96.55
            },
            {
              "repositoryId": 1,
              "name": "HTML",
              "value": 3.45
            },
            {
              "repositoryId": 2,
              "name": "Java",
              "value": 100.0
            }
          ]
        }
      ]
    }
"""


The JSON format follows the data description provided before. 

### Data properties

As can be appreciated in the previous section, each provider has distinct data that must be processed independently. For that reason is required to describe the data properties for each provider. Since each provider data is described in a previous section, this section will be focused on the main differences between them.

#### GitHub
The data describe user code stats related to a ser of projects. Also, each project has a set of related skills. Following the meaning of most important fields in the problem context are presented:

* **User followers**: A good programmer should have more followers but is not an assurance.
* **Repository Forked**: A forked project means that the main ideas of the project are not from the user and could be used to reduce the relevance of the project for the user.
* **Repository Contributors**: A project with more contributors means that is teamwork project, but  user contributions are reduced due to project commits are not only from the user
* **Repository Stars**: Usually, a good project is rated by other users increasing project relevance.
* **Repository Forks**: A project with interesting ideas is usually forked by other users increasing project relevance
* **Repository Views**: Users interested in the project behaviour and evolution use the views to keep track of it. 
* **Repository user commits**: A well-worked project should have more commits from the user. This value can be used to get an idea about user contributions to the project.
* **Repository user additions**: Measure the amount of bytes of conde added by the user.

All described repository fields can be used to define the relevance of the project and user contributions to it.

#### GitLab
The data that describes GitLab is very similar to GitHub. The main differences are that GitLab does not provide user and total additions and the value of the skills is the percent of usage in the project.

#### StackExchange
