# 1. Assessor and analyst work

## 1.0. Rating and criteria

Please [open this document](https://static.googleusercontent.com/media/guidelines.raterhub.com/en//searchqualityevaluatorguidelines.pdf)
and study chapters 13.0-13.4. Your task will be to assess the organic answers of search engines given the same query.

## 1.1. Explore the page

For the following search engines:
- https://duckduckgo.com/
- https://www.bing.com/
- https://ya.ru/
- https://www.google.com/

Perform the same query: "**How to get from Kazan to Voronezh**".

Discuss with your TA the following:
1. Which elements you may identify at SERP? Ads, snippets, blends from other sources, ...?
2. Where are organic results? How many of them are there?

## 1.2. Rate the results of the search engine

If there are many of you in the group, assess all search engines, otherwise choose 1 or 2. There should be no less than 5 of your for each search engine. Use the scale from the handbook, use 0..4 numerical equivalents for . 

Compute:
- average relevance and standard deviation.
- [Fleiss kappa score](https://en.wikipedia.org/wiki/Fleiss%27_kappa#Worked_example). Use [this implementation](https://www.statsmodels.org/dev/generated/statsmodels.stats.inter_rater.fleiss_kappa.html).
- [Kendall rank coefficient](https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient). Use [this implementation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kendalltau.html).

Discuss numerical results. Did you agree on the relevance? Did you agree on the rank? What is the difference?

In [1]:
import numpy as np
# example input by users
ranking_data = np.array([
    [4, 4, 4, 3, 4, 2, 2, 1, 1, 0], # assessor 1 relevance
    [4, 3, 4, 3, 3, 2, 1, 1, 1, 1], # assessor 2 relevance
    [3, 4, 4, 4, 4, 3, 2, 1, 1, 1], # ...
    [4, 4, 4, 4, 3, 2, 2, 1, 1, 0],
    [4, 4, 4, 4, 3, 2, 2, 1, 1, 3]
])

Averages ang standard deviations per item.

In [2]:
average_relevance = ranking_data.mean(axis=0)
sigma2 = ((ranking_data - average_relevance) ** 2).mean(axis=0)
sigma = sigma2 ** .5

for i in range(ranking_data.shape[1]):
    print(f" {i} relevance {average_relevance[i]:.2f} ± {sigma[i]:.3f}")

 0 relevance 3.80 ± 0.400
 1 relevance 3.80 ± 0.400
 2 relevance 4.00 ± 0.000
 3 relevance 3.60 ± 0.490
 4 relevance 3.40 ± 0.490
 5 relevance 2.20 ± 0.400
 6 relevance 1.80 ± 0.400
 7 relevance 1.00 ± 0.000
 8 relevance 1.00 ± 0.000
 9 relevance 1.00 ± 1.095


Fleiss kappa score

In [3]:
!pip install statsmodels



In [4]:
from statsmodels.stats.inter_rater import aggregate_raters, fleiss_kappa
transposed = ranking_data.T

aggregate, cats = aggregate_raters(transposed)
print("Agreement matrix:")
print(aggregate)
print("Categories:", cats)
print("Kappa:", fleiss_kappa(aggregate))

Agreement matrix:
[[0 0 0 1 4]
 [0 0 0 1 4]
 [0 0 0 0 5]
 [0 0 0 2 3]
 [0 0 0 3 2]
 [0 0 4 1 0]
 [0 1 4 0 0]
 [0 5 0 0 0]
 [0 5 0 0 0]
 [2 2 0 1 0]]
Categories: [0 1 2 3 4]
Kappa: 0.5156081808396124


In [7]:
ranking_data[1]

array([4, 3, 4, 3, 3, 2, 1, 1, 1, 1])

Kendall tau score is pairwise. Compare one to another.

In [5]:
from scipy.stats import kendalltau
kendalltau(ranking_data[0], ranking_data[1])

KendalltauResult(correlation=0.8336550215650926, pvalue=0.0031006074932690315)

# 2. Engineer work

You will create a bucket of URLs which are relevant for the query **"free cloud git"**. Then you will automate the search procedure using https://serpapi.com/, or https://developers.google.com/custom-search/v1/overview, or whatever.

Then you will compute MRR@10 and Precision@10.

## 2.1. Build your bucket here

In [10]:
rel_bucket = [
    "gitpod.io",
    "github.com",
    "bitbucket.org",
    "source.cloud.google.com",
    "gitlab.com",
    "sourceforge.net",
    "aws.amazon.com/codecommit/",
    "launchpad.net",
]

query = "free git cloud"

## 2.2. Relevance assessment

Write the code to check that the obtained document is relevant (True) or not (False).

In [11]:
def is_rel(resp_url):
    for u in rel_bucket:
        if u in resp_url:
            return True
    else:
        return False

## 2.3. Automation

Get search results from the automation tool you use.

In [12]:
api_key = "5aff1ae53da3a991a97d770bf1991833ba30a97d68925ede4cb0003285c727ba"

In [13]:
import requests 

url = f"https://serpapi.com/search.json?q={query}&hl=en&gl=us&google_domain=google.com&api_key={api_key}"
js = requests.get(url).json()

In [16]:
js['organic_results']

[{'position': 1,
  'title': '6 places to host your git repository - Opensource.com',
  'link': 'https://opensource.com/article/18/8/github-alternatives',
  'displayed_link': 'https://opensource.com › article › github-alternatives',
  'date': 'Aug 30, 2018',
  'snippet': '6 places to host your git repository · Option 1: GitHub. Seriously, this is a valid option. · Option 2: GitLab. GitLab is probably the leading ...',
  'snippet_highlighted_words': ['git', 'GitLab', 'GitLab'],
  'about_this_result': {'source': {'description': 'opensource.com was first indexed by Google more than 10 years ago',
    'source_info_link': 'https://opensource.com/article/18/8/github-alternatives',
    'security': 'secure',
    'icon': 'https://serpapi.com/searches/63e7c7a8e0ded48f30ac52b7/images/dea020dba9924c44aa340756ceef73bd8f22dd55bf181b0b64295d19232ac5aa2675782e4b19ba1eb2c3825531f96118.png'}},
  'about_page_link': 'https://www.google.com/search?q=About+https://opensource.com/article/18/8/github-alternati

In [17]:
rels = []
for result in js["organic_results"]:
    print(result['position'], result['title'])
    print(result['link'])
    print(is_rel(result['link']))
    rels.append(int(is_rel(result['link'])))
    print()

1 6 places to host your git repository - Opensource.com
https://opensource.com/article/18/8/github-alternatives
False

2 Bitbucket | Git solution for teams using Jira
https://bitbucket.org/product
True

3 Gitpod: Always ready to code.
https://www.gitpod.io/
True

4 GitLab: The DevSecOps Platform
https://about.gitlab.com/
True

5 14 Git Hosting Services Compared | Tower Blog
https://www.git-tower.com/blog/git-hosting-services-compared/
False

6 GitHub: Let's build from here · GitHub
https://github.com/
True

7 Git
https://git-scm.com/
False

8 Top GitHub Alternatives to Host Your Open Source Projects
https://itsfoss.com/github-alternatives/
False

9 Top 10 best Git hosting solutions and services in 2021
https://www.devopsschool.com/blog/top-5-git-hosting-solutions/
False

10 15 Best Github Alternatives in 2023 - Guru99
https://www.guru99.com/github-alternative.html
False



In [18]:
rels

[0, 1, 1, 1, 0, 1, 0, 0, 0, 0]

## 2.4. MRR

Compute MRR:

In [27]:
def mrr(list_of_lists, k=10):
    r = 0
    for l in list_of_lists:
        r += (1 / (k + 1)) if 1 not in l else 1 / (l.index(1) + 1)
        #print(r)
    return r / len(list_of_lists)

In [28]:
mrr([[0,0,0,0,0,0,0,0,0,0]])

0.09090909090909091

## 2.5. Precision
Compute mean precision:

In [73]:
def mp(list_of_lists, k=10):
    p = 0
    for l in list_of_lists:
        p += sum(l) / k
    return p / len(list_of_lists)

In [74]:
mp([rels])

0.4