## Wikipedia scrapy project

### items.py

In [None]:
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class WikipediaItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()
    url = scrapy.Field()
    description = scrapy.Field()
    links = scrapy.Field()

### wikipedia_pages.py

In [None]:
from scrapy.spiders import Spider
from scrapy.http import Request

from wikipedia.items import WikipediaItem
from bs4 import BeautifulSoup

import re

class PagesSpider(Spider):
    name = "wikipedia_pages"
    start_urls = [
        "https://en.wikipedia.org/wiki/United_States",
        "https://en.wikipedia.org/wiki/Meme",
        "https://en.wikipedia.org/wiki/Game_of_Thrones"
    ]

    allowed_domains = ["wikipedia.org"]

    visited_urls = set()

    body_link_selector = '(//div[@id="mw-content-text"]/p/a/@href)[position() < 100]'
    allowed_re = re.compile('https://en\.wikipedia\.org/wiki/'
                            '(?!((File|Talk|Category|Portal|Special|'
                            'Template|Template_talk|Wikipedia|Help|Draft):|Main_Page)).+')

    def parse(self, response):
        item = WikipediaItem()
        soup = BeautifulSoup(response.body, "lxml")

        item['url'] = response.url
        item['name'] = soup.find("h1", {"id": "firstHeading"}).string
        item['description'] = BeautifulSoup(response.xpath('//div[@id="mw-content-text"]/p[1]').extract_first(), "lxml").text[:255] + "..."
        item['links'] = [y for y in [response.urljoin(x) 
                                     for x in response.xpath(self.body_link_selector).extract() 
                                     if x[0] != "#"] if self.allowed_re.match(y)]
        yield item

        self.visited_urls.add(response.url)
        print(len(self.visited_urls))

        for link in item['links']:
            if not link in self.visited_urls:
                yield Request(link, callback = self.parse)

## Scraping

Скрэпинг запускался при помощи команды
```
scrapy crawl wikipedia_pages -o out.json
```
Для того, чтобы определить сколько было просмотренно страниц, в методе parse спайдера просто выводится количество url-ов, добавленых в visited_urls. После того, как было просмотренно более 10000 страниц, процесс останавливался.

## Построение графа

In [14]:
import json
import networkx as nx
import operator

# Десериализация json-файла с результатом scraping-а
with open('out.json') as pages_file:
    pages = json.load(pages_file)

# Множество ссылок посещенных страниц
urls = set([x['url'] for x in pages])

# Множество всех ссылок, исходящих из посещенных страниц
links = set([y for x in pages for y in x['links']])

# Множество ссылок, ведущих на страницы, которые не были посещены
skip = links.difference(urls)

# Построение графа
G = nx.DiGraph()
G.add_edges_from([(x['url'], y) for x in pages for y in x['links'] if not y in skip])

## PageRank

In [13]:
def pagerank(G, alpha=0.85):
    # Вычисления PageRank на графе
    pr = nx.pagerank(G, alpha)
    
    # Сортировка результата
    sorted_pr = sorted(pr.items(), key=operator.itemgetter(1), reverse=True)
    
    # Вывод TOP-10
    for x in sorted_pr[:10]:
        for y in pages:
            if x[0] == y['url']:
                print('{} {}\n{}\n{}\n'.format(y['name'], x[1], y['url'], y['description']))

### PageRank для alpha = 0.85

In [8]:
pagerank(G)

United States 0.030729757373793177
https://en.wikipedia.org/wiki/United_States
Coordinates: 40°N 100°W﻿ / ﻿40°N 100°W﻿ / 40; -100...

Ancient Greek 0.009239134942811976
https://en.wikipedia.org/wiki/Ancient_Greek
Ancient Greek includes the forms of Greek used in ancient Greece and the ancient world from around the 9th century BC to the 6th century AD. It is often roughly divided into the Archaic period (9th to 6th centuries BC), Classical period (5th and 4th centu...

DNA 0.006528139533556632
https://en.wikipedia.org/wiki/DNA
Deoxyribonucleic acid (i/diˈɒksiˌraɪboʊnjʊˌkliːɪk, -ˌkleɪɪk/;[1] DNA) is a molecule that carries the genetic instructions used in the growth, development, functioning and reproduction of all known living organisms and many viruses. DNA and RNA are nucleic...

Chinese language 0.0046919943307590015
https://en.wikipedia.org/wiki/Chinese_language
Chinese (汉语/漢語; Hànyǔ or 中文; Zhōngwén) is a group of related, but in many cases mutually unintelligible, language varietie

### PageRank для alpha = 0.95

In [12]:
pagerank(G, alpha=0.95)

United States 0.03276035239441285
https://en.wikipedia.org/wiki/United_States
Coordinates: 40°N 100°W﻿ / ﻿40°N 100°W﻿ / 40; -100...

Ancient Greek 0.01034229716199733
https://en.wikipedia.org/wiki/Ancient_Greek
Ancient Greek includes the forms of Greek used in ancient Greece and the ancient world from around the 9th century BC to the 6th century AD. It is often roughly divided into the Archaic period (9th to 6th centuries BC), Classical period (5th and 4th centu...

DNA 0.008955962451412931
https://en.wikipedia.org/wiki/DNA
Deoxyribonucleic acid (i/diˈɒksiˌraɪboʊnjʊˌkliːɪk, -ˌkleɪɪk/;[1] DNA) is a molecule that carries the genetic instructions used in the growth, development, functioning and reproduction of all known living organisms and many viruses. DNA and RNA are nucleic...

Metres above sea level 0.0072589833533525504
https://en.wikipedia.org/wiki/Above_mean_sea_level
Metres above mean sea level (MAMSL) or simply metres above sea level (MASL or m a.s.l.) is a standard metric measu

### PageRank для alpha = 0.5

In [10]:
pagerank(G, alpha=0.5)

United States 0.0184849550463837
https://en.wikipedia.org/wiki/United_States
Coordinates: 40°N 100°W﻿ / ﻿40°N 100°W﻿ / 40; -100...

Ancient Greek 0.00445377062185952
https://en.wikipedia.org/wiki/Ancient_Greek
Ancient Greek includes the forms of Greek used in ancient Greece and the ancient world from around the 9th century BC to the 6th century AD. It is often roughly divided into the Archaic period (9th to 6th centuries BC), Classical period (5th and 4th centu...

DNA 0.0025635913266018997
https://en.wikipedia.org/wiki/DNA
Deoxyribonucleic acid (i/diˈɒksiˌraɪboʊnjʊˌkliːɪk, -ˌkleɪɪk/;[1] DNA) is a molecule that carries the genetic instructions used in the growth, development, functioning and reproduction of all known living organisms and many viruses. DNA and RNA are nucleic...

Chinese language 0.002514775919433548
https://en.wikipedia.org/wiki/Chinese_language
Chinese (汉语/漢語; Hànyǔ or 中文; Zhōngwén) is a group of related, but in many cases mutually unintelligible, language varieties, 

### PageRank для alpha = 0.3

In [11]:
pagerank(G, alpha=0.3)

United States 0.011242922573482956
https://en.wikipedia.org/wiki/United_States
Coordinates: 40°N 100°W﻿ / ﻿40°N 100°W﻿ / 40; -100...

Ancient Greek 0.0024165273283877097
https://en.wikipedia.org/wiki/Ancient_Greek
Ancient Greek includes the forms of Greek used in ancient Greece and the ancient world from around the 9th century BC to the 6th century AD. It is often roughly divided into the Archaic period (9th to 6th centuries BC), Classical period (5th and 4th centu...

Chinese language 0.0014977823290465513
https://en.wikipedia.org/wiki/Chinese_language
Chinese (汉语/漢語; Hànyǔ or 中文; Zhōngwén) is a group of related, but in many cases mutually unintelligible, language varieties, forming a branch of the Sino-Tibetan language family. Chinese is spoken by the Han majority and many other ethnic groups in China....

DNA 0.001380953306800126
https://en.wikipedia.org/wiki/DNA
Deoxyribonucleic acid (i/diˈɒksiˌraɪboʊnjʊˌkliːɪk, -ˌkleɪɪk/;[1] DNA) is a molecule that carries the genetic instructions

## Выводы

Из поученных резльтатов видно, что при изменении дампинг параметра у PageRank изменяются ранки, порядок и даже результаты в TOP-10.
Для страниц с большим количеством входящих ссылок снижение дампинг параметра означает уменьшение составляющей зависящей от этих кол-ва этих ссылок, что в свою очередь означает уменьшение ранка в целом. Данное наблюдение подтверждается следуюющей формулой:
```
PR(A) = (1 - alpha) / N + alpha(PR(B)/L(B) + PR(C)/L(C) + PR(D)/L(D) + ...),
```
где PR(i) - ранк i-ой страницы, L(i) - кол-во исходящих ссылок из i-ой страницы, alpha - дампинг параметр