## 1. Parse data 

In [1]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
from requests.auth import HTTPBasicAuth

In [2]:
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}

In [3]:
url = "https://jobs.dou.ua/reviews"

In [4]:
def acquire_company(comment_soup):
    for link in comment_soup.find_all('a'):
        ref = link.get('href')
        if 'https://jobs.dou.ua/companies' in ref:
            return ref.split('/')[4]

In [5]:
def acquire_comment_text(comment_soup):
    return comment_soup.find("div", class_="l-text b-typo").p.text.replace(u'\xa0', u' ')

In [6]:
def get_comment_data(comment_soup):
    return comment_soup.find('a', class_="comment-link").text

In [7]:
def get_author_of_comment(comment_soup):
    return comment_soup.find("a", class_="avatar").get('href')

In [8]:
def ReviewData():
    def __init__(self, company_name: str, text: str, user: str, data: str):
        self.company_name = company_name
        self.text = text
        self.user = user
        self.data = data

In [9]:
def parse_comment(comment_soup):
    company_name = acquire_company(comment_soup)
    text = v
    user = get_author_of_comment(comment_soup)
    data = get_comment_data(comment_soup)
    
    return ReviewData(company_name=company_name, text=text, user=user, data=data)

In [23]:
def parse_url(url, review_datas): 
    result = requests.get(url, headers=headers)
    soup = BeautifulSoup(result.text, 'html.parser')
    comments = soup.findAll("div", class_="comment")
    for comment in comments:
        company_name = acquire_company(comment)
        if company_name is None:
            #probably, it is reply for review
            continue
        text = acquire_comment_text(comment)
        user = get_author_of_comment(comment)
        data = get_comment_data(comment)
        
        review_datas['company_name'].append(company_name)
        review_datas['text'].append(text)
        review_datas['user'].append(user)
        review_datas['data'].append(data)
        

In [24]:
def parse_urls(urls):
    review_datas = dict()
    review_datas['company_name'] = []
    review_datas['text'] = []
    review_datas['user'] = []
    review_datas['data'] = []
    
    for url in urls:
        try:
            parse_url(url, review_datas)
        except:
            print(url)
    
    return pd.DataFrame.from_dict(review_datas)
    

In [25]:
urls = [f"https://jobs.dou.ua/reviews/{i}/" for i in range(1, 607)]

In [26]:
%time df = parse_urls(urls)

https://jobs.dou.ua/reviews/431/
https://jobs.dou.ua/reviews/540/
CPU times: user 40.5 s, sys: 567 ms, total: 41 s
Wall time: 7min 14s


In [32]:
len(df)

13138

In [33]:
df.head(10)

Unnamed: 0,company_name,text,user,data
0,ring-ukraine,"Проработала в компании год. Ушла, т.к. другая ...",https://dou.ua/users/anna-semerenko/,31 марта 22:30
1,computer-school-hillel-international,"Всем привет!В прошлом году прошел два курса, в...",https://dou.ua/users/botirov-rustam/,31 марта 22:21
2,room4,Работаю в компании около 7 месяцев и очень дов...,https://dou.ua/users/victor-osadchiy/,31 марта 19:16
3,computer-school-hillel-international,Добрый день! Хочу оставить отзыв по обучению в...,https://dou.ua/users/mariya-bryizhko/,31 марта 18:36
4,codeit,"Работаю в компании уже ~6,5 лет. Это моя перва...",https://dou.ua/users/daria-lymar-1/,31 марта 17:48
5,codica,"Очень рад, что попал в эту компанию. Из ценног...",https://dou.ua/users/anatolij-bogatyirenko/,31 марта 17:30
6,codica,"Толичек, спасибо тебе огромное! Если надумаешь...",https://dou.ua/users/natalyaklimenko/,31 марта 18:02
7,mobilunity,"Скоро два года, как тут работаю.",https://dou.ua/users/seva-shasharin/,31 марта 16:51
8,onix,Когда просыпаешься утром в понедельник и не ум...,https://dou.ua/users/marina-pemahova/,31 марта 15:51
9,codica,Проработала в Codica больше двух лет на позици...,https://dou.ua/users/ekaterina-plyashechnik/,31 марта 15:26


In [37]:
print( "The number of companies: ", len(np.unique(df['company_name'].values)))

The number of companies:  2094


Saved data into file

In [None]:
df.to_csv("../data/dou-company-reviews.csv")

#### Short data description

As we can see from the table above or looking at the site (https://jobs.dou.ua/reviews/), there are not strictly annotated data. Here we can acquire user review text about specific company, company name, user name, where user worked while he wrote this review and the date of written review in dou.ua. 

From these data, the most interesing for us are review text and the name of company related to this review text. 

From the https://jobs.dou.ua/reviews/ site we parsed 13138 user reviews about 2094 different IT companies in Ukraine. 

Here, users usually write review in Ukrainian or Russian languages, so probabaly we translate all texts into Ukrainain or work with both languages. 

As we work with user reviews about different companies, of course there can be biases. For instance, in the available data we observed the situation when user left negative comment about company because he failed the interview by unknown reason and he did not work at this company.

## 2. How to annotate data

Unfortunetly,  we don't have strictly annoated data, so we need to annotate it for using machine learning approaches or for measuring the metric.

In this case, we want to know the next information about each user review: <br/>
1) whether sentence positive, neutral or negative <br/>
2) pros and cons (advantages and disadvantages) about company <br/>
3) whether user worked, is working or have never worked at this company. A lot of users who have never wrote reviews about the company based only on interview process, but someone wrote review while they are working there or worked before. 


In this case we show example how to annatate data here, in jupyter notebook

In [84]:
import warnings
warnings.filterwarnings('ignore')
from enum import IntEnum

In [85]:
class UserWorkType(IntEnum):
    WORKED = 0,
    WORKING = 1,
    NEVER_WORKED = 2,
    UNDEFINED = 3

In [78]:
df['prons'] = None
df['cons'] = None
df['user_work_type'] = UserWorkType.UNDEFINED

In [71]:
df.head()

Unnamed: 0,company_name,text,user,data,prons,cons,user_work_type
0,ring-ukraine,"Проработала в компании год. Ушла, т.к. другая ...",https://dou.ua/users/anna-semerenko/,31 марта 22:30,,,3
1,computer-school-hillel-international,"Всем привет!В прошлом году прошел два курса, в...",https://dou.ua/users/botirov-rustam/,31 марта 22:21,,,3
2,room4,Работаю в компании около 7 месяцев и очень дов...,https://dou.ua/users/victor-osadchiy/,31 марта 19:16,,,3
3,computer-school-hillel-international,Добрый день! Хочу оставить отзыв по обучению в...,https://dou.ua/users/mariya-bryizhko/,31 марта 18:36,,,3
4,codeit,"Работаю в компании уже ~6,5 лет. Это моя перва...",https://dou.ua/users/daria-lymar-1/,31 марта 17:48,,,3


For instanse, select the review

In [79]:
df['text'][19]

'Проходив співбесіду в дану компанію. Виконав тестове завдання, проходив технічну співбесіду. Фідбек не отримав — як повідомила мені рекрутер, технічний спеціаліст, який проводив інтервю, пішов у відпустку на дві неділі по сімейним обставинам. Після ще двох тижднів очікування фідбек так і не надійшов, в результаті вакансія на DOU вже закрита. Хочу з цього приводу відмітити непрофесіоналізм рекрутера (чи, можливо, технічного спеціаліста).'

Set prons and cons for this review

In [86]:
df['prons'][19] = []
df['cons'][19] = ['непрофесіоналізм рекрутера']

Set user work type for selected review (whether user worked, is working or have never worked at this company).

In [87]:
df['user_work_type'][19] = UserWorkType.NEVER_WORKED

In [82]:
df[19:20]

Unnamed: 0,company_name,text,user,data,prons,cons,user_work_type
19,echoua,Проходив співбесіду в дану компанію. Виконав т...,https://dou.ua/users/anton-jasynetsky/,30 марта 20:16,[],[непрофесіоналізм рекрутера],2


For annotation sentences, firstly, we need to tokenize these review into sentences and annotate each of them.

It would be good to have some instrument which simplify annotation process, probably we will create simple excel table where labeler can annotate data comfortable.

#### 2.2 how to improve data automatically  

1) As we work with a data which were written by any users, it is not a secret that there can be some mistakes. So before processing this data, it would be good to fix these mistakes automatically. As an option we can fix find words which are not in the dictionary and fond the most suiatable words for this incorrectly written. <br/>
2) Additionaly, before processing this data, we can remove some noise like smiles, stop words. <br/>
3) If this data would be annotated by two or more annoataors we can filter the data where they did not match user work type or swap any prons and cons (for example one annotator mention some phrase like prons, another like cons). <br/>
4) After annotation we can skip the data which don't have exactly prons and cons. <br/>