## <center> CSE STACK EXCHANGE - Prikupljanje podataka scrapeanjem </center>

#### Scrapeanje pomoću Scrapyja

In [None]:
!pip install scrapy

Successfully installed Automat-20.2.0 PyDispatcher-2.0.5 Twisted-21.7.0 constantly-15.1.0 cryptography-36.0.1 cssselect-1.1.0 h2-3.2.0 hpack-3.0.0 hyperframe-5.2.0 hyperlink-21.0.0 incremental-21.3.0 itemadapter-0.4.0 itemloaders-1.0.4 jmespath-0.10.0 parsel-1.6.0 priority-1.3.0 protego-0.1.16 pyOpenSSL-21.0.0 queuelib-1.6.2 scrapy-2.5.1 service-identity-21.1.0 w3lib-1.22.0 zope.interface-5.4.0


In [None]:
import scrapy
from scrapy.crawler import CrawlerProcess

##### Kreiranje spidera

In [None]:
class SExchangeSpider(scrapy.Spider):
    
    name = "SExchangeSpider"
    
    start_urls = ['https://cseducators.stackexchange.com/questions/0']

    def parse(self, response):

        print("Status odgovora: ", response)
        print("Sadržaj odgovor: ", response.text)

        content = "".join(response.xpath('//div[@class="postcell post-layout--right"]/div[@class="s-prose js-post-body"]/p/text()').extract())
        print("Sadržaj pitanja: ", content)

##### Pokretanje spidera

In [None]:
process = CrawlerProcess()
process.crawl(SExchangeSpider)
process.start()



> Značajke koje smo dobili scrapeanjem pomoću scrapyja



In [1]:
import pandas as pd
scrapy_features_scrapy = pd.DataFrame({"Pitanje ne postoji": [0, 404, 'nepoznato', 'nepoznato', 'stranica ne postoji'],
                   "Pitanje je izbrisano": [1, 404, 'nepoznato', 'nepoznato', 'preusmjeravanje urla - pitanje je postojalo'],
                   "Pitanje postoji": [2, 200, 'cijela html stranica', 'dohvaćen sadržaj pitanja', 'postoje podaci o pitanju']},
                  ["id", "response.status_code", "response.text", "question_content", "conclusion"])

In [2]:
scrapy_features_scrapy.rename_axis("features", axis="columns")

features,Pitanje ne postoji,Pitanje je izbrisano,Pitanje postoji
id,0,1,2
response.status_code,404,404,200
response.text,nepoznato,nepoznato,cijela html stranica
question_content,nepoznato,nepoznato,dohvaćen sadržaj pitanja
conclusion,stranica ne postoji,preusmjeravanje urla - pitanje je postojalo,postoje podaci o pitanju


### Drugi način: scrapeanje koristeći requests i BeautifulSoupa

#### Pokretanje zahtjeva

In [None]:
import requests

response = requests.get('https://cseducators.stackexchange.com/questions/2')

#### Scrapeanje pomoću Beautiful Soupa

In [None]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text)

print("Status odgovora: ", response.status_code)
#print("Sadržaj odgovora: ", response.text)

content = "".join([i.text for i in soup.find('div', class_="s-prose js-post-body").find_all('p')])
print("Sadržaj pitanja: ", content)

Status odgovora:  200
Sadržaj pitanja:  Grading currently either takes me a huge amount of time, or gets done in an extremely cursory way.  It occurs to me that, if my students were able to submit code into some sort of autotester, they would find many of their errors before I review their code.  This would mean that I would only have to look for clarity issues, not logic bugs.I am interested in software that (a) isn't too hard to set up, (b) will allow my students to test whether their code is (at least) to spec, and (c) allows me to quickly cycle through the code from all of the students in a section to look for style issues.I have looked far and wide, and have not yet found a reasonable system that accomplishes these three goals.  Many universities seem to have robust systems, but they all appear to be home-grown, and unavailable to outsiders.  Is anyone aware of such a system?  I honestly don't even care if it costs money - I just need a solution that makes this task more reasonabl



> Značajke koje smo dobili scrapeanjem pomoću Beautiful soupa



In [None]:
scrapy_features_bs = pd.DataFrame({"Pitanje ne postoji": [0, 404, 'cijela html stranica', 'nepoznato', 'stranica ne postoji'],
                   "Pitanje je izbrisano": [1, 404, 'cijela html stranica', 'nepoznato (postoji podatak o naslovu pitanja)', 'preusmjeravanje urla - pitanje je postojalo'],
                   "Pitanje postoji": [2, 200, 'cijela html stranica', 'dohvaćen sadržaj pitanja', 'postoje podaci o pitanju']},
                  ["id", "response.status_code", "response.text", "question_content", "conclusion"])

In [None]:
scrapy_features_bs.rename_axis("features", axis="columns")

features,Pitanje ne postoji,Pitanje je izbrisano,Pitanje postoji
id,0,1,2
response.status_code,404,404,200
response.text,cijela html stranica,cijela html stranica,cijeli html stranice
question_content,nepoznato,nepoznato (postoji podatak o naslovu pitanja),dohvaćen sadržaj pitanja
conclusion,stranica ne postoji,preusmjeravanje urla - pitanje je postojalo,postoje podaci o pitanju




> **Odluka**: koristiti će se Beautiful soup jer se mogu dobiti cijele html stranice u *response* objektu te iz njih dohvaćati podatke ovisno o tome da li pitanje postoji, izbrisano je ili ne postoji za razliku od scrapyja koji ne vraća html stranice ako je *status_code* 404. ❗



#### Dohvaćanje podataka

In [3]:
# biblioteke
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import re

In [4]:
clean = re.compile('<.*?>')

> Ukupan broj pitanja koje sadrži Stack Exchange - CSE stranica

In [5]:
response = requests.get("https://cseducators.stackexchange.com/questions/")
soup = BeautifulSoup(response.text)
number_of_questions = soup.find("div",class_="fs-body3 flex--item fl1 mr12 sm:mr0 sm:mb12").text.strip().split(" ")[0]
print("Ukupan broj pitanja: ", number_of_questions)

Ukupan broj pitanja:  1,038





> ID posljednjeg pitanja na stranici



In [6]:
last_question_id = soup.find("div","question-summary").get("id")[-4:]
print("Id posljednjeg pitanja je:", last_question_id)

Id posljednjeg pitanja je: 7248


> Klasifikacija podataka



In [None]:
question_classification = pd.DataFrame({"Nepostojeće pitanje": [0, 'https://cseducators.stackexchange.com/questions/0', 'https://cseducators.stackexchange.com/questions/0', '404', 'Stranica ne postoji.'],
                   "Izbrisano pitanje": [1, 'https://cseducators.stackexchange.com/questions/1', 'https://cseducators.stackexchange.com/questions/1/what-language-should-be-first-used-to-introduce-coding', '404', 'Informacija o brisanju pitanja.'],
                   "Postojeće pitanje": [2, 'https://cseducators.stackexchange.com/questions/2', 'https://cseducators.stackexchange.com/questions/2', '200', 'Podaci o pitanju.'],
                   "Odgovor": [10, 'https://cseducators.stackexchange.com/questions/10', 'https://cseducators.stackexchange.com/questions/3/is-it-possible-to-ensure-division-of-labor-on-a-group-assignment/10#10', '200', 'Podaci o odgovoru.'],
                   "Tag": [15, 'https://cseducators.stackexchange.com/questions/15', 'https://cseducators.stackexchange.com/tags/java/info', '200', 'Podaci o tagu.'],
                   "Preusmjeravanje na već postojeće pitanje": [88, 'https://cseducators.stackexchange.com/questions/88', 'https://cseducators.stackexchange.com/questions/69/how-can-i-integrate-teaching-source-code-control-git-mercurial-etc-into-my-int/88#88', '200', 'Podaci o već postojećem pitanju.'],
                   "Preusmjeravanje na već izbrisano pitanje": [3400, 'https://cseducators.stackexchange.com/questions/3400', 'https://cseducators.stackexchange.com/questions/3397/how-to-setup-python-to-repr-string-literals-with-double-quotes-when-possible/3400', '404', 'Stranica ne postoji.']},
                  ["id", "link", "get_link", "status_code", "message"])

In [None]:
question_classification

Unnamed: 0,Nepostojeće pitanje,Izbrisano pitanje,Postojeće pitanje,Odgovor,Tag,Preusmjeravanje na već postojeće pitanje,Preusmjeravanje na već izbrisano pitanje
id,0,1,2,10,15,88,3400
link,https://cseducators.stackexchange.com/questions/0,https://cseducators.stackexchange.com/questions/1,https://cseducators.stackexchange.com/questions/2,https://cseducators.stackexchange.com/question...,https://cseducators.stackexchange.com/question...,https://cseducators.stackexchange.com/question...,https://cseducators.stackexchange.com/question...
get_link,https://cseducators.stackexchange.com/questions/0,https://cseducators.stackexchange.com/question...,https://cseducators.stackexchange.com/questions/2,https://cseducators.stackexchange.com/question...,https://cseducators.stackexchange.com/tags/jav...,https://cseducators.stackexchange.com/question...,https://cseducators.stackexchange.com/question...
status_code,404,404,200,200,200,200,404
message,Stranica ne postoji.,Informacija o brisanju pitanja.,Podaci o pitanju.,Podaci o odgovoru.,Podaci o tagu.,Podaci o već postojećem pitanju.,Stranica ne postoji.


In [None]:
list_wiki = []
list_answers = []
list_questions = []

> ❗ Fokus je dohvatiti **sva pitanja** pa raspon id-jeva slijedi od 0 do last_question_id + 1 

> Budući plan je nadograditi program na način da se dohvaćaju svi odgovori, pitanja, tagovi i dr. 💡

In [None]:
#Kreiramo listu za sve podatke koje cemo spremati (wiki stranicu, stranicu sa odgovorima, stranicu sa pitanjima)

for i in range(0, int(last_question_id)+1):
    link = ('https://cseducators.stackexchange.com/questions/'+str(i))
    response = requests.get(link)
    soup = BeautifulSoup(response.text)

    get_link = soup.find("meta", property="og:url")["content"]
    tab_title = soup.find("title").text

    print("Status code:", response.status_code)

    question_existed = False

    #Slucaj kada ne postoji stranica
    if (response.status_code == 404):
      
      #Slucaj kada id ne postoji
      if (link == get_link):
        page_answer = re.sub(clean, '', str(soup.find("div",class_="fs-subheading mb24").find_all('p')[0]).replace("<p>","").replace("</p>",""))
        print("Id nepostojećeg pitanja:",i,"\nLink stranice:", get_link,"\nOdgovor stranice:", page_answer,"\n")

      #Slucaj kada je pitanje obrisano
      else:
        question_existed = True
        title = (get_link.split("/")[-1].replace("-"," ")).capitalize()
        
        if (title[-4:] != str(i)):
        
          info = " ".join(str(soup.find("div",class_="fs-subheading mb24").find_all('p')[0]).replace("<p>","").replace("</p>","").split(" ")[:3])
          info_comment = (soup.find("span",class_="revision-comment").text)
          print("Id izbrisanog pitanja:",i,"\nLink stranice:",get_link,"\nNaslov pitanja koje je obrisano:",title,"\nObavijest o pitanju:",info,info_comment,"\nPitanje je postojalo: ", question_existed, "\n")
          
          question = [i, title, "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", question_existed]
          list_questions.append(question)

        else:
          print("Id:", i, "\nPreusmjeravanje na pitanje s id koje je već izbrisano.")

    #Slucaj kada smo dobili podatak sa razlicitim id-em kojeg smo poslali
    else:

      if(link not in get_link):

        #Slucaj kada stranica koju smo dohvatili je wiki stranica
        if ("/tags/" in str(get_link)):
          print("Id wiki taga: ", i)

          title = (tab_title.split("'")[1]+ " "+tab_title.split("'")[2].split(" ")[1]+" "+tab_title.split("'")[2].split(" ")[2]).capitalize()
          info = str("".join([i.text for i in soup.find("div",class_="welovestackoverflow").find_all('p')])).strip()
          wiki = [i,title, info]

          #Spremamo sve naslove iz liste koji postoje te provjeravamo je li trenutni naslov vec postoji u listi
          wiki_titles = [i[1] for i in list_wiki]

          if (wiki[1] not in wiki_titles):
            list_wiki.append(wiki)

        else:
          #Slucaj kada smo dobili odgovor na pitanje
          print("Id odgovora na pitanje:", i)
          try:
            #Trazimo postoji li nase pitanje u podacima koje smo dobili i dohvacamo onda njegove podatke
            page_answer = soup.find("div", {"data-answerid":i})
            #Id pitanja
            id_question = get_link.split('/')[4]

            #Pokusavamo dohvatiti naslov pitanja(boldani dio teksta), a ako ga nema onda je None
            try:
              title = str(page_answer.find("div",class_="s-prose js-post-body").find("p").find("strong")).replace("<strong>","").replace("</strong>","")
            except AttributeError:
              title = "None"
            
            content = "".join([i.text for i in soup.find('div', {"id": "answer-"+str(i)}).find_all('div', class_="s-prose js-post-body")]).replace("\n", " ")

            #Pokusavamo dohvatiti id user-a (autora odgovora), a ako ga nema onda spremamo None
            try:
              user_id = page_answer.find("div",{"itemprop":"author"}).find("a")['href'].split('/')[2]
            except TypeError:
              user_id = "None"
            
            #datum kada je kreiran odgovor
            date = str(page_answer.find("span",class_="relativetime")["title"]).replace("Z","")
            
            #Broj vote-ova na odgovoru
            try:
              votes = page_answer.find("div", class_="js-vote-count flex--item d-flex fd-column ai-center fc-black-500 fs-title")["data-value"]
            except TypeError:
              votes = page_answer.find("div",class_="js-vote-count fs-title lh-md mb8 ta-center")["data-value"]

            answer = [i, title, content, user_id, date, votes, id_question]
            list_answers.append(answer)

          #Slucaj kada smo dobili pitanje koje vec postoji pod drugim id-em
          except AttributeError:
            id = get_link.split("/")[4]
            print("Poslani id", i ,"postoji već za pitanje sa id-em", id)

      #Slucaj kada smo dobili pitanje i sve njegove podatke
      else:
        print("Id pitanja:", i)
        
        title = "".join([i.text for i in soup.find('h1', class_="fs-headline1 ow-break-word mb8 flex--item fl1").find_all('a')])
        content = "".join([i.text for i in soup.find('div', class_="s-prose js-post-body").find_all('p')])
        created_at = ("".join([i['datetime'] for i in soup.find('div', class_="flex--item ws-nowrap mr16 mb8").find_all('time')])).replace("T"," ")

        #Dohvacamo user id, a ako ga nema onda postavljamo u None
        try:
          user_id = ("".join([i['href'] for i in soup.find('div', class_="post-signature owner flex--item").find(class_="user-details").find_all('a')])).split("/")[2]
        except IndexError:
          user_id = "None"
        views_count = ("".join([i for i in soup.find('div', class_="d-flex fw-wrap pb8 mb16 bb bc-black-075").find('div',class_="flex--item ws-nowrap mb8")['title']])).split(" ")[1]
        
        try:
          votes_count = "".join([i for i in soup.find('div', class_="js-vote-count flex--item d-flex fd-column ai-center fc-black-500 fs-title")['data-value']])
        except TypeError:
          votes_count = "".join([i for i in soup.find('div', class_="js-vote-count fs-title lh-md mb8 ta-center")['data-value']])

        #Dohvacamo broj bookmarks-a, a ukoliko ih nema onda spremamo 0
        try:
          bookmarks_count = "".join([i for i in soup.find('div', class_="js-bookmark-count mt4")['data-value']])
        except TypeError:
          bookmarks_count = "".join([i for i in soup.find('div', class_="js-bookmark-count mt4 d-none")])
          bookmarks_count = (0 if bookmarks_count == "" else 0)

        answers_count = "".join([i['data-answercount'] for i in soup.find('div', class_="answers-subheader d-flex ai-center mb8").find_all('h2')])
        tags = ("".join([(i.text+"\t") for i in soup.find('div', class_="d-flex ps-relative fw-wrap").find_all('a')])).split("\t")[:-1]
        comment_count = len(([i for i in soup.find("div",class_="post-layout").find("ul",class_="comments-list js-comments-list").find_all("li")]))+int([soup.find("div",class_="post-layout").find('ul',class_="comments-list js-comments-list")["data-remaining-comments-count"]][0])
        is_answered = (True if int(answers_count) > 0 else False)
        question = [i, title, content, created_at, user_id, is_answered, views_count, votes_count, bookmarks_count, answers_count, tags, comment_count, question_existed]
        list_questions.append(question)

    if (i!=0 and i % 20 == 0):
      time.sleep(45)

In [None]:
print("Duljina liste tagova:", len(list_wiki)) 
print("Duljina liste pitanja:", len(list_questions)) 
print("Duljina liste odgovora:", len(list_answers))

Duljina liste tagova: 204
Duljina liste pitanja: 1332
Duljina liste odgovora: 4192


In [None]:
#Spremamo liste u dataframe
data_wiki = pd.DataFrame(list_wiki, columns = ['Tag_ID','Title', 'Info'])
data_questions = pd.DataFrame(list_questions, columns=['Question_ID','Title','Content','Created at','User_ID','Is answered','Views count','Votes count','Bookmarks count','Answers count','Tags','Comments count', "Question existed"])
data_answers = pd.DataFrame(list_answers, columns = ['Answer_ID','Title', 'Content','User_ID','Date','Votes','Question_ID'])

In [None]:
print("Ukupan broj dohvaćenih pitanja:", data_questions.loc[data_questions['Question existed'] == False].shape[0])

Ukupan broj dohvaćenih pitanja: 1036


> Dohvaćeni podaci prikazani pomoću DataFramea

In [None]:
data_wiki.tail()

Unnamed: 0,Tag_ID,Title,Info
199,6345,Programming tag wiki,For questions about teaching programming (as o...
200,6479,Twos-complement tag wiki,Concerning the two's complement representation...
201,6677,Classroom-environment tag wiki,The Classroom Environment pertains to things l...
202,6893,Course-design tag wiki,For questions related to the design of a singl...
203,6917,Mentoring tag wiki,Questions related to mentoring students studyi...


In [None]:
data_wiki.shape

(204, 3)

In [None]:
data_questions.tail()

Unnamed: 0,Question_ID,Title,Content,Created at,User_ID,Is answered,Views count,Votes count,Bookmarks count,Answers count,Tags,Comments count,Question existed
1327,7226,Driven to Abstraction,One recurring discussion I have on this site w...,2022-01-10 11:21:38,5349,True,401,5,2,8,"[introductory-lesson, computational-thinking, ...",15,False
1328,7230,Any technology for full body video with slides...,Given the surge of COVID cases many of us have...,2022-01-11 14:33:01,6410,True,36,1,0,1,"[lecture-tools, information-technology]",2,False
1329,7234,Grading programming exercises: the quality vs....,BACKGROUND: I teach a C++ course with 300 stud...,2022-01-11 16:15:57,1873,True,116,5,0,2,"[grading, homework]",10,False
1330,7239,Do you include coding assignments in an intro ...,In an introductory course on complexity and co...,2022-01-15 13:20:36,11545,True,936,7,0,4,"[lesson-ideas, resource-request, undergraduate...",0,False
1331,7248,Why do upgrading a web page causes interrupt t...,\nWant to improve this question? Update the qu...,2022-01-18 15:25:04,11557,False,13,-2,0,0,[web-development],2,False


In [None]:
data_questions.shape

(1332, 13)

In [None]:
data_answers.tail()

Unnamed: 0,Answer_ID,Title,Content,User_ID,Date,Votes,Question_ID
4187,7243,,We have before us a question that is apparent...,5349,2022-01-16 22:44:05,1,7226
4188,7244,"Yes - things like sorts, regexes, and virtuali...","Yes - things like sorts, regexes, and virtual...",11549,2022-01-16 23:54:09,1,7239
4189,7245,,"I teach year 8, some of this: Boolean logic, ...",204,2022-01-17 09:42:00,2,7226
4190,7246,,Why not teach people how computers work righ...,2164,2022-01-17 13:47:57,4,7226
4191,7247,,One term that I don't see in the answers give...,6346,2022-01-17 19:02:42,0,3590


In [None]:
data_answers.shape

(4192, 7)

> Spremanje dohvaćenih podataka u csv datoteke

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
#Kreiramo putanju na drive-u i naziv dokumenta koji ce nam se stvorit te u njemu spremamo sve podatke koje imamo iz dataframe-a
path_w = '/content/drive/My Drive/data_wiki.csv'
with open(path_w,'w', encoding = 'utf-8-sig') as f:
  data_wiki.to_csv(f)

path_q = '/content/drive/My Drive/data_questions.csv'
with open(path_q,'w', encoding = 'utf-8-sig') as f:
  data_questions.to_csv(f)
  
path_a = '/content/drive/My Drive/data_answers.csv'
with open(path_a,'w', encoding = 'utf-8-sig') as f:
  data_answers.to_csv(f)