# Request, Response
- Request: Client "requets" support from the server
- Reponse: Server "responds" to the client's request
![Screenshot%202022-01-19%20at%206.13.27%20PM.png](attachment:Screenshot%202022-01-19%20at%206.13.27%20PM.png)

# Request Modules

- 3rd Party modules (3rd party modules mean modules made by companies) that are used to read static web documents
- standard library: similiar to urllib module, but requests are more widely used and are simpler
- normally requests are installed thorugh "pip install requests"
- document link: docs.python-requests.org/en/latest


In [4]:
#import request modules requested by the client
import requests

#storing requested url into a variable
url = "http://naver.com"

#get() -> request url address of the requested page from the server, and receive reponse
#.text converts data retrived from the response into text 
data = requests.get(url)
response = requests.get(url).text

data 
response[:200]
#2xx -> request has been correctly responded
#404 -> not found
#Essentially this code is equivalent to entering "naver.com" in the search engine

'\n<!doctype html>                          <html lang="ko" data-dark="false"> <head> <meta charset="utf-8"> <title>NAVER</title> <meta http-equiv="X-UA-Compatible" content="IE=edge"> <meta name="viewpo'

# Types of response codes:

- 1xx: continue, waiting
- 2xx: confirmation of request has been made
- 3xx: redirection of request when the location of client changes
- 4xx: potential error from the client (e.g. 404 not found)
- 5xx: potential error from the server

In [39]:
type(data)

requests.models.Response

In [40]:
#source page code from the first letter to the last 500 letters
data.text[:1000]

'\n<!doctype html>                          <html lang="ko" data-dark="false"> <head> <meta charset="utf-8"> <title>NAVER</title> <meta http-equiv="X-UA-Compatible" content="IE=edge"> <meta name="viewport" content="width=1190"> <meta name="apple-mobile-web-app-title" content="NAVER"/> <meta name="robots" content="index,nofollow"/> <meta name="description" content="네이버 메인에서 다양한 정보와 유용한 컨텐츠를 만나 보세요"/> <meta property="og:title" content="네이버"> <meta property="og:url" content="https://www.naver.com/"> <meta property="og:image" content="https://s.pstatic.net/static/www/mobile/edit/2016/0705/mobile_212852414260.png"> <meta property="og:description" content="네이버 메인에서 다양한 정보와 유용한 컨텐츠를 만나 보세요"/> <meta name="twitter:card" content="summary"> <meta name="twitter:title" content=""> <meta name="twitter:url" content="https://www.naver.com/"> <meta name="twitter:image" content="https://s.pstatic.net/static/www/mobile/edit/2016/0705/mobile_212852414260.png"> <meta name="twitter:description" content="네이버

In [41]:
#enconding is a method of transforming data into a safe form
#different from encrypting
#e.g. speaking in a language that the other person understands 

data.encoding

'UTF-8'

In [42]:
#saving source page text into html
#file is saved in the same directory as th at of this very python file
html = data.text
open ("naver.html", "w", encoding="utf-8").write(html) 
#open("name the file", "mode, w means write the file", "utf-8 is one of the common encoding types")
#returns the size of the saved file
#live feed features will not work though

184609

In [43]:
url = "https://api.sunrise-sunset.org/json?lat=36.7201600&lng=-4.4203400"

# json file changes data into form of dictionary&list so that it is easily handled by python
# json (JavaScript Object Notation): exchange information in the form of python's dictionary&list 
data = requests.get(url).json()
data

{'results': {'sunrise': '6:40:01 AM',
  'sunset': '6:17:48 PM',
  'solar_noon': '12:28:55 PM',
  'day_length': '11:37:47',
  'civil_twilight_begin': '6:15:31 AM',
  'civil_twilight_end': '6:42:19 PM',
  'nautical_twilight_begin': '5:45:34 AM',
  'nautical_twilight_end': '7:12:16 PM',
  'astronomical_twilight_begin': '5:15:32 AM',
  'astronomical_twilight_end': '7:42:18 PM'},
 'status': 'OK'}

# String Formatting
- format+ing 형식, 양식처럼 기본 틀이 갖춰짐
- f-formatting: 문자열 처음 시작할떄 (outside "") 소문자 f를 붙임 (e.g. f"hello world {num}")
- within string, storing a value in the variable

In [44]:
a, b, c = 10, 11.5, "abcd"
sample = f"one{a} two{b} three{c}"
sample

'one10 two11.5 threeabcd'

In [45]:
date = "2021-01-19"
url = f"https://api.sunrise-sunset.org/json?lat=36.7201600&lng=-4.4203400&date={date}"

data = requests.get(url).json()["results"]
data 

{'sunrise': '7:26:10 AM',
 'sunset': '5:30:46 PM',
 'solar_noon': '12:28:28 PM',
 'day_length': '10:04:36',
 'civil_twilight_begin': '6:59:41 AM',
 'civil_twilight_end': '5:57:15 PM',
 'nautical_twilight_begin': '6:28:05 AM',
 'nautical_twilight_end': '6:28:51 PM',
 'astronomical_twilight_begin': '5:57:11 AM',
 'astronomical_twilight_end': '6:59:45 PM'}

In [46]:
def by_date(date):
    url = f"https://api.sunrise-sunset.org/json?lat=36.7201600&lng=-4.4203400&date={date}"
    return requests.get(url).json()["results"]

by_date("2021-08-19")

{'sunrise': '5:37:10 AM',
 'sunset': '7:05:20 PM',
 'solar_noon': '12:21:15 PM',
 'day_length': '13:28:10',
 'civil_twilight_begin': '5:11:21 AM',
 'civil_twilight_end': '7:31:09 PM',
 'nautical_twilight_begin': '4:38:45 AM',
 'nautical_twilight_end': '8:03:45 PM',
 'astronomical_twilight_begin': '4:04:33 AM',
 'astronomical_twilight_end': '8:37:57 PM'}

# Data Frame
- data has been adjusted to a correspondent frame
- e.g.) spread sheet, excel, tables
- list and dictionary in python can express data frames
- list: data is stored according to index
- dictionary: keys and values are stored altogether

In [47]:
#create an empty list
sample_list= []

#dictionaries that has the same keys
sample_list.append({"date":"2020-01-01", "yo_il":"Wed"})
sample_list.append({"date":"2020-01-02", "yo_il":"Thu"})
sample_list.append({"date":"2020-01-03", "yo_il":"Fri"})

#data frame has been created
sample_list

[{'date': '2020-01-01', 'yo_il': 'Wed'},
 {'date': '2020-01-02', 'yo_il': 'Thu'},
 {'date': '2020-01-03', 'yo_il': 'Fri'}]

In [48]:
#pandas is a tool that manages data frame in python
import pandas as pd

#using sample_list as the base, data frame has been created and stored within the value df
df = pd.DataFrame(sample_list)
df

#remember columns (date, yo_il) can be expressed only once

Unnamed: 0,date,yo_il
0,2020-01-01,Wed
1,2020-01-02,Thu
2,2020-01-03,Fri


In [49]:
#saving data frame into a csv file
df.to_csv("sample.csv", index=False)
#index=False essentially excludes the index column

### csv file, index = True
![Screenshot%202022-03-06%20at%206.32.19%20PM.png](attachment:Screenshot%202022-03-06%20at%206.32.19%20PM.png)

### csv file, index = False
![Screenshot%202022-03-06%20at%206.32.29%20PM.png](attachment:Screenshot%202022-03-06%20at%206.32.29%20PM.png)

In [50]:
#saving the file on Desktop
####path = "path of desktop" 
###df.to_csv(path + "/" +"sample.csv", index=False)

In [51]:
#loading saved data
df1 = pd.read_csv("sample.csv")
df1 

Unnamed: 0,date,yo_il
0,2020-01-01,Wed
1,2020-01-02,Thu
2,2020-01-03,Fri


In [52]:
sample_list = []

sample_list.append(by_date("2020-01-01"))
sample_list.append(by_date("2020-01-02"))
sample_list.append(by_date("2020-01-03"))

sample_list

[{'sunrise': '7:28:37 AM',
  'sunset': '5:13:23 PM',
  'solar_noon': '12:21:00 PM',
  'day_length': '09:44:46',
  'civil_twilight_begin': '7:01:22 AM',
  'civil_twilight_end': '5:40:38 PM',
  'nautical_twilight_begin': '6:29:03 AM',
  'nautical_twilight_end': '6:12:58 PM',
  'astronomical_twilight_begin': '5:57:36 AM',
  'astronomical_twilight_end': '6:44:25 PM'},
 {'sunrise': '7:28:47 AM',
  'sunset': '5:14:10 PM',
  'solar_noon': '12:21:29 PM',
  'day_length': '09:45:23',
  'civil_twilight_begin': '7:01:34 AM',
  'civil_twilight_end': '5:41:24 PM',
  'nautical_twilight_begin': '6:29:16 AM',
  'nautical_twilight_end': '6:13:41 PM',
  'astronomical_twilight_begin': '5:57:50 AM',
  'astronomical_twilight_end': '6:45:07 PM'},
 {'sunrise': '7:28:55 AM',
  'sunset': '5:14:58 PM',
  'solar_noon': '12:21:56 PM',
  'day_length': '09:46:03',
  'civil_twilight_begin': '7:01:43 AM',
  'civil_twilight_end': '5:42:10 PM',
  'nautical_twilight_begin': '6:29:27 AM',
  'nautical_twilight_end': '6:14:

In [53]:
pd.DataFrame(sample_list)

Unnamed: 0,sunrise,sunset,solar_noon,day_length,civil_twilight_begin,civil_twilight_end,nautical_twilight_begin,nautical_twilight_end,astronomical_twilight_begin,astronomical_twilight_end
0,7:28:37 AM,5:13:23 PM,12:21:00 PM,09:44:46,7:01:22 AM,5:40:38 PM,6:29:03 AM,6:12:58 PM,5:57:36 AM,6:44:25 PM
1,7:28:47 AM,5:14:10 PM,12:21:29 PM,09:45:23,7:01:34 AM,5:41:24 PM,6:29:16 AM,6:13:41 PM,5:57:50 AM,6:45:07 PM
2,7:28:55 AM,5:14:58 PM,12:21:56 PM,09:46:03,7:01:43 AM,5:42:10 PM,6:29:27 AM,6:14:26 PM,5:58:02 AM,6:45:51 PM


# Time
- 수집 시간을 늘려서 상대 서버의 자원을 적게 사용할 수 있게 함 (눈치 보기)
- 의도적으로 코드 실행 시간 늦춤
- ***웹 크롤링에서는 일반적으로 프로그래밍상 말하는 빠른 코드가 좋은 코드라는 인식이 해당이 안됨***

In [35]:
import time

for i in range (5):
    print (i)
    time.sleep(2) #2초동안 멈추고 실행
    

0
1
2
3
4


In [68]:
for date in pd.date_range ("2021-01-01", "2021-01-10"):
    print(date)

2021-01-01 00:00:00
2021-01-02 00:00:00
2021-01-03 00:00:00
2021-01-04 00:00:00
2021-01-05 00:00:00
2021-01-06 00:00:00
2021-01-07 00:00:00
2021-01-08 00:00:00
2021-01-09 00:00:00
2021-01-10 00:00:00


In [69]:
for date in pd.date_range("2021-01-01", "2021-01-10"):
    #let date_str be the first 10 strings that exclude the exact time from the date
    date_str = str(date)[:10]
    print(date_str)

2021-01-01
2021-01-02
2021-01-03
2021-01-04
2021-01-05
2021-01-06
2021-01-07
2021-01-08
2021-01-09
2021-01-10


In [None]:
# Q) 
# 2021-01-01 to 2021-03-01 까지의 데이터를 모아서 sample_dates.csv 파일에 저장하세요.
# 이 데이터를 이용하는 사람이 되었다고 가정했을때, 데이터 수집을 지시하는 경우와 수집을 지시받은 경우의 관점에
# sample_dates.csv 사용에 문제가 있는지 고민하세요. 문제가 있으면? 혹은 없으면? 그 근거는 무엇인가?

In [71]:
import time
import pandas as pd

sample_list= []

for date in pd.date_range("2021-01-01", "2021-03-01"):
    date_str = str(date)[:10]
    #after inputing the dates stated above into by_date function that was previously defined,
    #the returned data are then appended and stored within the empty dictionary, sample_list
    sample_list.append(by_date(date_str)) 
    time.sleep(2)

pd.DataFrame(sample_list).to_csv("sample_dates.csv", index=False)


In [45]:
# A) 
# 코드를 작성한 사람은 row 별로 날짜라고 구분이 가능한데, 다른 분들은 알 수 있는 정보가 없다. 즉 row 와 row 를 구분할 수 있는 정보가 없다.
# 날짜와 관련된 column 을 새로 만들어 row 별로 날짜 정보를 추가하면 좋겠다. 

In [74]:
by_date(date_str)

{'sunrise': '6:46:34 AM',
 'sunset': '6:13:19 PM',
 'solar_noon': '12:29:57 PM',
 'day_length': '11:26:45',
 'civil_twilight_begin': '6:21:58 AM',
 'civil_twilight_end': '6:37:55 PM',
 'nautical_twilight_begin': '5:51:59 AM',
 'nautical_twilight_end': '7:07:54 PM',
 'astronomical_twilight_begin': '5:22:01 AM',
 'astronomical_twilight_end': '7:37:52 PM'}

In [76]:
def by_date_2(date):
    url = f"https://api.sunrise-sunset.org/json?lat=36.7201600&lng=-4.4203400&date={date}"
    sun_data = requests.get(url).json()["results"]
    sun_data["date"] = date
    return sun_data

#check what the newly built function returns
by_date_2("2021-01-01")

{'sunrise': '7:28:45 AM',
 'sunset': '5:13:58 PM',
 'solar_noon': '12:21:22 PM',
 'day_length': '09:45:13',
 'civil_twilight_begin': '7:01:31 AM',
 'civil_twilight_end': '5:41:12 PM',
 'nautical_twilight_begin': '6:29:13 AM',
 'nautical_twilight_end': '6:13:31 PM',
 'astronomical_twilight_begin': '5:57:46 AM',
 'astronomical_twilight_end': '6:44:57 PM',
 'date': '2021-01-01'}

In [78]:
sample_list= []

for date in pd.date_range("2021-01-01", "2021-03-01"):
    date_str = str(date)[:10]
    sample_list.append(by_date_2(date_str))
    time.sleep(1)
    
pd.DataFrame(sample_list).to_csv("sample_dates.csv", index=False)

# JSON, HTML, XML
### JSON (JavaScript Object Notation)
- key:value that stores/transfers data that is composed in pairs
- Javascript 에서 객체를 만들 때 사용하는 표현식
- JSON 표현식은 사람과 기계 모두 이해하기 쉬우며 용량이 적음


### HTML (Hypertext Markup Language)
- 웹사이트에서 화면에 표시되는 정보를 약속한 것
- 태그라는 개념으로 태그들이 계층적인 구조를 갖음
- Hypertext: 단순 텍스트 이상의 링크 등의 개념이 포함됨 텍스트
- Markup: 꺽쇠(<,>)로 이뤄진 태그를 사용하는 규격

### XML (eXtensible Markup Language)
- HTML의 한계를 극복할 목적으로 만들어진 다 목적 마크업 언어
- XML 는 HTML 처럼 데이터를 보여주는 목적이 아닌, 데이터를 저장하고 전달할 목적으로 만들어짐
- XML 태그는 HTML 태그처럼 미리 정의 되어있지 않고, 사용자가 직접 정의할 수 있음


### Dictionary 처럼 다룰수 있는 것들
- JSON, XML

### Dictionary 가 아닌 방식으로 다룰 수 있는 것들
- XML, HTML