# Web Scrapping? Crawling?


- 웹 서버에 저장된 데이터를 가져오는 행위(Web Scrapping 혹은 Crawling 이라고 함.)
<br><br>
- Web Scrapping과 Crawling은 의미상에 미묘한 차이가 있음. Web Scrapping이 단일 홈페이지에서 정보를 가져오는 행위라고 하면 Crawling은 불특정 다수의 웹페이지에서 필요한 정보를 모두 가져오는 행위를 의미함.

## 1. How to get Web-Info

- python에서 작동시킬 수 있는 크롤링 모듈은 크게 두 가지 형태가 있음.
<br><br>
- 웹 서버에 정보를 요청(request)하고, 메모리에 다운로드된 HTML, XML 정보로부터 필요한 정보를 추출해내는 방식과 javascript로 작성된 웹 페이지로부터 렌더링이 필요한 정보를 가져오는 방식이 있음.

    **(1) BeautilfulSoup**
        * 장점 : 쉬움, 심플함, 빠름(멀티프로세스, 멀티 스레드 적용시 해당)
        * 단점 : javascript 렌더링이 필요한 사이트의 크롤링이 어려움.
        
    **(2) Selenium**
        * 장점 : 사용자가 보는 웹 페이지의 모든 정보를 가져올 수 있음. javascript 렌더링 기능 지원, 사용방법이 직관적
        * 단점 : 느림. 메모리를 많이 차지함.

### (1) Guide for BeautifulSoup

1. 가져오고자 정보가 있는 웹 사이트의 URL 정보를 BeautifulSoup에게 넘겨줌.
2. 해당 URL로부터 메모리에 다운로드된 HTML 정보를 파싱함.
3. 파싱된 HTML 정보에서 필요한 정보가 담겨있는 head, body, title 등(HTML 구성요소)을 넘겨줌.
4. 데이터 유형(text, table, style 등)에 따라 가져오는 방식은 각기 다름.

### (2) Library for Crawling

------ 필 수 사 항 ------
- requests : URL로부터 HTML 정보를 읽어오는데 사용.
- BeautifulSoup : HTML 정보를 파싱하는데 사용.

------ 선 택 사 항 ------
- numpy : 행렬 계산을 위한 모듈
- pandas : excel, csv 등 dataframe을 다루는데 사용하는 모듈

### (3) Step summary
1. URL 패턴파악(여러 페이지로부터 데이터를 가져올 때)
2. 각각의 URL이 return 해주는 HTML 정보를 읽어옴
3. 읽어온 HTML 정보로부터 필요한 정보를 선택 추출
4. 추출한 정보를 DataFrame 형태로 변환
5. troubleshouting
6. 위의 과정을 반복

## 2. Step-by-Step implementation of crawling

In [None]:
from bs4 import BeautifulSoup # html parsing
import requests # url -> get html
# import time
# import datetime
# import tqdm
# import pandas as pd
# import re

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Excercise_1 단일 페이지에서 제목 가져오기

### request 실패시 우회 접속 방법 -> headers 명시

In [None]:
headers = {'User-Agents':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'}

In [None]:
url="https://oilprice.com/search/tab/articles/natural_gas_price"

In [None]:
req = requests.get(url, headers=headers)
req.content

b'<!DOCTYPE html>\n<html lang="en">\n<head>\n<meta name="viewport" content="width=device-width, initial-scale=1">\n\t<title>&quot;Natural Gas Price&quot; Articles</title>\n\t<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>\n\t<meta name="description" content=""/>\n\t<meta name="msvalidate.01" content="D14A5D573CE72469797ECB50683F1795"/>\n\t<meta name="robots" content="noindex,nofollow"/>\n\t\t\n\t\n\t<meta name="globalsign-domain-verification" content="i17dx3lHCudSTzNis2zK4tdE1MxllC1mnwnJlWeumr"/>\n\t<meta name="google-site-verification" content="GF7StMp03Fok8YcW3aFydFYdEc-SGzGZmsm8tg3zAXU"/>\n\t<meta name="google-site-verification" content="GmPOc04rwBxPbdPVD-Xv9q4TXrA_Gm2TmGHhT-j5eXM"/>\n\n\t<meta name="twitter:card" content="summary"/>\n\t<meta name="twitter:site" content="@oilandenergy"/>\n\t<meta name="twitter:title" content="&quot;Natural Gas Price&quot; Articles"/>\n\t<meta name="twitter:description" content=""/>\n\t<meta name="twitter:url" content="https://oi

In [None]:
soup=BeautifulSoup(req.content , 'html.parser')

In [None]:
soup.select('#search-results-articles > ul > li > h3 > a')[0]['href']

'https://oilprice.com/Energy/Energy-General/The-UAE-Is-Poised-To-Become-The-Next-Middle-East-Energy-Giant.html'

In [None]:
soup.select('#search-results-articles > ul > li > h3 > a')[0].text

'The UAE Is Poised To Become The Next Middle East Energy Giant'

In [None]:
x=soup.select('#search-results-articles > ul > li > h3 > a')

In [None]:
x[0].text

'The UAE Is Poised To Become The Next Middle East Energy Giant'

In [None]:
x[0]['href']

'https://oilprice.com/Energy/Energy-General/The-UAE-Is-Poised-To-Become-The-Next-Middle-East-Energy-Giant.html'

In [None]:
title_list = []
for x in soup.select('#search-results-articles > ul > li > h3 > a'):
    title_list.append(x.text)

### Excercise_2 1 ~ 5 페이지 제목 가져오기

In [None]:
article_info={}
for i in range(1,540):
    url='https://oilprice.com/search/tab/articles/natural_gas_price/Page-{}.html'.format(i)
    req=requests.get(url, headers=headers)
    soup=BeautifulSoup(req.content, 'html.parser' )
    for x in soup.select('#search-results-articles > ul > li > h3 > a'):
        title=x.text
        url=x['href']
        article_info[title]=url
    

In [None]:
article_info

{'The UAE Is Poised To Become The Next Middle East Energy Giant': 'https://oilprice.com/Energy/Energy-General/The-UAE-Is-Poised-To-Become-The-Next-Middle-East-Energy-Giant.html',
 'The Renewable Energy Revolution Has A Major Employment Problem': 'https://oilprice.com/Energy/Energy-General/The-Renewable-Energy-Revolution-Has-A-Major-Employment-Problem.html',
 'Energy Transition Forces LNG Industry To Cut Emissions': 'https://oilprice.com/Energy/Energy-General/Energy-Transition-Forces-LNG-Industry-To-Cut-Emissions.html',
 'Solving Nigeria’s Gasoline Crisis': 'https://oilprice.com/Energy/Oil-Prices/Solving-Nigerias-Gasoline-Crisis.html',
 'China’s Pivot To Gas Is Fueling Support In LNG Demand': 'https://oilprice.com/Energy/Natural-Gas/Chinas-Pivot-To-Gas-Is-Fueling-Support-In-LNG-Demand.html',
 'IEA Tells OPEC To “Open The Taps”': 'https://oilprice.com/Energy/Energy-General/IEA-Tells-OPEC-To-Open-The-Taps.html',
 'Turkey Makes Moves To Become An Energy Hub': 'https://oilprice.com/Energy/N

In [None]:
sample_url = list(article_info.values())[0]

In [None]:
req = requests.get(sample_url)
soup = BeautifulSoup(req.content, 'html.parser')

In [None]:
soup.select('span.article_byline')[0].text

'By Felicity Bradstock - Jun 14, 2021, 3:00 PM CDT'

In [None]:
article_content={}
for title, url in article_info.items():
    req=requests.get(url)
    soup=BeautifulSoup(req.content, 'html.parser')
    
    contributor=soup.select('span.article_byline')[0].text
    text=soup.select('#article-content')[0].text

    try:
        article_content['title'].append(title)
        article_content['contributor'].append(contributor)
        article_content['text'].append(text)

    except:
        article_content['title']=[title]
        article_content['contributor']=[contributor]
        article_content['text']=[text]



IndexError: ignored

In [None]:
article_content = {}
for title, url in article_info.items():
  req = requests.get(url)
  soup = BeautifulSoup(req.content, 'html.parser')

  contributor = soup.select('span.article_byline')[0].text
  text = soup.select('#article-content')[0].text

  try:
    article_content['title'].append(title)
    article_content['contributor'].append(contributor)
    article_content['text'].append(text)
  except:
    article_content['title'] = [title]
    article_content['contributor'] = [contributor]
    article_content['text'] = [text]

### Excercise_3 엑셀로 저장 - with pandas

In [None]:
import pandas as pd
data = pd.DataFrame.from_dict(article_content)
data

NameError: ignored

In [None]:
data.to_excel('/content/drive/MyDrive/JHS/data/oil_price_contentHS.xlsx', index=False)

Excercise_4 Dataframe 후처리

In [None]:
import re

In [None]:
'2009년 10월 15일'

'2009년 10월 15일'

In [None]:
is_number=re.compile('[0-9]+')
re.findall(is_number,'By Tsvetana Paraskova - 6 13, 2021, 4:00 PM CDT \n\n')

['6', '13', '2021', '4', '00']

In [None]:
is_all=re.compile('[A-z]+')
re.findall(is_all,'By Tsvetana Paraskova - 6 13, 2021, 4:00 PM CDT \n\n')

['By', 'Tsvetana', 'Paraskova', 'PM', 'CDT']

In [None]:
is_number = re.compile('[\s]+')
re.findall(is_number, 'By Tsvetana Paraskova - 6 13, 2021, 4:00 PM CDT \n\n')

[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' \n\n']

In [None]:
data['text'][0]

"\n\nSaudi Arabia is synonymous with oil, the EU is obsessed with renewable energy, and the U.S. is the world's leading natural gas producer, but there are few countries pursuing all three of these energy sources with as much vigor as the UAE.While much of the West, encouraged by the International Energy Agency (IEA) aim for net-zero, is shunning fossil fuels, energy demand continues to rise across the globe. Without sufficient renewable energy development to meet this demand, the UAE is acknowledging its position as a world leader in oil and gas, which are still very much still needed to power the world over the next decade and beyond.?Abu Dhabi is further expanding upon its already strong oil industry through the full-field development of its Belbazem offshore block, with heavy investment expected to boost oil output in the coming years.?In May, the National Petroleum Construction Company (NPCC) was awarded a $744 million contract by Al Yasat Petroleum Operations Company, a joint ven

In [None]:
sample_text= data['text'][0].strip()

In [None]:
is_not_word=re.compile('[^a-zA-Z0-9 .]+')
re.sub(is_not_word,'',sample_text)

'Saudi Arabia is synonymous with oil the EU is obsessed with renewable energy and the U.S. is the worlds leading natural gas producer but there are few countries pursuing all three of these energy sources with as much vigor as the UAE.While much of the West encouraged by the International Energy Agency IEA aim for netzero is shunning fossil fuels energy demand continues to rise across the globe. Without sufficient renewable energy development to meet this demand the UAE is acknowledging its position as a world leader in oil and gas which are still very much still needed to power the world over the next decade and beyond.Abu Dhabi is further expanding upon its already strong oil industry through the fullfield development of its Belbazem offshore block with heavy investment expected to boost oil output in the coming years.In May the National Petroleum Construction Company NPCC was awarded a 744 million contract by Al Yasat Petroleum Operations Company a joint venture between the Abu Dhab

In [None]:
is_not_word=re.compile('[^a-zA-Z0-9 .]+')

In [None]:
data['text']=list(map(lambda x:re.sub(is_not_word,'',x), data['text']))
data['text']

0        Saudi Arabia is synonymous with oil the EU is ...
1        Renewable energy is going gangbusters. The rem...
2        In just a few years the image of natural gas m...
3        The world of Nigerian refining is an enigma sh...
4        Global economic recovery is gradually taking s...
                               ...                        
10093    Indonesia which had begun producing oil in the...
10094    Largely overlooked in the nonRussian press an ...
10095    A funny thing is happening on the way to the c...
10096    Having looked at the major alternatives to fos...
10097    Jim Lane at Biofuelsdigest.com has written an ...
Name: text, Length: 10097, dtype: object

In [None]:
data

Unnamed: 0,title,text,date
0,The UAE Is Poised To Become The Next Middle Ea...,Saudi Arabia is synonymous with oil the EU is ...,2021-06-14
1,The Renewable Energy Revolution Has A Major Em...,Renewable energy is going gangbusters. The rem...,2021-06-14
2,Energy Transition Forces LNG Industry To Cut E...,In just a few years the image of natural gas m...,2021-06-13
3,Solving Nigeria’s Gasoline Crisis,The world of Nigerian refining is an enigma sh...,2021-06-13
4,China’s Pivot To Gas Is Fueling Support In LNG...,Global economic recovery is gradually taking s...,2021-06-12
...,...,...,...
10093,Former OPEC Member Indonesia Diversifies its E...,Indonesia which had begun producing oil in the...,2012-02-15
10094,Putin Looking to Modernize Russia's Energy Sec...,Largely overlooked in the nonRussian press an ...,2012-02-15
10095,How the US Shale Boom Will Change the World,A funny thing is happening on the way to the c...,2012-02-15
10096,The Age of Fossil Fuels is Far From Over,Having looked at the major alternatives to fos...,2012-02-15


In [None]:
%time
data['text'] = data['text'].apply(lambda x : re.sub(is_not_word, '', x))

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 5.72 µs


In [None]:
%time
data['text'] = list(map(lambda x:re.sub(is_not_word, '', x), data['text']))

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 7.39 µs


In [None]:
%time
sample=[]
for text in data['text']:
  sample.append(re.sub(is_not_word, '', text))
data['text'] = sample

CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 7.63 µs


data load 및 전처리


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import glob
import numpy as np
import pandas as pd

In [None]:
file=glob.glob('./drive/MyDrive/JHS/data/*')
file

NameError: ignored

In [None]:
data=pd.read_csv('./drive/MyDrive/JHS/data/oil_price_contentHS.csv',encoding='cp949')
data

NameError: ignored

Preprocessing-datetime




In [None]:
data['date']=list(map(lambda x:x.split('-')[1], data['contributor']))


In [None]:
data['month']=list(map(lambda x:x.split(' ')[1], data['date']))

data['month'].value_counts()
data['month'].unique()

array(['Jun', 'May', 'Apr', 'Mar', 'Feb', 'Jan', 'Dec', 'Nov', 'Oct',
       'Sep', 'Aug', 'Jul', ''], dtype=object)

In [None]:
data.query("month==''")

Unnamed: 0,title,contributor,text,date,month
9273,"Never Mind Oil, Libya could Supply Europe with...","By Nicolai Due-Gundersen - Apr 15, 2013, 4:48 ...","\n\nA few years back, an article appeared in t...",Gundersen,


In [None]:
data.drop(index=9273,inplace=True)

In [None]:
data['month']=list(map(lambda x:x.split()[0], data['date']))
data['day'] = list(map(lambda x : x.split()[1].replace(',', ''), data['date']))
data['year'] = list(map(lambda x : x.split()[2].replace(',', ''), data['date']))

In [None]:
data['date']=data['year']+'-'+data['month']+'-'+data['day']
data['date']=pd.to_datetime(data['date'])
data

Unnamed: 0,title,contributor,text,date,month,day,year
0,The UAE Is Poised To Become The Next Middle Ea...,"By Felicity Bradstock - Jun 14, 2021, 3:00 PM CDT","\n\nSaudi Arabia is synonymous with oil, the E...",2021-06-14,Jun,14,2021
1,The Renewable Energy Revolution Has A Major Em...,"By Haley Zaremba - Jun 14, 2021, 1:00 PM CDT",\n\nRenewable energy is going gangbusters. The...,2021-06-14,Jun,14,2021
2,Energy Transition Forces LNG Industry To Cut E...,"By Tsvetana Paraskova - Jun 13, 2021, 4:00 PM CDT","\n\nIn just a few years, the image of natural ...",2021-06-13,Jun,13,2021
3,Solving Nigeria’s Gasoline Crisis,"By Gerald Jansen - Jun 13, 2021, 10:00 AM CDT",\n\nThe world of Nigerian refining is an enigm...,2021-06-13,Jun,13,2021
4,China’s Pivot To Gas Is Fueling Support In LNG...,"By Vanand Meliksetian - Jun 12, 2021, 2:00 PM CDT",\n\nGlobal economic recovery is gradually taki...,2021-06-12,Jun,12,2021
...,...,...,...,...,...,...,...
10093,Former OPEC Member Indonesia Diversifies its E...,"By John Daly - Feb 15, 2012, 7:48 PM CST","\n\nIndonesia, which had begun producing oil i...",2012-02-15,Feb,15,2012
10094,Putin Looking to Modernize Russia's Energy Sec...,"By John Daly - Feb 15, 2012, 5:09 PM CST",\n\nLargely overlooked in the non-Russian pres...,2012-02-15,Feb,15,2012
10095,How the US Shale Boom Will Change the World,"By Gary Hunt - Feb 15, 2012, 5:02 PM CST",\n\nA funny thing is happening on the way to t...,2012-02-15,Feb,15,2012
10096,The Age of Fossil Fuels is Far From Over,"By Tom Murphy - Feb 15, 2012, 4:32 PM CST",\n\nHaving looked at the major alternatives to...,2012-02-15,Feb,15,2012


In [None]:
data.drop(columns=['year','month','day','contributor'],inplace=True)
data

Unnamed: 0,title,text,date
0,The UAE Is Poised To Become The Next Middle Ea...,"\n\nSaudi Arabia is synonymous with oil, the E...",2021-06-14
1,The Renewable Energy Revolution Has A Major Em...,\n\nRenewable energy is going gangbusters. The...,2021-06-14
2,Energy Transition Forces LNG Industry To Cut E...,"\n\nIn just a few years, the image of natural ...",2021-06-13
3,Solving Nigeria’s Gasoline Crisis,\n\nThe world of Nigerian refining is an enigm...,2021-06-13
4,China’s Pivot To Gas Is Fueling Support In LNG...,\n\nGlobal economic recovery is gradually taki...,2021-06-12
...,...,...,...
10093,Former OPEC Member Indonesia Diversifies its E...,"\n\nIndonesia, which had begun producing oil i...",2012-02-15
10094,Putin Looking to Modernize Russia's Energy Sec...,\n\nLargely overlooked in the non-Russian pres...,2012-02-15
10095,How the US Shale Boom Will Change the World,\n\nA funny thing is happening on the way to t...,2012-02-15
10096,The Age of Fossil Fuels is Far From Over,\n\nHaving looked at the major alternatives to...,2012-02-15


Preprocessing-text preprocessing

In [None]:
import re

In [None]:
data['text'][0].strip()

"Saudi Arabia is synonymous with oil, the EU is obsessed with renewable energy, and the U.S. is the world's leading natural gas producer, but there are few countries pursuing all three of these energy sources with as much vigor as the UAE.While much of the West, encouraged by the International Energy Agency (IEA) aim for net-zero, is shunning fossil fuels, energy demand continues to rise across the globe. Without sufficient renewable energy development to meet this demand, the UAE is acknowledging its position as a world leader in oil and gas, which are still very much still needed to power the world over the next decade and beyond.?Abu Dhabi is further expanding upon its already strong oil industry through the full-field development of its Belbazem offshore block, with heavy investment expected to boost oil output in the coming years.?In May, the National Petroleum Construction Company (NPCC) was awarded a $744 million contract by Al Yasat Petroleum Operations Company, a joint venture

In [None]:
re_x=re.compile("[^ 0-9a-zA-Z . ' ]+")

In [None]:
re_xa=re.compile('googletag.+;')

In [None]:
''.join(re.split(re_xa,data['text'][0]))

"\n\nSaudi Arabia is synonymous with oil, the EU is obsessed with renewable energy, and the U.S. is the world's leading natural gas producer, but there are few countries pursuing all three of these energy sources with as much vigor as the UAE.While much of the West, encouraged by the International Energy Agency (IEA) aim for net-zero, is shunning fossil fuels, energy demand continues to rise across the globe. Without sufficient renewable energy development to meet this demand, the UAE is acknowledging its position as a world leader in oil and gas, which are still very much still needed to power the world over the next decade and beyond.?Abu Dhabi is further expanding upon its already strong oil industry through the full-field development of its Belbazem offshore block, with heavy investment expected to boost oil output in the coming years.?In May, the National Petroleum Construction Company (NPCC) was awarded a $744 million contract by Al Yasat Petroleum Operations Company, a joint ven

In [None]:
data['text']=list(map(lambda x:''.join(re.split(re_xa,x)),data['text']))


In [None]:
data['text']=list(map(lambda x:re.sub(re_x,'',x),data['text']))

In [None]:
data['text'][0]

"Saudi Arabia is synonymous with oil the EU is obsessed with renewable energy and the U.S. is the world's leading natural gas producer but there are few countries pursuing all three of these energy sources with as much vigor as the UAE.While much of the West encouraged by the International Energy Agency IEA aim for netzero is shunning fossil fuels energy demand continues to rise across the globe. Without sufficient renewable energy development to meet this demand the UAE is acknowledging its position as a world leader in oil and gas which are still very much still needed to power the world over the next decade and beyond.Abu Dhabi is further expanding upon its already strong oil industry through the fullfield development of its Belbazem offshore block with heavy investment expected to boost oil output in the coming years.In May the National Petroleum Construction Company NPCC was awarded a 744 million contract by Al Yasat Petroleum Operations Company a joint venture between the Abu Dha

stopword preprocessing


In [None]:
data.to_csv('/content/drive/MyDrive/JHS/data/oilprice_preprocessing.csv')

In [None]:
import glob
import numpy as np
import pandas as pd

In [None]:
df=pd.read_csv('./drive/MyDrive/JHS/data/oilprice_preprocessing.csv',encoding='utf-8')
df

Unnamed: 0.1,Unnamed: 0,title,text,date
0,0,The UAE Is Poised To Become The Next Middle Ea...,Saudi Arabia is synonymous with oil the EU is ...,2021-06-14
1,1,The Renewable Energy Revolution Has A Major Em...,Renewable energy is going gangbusters. The rem...,2021-06-14
2,2,Energy Transition Forces LNG Industry To Cut E...,In just a few years the image of natural gas m...,2021-06-13
3,3,Solving Nigeria’s Gasoline Crisis,The world of Nigerian refining is an enigma sh...,2021-06-13
4,4,China’s Pivot To Gas Is Fueling Support In LNG...,Global economic recovery is gradually taking s...,2021-06-12
...,...,...,...,...
10092,10093,Former OPEC Member Indonesia Diversifies its E...,Indonesia which had begun producing oil in the...,2012-02-15
10093,10094,Putin Looking to Modernize Russia's Energy Sec...,Largely overlooked in the nonRussian press an ...,2012-02-15
10094,10095,How the US Shale Boom Will Change the World,A funny thing is happening on the way to the c...,2012-02-15
10095,10096,The Age of Fossil Fuels is Far From Over,Having looked at the major alternatives to fos...,2012-02-15


In [None]:
df['text']=df['text'].str.replace("[^a-z A-Z 0-9]","")

In [None]:
df['title']=df['title'].str.replace("[^a-z A-Z 0-9]","")

Vocab 만들기

In [None]:
vocab={}
for sentence in df['text']:
    for token in sentence.split():
        token=token.lower()
        if token not in stop_words:
            try :
                vocab[token]+=1
            except:
                vocab[token]=1

In [None]:
sorted(vocab.items(),key=lambda x: x[1],reverse=True)

Tokenize, 불용어 처리

In [None]:
from lxml import etree
import re
from nltk.tokenize import word_tokenize, sent_tokenize

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
df['title_tokenize']=df['title'].apply(word_tokenize)
df['text_tokenize']=df['text'].apply(word_tokenize)

df.head()

Unnamed: 0.1,Unnamed: 0,title,text,date,title_tokenize,text_tokenize
0,0,The UAE Is Poised To Become The Next Middle Ea...,Saudi Arabia is synonymous with oil the EU is ...,2021-06-14,"[The, UAE, Is, Poised, To, Become, The, Next, ...","[Saudi, Arabia, is, synonymous, with, oil, the..."
1,1,The Renewable Energy Revolution Has A Major Em...,Renewable energy is going gangbusters The rema...,2021-06-14,"[The, Renewable, Energy, Revolution, Has, A, M...","[Renewable, energy, is, going, gangbusters, Th..."
2,2,Energy Transition Forces LNG Industry To Cut E...,In just a few years the image of natural gas m...,2021-06-13,"[Energy, Transition, Forces, LNG, Industry, To...","[In, just, a, few, years, the, image, of, natu..."
3,3,Solving Nigerias Gasoline Crisis,The world of Nigerian refining is an enigma sh...,2021-06-13,"[Solving, Nigerias, Gasoline, Crisis]","[The, world, of, Nigerian, refining, is, an, e..."
4,4,Chinas Pivot To Gas Is Fueling Support In LNG ...,Global economic recovery is gradually taking s...,2021-06-12,"[Chinas, Pivot, To, Gas, Is, Fueling, Support,...","[Global, economic, recovery, is, gradually, ta..."


In [None]:
df['title_tokenize'][0][0]

'The'

In [None]:
df['title_tokenize']=df['title_tokenize'].apply(lambda x: [item.lower() for item in x])
df['text_tokenize']=df['text_tokenize'].apply(lambda x:[item.lower() for item in x])

In [None]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
stop_words=stopwords.words('english')

In [None]:
plus='this', 'in', 'on', 'at', 'but', 'and', 'however'
stop_words.append(plus)
stop_words

In [None]:
df['title_tokenize']=df['title_tokenize'].apply(lambda x: [item for item in x if item not in stop_words])
df['text_tokenize']=df['text_tokenize'].apply(lambda x: [item for item in x if item not in stop_words])
df.head()

Unnamed: 0.1,Unnamed: 0,title,text,date,title_tokenize,text_tokenize
0,0,The UAE Is Poised To Become The Next Middle Ea...,Saudi Arabia is synonymous with oil the EU is ...,2021-06-14,"[uae, poised, become, next, middle, east, ener...","[saudi, arabia, synonymous, oil, eu, obsessed,..."
1,1,The Renewable Energy Revolution Has A Major Em...,Renewable energy is going gangbusters The rema...,2021-06-14,"[renewable, energy, revolution, major, employm...","[renewable, energy, going, gangbusters, remark..."
2,2,Energy Transition Forces LNG Industry To Cut E...,In just a few years the image of natural gas m...,2021-06-13,"[energy, transition, forces, lng, industry, cu...","[years, image, natural, gas, markedly, shifted..."
3,3,Solving Nigerias Gasoline Crisis,The world of Nigerian refining is an enigma sh...,2021-06-13,"[solving, nigerias, gasoline, crisis]","[world, nigerian, refining, enigma, shrouded, ..."
4,4,Chinas Pivot To Gas Is Fueling Support In LNG ...,Global economic recovery is gradually taking s...,2021-06-12,"[chinas, pivot, gas, fueling, support, lng, de...","[global, economic, recovery, gradually, taking..."


In [None]:
df['title_tokenize'][0][1]

'poised'

In [None]:
from nltk.stem import WordNetLemmatizer
n=WordNetLemmatizer()
import nltk
nltk.download('wordnet')
from nltk import pos_tag
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [None]:
df.to_csv('/content/drive/MyDrive/JHS/data/oilprice_preprocessing_rev.csv')

In [None]:
df

In [None]:
word=['apple','banana', 'golf']
print([n.lemmatize(a) for a in word])


['apple', 'banana', 'golf']


In [None]:
df['lemmi_text']=df['text_tokenize'].apply(lambda x:[n.lemmatize(i) for i in x])
df['lemmi_title']=df['title_tokenize'].apply(lambda x:[n.lemmatize(i) for i in x])


In [None]:
df.head()

Unnamed: 0.1,Unnamed: 0,title,text,date,title_tokenize,text_tokenize,lemmi_text,lemmi_title
0,0,The UAE Is Poised To Become The Next Middle Ea...,Saudi Arabia is synonymous with oil the EU is ...,2021-06-14,"[uae, poised, become, next, middle, east, ener...","[saudi, arabia, synonymous, oil, eu, obsessed,...","[saudi, arabia, synonymous, oil, eu, obsessed,...","[uae, poised, become, next, middle, east, ener..."
1,1,The Renewable Energy Revolution Has A Major Em...,Renewable energy is going gangbusters The rema...,2021-06-14,"[renewable, energy, revolution, major, employm...","[renewable, energy, going, gangbusters, remark...","[renewable, energy, going, gangbusters, remark...","[renewable, energy, revolution, major, employm..."
2,2,Energy Transition Forces LNG Industry To Cut E...,In just a few years the image of natural gas m...,2021-06-13,"[energy, transition, forces, lng, industry, cu...","[years, image, natural, gas, markedly, shifted...","[year, image, natural, gas, markedly, shifted,...","[energy, transition, force, lng, industry, cut..."
3,3,Solving Nigerias Gasoline Crisis,The world of Nigerian refining is an enigma sh...,2021-06-13,"[solving, nigerias, gasoline, crisis]","[world, nigerian, refining, enigma, shrouded, ...","[world, nigerian, refining, enigma, shrouded, ...","[solving, nigeria, gasoline, crisis]"
4,4,Chinas Pivot To Gas Is Fueling Support In LNG ...,Global economic recovery is gradually taking s...,2021-06-12,"[chinas, pivot, gas, fueling, support, lng, de...","[global, economic, recovery, gradually, taking...","[global, economic, recovery, gradually, taking...","[china, pivot, gas, fueling, support, lng, dem..."


In [None]:
df['lemmi_title'][0][0]

'uae'

tf-idf

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
tfidf=TfidfVectorizer()

In [None]:
tfidfv=tfidf.fit(df['lemmi_title'][0])

In [None]:
print(tfidfv.transform(df['lemmi_title'][0]).toarray())

[[0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 1. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0.]]


In [None]:
df['tfidf']=df['lemmi_title'].apply(lambda x: [tfidfv.transform([i]).toarray() for i in x])

In [None]:
df['tfidf'][0]

[array([[0., 0., 0., 0., 0., 0., 0., 1.]]),
 array([[0., 0., 0., 0., 0., 0., 1., 0.]]),
 array([[1., 0., 0., 0., 0., 0., 0., 0.]]),
 array([[0., 0., 0., 0., 0., 1., 0., 0.]]),
 array([[0., 0., 0., 0., 1., 0., 0., 0.]]),
 array([[0., 1., 0., 0., 0., 0., 0., 0.]]),
 array([[0., 0., 1., 0., 0., 0., 0., 0.]]),
 array([[0., 0., 0., 1., 0., 0., 0., 0.]])]

In [None]:
df['lemmi_title'][0]

['uae', 'poised', 'become', 'next', 'middle', 'east', 'energy', 'giant']

In [None]:
a=[1,2,3]
np.array(a)

array([1, 2, 3])

In [None]:
from gensim.models import Word2Vec
from gensim.models import KeyedVectors

In [None]:
df['word2vec']=df['lemmi_title'].apply(lambda x : [loaded_model.get_vector(i) for i in x])

KeyError: ignored

In [None]:
loaded_model.get_vector("uae")

array([ 0.14015311,  0.04944398,  0.07569135,  0.00567519,  0.05914401,
        0.11963748,  0.02940406,  0.14644766, -0.03426981,  0.03344195,
       -0.07028589,  0.08589093, -0.09518461,  0.06391553,  0.1752236 ,
       -0.17142028, -0.06663116, -0.18755151, -0.08084892, -0.07019907,
        0.08139094,  0.18474427, -0.2332523 , -0.02221418, -0.12499598,
       -0.21192436, -0.19841751,  0.06562069, -0.07154199, -0.12927255,
        0.07057619, -0.10932018, -0.1776624 ,  0.14954446, -0.02624759,
       -0.12148854, -0.07198396, -0.07963964, -0.02892969, -0.15345195,
       -0.06868204,  0.17647381, -0.09321401, -0.07322565, -0.07310056,
       -0.18219592, -0.10568142, -0.14999501,  0.10372308, -0.1414438 ,
       -0.13047954,  0.24877042,  0.05182329,  0.07353585, -0.01348604,
        0.2987417 ,  0.12145665, -0.11931058, -0.12800086, -0.27122146,
        0.02329631, -0.01489131,  0.02612748,  0.23864974,  0.0513752 ,
       -0.10551874, -0.19443107, -0.04879979,  0.07605988,  0.08

Word2Vec

In [None]:
from gensim.models import Word2Vec

In [None]:
model = Word2Vec(sentences=df['lemmi_title'], size=100, window=5, min_count=5, workers=4, sg=1)

In [None]:
from gensim.models import KeyedVectors
model.wv.save_word2vec_format('eng_w2v') # 모델 저장
loaded_model = KeyedVectors.load_word2vec_format("eng_w2v") # 모델 로드

In [None]:
model_result = loaded_model.most_similar("price")
print(model_result)

[('rise', 0.9742411971092224), ('low', 0.9739410281181335), ('fall', 0.9734461307525635), ('production', 0.9636420011520386), ('oil', 0.962026834487915), ('rally', 0.9486740827560425), ('higher', 0.94339919090271), ('hit', 0.9421190023422241), ('rig', 0.9364202618598938), ('natural', 0.9356290102005005)]


NG Price 데이터 load

In [None]:
from datetime import datetime
from fredapi import Fred
import pandas_datareader as pdr
import pandas_datareader as data 

In [None]:
pip install fredapi

Collecting fredapi
  Downloading https://files.pythonhosted.org/packages/db/82/9ca1e4a7f1b2ae057e8352cc46d016866c067a12d013fc05af0670f4291a/fredapi-0.4.3-py3-none-any.whl
Installing collected packages: fredapi
Successfully installed fredapi-0.4.3


In [None]:
fred=Fred(api_key='5b00d2114222da72aab563643d32fb11')

In [None]:
start=datetime(2012,2,15)
end=datetime(2021,6,14)

In [None]:
df_NG=pdr.DataReader('DHHNGSP','fred',start,end)
df_NG.head()

Unnamed: 0_level_0,DHHNGSP
DATE,Unnamed: 1_level_1
2012-02-15,2.54
2012-02-16,2.47
2012-02-17,2.67
2012-02-20,
2012-02-21,2.63


In [None]:
df['date']=pd.to_datetime(df['date'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10097 entries, 0 to 10096
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   Unnamed: 0      10097 non-null  int64         
 1   title           10097 non-null  object        
 2   text            10097 non-null  object        
 3   date            10097 non-null  datetime64[ns]
 4   title_tokenize  10097 non-null  object        
 5   text_tokenize   10097 non-null  object        
 6   lemmi_text      10097 non-null  object        
 7   lemmi_title     10097 non-null  object        
 8   tfidf           10097 non-null  object        
dtypes: datetime64[ns](1), int64(1), object(7)
memory usage: 710.1+ KB


In [None]:
df=df.set_index('date')
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 10097 entries, 2021-06-14 to 2012-02-15
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Unnamed: 0      10097 non-null  int64 
 1   title           10097 non-null  object
 2   text            10097 non-null  object
 3   title_tokenize  10097 non-null  object
 4   text_tokenize   10097 non-null  object
 5   lemmi_text      10097 non-null  object
 6   lemmi_title     10097 non-null  object
 7   tfidf           10097 non-null  object
dtypes: int64(1), object(7)
memory usage: 709.9+ KB


In [None]:
df_NG.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2439 entries, 2012-02-15 to 2021-06-14
Data columns (total 1 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   DHHNGSP  2367 non-null   float64
dtypes: float64(1)
memory usage: 38.1 KB


In [None]:
df_merge=pd.merge(df,df_NG, left_index=True, right_index=True, how='left')
df_merge

Unnamed: 0.1,Unnamed: 0,title,text,title_tokenize,text_tokenize,lemmi_text,lemmi_title,tfidf,DHHNGSP
2012-02-15,10093,Former OPEC Member Indonesia Diversifies its E...,Indonesia which had begun producing oil in the...,"[former, opec, member, indonesia, diversifies,...","[indonesia, begun, producing, oil, early, 20th...","[indonesia, begun, producing, oil, early, 20th...","[former, opec, member, indonesia, diversifies,...","[[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]], [...",2.54
2012-02-15,10094,Putin Looking to Modernize Russias Energy Sect...,Largely overlooked in the nonRussian press an ...,"[putin, looking, modernize, russias, energy, s...","[largely, overlooked, nonrussian, press, incip...","[largely, overlooked, nonrussian, press, incip...","[putin, looking, modernize, russia, energy, se...","[[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]], [...",2.54
2012-02-15,10095,How the US Shale Boom Will Change the World,A funny thing is happening on the way to the c...,"[us, shale, boom, change, world]","[funny, thing, happening, way, clean, energy, ...","[funny, thing, happening, way, clean, energy, ...","[u, shale, boom, change, world]","[[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]], [...",2.54
2012-02-15,10096,The Age of Fossil Fuels is Far From Over,Having looked at the major alternatives to fos...,"[age, fossil, fuels, far]","[looked, major, alternatives, fossil, fuel, en...","[looked, major, alternative, fossil, fuel, ene...","[age, fossil, fuel, far]","[[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]], [...",2.54
2012-02-15,10097,How the US Can Produce 36 Billion Gallons of B...,Jim Lane at Biofuelsdigestcom has written an a...,"[us, produce, 36, billion, gallons, biofuels, ...","[jim, lane, biofuelsdigestcom, written, articl...","[jim, lane, biofuelsdigestcom, written, articl...","[u, produce, 36, billion, gallon, biofuels, an...","[[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]], [...",2.54
...,...,...,...,...,...,...,...,...,...
2021-06-12,6,Turkey Makes Moves To Become An Energy Hub,Turkey has dreamed about becoming an energy hu...,"[turkey, makes, moves, become, energy, hub]","[turkey, dreamed, becoming, energy, hub, decad...","[turkey, dreamed, becoming, energy, hub, decad...","[turkey, make, move, become, energy, hub]","[[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]], [...",
2021-06-13,2,Energy Transition Forces LNG Industry To Cut E...,In just a few years the image of natural gas m...,"[energy, transition, forces, lng, industry, cu...","[years, image, natural, gas, markedly, shifted...","[year, image, natural, gas, markedly, shifted,...","[energy, transition, force, lng, industry, cut...","[[[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]], [...",
2021-06-13,3,Solving Nigerias Gasoline Crisis,The world of Nigerian refining is an enigma sh...,"[solving, nigerias, gasoline, crisis]","[world, nigerian, refining, enigma, shrouded, ...","[world, nigerian, refining, enigma, shrouded, ...","[solving, nigeria, gasoline, crisis]","[[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]], [...",
2021-06-14,0,The UAE Is Poised To Become The Next Middle Ea...,Saudi Arabia is synonymous with oil the EU is ...,"[uae, poised, become, next, middle, east, ener...","[saudi, arabia, synonymous, oil, eu, obsessed,...","[saudi, arabia, synonymous, oil, eu, obsessed,...","[uae, poised, become, next, middle, east, ener...","[[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]], [...",3.36


In [None]:
df_merge.to_csv('/content/drive/MyDrive/JHS/data/NGprice_preprocessing.csv')

In [None]:
df_merge['tfidf'][0][0]

array([[0., 0., 0., 0., 0., 0., 0., 0.]])

TF-IDF

In [None]:
import pandas as pd

In [None]:
df=pd.read_csv('/content/drive/MyDrive/JHS/data/NGprice_preprocessing.csv')

In [None]:
df

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,title,text,title_tokenize,text_tokenize,lemmi_text,lemmi_title,tfidf,DHHNGSP
0,2012-02-15,10093,Former OPEC Member Indonesia Diversifies its E...,Indonesia which had begun producing oil in the...,"['former', 'opec', 'member', 'indonesia', 'div...","['indonesia', 'begun', 'producing', 'oil', 'ea...","['indonesia', 'begun', 'producing', 'oil', 'ea...","['former', 'opec', 'member', 'indonesia', 'div...","[array([[0., 0., 0., 0., 0., 0., 0., 0.]]), ar...",2.54
1,2012-02-15,10094,Putin Looking to Modernize Russias Energy Sect...,Largely overlooked in the nonRussian press an ...,"['putin', 'looking', 'modernize', 'russias', '...","['largely', 'overlooked', 'nonrussian', 'press...","['largely', 'overlooked', 'nonrussian', 'press...","['putin', 'looking', 'modernize', 'russia', 'e...","[array([[0., 0., 0., 0., 0., 0., 0., 0.]]), ar...",2.54
2,2012-02-15,10095,How the US Shale Boom Will Change the World,A funny thing is happening on the way to the c...,"['us', 'shale', 'boom', 'change', 'world']","['funny', 'thing', 'happening', 'way', 'clean'...","['funny', 'thing', 'happening', 'way', 'clean'...","['u', 'shale', 'boom', 'change', 'world']","[array([[0., 0., 0., 0., 0., 0., 0., 0.]]), ar...",2.54
3,2012-02-15,10096,The Age of Fossil Fuels is Far From Over,Having looked at the major alternatives to fos...,"['age', 'fossil', 'fuels', 'far']","['looked', 'major', 'alternatives', 'fossil', ...","['looked', 'major', 'alternative', 'fossil', '...","['age', 'fossil', 'fuel', 'far']","[array([[0., 0., 0., 0., 0., 0., 0., 0.]]), ar...",2.54
4,2012-02-15,10097,How the US Can Produce 36 Billion Gallons of B...,Jim Lane at Biofuelsdigestcom has written an a...,"['us', 'produce', '36', 'billion', 'gallons', ...","['jim', 'lane', 'biofuelsdigestcom', 'written'...","['jim', 'lane', 'biofuelsdigestcom', 'written'...","['u', 'produce', '36', 'billion', 'gallon', 'b...","[array([[0., 0., 0., 0., 0., 0., 0., 0.]]), ar...",2.54
...,...,...,...,...,...,...,...,...,...,...
10092,2021-06-12,6,Turkey Makes Moves To Become An Energy Hub,Turkey has dreamed about becoming an energy hu...,"['turkey', 'makes', 'moves', 'become', 'energy...","['turkey', 'dreamed', 'becoming', 'energy', 'h...","['turkey', 'dreamed', 'becoming', 'energy', 'h...","['turkey', 'make', 'move', 'become', 'energy',...","[array([[0., 0., 0., 0., 0., 0., 0., 0.]]), ar...",
10093,2021-06-13,2,Energy Transition Forces LNG Industry To Cut E...,In just a few years the image of natural gas m...,"['energy', 'transition', 'forces', 'lng', 'ind...","['years', 'image', 'natural', 'gas', 'markedly...","['year', 'image', 'natural', 'gas', 'markedly'...","['energy', 'transition', 'force', 'lng', 'indu...","[array([[0., 0., 1., 0., 0., 0., 0., 0.]]), ar...",
10094,2021-06-13,3,Solving Nigerias Gasoline Crisis,The world of Nigerian refining is an enigma sh...,"['solving', 'nigerias', 'gasoline', 'crisis']","['world', 'nigerian', 'refining', 'enigma', 's...","['world', 'nigerian', 'refining', 'enigma', 's...","['solving', 'nigeria', 'gasoline', 'crisis']","[array([[0., 0., 0., 0., 0., 0., 0., 0.]]), ar...",
10095,2021-06-14,0,The UAE Is Poised To Become The Next Middle Ea...,Saudi Arabia is synonymous with oil the EU is ...,"['uae', 'poised', 'become', 'next', 'middle', ...","['saudi', 'arabia', 'synonymous', 'oil', 'eu',...","['saudi', 'arabia', 'synonymous', 'oil', 'eu',...","['uae', 'poised', 'become', 'next', 'middle', ...","[array([[0., 0., 0., 0., 0., 0., 0., 1.]]), ar...",3.36


In [None]:
df['tfidf'][0]

'[array([[0., 0., 0., 0., 0., 0., 0., 0.]]), array([[0., 0., 0., 0., 0., 0., 0., 0.]]), array([[0., 0., 0., 0., 0., 0., 0., 0.]]), array([[0., 0., 0., 0., 0., 0., 0., 0.]]), array([[0., 0., 0., 0., 0., 0., 0., 0.]]), array([[0., 0., 1., 0., 0., 0., 0., 0.]]), array([[0., 0., 0., 0., 0., 0., 0., 0.]])]'

Natural gas price 통계 데이터

In [None]:
import statistics

In [None]:
df_NG['DHHNGSP']=pd.to_numeric(df_NG['DHHNGSP'])


In [None]:
df_NG.dropna(inplace=True)

In [None]:
statistics.mean(df_NG['DHHNGSP'])

2.981972961554711

In [None]:
statistics.stdev(df_NG['DHHNGSP'])

0.9409134080955804