# SCRIPT COLETA

Vamos começar a parte prática da coleta: desenvolver um script Python para executar a coleta de dados dos vinte anos, entre setembro 1997 e agosto de 2017. Coloque tudo em um DataFrame e depois salve em um arquivo .CSV com o nome OVNIS.csv.

 

Mãos a obra!

 

Sugestões de bibliotecas para essa etapa:
- requests: biblioteca para execução de requisições HTTP;
- BeautifulSoup: biblioteca para extração de dados em arquivos HTML e XML;
- Pandas: biblioteca para armazenar, limpar e salvar os dados em forma de tabela.


# Forma 1 - de Realizar o Web Scraping

In [11]:
#Instala as bibliotecas no collab
!pip install requests
!pip install beautifulsoup4



In [12]:
#Realizar os Imports das bibliotecas 
import urllib.request
from bs4 import BeautifulSoup

import pandas as pd
import numpy as np

In [13]:
def make_soup(url):
   #   """
    #   Função para adicionar as colunas(features) de uma lista em outra lista, na mesma ordem.
    #   INPUT:
    #   url: link do site contendo as tabelas com os reports dos ovnis.
    #   OUTPUT:
    #   soup: site armazenado no formato BeautifulSoup.
    #   """
  #Consulte o site e retorne o html para a variável 'pagina'
  pagina = urllib.request.urlopen(url)
  #Parse o html na variável 'pagina' e armazena no formato BeautifulSoup
  soup = BeautifulSoup(pagina, "html.parser") 
  return soup

In [14]:
#Array que recebe os urls que desejamos analisar.
urls = []

#Loop que vai percorrer o periodo de setembro 1997 até agosto de 2017.
for y in range(1997,2018):
  for m in range(1,13):
    y2 = y
    if y == 1997 and m < 8:
      continue
    if y == 2017 and m>10:
      break
    m2 = m
    if m<10:
      m2= f'0{m}'
    urls.append(f'http://www.nuforc.org/webreports/ndxe{y2}{m2}.html') #Especifica o URL da página que iremos fazer o webscraping

#Loop que printa os urls que vamos coletar os dados.
for x in urls:
  print(x)

http://www.nuforc.org/webreports/ndxe199708.html
http://www.nuforc.org/webreports/ndxe199709.html
http://www.nuforc.org/webreports/ndxe199710.html
http://www.nuforc.org/webreports/ndxe199711.html
http://www.nuforc.org/webreports/ndxe199712.html
http://www.nuforc.org/webreports/ndxe199801.html
http://www.nuforc.org/webreports/ndxe199802.html
http://www.nuforc.org/webreports/ndxe199803.html
http://www.nuforc.org/webreports/ndxe199804.html
http://www.nuforc.org/webreports/ndxe199805.html
http://www.nuforc.org/webreports/ndxe199806.html
http://www.nuforc.org/webreports/ndxe199807.html
http://www.nuforc.org/webreports/ndxe199808.html
http://www.nuforc.org/webreports/ndxe199809.html
http://www.nuforc.org/webreports/ndxe199810.html
http://www.nuforc.org/webreports/ndxe199811.html
http://www.nuforc.org/webreports/ndxe199812.html
http://www.nuforc.org/webreports/ndxe199901.html
http://www.nuforc.org/webreports/ndxe199902.html
http://www.nuforc.org/webreports/ndxe199903.html
http://www.nuforc.or

In [15]:
#Declaração de um array vazio
array = []

#Loop pra fazer a url
for x in urls:
  #Chama a função make_soup.
  soup = make_soup(x)
  #Procura as 'tr'
  for record in soup.findAll('tr'):
    #Procura as 'td'
    for data in record.findAll('td'):
      #Procura o conteúdo de 'font'
      cols = record.findAll('font')
      #Adiciona as linhas do registro do df
      array.append((cols[0].text.strip(), cols[1].text.strip(), cols[2].text.strip(), cols[3].text.strip(),cols[4].text.strip(),cols[5].text.strip(),cols[6].text.strip()))
      break;

#Cria um array numpy.
array2 = np.asarray(array)

#Adiciona os dados em um dataframe.
df = pd.DataFrame(array2)

#Cria as colunas do dataframe. 
df.columns = ['DATE/TIME', 'CITY', 'STATE','SHAPE', 'DURATION','SUMMARY','POSTED']

#Salva o conjunto de dados dataframe em um arquivo csv.
df.to_csv('ovnis_tabela.csv') 

In [16]:
#Printa os 10 primeiros elementos do dataframe.
df.head(10)

Unnamed: 0,DATE/TIME,CITY,STATE,SHAPE,DURATION,SUMMARY,POSTED
0,8/31/97 05:15,Lost Lake,OR,Egg,3 min,Two blue egg shaped objects floated past our c...,1/10/09
1,8/30/97 21:00,Ocracoke,NC,Other,5 minutes,Several stationary white lights in a curved sh...,7/14/13
2,8/30/97 19:00,Fort Fairfield,ME,Changing,10 minutes,Huge bright white round orange and red pulsati...,3/18/16
3,8/29/97 22:00,Tucson,AZ,Chevron,20+ secs.,As we traveled the 10e freeway to Tucson the c...,10/30/06
4,8/25/97 22:00,Fontana,CA,Triangle,15 sec.,My brother and I saw 3 lights forming a shape ...,10/30/06
5,8/17/97 18:00,Kalamazoo,MI,Disk,17 seconds,A saucer in the sky. 500 Lights On Object0: Yes,2/14/08
6,8/17/97 00:30,Sciota,PA,Disk,five minutes or so,"This is the second incident, nearly a year or ...",3/23/11
7,8/16/97 13:00,Louisville,KY,Changing,3 minutes,White Object Constantly Changes then Vanishes ...,8/10/18
8,8/15/97 00:00,Coimbra (Portugal),,Circle,2 minutes,The object was hovering in one place and then ...,2/1/07
9,8/15/97 22:00,Temerin (Serbia),,Disk,2-3 minutes,1997 august/ 50m away/ huge bright light/NO SO...,4/17/15


# Forma 2 - de Realizar o Web Scraping (Rascunho)


In [1]:
#Instala as bibliotecas no collab
!pip install requests
!pip install beautifulsoup4
!pip install selenium
!pip install requests-html



In [2]:
#Realizar os Imports das bibliotecas 
import pandas as pd
import requests
from requests_html import HTMLSession
import urllib.request
from bs4 import BeautifulSoup

session = HTMLSession()
response = session.get('http://www.nuforc.org/webreports/ndxevent.html')
soup = BeautifulSoup(response.content, 'html.parser')


In [10]:
ovni_reports = soup.find_all('table')

for ovni_report in ovni_reports:
  rows = ovni_report.find_all('tr')
  for row in rows:
     data_reports = row.find('a').get('href')
     print(data_reports)

None
<a href="ndxe202009.html">09/2020</a>
<a href="ndxe202008.html">08/2020</a>
<a href="ndxe202007.html">07/2020</a>
<a href="ndxe202006.html">06/2020</a>
<a href="ndxe202005.html">05/2020</a>
<a href="ndxe202004.html">04/2020</a>
<a href="ndxe202003.html">03/2020</a>
<a href="ndxe202002.html">02/2020</a>
<a href="ndxe202001.html">01/2020</a>
<a href="ndxe201912.html">12/2019</a>
<a href="ndxe201911.html">11/2019</a>
<a href="ndxe201910.html">10/2019</a>
<a href="ndxe201909.html">09/2019</a>
<a href="ndxe201908.html">08/2019</a>
<a href="ndxe201907.html">07/2019</a>
<a href="ndxe201906.html">06/2019</a>
<a href="ndxe201905.html">05/2019</a>
<a href="ndxe201904.html">04/2019</a>
<a href="ndxe201903.html">03/2019</a>
<a href="ndxe201902.html">02/2019</a>
<a href="ndxe201901.html">01/2019</a>
<a href="ndxe201812.html">12/2018</a>
<a href="ndxe201811.html">11/2018</a>
<a href="ndxe201810.html">10/2018</a>
<a href="ndxe201809.html">09/2018</a>
<a href="ndxe201808.html">08/2018</a>
<a href