# Rocket launch analysis

<h1>Table of contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Imports" data-toc-modified-id="Imports-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Imports</a></span></li><li><span><a href="#API-data" data-toc-modified-id="API-data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>API data</a></span></li><li><span><a href="#Webscraping-data" data-toc-modified-id="Webscraping-data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Webscraping data</a></span><ul class="toc-item"><li><span><a href="#Define-a-spider" data-toc-modified-id="Define-a-spider-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Define a spider</a></span></li><li><span><a href="#Define-the-parser" data-toc-modified-id="Define-the-parser-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Define the parser</a></span></li><li><span><a href="#Execution" data-toc-modified-id="Execution-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Execution</a></span></li></ul></li><li><span><a href="#Data-Export" data-toc-modified-id="Data-Export-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Data Export</a></span></li></ul></div>

## Imports

Libraries to import for the API and webscraping:

In [42]:
import requests, random, time
from bs4 import BeautifulSoup
import json
import pandas as pd

## API data

**Status**

- 1 = GO
- 2 = TBD
- 3 = Success
- 4 = Failure
- 5 = Hold
- 6 = In-flight
- 7 = Partial failure

In [2]:
url_api = 'https://spacelaunchnow.me/api/3.3.0/launch/?'
headers = {'User-Agent': 'Mozilla/5.0 CK={} (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko'}

In [3]:
# Function to get the information of the launches depending on their status

def get_launches(limit=100, status=2):
    
    # Initial offset of the query
    offset = 0
    # API URL
    url_api = 'https://spacelaunchnow.me/api/3.3.0/launch/?'
    
    # Loop until all the elements are requested
    while True:
        
        # Query
        q = {'mode':'list', 'limit':limit, 'offset':offset, 'status':status}
        # Request
        r = requests.get(url_api, headers=headers, params=q).json()
        
        # First request:
        if offset == 0:
            # Export the information to a DataFrame
            launches = pd.json_normalize(r['results'])
        else:
            #merge to previous dataframe
            df = pd.json_normalize(r['results'])
            launches = pd.concat([launches, df], ignore_index=True)
            
        # Check if all the elements have been requested and if so, break the loop
        if (offset+limit) < r['count']:
            offset += limit
        else:
            break
    
    return launches

Get the information of the future launches (confirmed and the ones to be defined)

In [4]:
launches_go = get_launches(status=1)
launches_tbd = get_launches(status=2)

Merge both DataFrames

In [5]:
launches_info = pd.concat([launches_go, launches_tbd], ignore_index=True)

Export the data to a csv file

In [9]:
launches_info.to_csv('output/api_data.csv', index=False)

In [27]:
urls_list = list(launches_info.sort_values(['net'])['slug'][:5])
print(urls_list)

['https://spacelaunchnow.me/launch/ariane-5-eca-galaxy-30-mev-2-bsat-4b', 'https://spacelaunchnow.me/launch/falcon-9-block-5-starlink-10', 'https://spacelaunchnow.me/launch/delta-iv-heavy-nrol-44', 'https://spacelaunchnow.me/launch/electron-stp-27rm', 'https://spacelaunchnow.me/launch/soyuz-21bfregat-glonass-k1']


## Webscraping data

From the API, there is a URL associated a each launch. Scrapping it, I will obtain more information about the launch and this information will be added to the DataFrame to complete it.

### Define a spider

In [43]:
class LaunchSpider:
    """
    Parameters:
    - url: List of urls to scrape
    - sleep_interval: the time interval in seconds to delay between requests. If <0, requests will not be delayed.
    - content_parser: a function reference that will extract the intended info from the scraped content
    """
    def __init__(self, url_list, sleep_interval=-1, content_parser=None):#, referer='https://www.google.com/maps/embed/v1/place?key=AIzaSyACbuVGTVzHToUb7vCwwQlJthvyEQL8RW4&q=Tanegashima,%20Japan&zoom=10'):
        self.url_list = url_list #To check if it is a list or a single one
        self.sleep_interval = sleep_interval
        #self.referer = referer
        self.content_parser = content_parser
        
    """
    Generate a random user-agent for the headers
    """
    def get_random_ua(self):
         browsers = ['Mozilla/5.0 CK={} (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
           'Mozilla/5.0 (iPhone; CPU iPhone OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148	',
           'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
           'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36']
        
         self.user_agent = random.choice(browsers)
    
    """
    Scrape the content of a single url
    """
    def scrape_url(self, url):
        
        # Get a random user-agent
        ua = self.get_random_ua()
        
        # Generate the headers
        headers = {'user-agent':ua}
    
        # If there is an error, print it
        try:
            response = requests.get(url, headers=headers, timeout=10)
            if response.status_code >= 400 and response.status_code < 500:
                print('The request failed because the resource either does not exist or is forbidden')
            elif response.status_code >= 300 and response.status_code < 400:
                print('Redirection error')
            elif response.status_code >= 500:
                print('Server error')
        except requests.exceptions.Timeout:
            # timeout error
            print('There has been a timeout error')
        except requests.exceptions.TooManyRedirects:
            # Too many redirects error
            print('Too many redirects')
        except requests.exceptions.SSLError:
            # SSL error
            print('SSL error')
        except requests.exceptions.RequestException as e:
            # Other unknown error
            print(f'{e}')
        
        result = self.content_parser(response.content)
        self.output_results(result)
    
    """
    Export the scraped content. Right now it simply print out the results.
    But in the future you can export the results into a text file or database.
    """
    def output_results(self, r):
        print(r)
        
    """
    After the class is instantiated, call this function to start the scraping urls list.
    This function uses a FOR loop to call `scrape_url()` for each url to scrape
    """
    def kickstart(self):
        
        for url in self.url_list:
            self.scrape_url(url)
            if self.sleep_interval > 0:
                time.sleep(self.sleep_interval)

### Define the parser

From each URL, we would like to obtain the description of the launch. It usually includes a description of the payload inside the rocket.

In [48]:
def launch_parser(content):

    soup = BeautifulSoup(content, 'html.parser')
    
    # Some of the launches do not have any description yet and in this case there is
    # no additional information to add
    
    try:
        return soup.find('div', attrs={'class':'col-md-12 mx-auto'}).find('p').text
    except:
        return 'no information'

### Execution

In [49]:
urls_list = list(launches_info.sort_values(['net'])['slug'][:5])
print(urls_list)

['https://spacelaunchnow.me/launch/ariane-5-eca-galaxy-30-mev-2-bsat-4b', 'https://spacelaunchnow.me/launch/falcon-9-block-5-starlink-10', 'https://spacelaunchnow.me/launch/delta-iv-heavy-nrol-44', 'https://spacelaunchnow.me/launch/electron-stp-27rm', 'https://spacelaunchnow.me/launch/soyuz-21bfregat-glonass-k1']


In [53]:
spider = LaunchSpider(urls_list, sleep_interval=1, content_parser=launch_parser)

In [54]:
spider.kickstart()

Galaxy-30 is a geostationary communications satellite for Intelsat. Satellite is built by Northrop Grumman Innovation Systems (NGIS) and is planned to provide video distribution and broadcast services to customers in North America.
Galaxy 30 satellite is launched in tandem with MEV-2 vehicle. MEV-2, which stands for Mission Extension Vehicle-2, is the second servicing mission by NGIS. MEV-2 will rendezvous and dock with the Intelsat 1002 satellite in early 2021. Then, MEV-2 will use its own thrusters and fuel supply to control the satellite’s orbit, thereby extending its useful lifetime.
Another passenger of the flight is the BSAT-4b satellite for the Japanese operator BSAT. BSAT-4b will serve as a back-up for BSAT-4a satellite, launched in 2017. BSAT-4b will provide Direct-to-Home television services and is expected to operate for at least 15 years.
A batch of 58 satellites for Starlink mega-constellation - SpaceX's project for space-based Internet communication system. This launch wi

## Data Export