# Combining two data sources

Time to combine the previous exercises and retrieve information from both Wikipedia and Yahoo Finance.

I need to modify the last programs to scrap the list of S&P 500 companies and get the additional info from the Yahoo Finance website. 

The data should be storage using pandas dataframes and save it on csv files.

## Imports

In [2]:
import time
import pandas as pd

from bs4 import BeautifulSoup
from selenium import webdriver

## Get target page

On this exercise we are meant to scrape info from "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"

In [11]:
# Load selenium webdriver
driver = webdriver.Chrome('C:\Windows\chromedriver\chromedriver')

# Load needed webpage
URL_TO_BE_SCRAPED = "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"
driver.get(URL_TO_BE_SCRAPED)

# Initialize soup
soup_SandP_500 = BeautifulSoup(driver.page_source, "lxml")

### Scrape the information from S&P 500 page on Wikipedia

In [12]:
table = soup_SandP_500.find(id="constituents")
company_header = table.thead.tr.th.text.strip()
companies = []
for tr in table.tbody.contents:
    info_scraped = {}
    try:
        info_scraped[company_header] = tr.td.text.strip()
        companies.append(info_scraped)
    except:
        pass
print(companies[0:3])

[{'Symbol': 'MMM'}, {'Symbol': 'AOS'}, {'Symbol': 'ABT'}]


## Scrape the information of each company

First we define the function that will retrieve the information from the Finance URL.

In [5]:
def scrape_company(soup):
    tables_div = soup.find(id="quote-summary")
    company_info = {}
    for x in tables_div.contents:
        for y in x.table:
            parent = y.tr.parent
            for tr in parent.contents:
                partial_list = []
                for td in tr.contents:
                    partial_list.append(td.text)
                company_info[partial_list[0]] = partial_list[1]
    return company_info

Now we need to iterate on all companies found on the first scraping.

In [13]:
BASE_FINANCE_URL = "https://finance.yahoo.com/quote/"

companies_info = {}

# Tried to fetch only 10 companies as a test
for company in companies[0:5]:
    company_symbol = company[company_header]

    # Load needed webpage
    finance_url = BASE_FINANCE_URL + company_symbol
    print('connecting to:', finance_url)
    driver.get(finance_url)

    # Initialize soup
    soup = BeautifulSoup(driver.page_source, "lxml")

    # Call function to scrape the company information
    company_info = scrape_company(soup=soup)

    companies_info[company_symbol] = company_info

    # As we are connecting to the same server a lot, this is important to respect their servers 
    time.sleep(5)

# Close the driver
driver.quit()

connecting to: https://finance.yahoo.com/quote/MMM
connecting to: https://finance.yahoo.com/quote/AOS
connecting to: https://finance.yahoo.com/quote/ABT
connecting to: https://finance.yahoo.com/quote/ABBV
connecting to: https://finance.yahoo.com/quote/ABMD


### Create the Data Frame and save the info

In [14]:
df = pd.DataFrame(companies_info)
filename = str(time.time())
df.to_csv(filename + '- SandP 500.csv')
df

Unnamed: 0,MMM,AOS,ABT,ABBV,ABMD
Previous Close,164.39,74.09,130.11,140.73,290.52
Open,163.79,73.73,129.56,140.15,287.27
Bid,160.60 x 1200,73.57 x 1300,117.17 x 800,138.75 x 1000,268.00 x 800
Ask,161.00 x 1000,73.70 x 800,134.22 x 900,145.00 x 2200,350.00 x 1400
Day's Range,160.10 - 164.74,72.91 - 74.52,128.60 - 130.52,139.23 - 142.77,286.80 - 299.24
52 Week Range,160.10 - 208.95,57.81 - 86.74,105.36 - 142.60,102.05 - 142.80,261.27 - 379.30
Volume,4046463,1060291,4149683,6921536,247843
Avg. Volume,2602227,1030359,6106506,6972585,358735
Market Cap,91.913B,11.711B,229.365B,248.652B,13.457B
Beta (5Y Monthly),0.96,1.16,0.74,0.80,1.36


### Conclusion

This exercise was good to practise the web scraping on multiple domains. With this, I was able to also see the importance of respecting servers requests and **NOT SPAM** them 😉.