## Exercise 1: Scrape NASDAQ Top Gainers
Steps:
1. **Initial Scrape:** Scrape the NASDAQ Top Gainers Table (https://www.nasdaq.com/market-activity/stocks/screener?exchange=nasdaq&status=top-gainers).
1. **Initial Scrape2:** If you get a timeout from NASDAQ try Yahoo Finance (https://finance.yahoo.com/markets/stocks/gainers/?guccounter=1&guce_referrer=aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS8&guce_referrer_sig=AQAAACvz6Ex45XoUQkTNdDAujGj-X1mDenZIQcqrx6vnpefvlJ9NoDdFaU1W6EO9SzM8m0aA1t7qTMhWSZq2zdbdGfRyC47dQXdu8ZG8IISgSgz6DXTsJe0Jrp3hGEKnAxOCDSjeey7roNKAj5L0UJ68arDOoeeI13BkNR2xMSggz88c)
2. **Data Cleanup:** Keep only the 'Symbol', 'Company', and 'Price' columns. With Yahoo data, Symbol and Company name is in the same column. 
3. **Analysis:** Find the company with the highest stock price.. Hint: With Yahoo you can use the start and count arguments to see all companies. 

In [None]:
import requests as re
from bs4 import BeautifulSoup
import pandas as pd
attrs = {
    "start": 0,
    "count": 100
}
url = "https://finance.yahoo.com/markets/stocks/gainers/"
results = re.get(url, attrs)
src = results.content
document = BeautifulSoup(src, "lxml")
tables = document.find_all("table")    # I verify that this has len(1)
table = tables[0]
data = {"Symbol": [], "Company": [], "Price": []}
rows = table.find_all("tr")
for row in rows[1:]:
    values = [c.get_text() for c in row.find_all("td")]
    symbol_and_name = values[0].split()
    symbol = symbol_and_name[0]
    company_name = " ".join(symbol_and_name[1:])
    price_chg_pctchg = values[1].split()
    price = price_chg_pctchg[0]
    data["Symbol"].append(symbol)
    data["Company"].append(company_name)
    data["Price"].append(float(price))

df = pd.DataFrame(data)
sorted_df = df.sort_values(by="Price", ascending=False)
top_company = sorted_df.iloc[0]
print(f"Of {len(df)} companies, {top_company.Company} has the most expensive share price")

In [None]:
from requests_html import HTMLSession

session = HTMLSession()

url = "https://finance.yahoo.com/markets/stocks/gainers/?start=0&count=100"
response = session.get(url)
tables = response.html.find('table')
table = tables[0]
rows = table.find('tr')
data = {"Symbol": [], "Company": [], "Price": []}
for row in rows[1:]:
    values = [c.text for c in row.find("td")]
    symbol_and_name = values[0].split()
    symbol = symbol_and_name[0]
    company_name = " ".join(symbol_and_name[1:])
    price_chg_pctchg = values[1].split()
    price = price_chg_pctchg[0]
    data["Symbol"].append(symbol)
    data["Company"].append(company_name)
    data["Price"].append(float(price))

df = pd.DataFrame(data)
sorted_df = df.sort_values(by="Price", ascending=False)
top_company = sorted_df.iloc[0]
print(f"Of {len(df)} companies, {top_company.Company} has the most expensive share price")

## Exercise 2: Scrape Top 250 Movies by Gross income
Steps:
1. **Initial Scrape:** Scrape BoxOfficeMojo's list of top 250 movies (https://www.boxofficemojo.com/chart/top_lifetime_gross/).
2. **Data Cleanup:** Keep only relevant columns such as 'Rank', 'Title', "Lifetime gross", and 'Year'.
3. **Analysis:** Find the best decade in terms of "Lifetime gross". 

In [None]:
url = "https://www.boxofficemojo.com/chart/top_lifetime_gross/"
results = re.get(url)
src = results.content
document = BeautifulSoup(src, "lxml")
tables = document.find_all("table")
table = tables[0]
rows = table.find_all('tr')
data = {"Title": [], "Gross": [], "Year": []}
for row in rows[1:]:
    elements = [e.get_text() for e in row.find_all("td")]
    data["Title"].append(elements[1])
    income = elements[2]
    income = float(income.replace(",", "").replace("$", ""))
    data["Gross"].append(income)
    data["Year"].append(int(elements[3]))

df = pd.DataFrame(data, index=pd.Index(range(1, len(data["Title"])+1), name="Rank"))
decades = [1970 + 10*i for i in range(6)]
average_gross = {}
max_decade = 0
for decade in decades:
    decade_df = df[(df["Year"] > decade)*(df["Year"] < decade + 10)]
    average_gross[decade] = decade_df["Gross"].mean()
    if average_gross[decade] > max_decade:
        max_decade = decade

print(f"The {decade}'s had the highest average Gross income")

## Exercise 3: Scrape Wikipedia's List of Best-selling Music Artists
Steps:
1. **Initial Scrape:** Scrape Wikipedia's table of best-selling music artists (https://en.wikipedia.org/wiki/List_of_best-selling_music_artists).
2. **Data Cleanup:** Retain only 'Artist', 'Country/Market', and 'Certified Sales'.
3. **Analysis:** Find the artist with the highest certified sales.

In [None]:
url = "https://en.wikipedia.org/wiki/List_of_best-selling_music_artists"
results = re.get(url)
src = results.content
document = BeautifulSoup(src, "lxml")
tables = document.find_all("table")
table = tables[0]
rows = table.find_all('tr')
data = {"Name": [], "Country": [], "Certified sales": []}
for row in rows[1:]:
    values = row.get_text().split("\n")
    name = values[1]
    country = values[3]
    sales = float(values[12].strip(" million"))
    data["Name"].append(name)
    data["Country"].append(country)
    data["Certified sales"].append(sales)

df = pd.DataFrame(data, index=pd.Index(range(1, len(data["Name"])+1), name="Claimed rank"))
df_sorted = df.sort_values(by="Certified sales", ascending=False)
print(f"{df_sorted.iloc[0].Name} has the highest certified sales")

    

## Exercise 4: Scrape CoinMarketCap's Top 10 Cryptocurrencies
Steps:
1. **Initial Scrape:** Scrape CoinMarketCap's table of top cryptocurrencies (https://coinmarketcap.com/).
2. **Data Cleanup:** Retain only 'Name', 'Symbol', and 'Market Cap'.
3. **Analysis:** Identify the cryptocurrency with the highest market cap.


In [4]:
from selenium import webdriver
from bs4 import BeautifulSoup
import time
import pandas as pd

driver = webdriver.Chrome()
url = "https://coinmarketcap.com/"
driver.get(url)
time.sleep(3)
height = driver.execute_script("return window.innerHeight;")
for _ in range(12):
    driver.execute_script(f"window.scrollBy(0, {height});")
    time.sleep(0.5)

html_content = driver.page_source
document = BeautifulSoup(html_content, "lxml")
tables = document.find_all("table")
table = tables[0]
rows = table.find_all('tr')
data = {"Company name": [], "Symbol": [], "Market cap": []}
for row in rows[1:]:
    elements = row.find_all("td")
    name_and_symbol = elements[2]
    name, symbol = [e.get_text() for e in name_and_symbol.find_all("p")]
    market_cap_text = elements[7].get_text()
    market_cap = float(market_cap_text.split("$")[2].replace(",",""))
    data["Symbol"].append(symbol)
    data["Company name"].append(name)
    data["Market cap"].append(float(market_cap))

driver.close()
df = pd.DataFrame(data, index=pd.Index(range(1, len(data["Company name"])+1), name="Rank"))
print(f"{df.sort_values(by="Market cap", ascending=False).iloc[0]["Company name"]} has the higest market cap")

Bitcoin has the higest market cap


In [5]:
from selenium import webdriver
from io import StringIO
from bs4 import BeautifulSoup
import time
import pandas as pd

driver = webdriver.Chrome()
url = "https://coinmarketcap.com/"
driver.get(url)
time.sleep(3)
height = driver.execute_script("return window.innerHeight;")
for _ in range(12):
    driver.execute_script(f"window.scrollBy(0, {height});")
    time.sleep(0.5)

html_content = driver.page_source
driver.close()
df = pd.read_html(StringIO(html_content))[0]
df

Unnamed: 0.1,Unnamed: 0,#,Name,Price,1h %,24h %,7d %,Market Cap,Volume(24h),Circulating Supply,Last 7 Days
0,,1,BitcoinBTC,"$65,579.85",0.29%,2.17%,3.30%,"$1.3T$1,295,157,824,810","$39,599,374,261604,147 BTC","19,759,562 BTC",
1,,2,EthereumETH,"$2,645.13",0.52%,0.59%,3.82%,"$318.29B$318,294,351,794","$17,712,231,5396,697,891 ETH","120,363,208 ETH",
2,,3,TetherUSDT,$1.00,0.00%,0.03%,0.02%,"$119.38B$119,383,515,443","$70,837,080,56170,819,749,402 USDT","119,354,306,804 USDT",
3,,4,BNBBNB,$603.78,0.55%,1.34%,5.67%,"$88.11B$88,111,101,011","$2,149,459,0963,560,022 BNB","145,933,216 BNB",
4,,5,SolanaSOL,$156.81,0.94%,3.31%,4.11%,"$73.52B$73,516,695,782","$3,289,113,79420,975,349 SOL","468,830,950 SOL",
...,...,...,...,...,...,...,...,...,...,...,...
95,,96,SATS1000SATS,$0.0003432,2.19%,8.98%,13.13%,"$720.82M$720,817,382","$132,812,046385,809,025,605 1000SATS","2,100,000,000,000 1000SATS",
96,,97,PendlePENDLE,$4.45,1.16%,1.11%,26.11%,"$718.86M$718,864,803","$140,440,10631,543,428 PENDLE","161,578,628 PENDLE",
97,,98,PayPal USDPYUSD,$0.9998,0.01%,0.01%,0.01%,"$711.94M$711,936,403","$25,081,67525,088,063 PYUSD","712,057,031 PYUSD",
98,,99,The SandboxSAND,$0.2967,0.26%,4.01%,8.09%,"$708.84M$708,840,832","$57,069,860192,486,199 SAND","2,389,232,126 SAND",
