<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

# Data Scraping for SmartInvest

- [API Reference]()
- [Reference](https://www.cnbc.com/)

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

### Table of Contents <a class="anchor" id="PSCRAPE_toc"></a>

* [Table of Contents](#PSCRAPE_toc)
    * [1. Abstract](#PSCRAPE_page_1)
    * [2. Imported Libraries](#PSCRAPE_page_2)
    * [3. Import Data](#PSCRAPE_page_3)
    * [4. Setting Notebook Options](#PSCRAPE_page_4)
    * [5. Looking at the Data](#PSCRAPE_page_5)
    * [6. Checking the Column Names](#PSCRAPE_page_6)
    * [7. Cleaning the Column Names](#PSCRAPE_page_7)
    * [8. Creating a new Cleaned Dataset](#PSCRAPE_page_8)
    * [9. Counting Columns](#PSCRAPE_page_9)
    * [10. Get Info about the Dataset](#PSCRAPE_page_10)
    * [11. Get Descriptive Statistics about the Dataset](#PSCRAPE_page_11)
    * [12. Counting Rows and Removing any NANs](#PSCRAPE_page_12)
    * [13. Correlation Analysis](#PSCRAPE_page_13)
    * [14. Principal Component Analysis (PCA)](#PSCRAPE_page_14)
    * [15. Group Comparison](#PSCRAPE_page_15)
    * [16. TBD](#PSCRAPE_page_16)
    * [17. Groupby Function](#PSCRAPE_page_17)

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

# Page 1 - Abstract <a class="anchor" id="PSCRAPE_page_1"></a>

[Back to Top](#PSCRAPE_toc)

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

>This abstract presents the utilization of the BeautifulSoup Python library to web scrape textual data from the CNBC financial news outlet. The scraped data serves as the foundation for developing a sentiment indicator aimed at stock market analysis. By extracting relevant financial news articles and employing sentiment analysis techniques, sentiment scores are assigned to the text, indicating whether the sentiment expressed in the articles is positive, negative, or neutral. These sentiment scores can be aggregated over time to create a sentiment indicator, providing valuable insights into market sentiment and assisting traders and investors in making informed decisions.

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

# Page 2 - Imported Libraries<a class="anchor" id="PSCRAPE_page_2"></a>

[Back to Top](#PSCRAPE_toc)

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">


In [2]:
import requests
from bs4 import BeautifulSoup

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

# Page 3 - Import Data<a class="anchor" id="PSCRAPE_page_3"></a>

[Back to Top](#PSCRAPE_toc)

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">


In [3]:
# Create object URL
#URL = "https://www.cnbc.com/"
#page = requests.get(URL)

#soup = BeautifulSoup(page.content, "html.parser")

In [8]:
# Build error response into https requesting content from the webpage

URL = "https://www.cnbc.com/"

page = requests.get(URL)

if page.status_code != 200:
    print("Error fetching page")
    exit()
else:
    content = page.content
#print(content)

## Primary Headline

In [69]:
URL = "https://www.cnbc.com/"

In [70]:
page = requests.get(URL)

In [71]:
soup = BeautifulSoup(page.content, "html.parser")

In [72]:


# This pulls pulls the URL and Main Headline Data
soup.find('div', class_="FeaturedCard-packagedCardTitle")

In [73]:
result = soup.find_all('div', class_="FeaturedCard-packagedCardTitle")

In [74]:
for headline in soup.find_all('div', class_="FeaturedCard-contentText"):
    links = headline.find_all("a")
    for link in links:
        print(link.text.strip())

Dow falls to session lows as Speaker McCarthy says debt ceiling talks still hung up on spending


## Secondary Headlines

In [76]:
result_secondary = soup.find_all('div', class_="SecondaryCard-headline")

In [77]:
for headline in soup.find_all('div', class_="SecondaryCard-headline"):
    links = headline.find_all("a")
    for link in links:
        print(link.text.strip())

Debt ceiling talks hit a snag over spending levels with eight days until default deadline
Meta has started its latest round of layoffs, focusing on business groups


## Latest News Headline

In [78]:
result_latest = soup.find_all('div', class_="LatestNews-headlineWrapper")

In [79]:
for headline in soup.find_all('div', class_="LatestNews-headlineWrapper"):
    links = headline.find_all("a")
    for link in links:
        print(link.text.strip())

Former JPMorgan exec Staley loses bid to dismiss suit over Jeffrey Epstein ties

Here’s why retailers are beating earnings estimates, even with lackluster sales
'Queer Eye's Jonathan Van Ness shares his No. 1 tip for healthy hair
Top 10 best U.S. cities for new college graduates

These financial stocks were most loved—and hated—by hedge funds

'Chin up and don't put a lot of money to work' — why Cramer is getting worried
Biden, Dems plan beefed-up 50-state fundraising strategy to overwhelm GOP rivals
Fastest-growing jobs that don't require a bachelor's degree—some pay over $100K
College enrollment continues to slide as students question the value of a four-year degree

EV startups conserve cash as make-or-break moment approaches
TSA PreCheck makes sense amid busy travel season — if you can get it in time
Debt ceiling talks hit a snag with eight days until default deadline
In Disney-DeSantis fight, former AG Bill Barr backs Florida governor
Harvard brain expert shares 5 things she 'neve

## In the above titles there are headlines that do not contribute for sentiment analysis. The Next task is to write the loop so it omits headlines if the are categorized as health and wellness.

In [None]:
for headline in soup.find_all('div', class_="LatestNews-headlineWrapper"):
    links = headline.find_all("a")
    for link in links:
        if headline.find_all("a") = class="ArticleHeader-styles-makeit-eyebrow--Degp4":
            
        print(link.text.strip())

In [81]:
for headline in soup.find_all('div', class_="LatestNews-headlineWrapper"):
    links = headline.find_all("a")
    for link in links:
        if "make-it" not in link.get('href') and "ArticleHeader-styles-make-it-eyebrow--Degp4" not in link.get('class', []):
            print(link.text.strip())

Former JPMorgan exec Staley loses bid to dismiss suit over Jeffrey Epstein ties

Here’s why retailers are beating earnings estimates, even with lackluster sales
'Queer Eye's Jonathan Van Ness shares his No. 1 tip for healthy hair
Top 10 best U.S. cities for new college graduates

These financial stocks were most loved—and hated—by hedge funds

'Chin up and don't put a lot of money to work' — why Cramer is getting worried
Biden, Dems plan beefed-up 50-state fundraising strategy to overwhelm GOP rivals
Fastest-growing jobs that don't require a bachelor's degree—some pay over $100K
College enrollment continues to slide as students question the value of a four-year degree

EV startups conserve cash as make-or-break moment approaches
TSA PreCheck makes sense amid busy travel season — if you can get it in time
Debt ceiling talks hit a snag with eight days until default deadline
In Disney-DeSantis fight, former AG Bill Barr backs Florida governor
Harvard brain expert shares 5 things she 'neve

In [34]:
for headline in soup.find_all('div', class_="SecondaryCardContainer-container"):
    # -- snip --
    links = headline.find_all("a")
    for link in links:
        link_url = link["href"]
        print(f"{link_url}\n")
        list_urls.append(link_url)

https://www.cnbc.com/2023/05/22/stock-market-today-live-updates.html

https://www.cnbc.com/2023/05/22/stock-market-today-live-updates.html

https://www.cnbc.com/2023/05/23/desantis-bundlers-presidential-campaign.html

https://www.cnbc.com/2023/05/23/desantis-bundlers-presidential-campaign.html



In [35]:
list_urls

['https://www.cnbc.com/2023/05/22/stock-market-today-live-updates.html',
 'https://www.cnbc.com/2023/05/22/stock-market-today-live-updates.html',
 'https://www.cnbc.com/2023/05/23/desantis-bundlers-presidential-campaign.html',
 'https://www.cnbc.com/2023/05/23/desantis-bundlers-presidential-campaign.html']

In [36]:
list_titles = []

In [38]:
for headline in soup.find_all('div', class_="SecondaryCardContainer-container"):
    links = headline.find_all("a")
    for link in links:
        title_text = link.text.strip()
        print(title_text)
        list_titles.append(title_text)


S&P 500 closes 1% lower Tuesday as debt ceiling talks drag on in Washington

Ron DeSantis lines up business leaders to raise money for 2024 presidential run


In [39]:
list_titles

['',
 'S&P 500 closes 1% lower Tuesday as debt ceiling talks drag on in Washington',
 '',
 'Ron DeSantis lines up business leaders to raise money for 2024 presidential run',
 '',
 'S&P 500 closes 1% lower Tuesday as debt ceiling talks drag on in Washington',
 '',
 'Ron DeSantis lines up business leaders to raise money for 2024 presidential run']

In [46]:
article_text

[]

In [59]:
for headline in soup.find_all('div', class_="SecondaryCardContainer-container"):
    links = headline.find_all("a")
    for link in links:
        link_url = link["href"]
        article_page = requests.get(link_url)
        article_soup = BeautifulSoup(article_page.content, "html.parser")
        article_text = article_soup.get_text()
        print(article_text)

DeSantis campaign lines up fundraisers for White House raceSkip NavigationwatchliveMarketsPre-MarketsU.S. MarketsCurrenciesCryptocurrencyFutures & CommoditiesBondsFunds & ETFsBusinessEconomyFinanceHealth & ScienceMediaReal EstateEnergyClimateTransportationIndustrialsRetailWealthLifeSmall BusinessInvestingPersonal FinanceFintechFinancial AdvisorsOptions ActionETF StreetBuffett ArchiveEarningsTrader TalkTechCybersecurityEnterpriseInternetMediaMobileSocial MediaCNBC Disruptor 50Tech GuidePoliticsWhite HousePolicyDefenseCongressEquity and OpportunityCNBC TVLive TVLive AudioBusiness Day ShowsEntertainment ShowsFull EpisodesLatest VideoTop VideoCEO InterviewsCNBC DocumentariesCNBC PodcastsCNBC WorldDigital OriginalsLive TV ScheduleWatchlistInvesting ClubTrust PortfolioAnalysisTrade AlertsMeeting VideosHomestretchJim's ColumnsEducationPROPro NewsPro LiveMarket ForecastSubscribeSign InMenuMake ItselectALL SELECTCredit Cards Loans Banking Mortgages Insurance Credit Monitoring Personal Finance S