<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

# Data Scraping for SmartInvest

- [API Reference]()
- [Reference](https://www.cnbc.com/)

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

### Table of Contents <a class="anchor" id="PSCRAPE_toc"></a>

* [Table of Contents](#PSCRAPE_toc)
    * [1. Abstract](#PSCRAPE_page_1)
    * [2. Imported Libraries](#PSCRAPE_page_2)
    * [3. Import Data](#PSCRAPE_page_3)
    * [4. Setting Notebook Options](#PSCRAPE_page_4)
    * [5. Looking at the Data](#PSCRAPE_page_5)
    * [6. Checking the Column Names](#PSCRAPE_page_6)
    * [7. Cleaning the Column Names](#PSCRAPE_page_7)
    * [8. Creating a new Cleaned Dataset](#PSCRAPE_page_8)
    * [9. Counting Columns](#PSCRAPE_page_9)
    * [10. Get Info about the Dataset](#PSCRAPE_page_10)
    * [11. Get Descriptive Statistics about the Dataset](#PSCRAPE_page_11)
    * [12. Counting Rows and Removing any NANs](#PSCRAPE_page_12)
    * [13. Correlation Analysis](#PSCRAPE_page_13)
    * [14. Principal Component Analysis (PCA)](#PSCRAPE_page_14)
    * [15. Group Comparison](#PSCRAPE_page_15)
    * [16. TBD](#PSCRAPE_page_16)
    * [17. Groupby Function](#PSCRAPE_page_17)

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

# Page 1 - Abstract <a class="anchor" id="PSCRAPE_page_1"></a>

[Back to Top](#PSCRAPE_toc)

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

>This abstract presents the utilization of the BeautifulSoup Python library to web scrape textual data from the CNBC financial news outlet. The scraped data serves as the foundation for developing a sentiment indicator aimed at stock market analysis. By extracting relevant financial news articles and employing sentiment analysis techniques, sentiment scores are assigned to the text, indicating whether the sentiment expressed in the articles is positive, negative, or neutral. These sentiment scores can be aggregated over time to create a sentiment indicator, providing valuable insights into market sentiment and assisting traders and investors in making informed decisions.

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

# Page 2 - Imported Libraries<a class="anchor" id="PSCRAPE_page_2"></a>

[Back to Top](#PSCRAPE_toc)

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">


In [2]:
import requests
from bs4 import BeautifulSoup

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

# Page 3 - Import Data<a class="anchor" id="PSCRAPE_page_3"></a>

[Back to Top](#PSCRAPE_toc)

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">


In [3]:
# Create object URL
#URL = "https://www.cnbc.com/"
#page = requests.get(URL)

#soup = BeautifulSoup(page.content, "html.parser")

In [8]:
# Build error response into https requesting content from the webpage

URL = "https://www.cnbc.com/"

page = requests.get(URL)

if page.status_code != 200:
    print("Error fetching page")
    exit()
else:
    content = page.content
#print(content)

## Primary Headline

In [69]:
URL = "https://www.cnbc.com/"

In [70]:
page = requests.get(URL)

In [71]:
soup = BeautifulSoup(page.content, "html.parser")

In [72]:


# This pulls pulls the URL and Main Headline Data
soup.find('div', class_="FeaturedCard-packagedCardTitle")

In [73]:
result = soup.find_all('div', class_="FeaturedCard-packagedCardTitle")

In [74]:
for headline in soup.find_all('div', class_="FeaturedCard-contentText"):
    links = headline.find_all("a")
    for link in links:
        print(link.text.strip())

Dow falls to session lows as Speaker McCarthy says debt ceiling talks still hung up on spending


## Secondary Headlines

In [76]:
result_secondary = soup.find_all('div', class_="SecondaryCard-headline")

In [77]:
for headline in soup.find_all('div', class_="SecondaryCard-headline"):
    links = headline.find_all("a")
    for link in links:
        print(link.text.strip())

Debt ceiling talks hit a snag over spending levels with eight days until default deadline
Meta has started its latest round of layoffs, focusing on business groups


## Latest News Headline

In [78]:
result_latest = soup.find_all('div', class_="LatestNews-headlineWrapper")

In [79]:
for headline in soup.find_all('div', class_="LatestNews-headlineWrapper"):
    links = headline.find_all("a")
    for link in links:
        print(link.text.strip())

Former JPMorgan exec Staley loses bid to dismiss suit over Jeffrey Epstein ties

Here’s why retailers are beating earnings estimates, even with lackluster sales
'Queer Eye's Jonathan Van Ness shares his No. 1 tip for healthy hair
Top 10 best U.S. cities for new college graduates

These financial stocks were most loved—and hated—by hedge funds

'Chin up and don't put a lot of money to work' — why Cramer is getting worried
Biden, Dems plan beefed-up 50-state fundraising strategy to overwhelm GOP rivals
Fastest-growing jobs that don't require a bachelor's degree—some pay over $100K
College enrollment continues to slide as students question the value of a four-year degree

EV startups conserve cash as make-or-break moment approaches
TSA PreCheck makes sense amid busy travel season — if you can get it in time
Debt ceiling talks hit a snag with eight days until default deadline
In Disney-DeSantis fight, former AG Bill Barr backs Florida governor
Harvard brain expert shares 5 things she 'neve

## In the above titles there are headlines that do not contribute for sentiment analysis. The Next task is to write the loop so it omits headlines if the are categorized as health and wellness.

In [None]:
for headline in soup.find_all('div', class_="LatestNews-headlineWrapper"):
    links = headline.find_all("a")
    for link in links:
        if headline.find_all("a") = class="ArticleHeader-styles-makeit-eyebrow--Degp4":
            
        print(link.text.strip())

In [81]:
for headline in soup.find_all('div', class_="LatestNews-headlineWrapper"):
    links = headline.find_all("a")
    for link in links:
        if "make-it" not in link.get('href') and "ArticleHeader-styles-make-it-eyebrow--Degp4" not in link.get('class', []):
            print(link.text.strip())

Former JPMorgan exec Staley loses bid to dismiss suit over Jeffrey Epstein ties

Here’s why retailers are beating earnings estimates, even with lackluster sales
'Queer Eye's Jonathan Van Ness shares his No. 1 tip for healthy hair
Top 10 best U.S. cities for new college graduates

These financial stocks were most loved—and hated—by hedge funds

'Chin up and don't put a lot of money to work' — why Cramer is getting worried
Biden, Dems plan beefed-up 50-state fundraising strategy to overwhelm GOP rivals
Fastest-growing jobs that don't require a bachelor's degree—some pay over $100K
College enrollment continues to slide as students question the value of a four-year degree

EV startups conserve cash as make-or-break moment approaches
TSA PreCheck makes sense amid busy travel season — if you can get it in time
Debt ceiling talks hit a snag with eight days until default deadline
In Disney-DeSantis fight, former AG Bill Barr backs Florida governor
Harvard brain expert shares 5 things she 'neve

# It will probably be more efficient to use titles from a specific section of the website to get better quality headlines

---
# Article

In [43]:
URL = "https://www.cnbc.com/2023/05/22/stock-market-today-live-updates.html"
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")

In [44]:
# soup

In [45]:
soup.find('h1')

<h1 class="LiveBlogHeader-headline">S&amp;P 500 closes 1% lower Tuesday as debt ceiling talks drag on in Washington: Live updates</h1>

In [46]:
title = soup.find('h1').text
title

'S&P 500 closes 1% lower Tuesday as debt ceiling talks drag on in Washington: Live updates'

In [47]:
content = soup.find('div', class_="group").text
content

'Stocks fell Tuesday as ongoing debt ceiling discussions appeared to yield little progress.The S&P 500 dropped 1.12% to settle at 4,145.58, while the Nasdaq Composite pulled back 1.26% to close at 12,560.25. The Dow Jones Industrial Average lost 231.07 points, or 0.69%, to finish at 33,055.51.Some traders interpreted the lack of any major updates on negotiations as a sign that lawmakers, perhaps, are struggling to progress as hoped.'

In [48]:
url = URL
url

'https://www.cnbc.com/2023/05/22/stock-market-today-live-updates.html'

# Putting it all together

In [16]:
import requests
from bs4 import BeautifulSoup

In [17]:
URL = "https://www.cnbc.com/"
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")

In [18]:
result = soup.find_all('h1', class_="group")

In [19]:
for headline in soup.find_all('div', class_="SecondaryCardContainer-container"):
    links = headline.find_all("a")
    for link in links:
        print(link.text.strip())


Debt ceiling talks hit a snag over spending levels with eight days until default deadline

Meta has started its latest round of layoffs, focusing on business groups


In [33]:
list_urls = []

In [34]:
for headline in soup.find_all('div', class_="SecondaryCardContainer-container"):
    # -- snip --
    links = headline.find_all("a")
    for link in links:
        link_url = link["href"]
        print(f"{link_url}\n")
        list_urls.append(link_url)

https://www.cnbc.com/2023/05/22/stock-market-today-live-updates.html

https://www.cnbc.com/2023/05/22/stock-market-today-live-updates.html

https://www.cnbc.com/2023/05/23/desantis-bundlers-presidential-campaign.html

https://www.cnbc.com/2023/05/23/desantis-bundlers-presidential-campaign.html



In [35]:
list_urls

['https://www.cnbc.com/2023/05/22/stock-market-today-live-updates.html',
 'https://www.cnbc.com/2023/05/22/stock-market-today-live-updates.html',
 'https://www.cnbc.com/2023/05/23/desantis-bundlers-presidential-campaign.html',
 'https://www.cnbc.com/2023/05/23/desantis-bundlers-presidential-campaign.html']

In [36]:
list_titles = []

In [38]:
for headline in soup.find_all('div', class_="SecondaryCardContainer-container"):
    links = headline.find_all("a")
    for link in links:
        title_text = link.text.strip()
        print(title_text)
        list_titles.append(title_text)


S&P 500 closes 1% lower Tuesday as debt ceiling talks drag on in Washington

Ron DeSantis lines up business leaders to raise money for 2024 presidential run


In [39]:
list_titles

['',
 'S&P 500 closes 1% lower Tuesday as debt ceiling talks drag on in Washington',
 '',
 'Ron DeSantis lines up business leaders to raise money for 2024 presidential run',
 '',
 'S&P 500 closes 1% lower Tuesday as debt ceiling talks drag on in Washington',
 '',
 'Ron DeSantis lines up business leaders to raise money for 2024 presidential run']

# Try to pull an entire article with a for loop

In [42]:
import requests
from bs4 import BeautifulSoup

In [40]:
article_text =[]

In [50]:
for A_url in link_url.soup.find_all('div', class_="group"):
    article = A_url.find_all("p")
    for text in texts:
        article_text = text.text.strip()
        print(article_text)
        article_text.append(article_text)

AttributeError: 'str' object has no attribute 'soup'

In [46]:
article_text

[]

In [59]:
for headline in soup.find_all('div', class_="SecondaryCardContainer-container"):
    links = headline.find_all("a")
    for link in links:
        link_url = link["href"]
        article_page = requests.get(link_url)
        article_soup = BeautifulSoup(article_page.content, "html.parser")
        article_text = article_soup.get_text()
        print(article_text)

DeSantis campaign lines up fundraisers for White House raceSkip NavigationwatchliveMarketsPre-MarketsU.S. MarketsCurrenciesCryptocurrencyFutures & CommoditiesBondsFunds & ETFsBusinessEconomyFinanceHealth & ScienceMediaReal EstateEnergyClimateTransportationIndustrialsRetailWealthLifeSmall BusinessInvestingPersonal FinanceFintechFinancial AdvisorsOptions ActionETF StreetBuffett ArchiveEarningsTrader TalkTechCybersecurityEnterpriseInternetMediaMobileSocial MediaCNBC Disruptor 50Tech GuidePoliticsWhite HousePolicyDefenseCongressEquity and OpportunityCNBC TVLive TVLive AudioBusiness Day ShowsEntertainment ShowsFull EpisodesLatest VideoTop VideoCEO InterviewsCNBC DocumentariesCNBC PodcastsCNBC WorldDigital OriginalsLive TV ScheduleWatchlistInvesting ClubTrust PortfolioAnalysisTrade AlertsMeeting VideosHomestretchJim's ColumnsEducationPROPro NewsPro LiveMarket ForecastSubscribeSign InMenuMake ItselectALL SELECTCredit Cards Loans Banking Mortgages Insurance Credit Monitoring Personal Finance S

In [17]:
import urllib

In [19]:
import requests
from bs4 import BeautifulSoup

In [21]:
URL = "https://www.nbcnews.com/us-news"
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")

In [25]:
# soup

In [34]:
soup.find('h2', class_="styles_headline__ice3t")

<h2 class="styles_headline__ice3t"><a href="https://www.nbcnews.com/news/us-news/colorado-cardiologist-accused-sexual-assault-case-named-9-women-met-hi-rcna85141">Colorado cardiologist accused in sexual assault case named by 9 more women he met on Hinge dating app</a></h2>

In [31]:
article = soup.find('h2', class_="styles_headline__ice3t").find('a')['href']
article

'https://www.nbcnews.com/news/us-news/colorado-cardiologist-accused-sexual-assault-case-named-9-women-met-hi-rcna85141'

---
# Article

In [36]:
URL = "https://www.nbcnews.com/news/us-news/8-year-old-girl-dies-us-border-patrol-custody-texas-migrant-rcna85019"
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")

In [3]:
# soup

In [6]:
soup.find('h1')

<h1 class="article-hero-headline__htag lh-none-print black-print article-hero-headline__htag--loading">8-year-old girl dies in U.S. Border Patrol custody in Texas</h1>

In [8]:
title = soup.find('h1').text
title

'8-year-old girl dies in U.S. Border Patrol custody in Texas'

In [16]:
content = soup.find('div', class_="article-body__content").text
content

'An 8-year-old girl died after suffering a "medical emergency" while in U.S. Border Patrol custody in Texas on Wednesday, authorities said.The girl and her family were being held at a facility in the city of Harlingen, near the Mexico border, Customs and Border Protection said in a statement. No more details have been released about the girl\'s identity."Emergency Medical Services were called to the station and transported her to the local hospital where she was pronounced dead," the statement said, adding that the Office of Professional Responsibility would conduct an investigation as is standard protocol in the case of a death. Illegal border crossings declined yesterday as tensions grow in citiesMay 16, 202301:28The agency said it had contacted the Department of Homeland Security’s Office of Inspector General and the Harlingen Police Department about the incident.Sgt. Larry Moore, a spokesman for the Harlingen Police Department, told the Associated Press he had no information about 

In [18]:
url = URL
url

'https://www.nbcnews.com/news/us-news/8-year-old-girl-dies-us-border-patrol-custody-texas-migrant-rcna85019'

# Test

In [None]:
# Build error response into https request

URL = "https://www.cnbc.com/"

page = requests.get(URL)

if page.status_code != 200:
    print("Error fetching page")
    exit()
else:
    content = response.content
#print(content)

In [60]:
article_text = []

In [62]:
for headline in soup.find_all('div', class_="LatestNews-headlineWrapper"):
    links = headline.find_all("a")
    for link in links:
        link_url = link["href"]
        article_page = requests.get(link_url)
        if article_page.status_code !=200:
            print("Error fetching page")
        else:
            article_soup = BeautifulSoup(article_page.content, "html.parser")
            article_text = article_soup.get_text()
            print(article_text)

Stocks moving big after hours: PANW, URBN, INTU, TOLSkip NavigationwatchliveMarketsPre-MarketsU.S. MarketsCurrenciesCryptocurrencyFutures & CommoditiesBondsFunds & ETFsBusinessEconomyFinanceHealth & ScienceMediaReal EstateEnergyClimateTransportationIndustrialsRetailWealthLifeSmall BusinessInvestingPersonal FinanceFintechFinancial AdvisorsOptions ActionETF StreetBuffett ArchiveEarningsTrader TalkTechCybersecurityEnterpriseInternetMediaMobileSocial MediaCNBC Disruptor 50Tech GuidePoliticsWhite HousePolicyDefenseCongressEquity and OpportunityCNBC TVLive TVLive AudioBusiness Day ShowsEntertainment ShowsFull EpisodesLatest VideoTop VideoCEO InterviewsCNBC DocumentariesCNBC PodcastsCNBC WorldDigital OriginalsLive TV ScheduleWatchlistInvesting ClubTrust PortfolioAnalysisTrade AlertsMeeting VideosHomestretchJim's ColumnsEducationPROPro NewsPro LiveMarket ForecastSubscribeSign InMenuMake ItselectALL SELECTCredit Cards Loans Banking Mortgages Insurance Credit Monitoring Personal Finance Small Bu

MissingSchema: Invalid URL '/investingclub/': No scheme supplied. Perhaps you meant http:///investingclub/?

# Putting it all together

In [38]:
import requests
from bs4 import BeautifulSoup

In [39]:
URL = "https://www.nbcnews.com/us-news"
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")

In [45]:
result = soup.find_all('h2', class_="styles_headline__ice3t")

In [64]:
for job_element in soup.find_all('h2', class_="styles_headline__ice3t"):
    links = job_element.find_all("a")
    for link in links:
        print(link.text.strip())

Colorado cardiologist accused in sexual assault case named by 9 more women he met on Hinge dating app
New Mexico teen gunman died in hail of police gunfire, likely his final wish, chief says
States with abortion bans could drive away young doctors, survey finds
Suspected overdose deaths of 2 girls at a high school lead to murder charges against student
Woman who refused tuberculosis treatment is not in custody 2 months after arrest warrant issued
GOP witnesses undermined Jan. 6 cases with conspiracy theories, FBI says
FDA panel recommends the first shot to prevent RSV in infants by vaccinating pregnant mothers
Disney scraps plan for new Florida campus amid DeSantis feud
Wife of radiologist who drove Tesla off Calif. cliff with family inside said he did it 'on purpose,' unsealed docs reveal
Groom whose wife was killed in wedding night crash sues driver and bars she allegedly visited
The fractured GOP is leaning on Rep. Garret Graves to negotiate the debt ceiling and keep the party unite

In [72]:
list_urls = []

In [73]:
for job_element in soup.find_all('h2', class_="styles_headline__ice3t"):
    # -- snip --
    links = job_element.find_all("a")
    for link in links:
        link_url = link["href"]
        print(f"{link_url}\n")
        list_urls.append(link_url)

https://www.nbcnews.com/news/us-news/colorado-cardiologist-accused-sexual-assault-case-named-9-women-met-hi-rcna85141

https://www.nbcnews.com/news/us-news/new-mexico-shooting-new-details-women-shot-while-trying-to-help-victim-rcna85158

https://www.nbcnews.com/health/health-news/states-abortion-bans-young-doctors-survey-rcna84899

https://www.nbcnews.com/news/us-news/tennessee-student-charged-murder-overdose-deaths-high-school-girls-rcna85140

https://www.nbcnews.com/news/us-news/woman-refused-tuberculosis-treatment-not-custody-2-months-arrest-warra-rcna85076

https://www.nbcnews.com/politics/congress/gop-witnesses-undermined-jan-6-cases-conspiracy-theories-fbi-says-rcna85095

https://www.nbcnews.com/health/kids-health/rsv-vaccine-infants-fda-panel-vote-rcna84740

https://www.nbcnews.com/business/corporations/disney-scraps-plan-new-florida-campus-desantis-feud-rcna85130

https://www.nbcnews.com/news/us-news/wife-radiologist-drove-tesla-ca-cliff-family-said-purpose-unsealed-doc-rcna850

In [76]:
list_urls

['https://www.nbcnews.com/news/us-news/colorado-cardiologist-accused-sexual-assault-case-named-9-women-met-hi-rcna85141',
 'https://www.nbcnews.com/news/us-news/new-mexico-shooting-new-details-women-shot-while-trying-to-help-victim-rcna85158',
 'https://www.nbcnews.com/health/health-news/states-abortion-bans-young-doctors-survey-rcna84899',
 'https://www.nbcnews.com/news/us-news/tennessee-student-charged-murder-overdose-deaths-high-school-girls-rcna85140',
 'https://www.nbcnews.com/news/us-news/woman-refused-tuberculosis-treatment-not-custody-2-months-arrest-warra-rcna85076',
 'https://www.nbcnews.com/politics/congress/gop-witnesses-undermined-jan-6-cases-conspiracy-theories-fbi-says-rcna85095',
 'https://www.nbcnews.com/health/kids-health/rsv-vaccine-infants-fda-panel-vote-rcna84740',
 'https://www.nbcnews.com/business/corporations/disney-scraps-plan-new-florida-campus-desantis-feud-rcna85130',
 'https://www.nbcnews.com/news/us-news/wife-radiologist-drove-tesla-ca-cliff-family-said-pu

In [85]:
list_titles = []

In [86]:
for job_element in soup.find_all('h2', class_="styles_headline__ice3t"):
    links = job_element.find_all("a")
    for link in links:
        title_text = link.text.strip()
        print(title_text)
        list_titles.append(title_text)

Colorado cardiologist accused in sexual assault case named by 9 more women he met on Hinge dating app
New Mexico teen gunman died in hail of police gunfire, likely his final wish, chief says
States with abortion bans could drive away young doctors, survey finds
Suspected overdose deaths of 2 girls at a high school lead to murder charges against student
Woman who refused tuberculosis treatment is not in custody 2 months after arrest warrant issued
GOP witnesses undermined Jan. 6 cases with conspiracy theories, FBI says
FDA panel recommends the first shot to prevent RSV in infants by vaccinating pregnant mothers
Disney scraps plan for new Florida campus amid DeSantis feud
Wife of radiologist who drove Tesla off Calif. cliff with family inside said he did it 'on purpose,' unsealed docs reveal
Groom whose wife was killed in wedding night crash sues driver and bars she allegedly visited
The fractured GOP is leaning on Rep. Garret Graves to negotiate the debt ceiling and keep the party unite

In [87]:
list_titles

['Colorado cardiologist accused in sexual assault case named by 9 more women he met on Hinge dating app',
 'New Mexico teen gunman died in hail of police gunfire, likely his final wish, chief says',
 'States with abortion bans could drive away young doctors, survey finds',
 'Suspected overdose deaths of 2 girls at a high school lead to murder charges against student',
 'Woman who refused tuberculosis treatment is not in custody 2 months after arrest warrant issued',
 'GOP witnesses undermined Jan. 6 cases with conspiracy theories, FBI says',
 'FDA panel recommends the first shot to prevent RSV in infants by vaccinating pregnant mothers',
 'Disney scraps plan for new Florida campus amid DeSantis feud',
 "Wife of radiologist who drove Tesla off Calif. cliff with family inside said he did it 'on purpose,' unsealed docs reveal",
 'Groom whose wife was killed in wedding night crash sues driver and bars she allegedly visited',
 'The fractured GOP is leaning on Rep. Garret Graves to negotiate

In [17]:
import urllib