<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

# Data Scraping for SmartInvest

- [API Reference]()
- [Reference](https://www.cnbc.com/)

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

### Table of Contents <a class="anchor" id="PSCRAPE_toc"></a>

* [Table of Contents](#PSCRAPE_toc)
    * [1. Abstract](#PSCRAPE_page_1)
    * [2. Imported Libraries](#PSCRAPE_page_2)
    * [3. Import Data](#PSCRAPE_page_3)
    * [4. Data Compiler](#PSCRAPE_page_4)
    * [5. Looking at the Data](#PSCRAPE_page_5)
    * [6. Checking the Column Names](#PSCRAPE_page_6)
    * [7. Cleaning the Column Names](#PSCRAPE_page_7)
    * [8. Creating a new Cleaned Dataset](#PSCRAPE_page_8)
    * [9. Counting Columns](#PSCRAPE_page_9)
    * [10. Get Info about the Dataset](#PSCRAPE_page_10)
    * [11. Get Descriptive Statistics about the Dataset](#PSCRAPE_page_11)
    * [12. Counting Rows and Removing any NANs](#PSCRAPE_page_12)
    * [13. Correlation Analysis](#PSCRAPE_page_13)
    * [14. Principal Component Analysis (PCA)](#PSCRAPE_page_14)
    * [15. Group Comparison](#PSCRAPE_page_15)
    * [16. TBD](#PSCRAPE_page_16)
    * [17. Groupby Function](#PSCRAPE_page_17)

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

# Page 1 - Abstract <a class="anchor" id="PSCRAPE_page_1"></a>

[Back to Top](#PSCRAPE_toc)

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

>This abstract presents the utilization of the BeautifulSoup Python library to web scrape textual data from the CNBC financial news outlet. The scraped data serves as the foundation for developing a sentiment indicator aimed at stock market analysis. By extracting relevant financial news articles and employing sentiment analysis techniques, sentiment scores are assigned to the text, indicating whether the sentiment expressed in the articles is positive, negative, or neutral. These sentiment scores can be aggregated over time to create a sentiment indicator, providing valuable insights into market sentiment and assisting traders and investors in making informed decisions.

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

# Page 2 - Imported Libraries<a class="anchor" id="PSCRAPE_page_2"></a>

[Back to Top](#PSCRAPE_toc)

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">


In [1]:
import requests
from bs4 import BeautifulSoup

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

# Page 3 - Import Data<a class="anchor" id="PSCRAPE_page_3"></a>

[Back to Top](#PSCRAPE_toc)

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">


## Primary Headline

In [2]:
URL = "https://www.cnbc.com/"

In [3]:
page = requests.get(URL)

In [4]:
soup = BeautifulSoup(page.content, "html.parser")

In [5]:


# This pulls pulls the URL and Main Headline Data
soup.find('div', class_="FeaturedCard-packagedCardTitle")

In [6]:
result = soup.find_all('div', class_="FeaturedCard-packagedCardTitle")

In [7]:
for headline in soup.find_all('div', class_="FeaturedCard-contentText"):
    links = headline.find_all("a")
    for link in links:
        print(link.text.strip())

Debt ceiling negotiators make progress on spending, but still have 'major issues'


# Secondary Headlines

In [8]:
import requests
from bs4 import BeautifulSoup

In [9]:
URL = "https://www.cnbc.com/"

In [10]:
page = requests.get(URL)

In [11]:
soup = BeautifulSoup(page.content, "html.parser")

In [12]:
soup.find('div', class_="SecondaryCard-headline")

<div class="SecondaryCard-headline"><a href="https://www.cnbc.com/2023/05/26/job-cuts-jpmorgan-chase-cut-about-500-tech-and-ops-jobs-.html" title="JPMorgan Chase cut about 500 jobs this week, including technology and operations roles">JPMorgan Chase cut about 500 jobs this week, including technology and operations roles</a></div>

In [13]:

result_secondary = soup.find_all('div', class_="SecondaryCard-headline")

In [14]:
for headline in soup.find_all('div', class_="SecondaryCard-headline"):
    links = headline.find_all("a")
    for link in links:
        print(link.text.strip())

JPMorgan Chase cut about 500 jobs this week, including technology and operations roles
Dow jumps 300 points as Wall Street grows hopeful that a debt ceiling deal can be reached


# Latest News Headlines

In [15]:
import requests
from bs4 import BeautifulSoup

In [16]:
URL = "https://www.cnbc.com/"

In [17]:
page = requests.get(URL)

In [18]:
soup = BeautifulSoup(page.content, "html.parser")

In [19]:
soup.find('div', class_="LatestNews-headlineWrapper")

<div class="LatestNews-headlineWrapper"><span class="LatestNews-wrapper"><time class="LatestNews-timestamp">11 Min Ago</time></span><a class="LatestNews-headline" href="https://www.cnbc.com/2023/05/26/disney-rips-desantis-bid-to-disqualify-judge.html" title="Disney rips DeSantis bid to disqualify judge in free speech lawsuit">Disney rips DeSantis bid to disqualify judge in free speech lawsuit</a></div>

In [20]:

result_secondary = soup.find_all('div', class_="LatestNews-headlineWrapper")

In [21]:
for headline in soup.find_all('div', class_="LatestNews-headlineWrapper"):
    links = headline.find_all("a")
    for link in links:
        print(link.text.strip())

Disney rips DeSantis bid to disqualify judge in free speech lawsuit
Why the pause on student loan payments has been a win for public servants
Facebook-Giphy sale shows how fear of regulators is slowing M&A market
JPMorgan Chase cut about 500 jobs this week, including tech and operations roles
Is there a 'right' age for kids to be on social media? Here's what an expert says

A.I. excitement leads to a winning week for Nvidia and other tech stocks

Nvidia shares jumped 25% this week — and got cheaper. Here's how that happens

Needham says this stock plays the 'almost perfect marriage' between A.I., crypto
AI is the latest buzzword in tech—but before investing, know these 4 terms
Taylor Swift to Metallica: Top 10 most in-demand artists of summer

Investors shifted into these gold and small-cap ETFs this week
Ford's EV charging deal with Tesla puts pressure on GM, other rival automakers
Paramount shares pop after BDT Capital bets on the media giant's key shareholder
Stocks making the bigge

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

# Page 4 - Data Complier <a class="anchor" id="PSCRAPE_page_4"></a>

[Back to Top](#PSCRAPE_toc)

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

# Primary Headline Compiler

In [22]:
# Automate Collection 
import schedule
import time

import pandas as pd
import csv
import requests
from bs4 import BeautifulSoup
url = "https://www.cnbc.com/"

def append_data_to_csv(url):
    file_path = "../Data/PrimaryHeadline_CV.csv"

    # Fetch the data from the URL
    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html.parser")

    # Extract the headlines
    headlines = set()
    for headline in soup.find_all('div', class_="FeaturedCard-contentText"):
        links = headline.find_all("a")
        for link in links:
            headlines.add(link.text.strip())

    # Append the data to the CSV file
    with open(file_path, "a", newline="") as file:
        writer = csv.writer(file)

        # Check for duplicates before writing
        existing_headlines = set()
        try:
            with open(file_path, "r") as existing_file:
                existing_data = csv.reader(existing_file)
                for row in existing_data:
                    existing_headlines.update(row)
        except FileNotFoundError:
            pass

        new_headlines = headlines - existing_headlines
        if new_headlines:
            writer.writerow(new_headlines)
            
# Schedule the interval for collection
#schedule.every(2).hours.do(append_data_to_csv, url="https://www.cnbc.com/")


# while True:
    # Example usage
URL = "https://www.cnbc.com/"
append_data_to_csv(URL)

In [23]:
data_test = pd.read_csv('../Data/PrimaryHeadline_CV.csv')

In [24]:
data_test

Unnamed: 0,"Debt ceiling negotiators make progress on spending, but still have 'major issues'"


# Secondary Headline Complier

In [25]:
# Automate Collection 
import schedule
import time

import pandas as pd
import csv
import requests
from bs4 import BeautifulSoup
url = "https://www.cnbc.com/"

def append_data_to_csv(url):
    file_path = "../Data/SecondaryHeadline_CV.csv"

    # Fetch the data from the URL
    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html.parser")

    # Extract the headlines
    headlines = set()
    for headline in soup.find_all('div', class_="SecondaryCard-headline"):
        links = headline.find_all("a")
        for link in links:
            headlines.add(link.text.strip())

    # Append the data to the CSV file
    with open(file_path, "a", newline="") as file:
        writer = csv.writer(file)

        # Check for duplicates before writing
        existing_headlines = set()
        try:
            with open(file_path, "r") as existing_file:
                existing_data = csv.reader(existing_file)
                for row in existing_data:
                    existing_headlines.update(row)
        except FileNotFoundError:
            pass

        new_headlines = headlines - existing_headlines
        if new_headlines:
            writer.writerow(new_headlines)
            
# Schedule the interval for collection
#schedule.every(2).hours.do(append_data_to_csv, url="https://www.cnbc.com/")


#while True:
    # Example usage
URL = "https://www.cnbc.com/"
append_data_to_csv(URL)

In [26]:
data_test1 = pd.read_csv('../Data/SecondaryHeadline_CV.csv')


In [27]:
data_test1

Unnamed: 0,"JPMorgan Chase cut about 500 technology and operations jobs this week, sources say",Dow jumps 300 points as Wall Street grows hopeful that a debt ceiling deal can be reached
0,"JPMorgan Chase cut about 500 jobs this week, i...",


# Latest News Headline Compiler

In [28]:
# Latest News Headlines 
import schedule
import time

# NLP Remove stop words
import nltk
from nltk.corpus import stopwords

import pandas as pd
import csv
import requests
from bs4 import BeautifulSoup
url = "https://www.cnbc.com/"

def preprocess_data(data):
    # Apply any necessary preprocessing steps
    # Remove characters not recognized by UTF-8
    data = data.encode('utf-8', errors='ignore').decode('utf-8')
    
    # For example, convert text to lowercase or capitalize the first letter of each word
    return data.lower()

     # Remove stopwords
    stop_words = set(stopwords.words('english'))
    words = [word for word in data.split() if word.lower() not in stop_words]
    data = ' '.join(words)
    return data

def append_data_to_csv(url):
    file_path = "../Data/LatestHeadline_CV.csv"

    # Fetch the data from the URL
    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html.parser")

    # Extract the headlines
    headlines = set()
    for headline in soup.find_all('div', class_="LatestNews-headlineWrapper"):
        links = headline.find_all("a")
        for link in links:
            headlines.add(link.text.strip())
            
    processed_headlines = [preprocess_data(headline) for headline in headlines]

    # Append the data to the CSV file
    with open(file_path, "a", newline="") as file:
        writer = csv.writer(file)

        # Check for duplicates before writing
        existing_headlines = set()
        try:
            with open(file_path, "r") as existing_file:
                existing_data = csv.reader(existing_file)
                for row in existing_data:
                    existing_headlines.update(row)
        except FileNotFoundError:
            pass

        new_headlines = set(processed_headlines) - set(existing_headlines)
        if new_headlines:
            writer.writerow(new_headlines)
            
# Schedule the interval for collection
#schedule.every(2).hours.do(append_data_to_csv, url="https://www.cnbc.com/")


#while True:
    # Example usage
URL = "https://www.cnbc.com/"
append_data_to_csv(URL)

In [29]:
data_test2 = pd.read_csv('../Data/LatestHeadline_CV.csv',encoding='latin-1')

In [30]:
data_test2

Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25,Unnamed: 26,Unnamed: 27,Unnamed: 28,Unnamed: 29,ï»¿
,"companies add, expand tuition assistance so workers can go back to college","stocks making the biggest moves midday: ford, marvell technology, gap and more","despite all odds, consumers are still traveling this summer. how to play it",marvell shares soar 25% after the chip firm beats on top and bottom line results,biden interior advances renewable energy transmission projects in nevada,apple and nvidia are in my top 5 holdings. am i still diversified enough?,companies are learning gen z isnt the easiest generation to work with,heres why it might be time to buy longer-term bonds now,house passes bill blocking student debt forgivenesswhat borrowers need to know,paramount shares pop after bdt capital bets on the media giant's key shareholder,"ford's ev charging deal with tesla puts pressure on gm, other rival automakers",morgan stanley upgrades this mining stock that can surge more than 20%,"club name ford teams up with tesla, sending shares soaring",bofa hits spacecraft builder terran orbital with rare double downgrade,taylor swift to metallica: top 10 most in-demand artists of summer,facebook-giphy sale shows how fear of regulators is slowing m&a market,"ai is the latest buzzword in techbut before investing, know these 4 terms","needham says this stock plays the 'almost perfect marriage' between a.i., crypto","jpmorgan chase cut about 500 tech and operations jobs this week, sources say",how virtual layoffs became the new normal for workplaces,nvidia shares jumped 25% this week  and got cheaper. here's how that happens,these industries could face major disruptions from a.i.,a.i. excitement leads to a winning week for nvidia and other tech stocks,vesting means it can take up to 6 years for workers to own their 401(k) match,is there a 'right' age for kids to be on social media? here's what an expert says,jpmorgan ceo jamie dimon faces deposition in jeffrey epstein lawsuits,"30-year-old billionaire started with a website, sewing kit and pizza hut salary",why the pause on student loan payments has been a win for public servants,how this cmo got comfortable embracing his asian identity at work,investors shifted into these gold and small-ca...
"jpmorgan chase cut about 500 jobs this week, including tech and operations roles",disney rips desantis bid to disqualify judge in free speech lawsuit,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


# All Headlines Compiler

In [48]:

import schedule
import time

# NLP Remove stop words
import nltk
from nltk.corpus import stopwords

import pandas as pd
import csv
import requests
from bs4 import BeautifulSoup
url = "https://www.cnbc.com/"

def preprocess_data(data):
    # Apply any necessary preprocessing steps
    # Remove characters not recognized by UTF-8
    data = data.encode('utf-8', errors='ignore').decode('utf-8')
    
    # For example, convert text to lowercase or capitalize the first letter of each word
    return data.lower()

     # Remove stopwords
    stop_words = set(stopwords.words('english'))
    words = [word for word in data.split() if word.lower() not in stop_words]
    data = ' '.join(words)
    return data

def append_data_to_csv(url):
    file_path = "../Data/ConsolidatedHeadline_CV.csv"

    # Fetch the data from the URL
    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html.parser")

    # Extract the headlines
    headlines = set()
    for headline in soup.find_all('div', class_="FeaturedCard-contentText"):
        links = headline.find_all("a")
        for link in links:
            headlines.add(link.text.strip())

    for headline in soup.find_all('div', class_="SecondaryCard-headline"):
        links = headline.find_all("a")
        for link in links:
            headlines.add(link.text.strip())

    for headline in soup.find_all('div', class_="LatestNews-headlineWrapper"):
        links = headline.find_all("a")
        for link in links:
            headlines.add(link.text.strip())

    # Process the headlines
    processed_headlines = [preprocess_data(headline) for headline in headlines]

    # Filter out null values
    non_null_headlines = [headline for headline in processed_headlines if headline]
    
    # Append the data to the CSV file
    with open(file_path, "a", newline="") as file:
        writer = csv.writer(file)

        # Check for duplicates before writing
        existing_headlines = set()
        try:
            with open(file_path, "r") as existing_file:
                existing_data = csv.reader(existing_file)
                for row in existing_data:
                    existing_headlines.update(row)
        except FileNotFoundError:
            pass

        new_headlines = set(processed_headlines) - set(existing_headlines)
        if new_headlines:
            writer.writerow(new_headlines)

            
# Schedule the interval for collection
#schedule.every(2).hours.do(append_data_to_csv, url="https://www.cnbc.com/")


#while True:
    # Example usage
URL = "https://www.cnbc.com/"
append_data_to_csv(URL)


In [49]:
data_test3 = pd.read_csv('../Data/ConsolidatedHeadline_CV.csv',encoding='latin-1')

In [50]:
data_test3

Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25,Unnamed: 26,Unnamed: 27,Unnamed: 28,Unnamed: 29,Unnamed: 30,Unnamed: 31,Unnamed: 32,ï»¿
,"debt ceiling negotiators make progress on spending, but still have 'major issues'",morgan stanley upgrades this mining stock that can surge more than 20%,"ai is the latest buzzword in techbut before investing, know these 4 terms","ford's ev charging deal with tesla puts pressure on gm, other rival automakers",marvell shares soar 25% after the chip firm beats on top and bottom line results,"jpmorgan chase cut about 500 jobs this week, including technology and operations roles",dow jumps 300 points as wall street grows hopeful that a debt ceiling deal can be reached,nvidia shares jumped 25% this week  and got cheaper. here's how that happens,disney rips desantis bid to disqualify judge in free speech lawsuit,is there a 'right' age for kids to be on social media? here's what an expert says,investors shifted into these gold and small-cap etfs this week,vesting means it can take up to 6 years for workers to own their 401(k) match,a.i. excitement leads to a winning week for nvidia and other tech stocks,"despite all odds, consumers are still traveling this summer. how to play it",taylor swift to metallica: top 10 most in-demand artists of summer,companies are learning gen z isnt the easiest generation to work with,"club name ford teams up with tesla, sending shares soaring",paramount shares pop after bdt capital bets on the media giant's key shareholder,biden interior advances renewable energy transmission projects in nevada,"needham says this stock plays the 'almost perfect marriage' between a.i., crypto","30-year-old billionaire started with a website, sewing kit and pizza hut salary",why the pause on student loan payments has been a win for public servants,house passes bill blocking student debt forgivenesswhat borrowers need to know,jpmorgan ceo jamie dimon faces deposition in jeffrey epstein lawsuits,how virtual layoffs became the new normal for workplaces,bofa hits spacecraft builder terran orbital with rare double downgrade,heres why it might be time to buy longer-term bonds now,"jpmorgan chase cut about 500 jobs this week, including tech and operations roles",how this cmo got comfortable embracing his asian identity at work,facebook-giphy sale shows how fear of regulators is slowing m&a market,apple and nvidia are in my top 5 holdings. am i still diversified enough?,"stocks making the biggest moves midday: ford, marvell technology, gap and more","companies add, expand tuition assistance so wo..."
"stocks rally friday on hopes for a debt ceiling deal, nasdaq notches fifth straight week of wins",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

# Page 4 - Data Automation <a class="anchor" id="PSCRAPE_page_5"></a>

[Back to Top](#PSCRAPE_toc)

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

In [34]:
# Automate Collection 
import schedule
import time

# NLP Remove stop words
import nltk
from nltk.corpus import stopwords

import pandas as pd
import csv
import requests
from bs4 import BeautifulSoup
url = "https://www.cnbc.com/"

def preprocess_data(data):
    # Apply any necessary preprocessing steps
    # Remove characters not recognized by UTF-8
    data = data.encode('utf-8', errors='ignore').decode('utf-8')
    
    # For example, convert text to lowercase or capitalize the first letter of each word
    return data.lower()

     # Remove stopwords
    stop_words = set(stopwords.words('english'))
    words = [word for word in data.split() if word.lower() not in stop_words]
    data = ' '.join(words)
    return data

def append_data_to_csv(url):
    file_path = "../Data/ConsolidatedHeadline_CV.csv"

    # Fetch the data from the URL
    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html.parser")

    # Extract the headlines
    headlines = set()
    for headline in soup.find_all('div', class_="FeaturedCard-contentText"):
        links = headline.find_all("a")
        for link in links:
            headlines.add(link.text.strip())

    for headline in soup.find_all('div', class_="SecondaryCard-headline"):
        links = headline.find_all("a")
        for link in links:
            headlines.add(link.text.strip())

    for headline in soup.find_all('div', class_="LatestNews-headlineWrapper"):
        links = headline.find_all("a")
        for link in links:
            headlines.add(link.text.strip())

    # Process the headlines
    processed_headlines = [preprocess_data(headline) for headline in headlines]

    # Append the data to the CSV file
    with open(file_path, "a", newline="") as file:
        writer = csv.writer(file)

        # Check for duplicates before writing
        existing_headlines = set()
        try:
            with open(file_path, "r") as existing_file:
                existing_data = csv.reader(existing_file)
                for row in existing_data:
                    existing_headlines.update(row)
        except FileNotFoundError:
            pass

        new_headlines = set(processed_headlines) - set(existing_headlines)
        if new_headlines:
            writer.writerow(new_headlines)

            
# Schedule the interval for collection
#schedule.every(2).hours.do(append_data_to_csv, url="https://www.cnbc.com/")


#while True:
    # Example usage
URL = "https://www.cnbc.com/"
append_data_to_csv(URL)
