# STAGE 1 - Web Scraping and Data Storage

**Import Required Libraries:**
* **requests:** For sending HTTP requests to website.
* **BeautifulSoup:** For parsing HTML content.
* **pymysql:** For connecting and interacting with a MySQL database.

In [1]:
import requests
from bs4 import BeautifulSoup
import pymysql

**Connect to MySQL Database:**

Establishes a connection to the MySQL database named ` STOCK_PREDICTION` 

In [2]:
mydb = pymysql.connect(host="localhost", user="root", password="Onmyway09@", database="STOCK_PREDICTION")

**Create a Cursor and a table:**
* Cursor - Used to execute SQL commands in the database.
* mycursor.execute -Creates a table stock_articles if it doesn’t already exist to store article headlines.

In [3]:
# Create a cursor
mycursor = mydb.cursor()

# Create table if it doesn't exist
mycursor.execute(""" CREATE TABLE IF NOT EXISTS stock_articles (
    id INT AUTO_INCREMENT PRIMARY KEY,
    headline TEXT NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP );""")

0

**Function to Store Articles in the Database:**
* Checks if the article already exists in the database.
* If not found, inserts the article headline into the database; otherwise, skips it.

In [4]:
# Function to store article in MySQL database
def store_article_in_db(title):
    # Check if article already exists
    mycursor.execute("SELECT headline FROM stock_articles WHERE headline = %s", (title,))
    result = mycursor.fetchone()
    
    # Insert if not found in the table
    if not result:
        mycursor.execute("INSERT INTO stock_articles (headline) VALUES (%s)", (title,))
        mydb.commit()
        print("Stored in database:", title)
    else:
        print("Duplicate article, not stored:", title)

**Define Base URLs and Generate URLs for Pagination:**
* List of base URLs to scrape articles from various categories on Moneycontrol.
* Adds all pages (up to 30) for each base URL to the all_urls list.

In [5]:
# List of base URLs
base_urls = [
    "https://www.moneycontrol.com/news/business/markets/",
    "https://www.moneycontrol.com/news/business/mutual-funds/",
    "https://www.moneycontrol.com/news/business/personal-finance/",
    "https://www.moneycontrol.com/news/business/",
    "https://www.moneycontrol.com/news/business/economy/",
    "https://www.moneycontrol.com/news/business/companies/",
    "https://www.moneycontrol.com/news/business/personal-finance/",
    "https://www.moneycontrol.com/news/business/ipo/",
    "https://www.moneycontrol.com/news/business/real-estate/",
    "https://www.moneycontrol.com/news/business/banks/",
    "https://www.moneycontrol.com/news/business/stocks/",
    "https://www.moneycontrol.com/news/india/",
    "https://www.moneycontrol.com/news/tags/technical-analysis.html"]

# Generate URLs for each base URL from page 1 to page 30
all_urls = []  # List to hold all generated URLs

for base_url in base_urls:
    all_urls.append(base_url)  # Add the base URL (for page 1)
    all_urls += [f"{base_url}page-{i}/" for i in range(2, 31)]  # Generate URLs from page 2 to 30

**Function to Scrape Articles and Iterate Over All URLs:**
* Sends a request to the URL and extracts article headlines using BeautifulSoup.
* Calls store_article_in_db() to save each unique headline in the database.
* Loops through all the generated URLs and scrapes articles from each page.

In [6]:
# Function to scrape articles from each URL
def scrape_articles(url):
    try:
        response = requests.get(url)
        soup = BeautifulSoup(response.content, 'html.parser')
        articles = soup.find_all('h2')
        for article in articles:
            a_tag = article.find('a')
            if a_tag:
                title = a_tag.get('title')
                if title:
                    # Store the title in the database
                    store_article_in_db(title)  # Call function to store the article
    except Exception as e:
        print(f"Error occurred while scraping {url}: {e}")

# Iterate over each URL and scrape articles from each page
for url in all_urls:
    scrape_articles(url)

print("All articles have been scraped and stored in the database!")

Stored in database: Why is Pudumjee Industries down 10% in a flattish market?
Stored in database: NTT DATA acquires Niveus Solutions, becomes 2nd Udupi-based firm owned by Japanese companies
Stored in database: NTPC Green assures funding for capex will be at the most 'optimum price'
Stored in database: Mid-day Mood | Sensex rises 500 pts from day's low, led by Adani, bank stocks; Nifty above 24,300
Stored in database: InCred equities revises Nifty target, upgrades pharma sector outlook
Stored in database: Swiggy's shares rise 20% in three straight days, food delivery firm to declare Q2 results on Dec 3
Stored in database: Adani stocks rise as group refutes bribery allegations against founder Gautam Adani; shares jump up to 7%
Stored in database: Swiggy| The company is poised to grow with significant demand and expansion| Stock of the day
Stored in database: DIPAM says NTPC Green's listing a 'big opportunity in nation building'
Stored in database: Netweb Technologies jump 8%, hits fresh