# 📌 Task 1: Web Scraping
### Internship - CodeAlpha

---

## 🎯 Objective

In this task, we will scrape data from a public website using Python. The scraped data will later be used for EDA, visualization, and sentiment analysis.  

We chose [**Goodreads Quotes**](https://www.goodreads.com/quotes) because:
- It contains text data suitable for NLP
- It's interesting, emotional, and motivational
- It provides authors and tags (for EDA and categorization)

---

## 🛠️ Libraries Required
We'll use the following:
- `requests`: to fetch web pages
- `BeautifulSoup`: to parse HTML
- `pandas`: to organize and store data


In [None]:
# ✅ Install libraries (Run this only once)
!pip install requests beautifulsoup4 pandas



In [None]:
# ✅ Import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd


---

## 🌐 Step 1: Choose the Target Website

We'll scrape this URL:
👉 https://www.goodreads.com/quotes

Each page has:
- Quote text
- Author name
- Tags (comma-separated)




---

## ☁ Step 2: Scrape first 5 pages

In [None]:
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0 Safari/537.36'
}
# Goodreads is blocking your request, because it thinks you're a bot. Add Headers to Act Like a Real Browser

all_quotes = []

for page in range(1, 6):  # Scrape first 5 pages
    url = f'https://www.goodreads.com/quotes?page={page}'
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    quote_blocks = soup.find_all('div', class_='quote')

    for quote in quote_blocks:
        quote_text_block = quote.find('div', class_='quoteText')
        if quote_text_block:
            text = quote_text_block.get_text(strip=True, separator=" ")
            author = quote.find('span', class_='authorOrTitle').get_text(strip=True)
            tags_block = quote.find('div', class_='greyText smallText left')
            tags = tags_block.get_text(strip=True).replace('Tags:', '') if tags_block else ''

            all_quotes.append({
                'quote': text,
                'author': author,
                'tags': tags
            })

print("✅ Total quotes scraped:", len(all_quotes))


✅ Total quotes scraped: 150


## ✅ Step 3: Save and display the data

In [None]:
df_quotes = pd.DataFrame(all_quotes)
df_quotes.to_csv("quotes_data.csv", index=False)

# Display first 5 rows
df_quotes.head()


Unnamed: 0,quote,author,tags
0,“Be yourself; everyone else is already taken.”...,Oscar Wilde,"tags:attributed-no-source,be-yourself,gilbert-..."
1,"“I'm selfish, impatient and a little insecure....",Marilyn Monroe,"tags:attributed-no-source,best,life,love,misat..."
2,"“So many books, so little time.” ― Frank Zappa",Frank Zappa,"tags:books,humor"
3,“Two things are infinite: the universe and hum...,Albert Einstein,"tags:attributed-no-source,human-nature,humor,i..."
4,“A room without books is like a body without a...,Marcus Tullius Cicero,"tags:attributed-no-source,books,simile,soul"


---

## 🧾 Summary

- We successfully scraped quotes from Goodreads.
- Stored data contains:
  - `quote`: The full quote text
  - `author`: The name of the author
  - `tags`: List of associated tags

We'll now use this data in our next task — **Exploratory Data Analysis (EDA)**.

➡️ **Next Notebook: Task 2 - EDA**

---
