# Story Finder & Generator - Part 0
## Crawler and Scraper -  Get Data from the Web
---
**IRS Innovative Assignment (Even 2023-24)** <br>
**Roll No. and Names:**<br>
&emsp;21BCE183 Parv Thacker <br>
&emsp;21BCE201 Kaju Patel <br>
&emsp;21BCE250 Tanvi Rathod <br>

---
---

# Imports

In [1]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import pandas as pd
import re

# Modules

## Crawler

In [2]:
def crawl(url, max_pages=100):
    visited = set()  #Track visited URLs
    frontier = [url]  #Frontier - Queue
    page_count = 0
    crawled_urls = []

    while frontier and page_count < max_pages:
        current_url = frontier.pop(0)
        
        if current_url in visited:   # Skip visited
            continue
        visited.add(current_url)
        
        try:
            headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'}
            
            response = requests.get(current_url, headers=headers)
            if response.status_code == 200:
                soup = BeautifulSoup(response.content, 'html.parser')
                
                # Processing Page
                print("Crawling:", current_url)
                crawled_urls.append(current_url)
                
                # Extract links from the page
                links = soup.find_all('a', href=True)
                for link in links:
                    absolute_url = urljoin(current_url, link['href'])
                    
                    # URL Rule - Checks (Robots.txt - Check skipped as only selected urls will be added to frontier)
                    if absolute_url.startswith("https://www.goodreads.com/book/show") or absolute_url.startswith("https://www.goodreads.com/book/similar") or \
                        absolute_url.startswith("https://www.goodreads.com/series/") or absolute_url.startswith("https://www.goodreads.com/author/show" )\
                            or absolute_url.startswith("https://www.goodreads.com/list/show/") or absolute_url.startswith("https://www.goodreads.com/list/book/")  :
                        
                        # Don't add URLs already crawled or are in Frontier Queue
                        if "#" not in absolute_url \
                           and "?review" not in absolute_url and absolute_url not in visited and absolute_url not in frontier:
                            frontier.append(absolute_url)
                
                page_count += 1
            else:
                print("Failed to fetch:", current_url)
        except Exception as e:
            print("Error:", e)
            continue
    
    return crawled_urls

## Scraper

In [3]:
def scrape_reviews(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # URL --> Book ID
    book_id = url.split('/')[-1].split('.')[0]

    reviews = []

    # Finding all review sections
    review_sections = soup.find_all('section', class_='ReviewCard__content')

    # Extract review text and vote count
    for review_section in review_sections:
        # Extracting review text
        review_text = review_section.find('div', class_='TruncatedContent__text').text.strip()

        # Extracting vote count
        vote_count_element = review_section.find('span', class_='Button__labelItem')
        if vote_count_element:
            vote_count_text = vote_count_element.text.strip()
            vote_count = re.findall(r'\d+', vote_count_text)
            if vote_count:
                vote_count = int(vote_count[0])
            else:
                vote_count = 0
        else:
            vote_count = 0

        reviews.append({
            'book_id': book_id,
            'review': review_text,
            'vote_count': vote_count
        })

    return reviews

In [4]:
def convert_to_int(book_id):
    try:
        return int(book_id)
    except ValueError:
        # If book_id is not a number, check if it starts with a number
        num_part = ''.join(filter(str.isdigit, book_id))
        if num_part:
            return int(num_part)
        else:
            return None  # Return None to drop the record


# Driver

In [5]:
crawled_urls = crawl("https://www.goodreads.com/book/show/14935", max_pages=30)

Crawling: https://www.goodreads.com/book/show/14935
Crawling: https://www.goodreads.com/author/show/1265.Jane_Austen
Crawling: https://www.goodreads.com/author/show/9321.Ros_Ballaster
Crawling: https://www.goodreads.com/author/show/3148482._
Crawling: https://www.goodreads.com/book/show/1885.Pride_and_Prejudice
Crawling: https://www.goodreads.com/book/show/6969.Emma
Crawling: https://www.goodreads.com/book/show/14935.Sense_and_Sensibility
Crawling: https://www.goodreads.com/book/show/45032.Mansfield_Park
Crawling: https://www.goodreads.com/book/show/50398.Northanger_Abbey
Crawling: https://www.goodreads.com/book/show/2156.Persuasion
Crawling: https://www.goodreads.com/book/show/166177.Sanditon
Crawling: https://www.goodreads.com/book/show/31669.A_Memoir_of_Jane_Austen
Crawling: https://www.goodreads.com/author/show/18304042.Frances_Burney
Crawling: https://www.goodreads.com/author/show/285932._Mary_Brunton
Crawling: https://www.goodreads.com/author/show/22191.Samuel_Johnson
Crawling: h

In [6]:
filtered_urls = [url for url in crawled_urls if url.startswith("https://www.goodreads.com/book/show/")]
filtered_urls

['https://www.goodreads.com/book/show/14935',
 'https://www.goodreads.com/book/show/1885.Pride_and_Prejudice',
 'https://www.goodreads.com/book/show/6969.Emma',
 'https://www.goodreads.com/book/show/14935.Sense_and_Sensibility',
 'https://www.goodreads.com/book/show/45032.Mansfield_Park',
 'https://www.goodreads.com/book/show/50398.Northanger_Abbey',
 'https://www.goodreads.com/book/show/2156.Persuasion',
 'https://www.goodreads.com/book/show/166177.Sanditon',
 'https://www.goodreads.com/book/show/31669.A_Memoir_of_Jane_Austen',
 'https://www.goodreads.com/book/show/91582.Lady_Susan',
 'https://www.goodreads.com/book/show/14905.The_Complete_Novels',
 'https://www.goodreads.com/book/show/208729.Lady_Susan_The_Watsons_Sanditon',
 'https://www.goodreads.com/book/show/6693775-pride-and-prejudice-and-zombies',
 'https://www.goodreads.com/book/show/6185.Wuthering_Heights',
 'https://www.goodreads.com/book/show/2175.Madame_Bovary']

In [None]:
# List of example URLs
urls = filtered_urls

# Scrape reviews --> dictionaries
all_reviews = []
for url in urls:
    reviews = scrape_reviews(url)
    all_reviews.extend(reviews)

# Dictionaries --> Dataframe
reviews_df = pd.DataFrame(all_reviews)

In [8]:
# Convert book_ids to integer
reviews_df['book_id'] = reviews_df['book_id'].apply(convert_to_int)

# Drop - non-int book ids
reviews_df = reviews_df.dropna(subset=['book_id'])

print(reviews_df)

     book_id                                             review  vote_count
0      14935  Money. It's all about the money. I mean, why e...           0
1      14935  I love Jane Austen. I LOVE Jane Austen. I LOVE...        1800
2      14935  *life goals: to be an Eleanor*reality: being a...         535
3      14935  While I enjoyed the relationship between the s...         419
4      14935  I'm not a fan of Jane Austen. I've given her m...         390
..       ...                                                ...         ...
445     2175  Madame Bovary was a real treat. I'm glad that ...           0
446     2175  گوستاو فلوبر نزديك چهار سال براي نوشتن اين داس...         138
447     2175  C'EST MOI Meravigliosa come sempre, sempliceme...           0
448     2175  Like every European teenager who takes French ...           0
449     2175  There’s something about Flaubert’s writing tha...           0

[450 rows x 3 columns]


# Save CSV

In [9]:
# Save --> CSV
reviews_df.to_csv('reviews.csv', index=False)