<a href="https://colab.research.google.com/github/CreativeKenning/simple-scraper-colab/blob/main/Scraper_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


## How to make your own copy

In order to run this scraper, you'll need to make your own copy, to do so:
1. Navigate to the "file" tab
2. Cick "Save a Copy in Drive"
3. You're set! Start working with that copy

## Some Colab Tips

1. If you'd like to hide the "form" overlaying the code, simply right click on a cell, find the "form" tab, and click "hide form"




## What Is included in this Colab Sheet?
1. Intoduce you to the basic structure of HTML
2. Demonstrate how to investigate HTML using BeautifulSoup
2. Guide you through scraping in a low-to-no coding environment with simple data-export in CSV and TXT files for easy analysis using tools like [Orange](https://orangedatamining.com/) or [Antconc](https://www.laurenceanthony.net/software/antconc/)
3. Provides Multi-variable scraping and Data-export options
4. Provides a simplified web-crawler that pulls all links from a given website, and presents them for either data export or further analysis.


## Prologue: Importing Libraries and Making the Soup
* 1. [Library Import (Mandatory!)](#scrollTo=eXLTS2V64ztn&line=1&uniqifier=1)
  * Imports Necessary Libraries to run the code
* 2. [Choose Website](#scrollTo=yOpxFGXFlYpG&line=1&uniqifier=1)
  * Enter the URL of the website you want to scrape
* 3.  [Making the Soup](#scrollTo=Dsi-PhX7h2wT&line=1&uniqifier=1)
  * Ensure we can pull the raw HTML from the chosen website

## Section A. Fltering HTML and Simple 2 tag scraping

* 1A. [Prettified HTML Viewing](#scrollTo=A0JhvWZFmBPk&line=1&uniqifier=1)
  * View chosen websites' HTML in a clean, hierarchical manner
* 2A. [HTML Filtering By Tag and Class](#scrollTo=kmP6VUMLWp6l&line=1&uniqifier=1)
  * Interactive HTML filter: find the data you want to scrape
* 3A. [Simple 2-tag scraping](#scrollTo=5VBqh_qjYnaf&line=9&uniqifier=1)
  * Simple Scraper: Cell to pull and view data from up to two HTML tag/class pairs
* 4A. [Data Export](#scrollTo=7mx6qoosxMH_&line=1&uniqifier=1)
  * Export your data in TXT or CSV format

### Section B. Multi-Tag Scraping and export
* 1B. [Multi-Tag input](#scrollTo=aiJfr2MAdISv&line=1&uniqifier=1)
  * A multi-tag input for those looking to scrape more than two HTML tag/class pairs
* 2B. [Dictionary Checker](#scrollTo=IpB_q-8xrsQ9)
  * A proofing cell to check your inputs to ensure there aren't any typos!
* 3B. [Multi-Tag Scraper](#scrollTo=kJ_YuM4msne5&line=1&uniqifier=1)
  * Running the multi-tag scraper and viewing the scraped data
* 4b. [Multi-Tag Data Export](#scrollTo=6Qh9Or8uPpTm&line=3&uniqifier=1)
  * Data Export in TXT or CSV format
  
### Section C. Webpage crawling and overview tool
* 1C. [Website picker and Crawler](#scrollTo=i5cECfS2-u7K&line=1&uniqifier=1)
  * A simple website crawler that extracts all webpages from your website, and displays a simple overview of their "paragraph" tags
* 2C.[Website Crawler Data Export](#scrollTo=WwGWyuSx487r&line=1&uniqifier=1)
  * Data Export for your crawled website as TXT or CSV file



In [None]:
#@markdown ### 1. Importing libraries

#@markdown First things first, run this cell to import all the necessary dependencies.
#@markdown <br><br>
#@markdown This also mounts to your
#@markdown GoogleDrive. If you would like to save the data gathered fropm this Colab sheet you should allow access to your drive when prompted

#@markdown You only need to run this once.
!pip install Colorama
!pip install ipywidgets
!pip install regex
import regex as re
import ipywidgets as widgets
import colorama
import ast
from colorama import Fore, Back, Style
from pathlib import Path
from bs4 import SoupStrainer, BeautifulSoup
import pandas as pd
import numpy as np
import nltk
import nltk.data
import urllib
from urllib import request
from urllib.parse import urljoin, urlparse
import urllib.request as ur
import requests
from google.colab import drive

drive.mount('/content/drive')


Collecting Colorama
  Downloading colorama-0.4.6-py2.py3-none-any.whl.metadata (17 kB)
Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Installing collected packages: Colorama
Successfully installed Colorama-0.4.6
Collecting jedi>=0.16 (from ipython>=4.0.0->ipywidgets)
  Downloading jedi-0.19.2-py2.py3-none-any.whl.metadata (22 kB)
Downloading jedi-0.19.2-py2.py3-none-any.whl (1.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m23.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: jedi
Successfully installed jedi-0.19.2
Mounted at /content/drive


In [None]:
#@markdown ### 2. Choosing Your website
#@markdown ### How to use this cell

#@markdown Run this cell, and paste the URL of the website you want to scrape and hit Enter.
#@markdown * If you're looking for an easy website to get started, I recommend https://quotes.toscrape.com/
#@markdown I will be using this website as an example throughout
#@markdown * If you want to scrape multiple webpages, come back and run this
#@markdown


url = input("Enter Your URL and hit Enter to get started!")

print(Fore.GREEN +"Okay, you've entered " + url)

Enter Your URL and hit Enter to get started! https://quotes.toscrape.com/
[32mOkay, you've entered  https://quotes.toscrape.com/


In [None]:
#@markdown ### 3. Making the BeautifulSoup Object

#@markdown ### How to use this cell
#@markdown 1. Run this cell, and it will attempt to pull the HTML from your chosen website
#@markdown <br><br>This cell just makes sure we can pull the website, we'll parse the HTMl in the next cell.


#This block essentially pulls the the HTML from the website using the *requests* library.
#Then, we use beautiful soup to turn that website into a "beautiful soup object" called "soup".
#@markdown


# Send a GET request to the website
response = requests.get(url)

# Create a BeautifulSoup object and parse with an HTML parser
soup = BeautifulSoup(response.content, "html.parser")
# If the status code of the "respnse" object is 200 (meaning it's all good)
if response.status_code == 200:
  #Print this message with the color red
  print(Fore.RED + f"We've created our beautifulSoup object, represented by the variable 'soup' ")
#If the status code is 404 (meaning it isn't found)
elif response.status_code == 404:
  #Let the user know something went wrong
  print(Fore.RED +f"Something went wrong: please doublecheck your URL and try again")



[31mWe've created our beautifulSoup object, represented by the variable 'soup' 


# Section A: Two-tag Scraper


**1A. Viewing the HTML**

HTML stands for HyperText-Markup-Language, and HTML documents are formatted based on HTML tags; because tags provide structure for a webpage, we can use them to collect the data we want.
<br><br>
Tags that contain other tags are called 'parent' tags, the tags contained within a parent are called  'child' tags. Tags that share a parent are called 'Sibling' tags.
<br><br>
This cell helps you find what tags contain the data you want to scrape, by filtering out the 'noise' and formatting tags hierarchically.
<br><br>
 We will also be looking for the 'class' of the parent and child tags, if any: they allow the same tags (such as 'p', a paragraph tag), to be styled differently across the webpage using CSS (Cascading Style Sheets). For scraping, however, we can use class to further hone-in on the data we want to collect!
<br><br>

 As you explore the HTML of your  chosen webpage,

 1. look for the tags that contain the data you want(child tags)
 2. Look for the tags directly above them (parent tags)
 3. Note any tags in the HTML tree that are adjacent to your data (sibling tags)
 4. Note the HTML 'class' if any, of all parent, child, and sibling tags

This information will not only help you pull the data you need, but filter the noise that you don't!


In [None]:
#@markdown ### 2A. Viewing the HTML
#@markdown Run this cell to view the "Prettified" HTML!

x = soup.prettify()

# Ask the user if they want to export the content as a text file or view it in the browser
user_input = input("Would you like to export this as a text file or view in browser? Please enter Y to save as a .txt file, or N to view in browser: ").upper()

if user_input == "N":
    print(x)
elif user_input == "Y":
    # Ask for the filename
    filename = input("Please enter the file name, please follow best practices e.g. CamelCase: ") +".txt"

    # Define the file path in Google Drive
    file_path = Path("/content/drive/My Drive/scraper") / filename

    file_path.parent.mkdir(parents=True, exist_ok=True)
    # Write the prettified HTML content to the file in Google Drive
    with open(file_path, "w") as file_object:
        file_object.write(x)

    print(f"The content has been exported to {file_path}")
else:
    print("I couldn't understand that, please enter Y or N")



Would you like to export this as a text file or view in browser? Please enter Y to save as a .txt file, or N to view in browser: n
<!DOCTYPE html>
<html lang="en">
 <title>
  404 Not Found
 </title>
 <h1>
  Not Found
 </h1>
 <p>
  The requested URL was not found on the server. If you entered the URL manually please check your spelling and try again.
 </p>
</html>



In [None]:
#@markdown ### 3A. Filtering cell: Find the parent and child tags you need for analysis
#@markdown By now, you should have a general idea of where the data you need is located in the HTML tree; this cell helps us refine that further.

#@markdown <br> 1. Enter the parent tag you want to search under in the 'findtag' section,
#@markdown and its class (if any) in the 'findclass' box


#@markdown <br><br> 2. Run this cell to investigate the 'children' contained beneath the parent tag/class pair you entered

findtag = 'div' #@param {type:"string"}
#@markdown Filter the tags by CSS class
findclass = 'col-md-8' #@param {type:"string"}

# @markdown If you don't know where to begin, try
#@markdown checking out [quotestoscrape.com](https://quotes.toscrape.com/),
#@markdown note how the same HTML tag, "div", contains
#@markdown different information based on its class. E.g. the 'quote' class<br><br>
#@markdown Try entering 'div' in the box findtag, and the class 'quote', 'row', 'col-md-8', or 'col-md-4', in the 'findclass' box.
#@markdown <br><br> Notice how you can filter for different classes of information by changing the tag our scraper looks for.
#@markdown This is important for ensuring you don't pull any extraneous data, that might make cleaning more difficult!
#@markdown <br><br> Once you've refined your search,
#@markdown write down the *tag* and *class* of any child tags containing important information, and the tag and class of their *direct parent tag*
#@markdown We'll be utilizing that information in the next cell<br>

if findclass:  # This begins an if/elif/else statement that checks *if* the variable 'findall' has a value
    z = soup.find_all(findtag, class_=findclass) #If it does, then the code searches for user-defined tag and class
else: #if findclass has no value
    z = soup.find_all(findtag) #Then the code only searches for the tag

if z: #further, if "z"--the filtered HTML tag" has a value
    for tag in z: #then for each tag in the filtered HTML Z
        print(tag.prettify()) #print the prettified version
else: #if z has no value
    print(f"No elements found for tag: {findtag} with class: {findclass}") #Then print that no elements have been found





NameError: name 'soup' is not defined

In [None]:
def tag_scrape(entry,tag, _class):
  if entry.find(tag, {"class": _class}): #This code is based on the previous entry, the only difference is the changed class "event-box-date"
      description = entry.find(tag,{"class": _class}).get_text().strip()
  else:
    description = "none"
  return(description)


#@markdown ### 4A: 2.Tag Scraping cell
#@markdown In this cell, you will take the HTML tags you've identified in 5A and enter them here; now that we know *exactly* where you're data is, we're going to export it to your google drive.
#@markdown <br><br>
#@markdown How to use this cell:
#@markdown 1. First, enter the parent tag and class you identified in section 5A: enter the parent tag into the "findall" box, and "the class in the findclass" box<br><br>
#@markdown 2. Then, take the child tag(s) you identified in 5A, and enter them into tag1 and class1, and tag2 and class2 respectively. <br>
#@markdown <br>2a. If you only have one tags worth of data you want to collect, leave tag2 and class2 blank <br><br>
#@markdown 3. Enter the column names for the data you're collecting in "namedata1" and "namedata2" if applicable; this tells the code what to name the columns in your data.<br><br>
#@markdown 4. Finally, run the cell to collect your data. If you don't pull any data, first check your spelling. Then, go back to 5A, and make sure that you correctly identified the parent tags *above* your data, and the child tags that *contain* your data.


#@markdown Filtering: Enter the parent tag and class you want to look beneath:
findall = "div" #@param  {type:"string"}
findclass = "quote" #@param {type:"string"}
#@markdown <br>
#@markdown Enter the data-containing first child tag and class pair here
tag1 = "span" #@param  {type:"string"}
class1 = "text" #@param {type:"string"}
# @markdown Enter your second child tag and class pair here (optional)
tag2 = "small" #@param {type:"string"}
class2 = "author" #@param {type:"string"}
#@markdown What kind of data are you scraping? Provide column names here
namedata1 = "text" #@param {type:"string"}
namedata2 = "author" #@param {type:"string"}
#@markdown <br>

# @markdown If everything goes well, you should see your data in a "dataframe" at the bottom of this cell.
# @markdown A dataframe is a powerful tool you can use to store, clean, and analyze data.
# @markdown <br><br>
# @markdown Plus, colab makes it easy by integrating auto-coding graphs and options directly into its DF user interface.
# @markdown <br><br>
# @markdown Try clicking the "convert dataframe into interactive table" or "suggest charts" button to find some new ways to view the data you just collected.
# @markdown
# @markdown Alternatively, you can export it to a CSV file in the next cell
if findclass:  # This begins an if/elif/else statement that checks *if* the variable 'findall' has a value
  entries = soup.find_all(findall, class_=findclass) #If it does, then the code searches for user-defined tag and class
else: #if findclass has no value
    entries = soup.find_all(findall) #Then the code only searches for the tag

data = []# creates a list titled "data"

for entry in entries:
  if tag2:
    name = tag_scrape(entry,tag1,class1)
    description = tag_scrape(entry,tag2,class2)
    data.append({namedata1: name, namedata2: description})
  else:
    name = tag_scrape(entry,tag1,class1)
    data.append({namedata1: name})


df = pd.DataFrame(data)


df


Unnamed: 0,text,author
0,“The world as we have created it is a process ...,Albert Einstein
1,"“It is our choices, Harry, that show what we t...",J.K. Rowling
2,“There are only two ways to live your life. On...,Albert Einstein
3,"“The person, be it gentleman or lady, who has ...",Jane Austen
4,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe
5,“Try not to become a man of success. Rather be...,Albert Einstein
6,“It is better to be hated for what you are tha...,André Gide
7,"“I have not failed. I've just found 10,000 way...",Thomas A. Edison
8,“A woman is like a tea bag; you never know how...,Eleanor Roosevelt
9,"“A day without sunshine is like, you know, nig...",Steve Martin


In [None]:
#@markdown ## 5A. Data Export
#@markdown Run this cell and follow its prompts to save your scraped data in a google drive folder
def saveFile(fileName, dataframe):
  file_path = Path("/content/drive/My Drive/scraper") / fileName
  file_path.parent.mkdir(parents=True, exist_ok=True)
  df.to_csv(file_path, index=False)
  return file_path

print(Fore.RED + "Would you like to export your data as a CSV or TXT file?")
user_input = input(f"Please enter '+' to save the whole dataframe, or '-' if you wouldlike to choose specific columns")


if user_input == "+":
  fileChoice = input("would you like to save this file as a CSV or a txt file? Please input c for CSV or t for txt").lower()
  urFile = input("Now please enter the name of the file")
  if fileChoice == "c":
    file = urFile+".csv"
    file_path = saveFile(file,df)
    print(Fore.GREEN + f"Success! Check your drive for your CSV file for {file_path}")
  elif fileChoice == "t":
    file = urFile+".txt"
    file_path = saveFile(file,df)
    print(Fore.GREEN + f"Success! Check your drive for your TXT file for {file_path}")
  else :
    print("please enter c or t")

elif user_input == "-":
  columnInput = input("Please enter the column names you want to save, seperated by a comma")
  names = columnInput.split(',')
  names = [name.strip().lower() for name in names]
  df = df[names]
  fileChoice = input("would you like to save this file as a CSV or a txt file? Please input c for CSV or t for txt").lower()
  urFile = input("Now please enter the name of the file")

  if fileChoice == "c":
    file = urFile+".csv"
    file_path = saveFile(file,df)
    print(Fore.GREEN + f"Success! Check your drive for your CSV file at {file_path}")

  elif fileChoice == "t":
    file = urFile+".txt"
    file_path = saveFile(file,df)
    print(Fore.GREEN + f"Success! Check your drive for your TXT file at {file_path}")
  else :
    print("please enter c or t")
else:
  print('please enter + or -')


[31mWould you like to export your data as a CSV or TXT file?
Please enter '+' to save the whole dataframe, or '-' if you wouldlike to choose specific columns-
Please enter the column names you want to save, seperated by a commatext
would you like to save this file as a CSV or a txt file? Please input c for CSV or t for txtc
Now please enter the name of the filesupertest
[32mSuccess! Check your drive for your CSV file for /content/drive/My Drive/scraper/supertest.csv


# Section B: Multi-Tag Scraper

The following cells are designed so you can scrape as many tag-class pairs as you want, and export the data in your chosen format.

### Step 1: Input

Write down your tag-class pairs beforehand, and decide on how you want to label the data. You will enter these parameters into the "input" cell. This cell turns your parameters into a "list of dictionaries" which tell the scraper what tags to pull, and how to sort the resulting data

### Step 2: Double Check and Processing
Be sure to double check your tag/class pairs and column names by running the "double check" cell. This will ensure you collect the correct data, and also will demonstrate how Python stores your data in a dictionary in key-value pairs: in this instance "tag","class", and "column" are the keys, while you enter the values that tell the cell what to scrape [You can read up on dictionaries here](https://www.codecademy.com/resources/docs/python/dictionaries).

## Step 3: Run the Processing cell
Once you've double checked your tag/class pairs, run the "processing" cell, and view the resulting dataframe for patterns, or to see if the correct data was pulled. This may take a couple tries, with or without typos. I recommend writing your tags down on a piece of scratch paper to keep things straight,or in a handy txt file in another tab.

### Step 4: Data Export

After inspecting the dataframe to ensure everything is in order, run the data export cell to choose how you want to export your data. You can choose a specific file type (CSV or TXT) and specific columns.



In [None]:
#@title B1: Multi Tag-Class Scraper: Input

#@markdown This cell is specifically designed for if you have more than two tags you need to scrape

#@markdown * Run this cell, and enter the tag, class, and desired column name for your desired data
#@markdown * Be careful: if you click "add" and have a typo, you'll have to run this cell again
#@markdown *The widget is not self-refreshing, so you'll need to delete and re-enter the next tag/value/column entry each time you add a new one

def display_data():
    display(widgets.Label(value='Current tag-class-column pairs:'))
    for pair in tag_class_pairs:
        display(widgets.Label(value=str(pair)))
def on_add_button_clicked(b):
    tag_class_pairs.append({
      'tag': tag_input.value,
      'class': class_input.value,
      'column': column_input.value
    })
    display_data()

# Interactive form for user input
tag_input = widgets.Text(description='HTML Tag:')
class_input = widgets.Text(description='CSS Class:')
column_input = widgets.Text(description='Column Name:')
add_button = widgets.Button(description='Add')

tag_class_pairs = []



add_button.on_click(on_add_button_clicked)


display(tag_input, class_input, column_input, add_button)




Text(value='', description='HTML Tag:')

Text(value='', description='CSS Class:')

Text(value='', description='Column Name:')

Button(description='Add', style=ButtonStyle())

Label(value='Current tag-class-column pairs:')

Label(value="{'tag': 'span', 'class': 'text', 'column': 'text'}")

Label(value='Current tag-class-column pairs:')

Label(value="{'tag': 'span', 'class': 'text', 'column': 'text'}")

Label(value="{'tag': 'small', 'class': 'author', 'column': 'author'}")

Label(value='Current tag-class-column pairs:')

Label(value="{'tag': 'span', 'class': 'text', 'column': 'text'}")

Label(value="{'tag': 'small', 'class': 'author', 'column': 'author'}")

Label(value="{'tag': 'a', 'class': 'tag', 'column': 'tag'}")

In [None]:
#@title B2: Double-check your tag class pairs and column titles
#@markdown Run this cell to see the tag, class, and column name for your data.
#@markdown <br> <br>It should look like this "[{'tag': 'your value', 'class': 'your value', 'column': 'your column name'}]" and etc...

print(tag_class_pairs)

[{'tag': 'span', 'class': 'text', 'column': 'text'}, {'tag': 'small', 'class': 'author', 'column': 'author'}, {'tag': 'a', 'class': 'tag', 'column': 'tag'}]


In [None]:
#@title B3: Multi-Tag Scraper: Processing

#@markdown # How to use this cell:
#@markdown 1. Ensure that you've determined the class and type of the parent tag whose children you need to pull.
#@markdown 2. Ensure that you've double checked your tag/class pairs in the dictionary-viewer above.
#@markdown 3. Hit the "run cell" button and view the resulting dataframe.
#@markdown 4. Note the names of any columns you want to export, or simply export the whole dataframe using the following cell.



#@markdown If you encounter errors, double check the parent tag and class entered in the "findall" and "findclass" boxes.
#@markdown <br><br>Next, ensure that the tag/class pairs you've entered correspond to the data you need to pull in the HTML.
#@markdown <br><br>Try running this with one tag first, and slowly adding more if you continue having errors.
#@markdown <br><br>
#@markdown Also, this is a simple scraper may run into errors that were not accounted for, so keep that in mind.

findall1 = "div" #@param  {type:"string"}
findclass1 = "quote" #@param {type:"string"}
#@markdown <br>

#tag_class_pairs = [{'tag':'span', 'class':'text', 'column':'text'}, {'tag':'small', 'class':'author', 'column':'author'},{'tag':'small','class':'author', 'column':'author'},{'tag':'a', 'class':'tag', 'column':'tags'}]

def tag_scrape(entry, tag, _class):
    elements = entry.find_all(tag, {"class": _class})
    if elements:
        return [element.get_text().strip() for element in elements]
    else:
        return ["none"]

if findclass1:  # This begins an if/elif/else statement that checks *if* the variable 'findall' has a value
  entries = soup.find_all(findall1, class_=findclass1) #If it does, then the code searches for user-defined tag and class
else: #if findclass has no value
  entries = soup.find_all(findall1) #Then the code only searches for the tag

data = []#Creates an empty dictionary we will use to hold our data


##To modify what gets scraped.change the value of each "key:value" pair; to collect more, copy and paste the {dictionary} within the [list] and edit accordingly




for entry in entries: #for each entry in the soup object (entries)
  entry_data = {} #first open a dictionary titled "entry data"
  for pair in tag_class_pairs: #Then for each entry in the list of dictionaries "tag_class_pairs"
    tag = pair["tag"]# "Tag" is equal to the pair of the "tag" key in the dictionary
    class_name = pair["class"] #"Class_name"is equal to the pair fo the"class" key
    column = pair["column"] #and "column" is equal to the column key

    scraped_data = tag_scrape(entry,tag,class_name)#this passes the entry,tag, and class_name values to our "tag_scrape"function

    entry_data[column] = ', '.join(scraped_data) #Reorganizing dictionary based on column


  data.append(entry_data)
print(data)
# Create a DataFrame from the collected data
df = pd.DataFrame(data)

# Display the DataFrame
df


[{'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein', 'tag': 'change, deep-thoughts, thinking, world'}, {'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'author': 'J.K. Rowling', 'tag': 'abilities, choices'}, {'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', 'author': 'Albert Einstein', 'tag': 'inspirational, life, live, miracle, miracles'}, {'text': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', 'author': 'Jane Austen', 'tag': 'aliteracy, books, classic, humor'}, {'text': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", 'author': 'Marilyn Monroe', 'tag': 'be-yourself, inspirational'}, {'text': '“Try not to become a ma

Unnamed: 0,text,author,tag
0,“The world as we have created it is a process ...,Albert Einstein,"change, deep-thoughts, thinking, world"
1,"“It is our choices, Harry, that show what we t...",J.K. Rowling,"abilities, choices"
2,“There are only two ways to live your life. On...,Albert Einstein,"inspirational, life, live, miracle, miracles"
3,"“The person, be it gentleman or lady, who has ...",Jane Austen,"aliteracy, books, classic, humor"
4,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe,"be-yourself, inspirational"
5,“Try not to become a man of success. Rather be...,Albert Einstein,"adulthood, success, value"
6,“It is better to be hated for what you are tha...,André Gide,"life, love"
7,"“I have not failed. I've just found 10,000 way...",Thomas A. Edison,"edison, failure, inspirational, paraphrased"
8,“A woman is like a tea bag; you never know how...,Eleanor Roosevelt,misattributed-eleanor-roosevelt
9,"“A day without sunshine is like, you know, nig...",Steve Martin,"humor, obvious, simile"


In [None]:
#@title B4: Multi-Tag Data Export cell
#@markdown Run this file to export your multi-tag data.
#def save_type(file,fileChoice):

def saveFile(fileName, dataframe):
  file_path = Path("/content/drive/My Drive/scraper") / fileName
  file_path.parent.mkdir(parents=True, exist_ok=True)
  df2.to_csv(file_path, index=False)
  return file_path

print(Fore.RED + "Would you like to export your data as a CSV or TXT file?")
user_input = input(f"Please enter '+' to save the whole dataframe, or '-' if you wouldlike to choose specific columns")


if user_input == "+":
  fileChoice = input("would you like to save this file as a CSV or a txt file? Please input c for CSV or t for txt").lower()
  urFile = input("Now please enter the name of the file")
  if fileChoice == "c":
    file = urFile+".csv"
    file_path = saveFile(file,df)
    print(Fore.GREEN + f"Success! Check your drive for your CSV file for {file_path}")
  elif fileChoice == "t":
    file = urFile+".txt"
    file_path = saveFile(file,df)
    print(Fore.GREEN + f"Success! Check your drive for your TXT file for {file_path}")
  else :
    print("please enter c or t")

elif user_input == "-":
  columnInput = input("Please enter the column names you want to save, seperated by a comma")
  names = columnInput.split(',')
  names = [name.strip().lower() for name in names]
  df2 = df[names]
  fileChoice = input("would you like to save this file as a CSV or a txt file? Please input c for CSV or t for txt").lower()
  urFile = input("Now please enter the name of the file")

  if fileChoice == "c":
    file = urFile+".csv"
    file_path = saveFile(file,df2)
    print(Fore.GREEN + f"Success! Check your drive for your CSV file for {file_path}")

  elif fileChoice == "t":
    file = urFile+".txt"
    file_path = saveFile(file,df2)
    print(Fore.GREEN + f"Success! Check your drive for your TXT file for {file_path}")
  else :
    print("please enter c or t")
else:
  print('please enter + or -')


[31mWould you like to export your data as a CSV or TXT file?
Please enter '+' to save the whole dataframe, or '-' if you wouldlike to choose specific columns-
Please enter the column names you want to save, seperated by a commatext,tag
would you like to save this file as a CSV or a txt file? Please input c for CSV or t for txtc
Now please enter the name of the filebiggerTest3
[32mSuccess! Check your drive for your CSV file for /content/drive/My Drive/scraper/biggerTest3.csv


# Section C: Internal Link Crawler

### Part 3: Website Crawler and Link Collecter

Do you want to get a better idea of the entirety of your chosen website? This cell will take your URL, and attempt to output a list of all internal website links
<br><br>
 This should allow you to get an idea of the links contained in your websites, the links on each specific webpage, and the text (if any) that is conatained in the body paragraphs

 You can export this data at the end of this section using the ""data export" cell

 My code here began as a fork of https://github.com/mujeebishaque/extract-urls/blob/main/README.md, I built on the framework provided, and I used ChatGPT. In particular, I had no idea I could use a *set* to comtain a list of already-scraped collected!


In [None]:
#@markdown ### Website Crawler

#@markdown 1. Find a link to the website you want to scrape and click the arrow
#@markdown 2. Enter that link, hit enter,and wait for the crawler (it may trake up 5 to ten minutes depending on the size of the website!)
#@markdown 3. Explore the resulting Dataframe, and decide whcih columns you want to export to your google drive as a CSV or TXT file.
#@markdown <br> <br>
#@markdown This can be a great way to get a general overview of what's on your chosen website, and help you isolate any individual webpages you may want to analyze in more detail.
#@markdown <br> If you find another page of interest, you can return back to the scraper cells, and repeat the exploratory proces
#XML file and scraper are returning vastly different values, are some links in the XML simply no longer linked on the website?

#This function takes the user input and uses urllib's URL parse to create a key-value pair for each section of the URL
# We to extract the base URL


def parse_url(user_input_url): #thiscell uses the "urllib" library to pares URL's
    #this line of code breaks the URL
    #into key-value pairs
    parsed_url = urlparse(user_input_url)
    #by returning only the "scheme" and "netloc"
    return f"{parsed_url.scheme}://{parsed_url.netloc}"
    #the code returns a base_url by reassembling the scheme and netloc
def fetch_website_content(current_url):
    try:
        #pulls the HTML from the current_url
        response = requests.get(current_url.strip())
        #checks if the website responds
        response.raise_for_status()
        return response.text #this returns the text of the response to the "collect link data" function
        #catches exceptions from the requests library
    except requests.exceptions.RequestException as e:
        print(f"Failed to fetch {current_url}: {e}")
        return None

def categorize_urls(links, base_url):
    internal_urls = []
    external_urls = []

    for link in links:
        href = link.get('href')
        if href and href != "#":  # Skip placeholder links
            full_url = urljoin(base_url, href)
            parsed_url = urlparse(full_url)
            if not parsed_url.scheme or not parsed_url.netloc:
                continue  # Skip malformed URLs
            if base_url in full_url:
                internal_urls.append(full_url)
            else:
                external_urls.append(full_url)

    return internal_urls, external_urls

def collect_link_data(current_url, base_url, visited_urls, link_data):
    #passess the current_url to the function "fetch_website_content"
    website_content = fetch_website_content(current_url)
    if website_content is None:
        return []

    soup = BeautifulSoup(website_content, 'html.parser') #creates beautiful sooup object
    title = soup.title.string if soup.title else 'No Title' #pulls title and categorizes it under the variable "title"
    text = " ".join([spime.text.strip() for spime in soup.find_all(["h1", "h2", "h3", "h4", "h5", "p"])]) #pulls sample text from the BeautifulSoup Object
  #this pulls internal links from external links by passing 'a' tags pulled from the BeautifulSoup Object into the "categorize URLS" function
    internal_links, external_links = categorize_urls(soup.find_all('a', href=True), base_url)
    #appends data (including internal and external links from the categorize URLS function) into a list of dictionaries
    link_data.append({'URL': current_url, 'title': title, 'text': text, 'internal_links': internal_links, 'external_links':external_links})

    return internal_links #this returns the new internal links for the crawler to parse
#This is the main crawler,and it proceeds first from the base-URL
def crawl(base_url):
    #Creates a set of visted URL's (sets can have no duplicate)
    visited_urls = set()
    #A dictionary to contain link_data
    link_data = []
    #A set named "to_crawl" that will populate with links
    #from the function "collect link data"
    to_crawl = {base_url}

    #While the set is there
    while to_crawl:
        #current_url is taken randomly from the to_crawl set
        current_url = to_crawl.pop()
        #if it hasne't already been visited
        if current_url not in visited_urls:
            #Print what is being crawled
            print(f"Crawling: {current_url}")  # Print the current URL so the user sees
            #Add the URL to the visited_url Set
            visited_urls.add(current_url)
            #Collect link data that is stored in the dictionary
            new_internal_links = collect_link_data(current_url, base_url, visited_urls, link_data)
            #Update the set "to crawl" with the links pulledfrom "new_links"
            to_crawl.update(new_internal_links)

    return pd.DataFrame(link_data), pd.DataFrame(visited_urls)

# The maine function
if __name__ == '__main__':
    #asks for input url
    user_input_url = input("Input URL: ")
    if not user_input_url:
        raise Exception("INFO: Invalid Input")
    #passes base url to URLLIB url paresr
    base_url = parse_url(user_input_url)
    #starts cralwer using base URL
    df_links, df_visited_Links = crawl(base_url)
    #Data export
df_links



Input URL: https://quotes.toscrape.com/
Crawling: https://quotes.toscrape.com
Crawling: https://quotes.toscrape.com/author/J-K-Rowling
Crawling: https://quotes.toscrape.com/tag/life/page/1/
Crawling: https://quotes.toscrape.com/tag/plans/page/1/
Crawling: https://quotes.toscrape.com/tag/aliteracy/page/1/
Crawling: https://quotes.toscrape.com/author/Albert-Einstein
Crawling: https://quotes.toscrape.com/author/Thomas-A-Edison
Crawling: https://quotes.toscrape.com/tag/simile/
Crawling: https://quotes.toscrape.com/page/2/
Crawling: https://quotes.toscrape.com/tag/understand/page/1/
Crawling: https://quotes.toscrape.com/author/Marilyn-Monroe
Crawling: https://quotes.toscrape.com/author/Andre-Gide
Crawling: https://quotes.toscrape.com/author/Bob-Marley
Crawling: https://quotes.toscrape.com/
Crawling: https://quotes.toscrape.com/tag/world/page/1/
Crawling: https://quotes.toscrape.com/tag/misattributed-eleanor-roosevelt/page/1/
Crawling: https://quotes.toscrape.com/tag/navigation/page/1/
Crawl

Unnamed: 0,URL,title,text,internal_links,external_links
0,https://quotes.toscrape.com,Quotes to Scrape,Quotes to Scrape Login Top Ten tags Quotes by:...,"[https://quotes.toscrape.com/, https://quotes....","[https://www.goodreads.com/quotes, https://www..."
1,https://quotes.toscrape.com/author/J-K-Rowling,Quotes to Scrape,Quotes to Scrape Login J.K. Rowling Born: July...,"[https://quotes.toscrape.com/, https://quotes....","[https://www.goodreads.com/quotes, https://www..."
2,https://quotes.toscrape.com/tag/life/page/1/,Quotes to Scrape,Quotes to Scrape Login Viewing tag: life Top T...,"[https://quotes.toscrape.com/, https://quotes....","[https://www.goodreads.com/quotes, https://www..."
3,https://quotes.toscrape.com/tag/plans/page/1/,Quotes to Scrape,Quotes to Scrape Login Viewing tag: plans Top ...,"[https://quotes.toscrape.com/, https://quotes....","[https://www.goodreads.com/quotes, https://www..."
4,https://quotes.toscrape.com/tag/aliteracy/page/1/,Quotes to Scrape,Quotes to Scrape Login Viewing tag: aliteracy ...,"[https://quotes.toscrape.com/, https://quotes....","[https://www.goodreads.com/quotes, https://www..."
...,...,...,...,...,...
210,https://quotes.toscrape.com/page/10/,Quotes to Scrape,Quotes to Scrape Login Top Ten tags Quotes by:...,"[https://quotes.toscrape.com/, https://quotes....","[https://www.goodreads.com/quotes, https://www..."
211,https://quotes.toscrape.com/tag/god/page/1/,Quotes to Scrape,Quotes to Scrape Login Viewing tag: god Top Te...,"[https://quotes.toscrape.com/, https://quotes....","[https://www.goodreads.com/quotes, https://www..."
212,https://quotes.toscrape.com/tag/age/page/1/,Quotes to Scrape,Quotes to Scrape Login Viewing tag: age Top Te...,"[https://quotes.toscrape.com/, https://quotes....","[https://www.goodreads.com/quotes, https://www..."
213,https://quotes.toscrape.com/tag/better-life-em...,Quotes to Scrape,Quotes to Scrape Login Viewing tag: better-lif...,"[https://quotes.toscrape.com/, https://quotes....","[https://www.goodreads.com/quotes, https://www..."


In [None]:
#@markdown ### Updated Data Export cell for link export
#@markdown Follow the prompts to save your crawled website as a CSV or TXT file
#@markdown <br> If you are only interested in website URL's, make sure to chose to save only specific columns, and the appropriate column name when prompted
#def save_type(file,fileChoice):

def saveFile(fileName, dataframe):
  file_path = Path("/content/drive/My Drive/scraper") / fileName
  file_path.parent.mkdir(parents=True, exist_ok=True)
  df_links.to_csv(file_path, index=False)
  return file_path

print(Fore.RED + "Would you like to export your data as a CSV or TXT file?")
user_input = input(f"Please enter '+' to save the whole dataframe, or '-' if you wouldlike to choose specific columns")


if user_input == "+":
  fileChoice = input("would you like to save this file as a CSV or a txt file? Please input c for CSV or t for txt").lower()
  urFile = input("Now please enter the name of the file")
  if fileChoice == "c":
    file = urFile+".csv"
    file_path = saveFile(file,df_links)
    print(Fore.GREEN + f"Success! Check your drive for your CSV file for {file_path}")
  elif fileChoice == "t":
    file = urFile+".txt"
    file_path = saveFile(file,df_links)
    print(Fore.GREEN + f"Success! Check your drive for your TXT file for {file_path}")
  else :
    print("please enter c or t")

elif user_input == "-":
  columnInput = input("Please enter the column names you want to save, seperated by a comma")
  names = columnInput.split(',')
  names = [name.strip() for name in names]
  df_links = df_links[names]
  fileChoice = input("would you like to save this file as a CSV or a txt file? Please input c for CSV or t for txt").lower()
  urFile = input("Now please enter the name of the file")

  if fileChoice == "c":
    file = urFile+".csv"
    file_path = saveFile(file,df_links)
    print(Fore.GREEN + f"Success! Check your drive for your CSV file  '{file_path}'")

  elif fileChoice == "t":
    file = urFile+".txt"
    file_path = saveFile(file,df_links)
    print(Fore.GREEN + f"Success! Check your drive for your TXT file  '{file_path}'")
  else :
    print("please enter c or t")
else:
  print('please enter + or -')


[31mWould you like to export your data as a CSV or TXT file?
Please enter '+' to save the whole dataframe, or '-' if you wouldlike to choose specific columns-
Please enter the column names you want to save, seperated by a commaURL
would you like to save this file as a CSV or a txt file? Please input c for CSV or t for txtt
Now please enter the name of the fileURL_only
[32mSuccess! Check your drive for your TXT file  '/content/drive/My Drive/scraper/URL_only.txt'
