# Analyze URLS for Topic & Category



### Overview

Using AI and web scraping to analyze URLs in bulk


### Free Scripts

This is part of a series of free python scripts. 

You can find all the scripts online at https://github.com/FrontAnalyticsInc/data-winners

Message me on twitter if you have any suggestions.


### About Me

My name is Alton Alexander. I am a Data Science consultant turned entreprenuer building SaaS tools for SEO.

Find more about my free scripts or ask me any question on twitter: [twitter.com/@alton_lex](https://twitter.com/alton_lex)


### Join the Course for Data Winners

Join the conversation:

- private Discord community

- Video tutorials

- Feedback and support on this and other scripts

Join now: https://datawinners.gumroad.com/l/data-analytics-for-seo

In [None]:
# Set your api key
# copy and paste your open api key as a varible

openai.api_key = "sk-PASTEYOURKEYHERE"

In [45]:
import requests
from bs4 import BeautifulSoup
# install libraries
%pip install openai

# load libraries
import openai

Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [None]:
# Set your api key
# copy and paste your open api key as a varible

openai.api_key = input()

In [97]:
# enter one or more urls (comma seperated)

raw_url_text = "https://tourismhuahin.com,https://contentcurator.com,https://frontanalytics.com,https://datawinners.com"
raw_url_text = input()


# convert to list
urls = raw_url_text.split(",")

https://tourismhuahin.com,https://contentcurator.com,https://frontanalytics.com,https://datawinners.com


In [98]:
urls

['https://tourismhuahin.com',
 'https://contentcurator.com',
 'https://frontanalytics.com',
 'https://datawinners.com']

# code

In [48]:

def parse_page_to_sections_list( link ):
    '''
    Simple function to download the raw html of a link
    parses contents into an array of dictionaries
    
    '''
    page = requests.get(link)
    soup = BeautifulSoup(page.text, "lxml")

    title = soup.find("title").text

    headers = ['h1', 'h2', 'h3']

    df_json = []

    for header in headers:
        sections = soup.find_all(header)

        order = 0
        for section in sections:
            tmp = {
                "url": link,
                "page_title": title.strip(),
                "header": header,
                "order": order,
                "title": section.text.strip()
            }
            df_json.append( tmp )
            order = order +1

    return title, df_json

In [49]:
# request the page and create a dataset for each url

# cache all results
all_results = {}

for url in urls:
    print(url)
    
    if url not in all_results:
        title, sections = parse_page_to_sections_list( url )
        all_results[url] = {
            "title": title,
            "sections": sections
        }

all_results

https://tourismhuahin.com
https://contentcurator.com
https://frontanalytics.com
https://datawinners.com


{'https://tourismhuahin.com': {'title': 'Tourism Hua Hin – Discover Paradise',
  'sections': [{'url': 'https://tourismhuahin.com',
    'page_title': 'Tourism Hua Hin – Discover Paradise',
    'header': 'h1',
    'order': 0,
    'title': 'DISCOVER PARADISE'},
   {'url': 'https://tourismhuahin.com',
    'page_title': 'Tourism Hua Hin – Discover Paradise',
    'header': 'h1',
    'order': 1,
    'title': 'Embrace the Serenity and Bliss of Hua Hin'},
   {'url': 'https://tourismhuahin.com',
    'page_title': 'Tourism Hua Hin – Discover Paradise',
    'header': 'h2',
    'order': 0,
    'title': 'Beautiful Temple Complex and Beach in Khao Tao'},
   {'url': 'https://tourismhuahin.com',
    'page_title': 'Tourism Hua Hin – Discover Paradise',
    'header': 'h2',
    'order': 1,
    'title': 'Your Ultimate Guide to Thailand’s Enchanting Coastal Gem'},
   {'url': 'https://tourismhuahin.com',
    'page_title': 'Tourism Hua Hin – Discover Paradise',
    'header': 'h2',
    'order': 2,
    'title':

In [50]:

# in batches of 10 send to openai
from itertools import islice

def split_every(n, iterable):
    i = iter(iterable)
    piece = list(islice(i, n))
    while piece:
        yield piece
        piece = list(islice(i, n))
        
parts = list( split_every(10,list(all_results.keys())) )

In [109]:
# create openai api calls

all_calls = []

for batch in parts:
    
    context = "You are an expert at categorizing data. Return a table of responses in markdown format where each row has the url and a column for the category which should represent the main content category for the url using the iab content taxonomy 3.0, another column which indicates true or false if this is sfw content (where nsfw includes but not limited to gambling, adult, illegal, hate/war etc), followed by a column that lists of at least  search queries where this page might appear, and finally one last column which is the primary overall topic or keyphrase to describe the entire url. The user will give a set of urls, you always reply using those urls and the assocaited information. only reply with a markdown table."

    prompt_array = ["""Create a mardown table to categorize the following urls. Use the following url data and the associated headers for that url:
"""]
    
    for url in batch:
        if 'title' in all_results[url]:
            prompt_array.append("")
            prompt_array.append("url: "+url)
            prompt_array.append("title: "+all_results[url]['sections'][0]['title'])
            
            for header in all_results[url]['sections'][0:5]:
                prompt_array.append(header['header']+": "+header['title'])

prompt_array

['Create a mardown table to categorize the following urls. Use the following url data and the associated headers for that url:\n',
 '',
 'url: https://tourismhuahin.com',
 'title: DISCOVER PARADISE',
 'h1: DISCOVER PARADISE',
 'h1: Embrace the Serenity and Bliss of Hua Hin',
 'h2: Beautiful Temple Complex and Beach in Khao Tao',
 'h2: Your Ultimate Guide to Thailand’s Enchanting Coastal Gem',
 'h2: Hello world!',
 '',
 'url: https://contentcurator.com',
 'title: Accelerate Your Content Strategy',
 'h1: Accelerate Your Content Strategy',
 'h2: Supercharge Your Content Strategy with AI and Data',
 'h2: Content Strategy Powered by Data and AI',
 'h2: Accelerate Your Content Strategy',
 'h3: Keyword Research',
 '',
 'url: https://frontanalytics.com',
 'title: AccelerateInnovation',
 'h1: AccelerateInnovation',
 'h2: Core Capabilities',
 'h2: Advanced Analytics: Why the Best Solutions come from Pragmatic Data Science',
 'h2: Latest Insights',
 '',
 'url: https://datawinners.com',
 'title: '

In [110]:
def openai_chatgen_for_datawinners(
        context,
        prompt,
        model = "gpt-3.5-turbo",
        max_length = 3000
    ):
    """ A generic wrapper for the endpoint """
    
    response = openai.ChatCompletion.create(
      model=model,
      messages=[
        {
          "role": "system",
          "content": context
        },
        {
          "role": "user",
          "content": prompt
        }
      ],
      temperature=1,
      top_p=1,
      frequency_penalty=0,
      presence_penalty=0
    )

    
    return response.choices[0]['message']['content']

In [111]:
md_table = openai_chatgen_for_datawinners(context, "/n".join(prompt_array))

In [112]:
# display table

from IPython.display import display_markdown

display_markdown(md_table, raw=True)

| URL                                  | Category                  | SFW    | Search Queries                                                                                                             | Primary Topic                |
|--------------------------------------|---------------------------|--------|---------------------------------------------------------------------------------------------------------------------------|------------------------------|
| https://tourismhuahin.com/           | Travel and Tourism         | True   | Thailand travel, Hua Hin attractions, Beaches in Khao Tao, Thai coastal gem, Hua Hin tourism                                | Hua Hin tourism              |
| https://contentcurator.com/          | Marketing and Advertising | True   | Content strategy, AI in content strategy, Data-driven content strategy, Supercharge content strategy, Keyword research      | Content strategy             |
| https://frontanalytics.com/          | Analytics                  | True   | Accelerate innovation, Core capabilities, Advanced analytics, Pragmatic data science, Latest insights                     | Accelerate innovation        |
| https://datawinners.com/            | SEO and Analytics          | True   | Google Search Console into BigQuery, Niche case study, Top SEO niche Twitter accounts, Data analytics for SEO, Dashboard | SEO analytics                |