# Week 3:  Web-as-Output!

Last week was dedicated to _consuming_ (or, perhaps, _gathering_) content **from** the web.

This week and this notebook invites you into the world of _producing_ content for the web. The nice thing is
  + the actual _producing_ happens in a scripting language
  + and then the _formatting_ for the web can be done automatically
  + whew!

#### <font style="color:rgb(180,120,10);"><b>Part 1</b>:  &nbsp; "real" webscraping...</font>

This problem bridges input from the web with output to the web. Last week's use of APIs found and interpreted **structured** data, mostly JSON.  (For pre-defined APIs, JSON is what's used, most of the time!)

What if a site has information you'd like to use, but only has HTML, but not JSON? In this case, <tt>requests</tt> will provide the raw HTML (as a string) and it'll be up to us to extract the information we want! We'll use 
  + Python string-handling and <tt>string</tt> libraries, and
  + Python's _regular expression_ <tt>re</tt> library, a mini-language for string-matching and -manipulating.

First, an example.  We want to programmatically access the _best snacks_ on the <u>definitive snacks page</u>, which is [here at this url](https://www.cs.hmc.edu/~dodds/demo.html)

Alas, this snack-centric web service seems not to have a JSON API! We will have to grab the whole HTML text. HTML is always sent over as a huge string...

In [2]:
import requests

url = "https://www.cs.hmc.edu/~dodds/demo.html"
result = requests.get(url)
print(f"{result}")

<Response [200]>


In [3]:
# Let's print the text we just grabbed:
text = result.text
print(text)

<html>
  <head>
    <title>My streamlined website</title>
  </head>
  <body>
    <h1> Welcome! </h1>
    <h2> The best numbers </h2>

    <div id="numbers">
      <ol>
	<li class="answer"> 42 </li>
	<li class="question5"> 50 </li>
	<li class="yikes"> <a
  href="https://en.wikipedia.org/wiki/Rayo%27s_number">Rayo's number</a> </li>
      </ol>
    </div>

    <img src="./spam.jpg" height="84px">
    <br><br>

    <h2> The <s>best</s> only snacks </h2>

    <div id="coffee">
      <ul>
	<li class="latte"> Poptarts </li>
	<li class="latte"> Chocolate </li>
	<li class="dedecaf"> Coffee </li>
      </ul>
    </div>

    <img src="./alien.png" height="101px">

  </body>
</html>


<!--    <a href="./demo_cat.html">Aliens <3 cats!</a>  -->




Ooh... we notice that all of the snacks are inside <tt>&lt;li&gt;</tt> tags, which are _list items_

So, we can grab the location of every instance of the substring <tt>"&lt;li"</tt> &nbsp; 

Let's see/remember what the <tt>find</tt> method does:

In [4]:
"abcdefghijklmnopqrstuvwxy&jk".find("j",20)   # try 'a', 'j', 'hi', 'hit', and 'z' !  Also, ('j', 20)  nk!

26

Ok!  Let's 
  + find the first instance of <tt>"&lt;li"</tt> and 
  + find the first instance of <tt>"&lt;/li&gt;"</tt> and
  + print their indices and
  + print the string between them!

In [5]:
start_index = text.find("<li")
end_index = text.find("</li>")
print(f"{(start_index, end_index) }")
print()

# let's get the substring!
substring = text[ start_index : end_index ]
print(f"{substring}")

(169, 192)

<li class="answer"> 42 


Aha!  We got the number (42), but we got lots of stuff around it, too...

Let's repeat the process, further inward. 
  + Note that we're using <tt>substring</tt> not the whole entire <tt>text</tt>!
  + perhaps you see the character to find and slice away?!

In [6]:
end_index = substring.find(">")
print(f"{end_index }")
print()

# let's get the subsubstring! We're +1'ing this:
subsubstring = substring[ end_index+1 : ]
print(f"{subsubstring}")

18

 42 


<b>Remember, it's a string!</b>

In [7]:
answer = int(subsubstring)   # Now, it's an int!
print(f"{answer%100}")

42


One really nice thing about <tt>find</tt> is that it can also take a _starting point_ where it will start looking!

We really want to start _after_ the word <tt>"snacks"</tt> appears!  Let's find it:

In [8]:
snack_index = text.find("snacks")
print(f"{snack_index}")

442


Now, let's use the same technique as above to find the HTML _list item_ after <tt>snack_index</tt>
  + Notice the use of snack_index as the optional <tt>start</tt> input argument:

In [9]:
start_index = text.find("<li", snack_index)
end_index = text.find("</li>", snack_index)
print(f"{(start_index, end_index)}")
print()

# let's get the substring!
substring = text[ start_index : end_index ]
print(f"{substring}")

(490, 518)

<li class="latte"> Poptarts 


#### We should build a function!

In [10]:
"abc".find("a",-1)

-1

In [11]:
def get_list_item(text, starting_pt=0):
    """ returns the <li ... </li> after starting_pt in text """
    start_index = text.find("<li", starting_pt)
    end_index = text.find("</li>", start_index)  # do you see why this is better?!
    # print(f"{(start_index, end_index) = }")
    substring = text[ start_index : end_index ]  # let's get the substring!
    return substring, start_index, end_index

if True:
    result0 = get_list_item(text)
    result442 = get_list_item(text,442)
    print(f" {result0} \n {result442 }")

 ('<li class="answer"> 42 ', 169, 192) 
 ('<li class="latte"> Poptarts ', 490, 518)


#### We should write a loop!

In [12]:
starting_pt = text.find("snacks")

ListItems = []

while True:
    substring, start_index, end_index = get_list_item(text, starting_pt)
    if end_index != -1:  # if it's a valid substring
        ListItems.append(substring)
        print(substring)
        starting_pt = end_index  
    else:
        break



<li class="latte"> Poptarts 
<li class="latte"> Chocolate 
<li class="dedecaf"> Coffee 


<hr>

### Work on Homework 3 Problem 1.

<hr>

Onward to <b>Part 2</b>, writing your own web-engine (with regular expressions) 
  + We'll start by introducing _regular expressions_ - we'll see they provide a nice way to "grab" the <tt>&lt;li&gt;</tt> items from HTML...
  + In fact, they're a great toolset for pretty much ***any*** text-extraction at all!

  <br><br>

#### <font style="color:rgb(180,120,10);"><b>Part 2</b>  &nbsp; Regular Expressions: &nbsp;  A _better_ approach to list-item finding and extracting...</font>

The list-item example above used one function to find the items and another to "clearn them up."
  + This is great! And, will work for absolutely anything you need (adding functions as you go...)

**However**, there is a very powerful "mini" pattern-matching language that can help with many text-processing tasks: ***regular expressions***
  + Sometimes called <tt>regex</tt>'es or <tt>re</tt>'s,
  + regular expressions are a very compact languages for matching text patterns.
  + the Python library is <tt>re</tt>

Before unpacking the regex language, let's see it in action for the "handle list-item tags" challenge:

In [13]:
import re

# here is a small bit of the snack-page HTML text:
textpiece = """   pre_snacks  <li class="latte"> Poptarts </li>
                        <li class="latte"> Chocolate </li>  
                              <li class="dedecaf"> Coffee </li>   post_snacks
            """          

# here is an example substitution (sub) using a regular expression:

result = re.sub(r"snacks", "SNACKS!", textpiece)   # replaces "snacks" with "SNACKS!"  # let's try it with latte, too! with li?!
print(result)

   pre_SNACKS!  <li class="latte"> Poptarts </li>
                        <li class="latte"> Chocolate </li>  
                              <li class="dedecaf"> Coffee </li>   post_SNACKS!
            


In [14]:
# REs are a whole language! Let's see a more strategic use of regular expressions...

m = re.sub(r'<li.*>(.*)</li>', r'\1', textpiece )      # Yikes!    Common functions: findall, sub, search, match  

print(f"{m}")                                    # Wow!

   pre_snacks   Poptarts 
                         Chocolate   
                               Coffee    post_snacks
            


In [15]:
# REs are a whole language! Let's see a more strategic use of regular expressions...

m = re.findall(r'<li.*>(.*)</li>', textpiece )      # Yikes!    Common functions: findall, sub, search, match  

print(f"{m}")                                    # Wow!

[' Poptarts ', ' Chocolate ', ' Coffee ']


Regular expressions!  &nbsp;&nbsp; No turning back now...  😊

Let's build up to that large example above:

In [16]:
# Let's try some smaller examples to build up to the above (crazy!) example

# fundamental capability:  regex substitution  
#
#    the regex:
#      matcher:    replacer:   in this string:
re.sub(r"Harvey",  "Mildred",  "Harvey Mudd")           # the 'r' is for 'raw' strings. They're best for re's.

'Mildred Mudd'

In [17]:
re.findall(r"Harvey",  "Harvey Mudd (Harvey!)")       # findall works: not as interesting... (try H vs. h; add . or ..)

['Harvey', 'Harvey']

In [18]:
re.sub(r"car", "cat",  "This car is careful!")          # we'll stick with substitution for now...  uh oh!  space or ,1

'This cat is cateful!'

In [19]:
re.sub(r"d", "dd", "Harvey Mud")          # try "Mildred Mudd"

'Harvey Mudd'

In [20]:
# ANCHORS:  Patterns can be anchored:   $ meand the _end_
re.sub(r"d$", "dd", "Mildred Mud" )   # $ signifies (matches) the END of the line

'Mildred Mudd'

In [21]:
# ANCHORS:  Patterns can be anchored:   ^  means the _start_ 
re.sub(r"^M", "ℳ", "Mildred Mudd" )   # ^ signifies (matches) the START of the line  (unicode M :)

'ℳildred Mudd'

In [22]:
# PLUS  +   means one or more:
re.sub(r"i+", "i", "Isn't the aliiien skiing this weekend? AiiiIIIiiiiIIIeee!" )   # try replacing with "" or "I" or "𝒾" or "ⓘ"

"Isn't the alien sking this weekend? AiIIIiIIIeee!"

In [23]:
# SquareBrackets  [iI]  mean any from that character group:
re.sub(r"[Ii]+", "i", "Isn't the aliiien skiing this weekend? AiiiIIIiiiiIIIeee!" )   # it can vary within the group!

"isn't the alien sking this weekend? Aieee!"

In [24]:
# SquareBrackets allow ranges, e.g., [a-z]
re.sub(r"[a-z]", "*", "Aha! You've FOUND my secret: 42!")       # use a +,  add A-Z, show \w, for "word" character

"A**! Y**'** FOUND ** ******: 42!"

In [25]:
# Let's try the range [0-9] and +
re.sub(r"[0-9]+", "42",  "Aliens <3 pets! They have 45 cats, 6 lemurs, and 789 manatees!")   # DISCUSS!  no +? How to fix?!

'Aliens <42 pets! They have 42 cats, 42 lemurs, and 42 manatees!'

Ok! &nbsp;&nbsp; Let's expand our thought experiments:

In [26]:
re.sub( r"or", "and", "words or phrases" )
re.sub( r"s", "-", "words or phrases" )
re.sub( r"[aeiou]", "-", "words or phrases" )

re.sub( r"$", " [end]", "words or phrases" )
re.sub( r"^", "[start] ", "words or phrases" )

# Challenge! The dot . matches _any_ single character:  
re.sub( r".", "-", "words or phrases" )   # What will this do?

re.sub( r".s", "-S", "words or phrases" )  # And this one?!

re.sub( r".+s", "-S", "words or phrases" )  # And this one?!!

'-S'

#### Let's try a _physical_ matching of these examples above...

There is one more "common" regular expression element. &nbsp;&nbsp; The star * means "zero or more" of what precedes it...

It's similar to the plus + (which means 1 or more), _but * also allows for 0 times_ !  &nbsp;&nbsp; This can be mind-bending...

In [27]:
# The * matches zero-or-more of the preceding pattern  Let's try it:

re.sub( r" [0-9]+ ", " # ", 'The safe combination is 42 - 47 - 101 - 46 ' )   # try with *, and remove the 101 ... 

'The safe combination is # - # - # - # '

####   Ok!  Let's switch to a more <font color="DodgerBlue"><b>hands-on</b></font> medium: &nbsp;&nbsp; _paper!_

... to try out our ``"alabama"`` and ``"Google"`` regular-expression challenges... :) 

<br><br>

We now have ***almost*** everything in that list-item-handling example from a while back. 

Let's take a look -- and add the idea of a _capture group_   &nbsp;&nbsp; (using parens)

In [28]:
textpiece = """   pre_snacks  <li class="latte"> Poptarts </li> 
                        <li class="latte"> Chocolate </li>  
                        <li class="dedecaf"> Coffee </li>   post_snacks
            """    

m = re.findall(r'<li.*>(.*)</li>',textpiece )      # the parens are a "capture group"   # try w/o it  # try search & sub
                                                   # each set of parents "captures" the text inside it
print(f"{m}")                                   # it can even be used later, as \1, \2, \3, etc. 

[' Poptarts ', ' Chocolate ', ' Coffee ']


We'll practice more with these in lab on Tuesday evening:
+ we have a series of warm-up challenges...
+ and then some example "serious" regex applications...
+ and some not-so-serious ones, too (crosswords)...

As a prelude, let's do one "serious" example together: ***finding an email address***

In [29]:
filepiece = """   pre_emails  zdodds@gmail.com   dodds@cs.hmc.edu
                        Ran.Libeskind-Hadas@claremontmckenna.edu  hadas@cmc.edu  JaSoN_KeLleR@cCrAb.pacific.net  
                        <a href="mailto:Jason.Keller@ClaremontMcKenna.edu">Email Prof. Keller@this link!</a>  post_emails
            """  

m = re.findall(r' .*gmail.* ',filepiece, re.M)     # this is wrong, but a start!
                                                   
print(f"{m}")                                   # let's build step-by-step... 

['   pre_emails  zdodds@gmail.com   ']


The next cell has the _starting markdown_ for our **markdown-to-markup** web engine.  

Because the next cell ***is*** markdown -- and it's in a notebook _with_ a markdown engine -- you'll see the markup, as usual!
+ As usual, you can see the markdown by double-clicking the cell
+ It's also available as a Python string in the following cell...

# Claremont's Colleges

The Claremont Colleges are a *consortium* of **five** SoCal institutions. <br>
We list them here.

## The 5Cs: a list
+ [Pomona](https://www.pomona.edu/)
+ [CMC](https://www.cmc.edu/)
+ [Pitzer](https://www.pitzer.edu/)
+ [Scripps](https://www.scrippscollege.edu/)
+ [HMC](https://www.hmc.edu/)

The above's an _unordered_ list.  <br>
At the 5Cs, we all agree there's __no__ order!

---

## Today's featured college: [CMC](https://coloradomtn.edu/)

<img src="https://ygzm5vgh89zp-u4384.pressidiumcdn.com/wp-content/uploads/2017/06/GWS_campusview_1000x627.jpg" height=160>

--

### Also featured: &nbsp; Scripps and Pitzer and Mudd

<img src="https://i0.wp.com/tsl.news/wp-content/uploads/2018/09/scripps.png?w=1430&ssl=1" height=100> &nbsp; 
<img src="https://www.pitzer.edu/communications/wp-content/uploads/sites/17/2023/09/clock-tower.png" height=100> &nbsp; 
<img src="https://www.hmc.edu/about/wp-content/uploads/sites/2/2020/02/campus-gv.jpg" height=100>

Are there _other_ schools in Claremont?

### Claremont destinations
+ _Pepo Melo_, a fantastic font of fruit!
+ **Starbucks**, the center of Claremont's "city," not as good as Scripps's _Motley_ 
+ ***Sancho's Tacos***, the village's newest establishment
+ ~~In-and-out Burger~~ (not in Claremont, alas, but close! CMC-supported!)
+ `42`nd Street Bagel, an HMC fave, definitely _well-numbered_
+ Trader Joe's, providing fuel for the walk back to Pitzer _from Trader Joe's_

---

#### Regular Expression Code-of-the-Day 
`import re`               
`pet_statement = re.sub(r'dog', 'cat', 'I <3 dogs')`

#### New Construction of the ~~Day~~ _Decade_!

<img src="https://www.cs.hmc.edu/~dodds/roberts_uc.png" height=150> <br><br>

CMC's **_Roberts Science Center_, also known as _"The Rubik Cube"_** <br>
Currently under construction, under water, and undeterred by the SoCal rain... 

<br><br>


In [30]:
#
# Here is a code cell, with the entire first-draft markdown of the previous cell stored in a Python variable,   original_markdown 
#

original_markdown = """

# Claremont's Colleges

The Claremont Colleges are a *consortium* of **five** SoCal institutions. <br>
We list them here.

## The 5Cs: a list
+ [Pomona](https://www.pomona.edu/)
+ [CMC](https://www.cmc.edu/)
+ [Pitzer](https://www.pitzer.edu/)
+ [Scripps](https://www.scrippscollege.edu/)
+ [HMC](https://www.hmc.edu/)

The above's an _unordered_ list.  <br>
At the 5Cs, we all agree there's __no__ order!

---

## Today's featured college: [CMC](https://coloradomtn.edu/)

<img src="https://ygzm5vgh89zp-u4384.pressidiumcdn.com/wp-content/uploads/2017/06/GWS_campusview_1000x627.jpg" height=160>

--

### Also featured: &nbsp; Scripps and Pitzer and Mudd

<img src="https://i0.wp.com/tsl.news/wp-content/uploads/2018/09/scripps.png?w=1430&ssl=1" height=100> &nbsp; 
<img src="https://www.pitzer.edu/communications/wp-content/uploads/sites/17/2023/09/clock-tower.png" height=100> &nbsp; 
<img src="https://www.hmc.edu/about/wp-content/uploads/sites/2/2020/02/campus-gv.jpg" height=100>

Are there _other_ schools in Claremont?

### Claremont destinations
+ _Pepo Melo_, a fantastic font of fruit!
+ **Starbucks**, the center of Claremont's "city," not as good as Scripps's _Motley_ 
+ ***Sancho's Tacos***, the village's newest establishment
+ ~~In-and-out Burger~~ (not in Claremont, alas, but close! CMC-supported!)
+ `42`nd Street Bagel, an HMC fave, definitely _well-numbered_
+ Trader Joe's, providing fuel for the walk back to Pitzer _from Trader Joe's_

---

#### Regular Expression Code-of-the-Day 
`import re`               
`pet_statement = re.sub(r'dog', 'cat', 'I <3 dogs')`

#### New Construction of the ~~Day~~ _Decade_!

<img src="https://www.cs.hmc.edu/~dodds/roberts_uc.png" height=150> <br><br>

CMC's **_Roberts Science Center_, also known as _"The Rubik Cube"_** <br>
Currently under construction, under water, and undeterred by the SoCal rain... 

<br><br>

"""

In [31]:
#
# here is a function to write a string to a file (default name: output.html)
#

def write_to_file(contents, filename="output.html"):
    """ writes the string final_contents to the file filename """
    f = open(filename,"w")
    print(contents, file=f)
    f.close()

#### <b>Your task</b> is to create a set of functions that create a markdown-to-markup transformer!
+ <b>including</b> at least these existing markdown features: headers, bold, italic, strikethrough (for Toby!), url-links, and item-lists
+ <b>and you should design</b> at least three new markdown-features of your own. <font size="-2">(This is ***modern*** markdown, not that stodgy markdown from the 90's!)</font>
+ The assignment page has several suggestions. You'll add to the markdown source to show off your new features (and customize)

<hr>

To get started, the following cells have a couple of example transformations: 
+ how to convert all of the newlines to <tt>&lt;br&gt;</tt>
+ how to handle the <tt># </tt>  top-level headers, which use <tt>&lt;h1&gt;</tt> and  <tt>&lt;/h1&gt;</tt> around their contents
+ how to handle fixed-width (<tt>code-type</tt>) text, which converts backticks <tt>`</tt> to <tt>&lt;tt&gt;</tt>, e.g., <tt>&#96;code&#96;</tt> to <tt>&lt;tt&gt;code&lt;/tt&gt;</tt>

It writes out the result to a file. 
+ Reload it directly in a browser to see how well it's doing.
+ Then, dive into the other changes...

In [32]:
#
# overall mardown-to-markup transformer
#

contents_v0 = original_markdown              # here is the input - be sure to run the functions, below:

# contents_v1 = handle_newlines(contents_v0)   #   blank lines to <br>
# contents_v2 = handle_headers(contents_v1)    #   # title to <h1>title</h1>  (more needed: ## to <h2>, ... up to <h6>)
# contents_v3 = handle_code(contents_v2)       #   `code` to <tt>code</tt>

final_contents = contents_v0                 # here is the output

write_to_file(final_contents, "output.html") # now, written to file:  Reload it in your browser!

In [34]:
# we can print the final output's source
print(final_contents)    # Usually, first reload output.html in your browser - this can help debug...



# Claremont's Colleges

The Claremont Colleges are a *consortium* of **five** SoCal institutions. <br>
We list them here.

## The 5Cs: a list
+ [Pomona](https://www.pomona.edu/)
+ [CMC](https://www.cmc.edu/)
+ [Pitzer](https://www.pitzer.edu/)
+ [Scripps](https://www.scrippscollege.edu/)
+ [HMC](https://www.hmc.edu/)

The above's an _unordered_ list.  <br>
At the 5Cs, we all agree there's __no__ order!

---

## Today's featured college: [CMC](https://coloradomtn.edu/)

<img src="https://ygzm5vgh89zp-u4384.pressidiumcdn.com/wp-content/uploads/2017/06/GWS_campusview_1000x627.jpg" height=160>

--

### Also featured: &nbsp; Scripps and Pitzer and Mudd

<img src="https://i0.wp.com/tsl.news/wp-content/uploads/2018/09/scripps.png?w=1430&ssl=1" height=100> &nbsp; 
<img src="https://www.pitzer.edu/communications/wp-content/uploads/sites/17/2023/09/clock-tower.png" height=100> &nbsp; 
<img src="https://www.hmc.edu/about/wp-content/uploads/sites/2/2020/02/campus-gv.jpg" height=100>

Are there _

In [33]:
# here is a function to handle blank lines (making them <br>)
#

def handle_newlines(contents):
    """ replace all of the just-newline characters \n with HTML newlines <br> """
    NewLines = []
    OldLines = contents.split("\n")

    for line in OldLines:
        new_line = re.sub(r"^\s*$", r"<br>", line)  # if it has only space characters, \s, we make an HTML newline <br>
        NewLines.append(new_line)

    new_contents = "\n".join(NewLines)   # join with \n characters so it's readable by humans
    return new_contents

if True:
    contents = "# Title  \n    \n# Another title"
    new_contents = handle_newlines(contents)
    print(new_contents)

# Title  
<br>
# Another title


In [34]:
# here is a function to handle headers - right now only h1 (top-level)
#

def handle_headers(contents):
    """ replace all of the #, ##, ###, ... ###### headers with <h1>, <h2>, <h3>, ... <h6> """
    NewLines = []
    OldLines = contents.split("\n")

    for line in OldLines:
        new_line = re.sub(r"^# (.*)$", r"<h1>\1</h1>", line)  # capture the contents and wrap with <h1> and </h1>
                                                              # Aha! You will be able to handle the other headers here!
        NewLines.append(new_line)

    new_contents = "\n".join(NewLines)   # join with \n characters so it's readable by humans
    return new_contents

if True:
    contents = "# Title  \n<br>\n# Another title"
    new_contents = handle_headers(contents)
    print(new_contents)

<h1>Title  </h1>
<br>
<h1>Another title</h1>


In [35]:
# here is a function to handle code - using markdown backticks
#

def handle_code(contents):
    """ replace all of the backtick content with <code> </code> """
    NewLines = []
    OldLines = contents.split("\n")

    for line in OldLines:
        new_line = re.sub(r"`(.*)`", r"<tt>\1</tt>", line)  # capture the contents and wrap with <code> and </code>
        NewLines.append(new_line)

    new_contents = "\n".join(NewLines)   # join with \n characters so it's readable by humans
    return new_contents

if True:
    contents = "This is `42`   \n<br> \n Our library:  `import re`"
    new_contents = handle_code(contents)
    print(new_contents)

This is <tt>42</tt>   
<br> 
 Our library:  <tt>import re</tt>


### _Example_ &nbsp;&nbsp; Superbowl prediction based on the _"wisdom of the web"_ ...

In [36]:
import requests

def get_wikipedia_page(url):
    """ returns the text from the wikipedia page, provided a url """
    url = "https://en.wikipedia.org/wiki/" + url
    result = requests.get(url)
    # print(result)
    text = result.text.lower()
    # print(f"{len(text) = }")
    return text

if True:
    url = "Pittsburgh_Steelers"
    # url = "Kansas_City_Chiefs"
    # url = "Pomona_College"
    text_team = get_wikipedia_page(url)
    print(f"{len(text_team) }")

664037


In [37]:
import re 

def team_colors(text_team):
    """ returns the top-ten-color score, or 0 if not in the top-ten list... """

    s = "team colors"
    i = text_team.find(s)
    if i == -1:
        print("No team colors... checking for school colors")
        s = "school colors"
        i = text_team.find(s)
        if i == -1:
            print("No colors at all!")
            return []
    # sequence of strings to find the colors...
    i2 = text_team.find("infobox-data",i+1)
    i3 = text_team.find(">",i2+1)
    i4 = text_team.find("<",i3+1)
    if i4 == i3 + 1:  # but, some have a different format!
        i2 = text_team.find("&#160;",i4+1)
        i2 = text_team.find("&#160;",i2+1)
        i3 = text_team.find(">",i2+1)
        i4 = text_team.find("<",i3+1)
        print(f"{(i3,i4) }")
    
    # got it!  Let's substitute to be comma-separated:
    s = text_team[i3+1:i4]
    s = re.sub("and",",",s)
    s = re.sub("&amp;",",",s)
    L = s.split(",")
    L = [s.strip() for s in L]
    # print(L)
    return L

if True:
    result = team_colors(text_team)
    print(f"{result }")

['black', 'gold']


In [38]:
import requests

url = "https://www.thetoptens.com/colors/top-ten-favorite-colors/"  # uh oh!  now dynamically loading...
url = "https://web.archive.org/web/20211130204459/https://www.thetoptens.com/colors/top-ten-favorite-colors/"
result = requests.get(url)
print(result)
text_colors = result.text.lower()
print(f"{len(text_colors) }")

<Response [200]>
98464


In [39]:
def color_score(color, text_colors):
    """ returns the top-ten-color score, or 0 if not in the top-ten list... """
    color = color.lower()
    s = f'<b>{color}</b>'
    i = text_colors.find(s)
    if i == -1: return 0  # not found

    i2 = text_colors.rfind("<",0,i-2)  # looks left! (will be one to the right of the numeric "score")
    i3 = text_colors.rfind(">",0,i2)   # looks left! (will be one to the left of the numeric "score")
    top_ten_color_score = int(text_colors[i3+1:i2])
    return top_ten_color_score
    # print(f"{i3 =}")
    # print(text_colors[i3+1:i2])  # this one!
    # print(text_colors[i3+1:i2+25])

if True:
    color = "gold"
    score = color_score(color, text_colors)
    print(f"{score} for {color}")

8 for gold


In [40]:
def color_scores(colorL, text_colors):
    """ returns the sum of the scores for the colors in colorL """
    total = 0
    for color in colorL:
        color = color.lower()
        score = color_score(color, text_colors)
        total += score
        print(f" {color:>15s} : {score}")
    return total

if True:
    colorL = ["violet", "Red"]
    total = color_scores(colorL, text_colors)
    print(f"{total}")

          violet : 21
             red : 2
23


In [41]:
#
# So, who wins?!
#

#
# here are some urls to compare
#

"San_Francisco_49ers"
"Pittsburgh_Steelers"
"Philadelphia_Eagles"
"Kansas_City_Chiefs"
team1 = "Pomona_College"
team2 = "Pitzer_College"
"Claremont_McKenna_College"
"Scripps_College"
"Harvey_Mudd_College"

team1_wikip = get_wikipedia_page(team1)
team2_wikip = get_wikipedia_page(team2)

color_list1 = team_colors(team1_wikip)
color_list2 = team_colors(team2_wikip)

print("+"*42)
print(f"+++ For {team1}")
final_score1 = color_scores(color_list1,text_colors)
print(f"+++ total score: {final_score1}\n")
print("+"*42)
print(f"+++ For {team2}")
final_score2 = color_scores(color_list2,text_colors)
print(f"+++ total score: {final_score2}")



No team colors... checking for school colors
No team colors... checking for school colors
++++++++++++++++++++++++++++++++++++++++++
+++ For Pomona_College
            blue : 1
           white : 9
+++ total score: 10

++++++++++++++++++++++++++++++++++++++++++
+++ For Pitzer_College
          orange : 6
           white : 9
+++ total score: 15
