In [0]:
# Library imports for this lesson

import pandas as pd
import re
from bs4 import BeautifulSoup
import requests

## What is NLP?

### NLP != Neuro-Linguistic Programming (at least for our purposes)

In Google searches you'll also see Neuro-Linguistic Programming come up as a common result when using the acronym (NLP). Neuro-Linguistic Programming is a somewhat discredited psycotherapy and counseling technique that has retained some level of popularity with hypno-therapists and business management trainers. <https://en.wikipedia.org/wiki/Neuro-linguistic_programming>

### NLP == Natural Language Processing

"Natural" meaning - not computer languages but spoken/written human languages. The hard thing about NLP is that human languages are far less structured or consistent than computer languages. This is perhaps the largest source of difficulty when trying to get computers to "understand" human languages. How do you get a machine to understand sarcasm, and irony, and synonyms, connotation, denotation, nuance, and tone of voice --all without it having lived a lifetime of experience for context? If you think about it, our human brains have been exposed to quite a lot of training data to help us interpret languages, and even then we misunderstand each other pretty frequently. 

# Popular Applications of NLP: 

- Sentiment Analysis (Positive, Negative or Neutral)
- Chatbots (Natural Language Understanding + Natural Language Generation)
- Speech Recognition (dictation, Siri, "hey google", Amazon Echo, Cortana, video captioning, etc.)
 - Have to interpret the audio correctly
 - Have to interpret the words correctly after that.
- Machine Translation (google translate)
- Spell Check
- Keyword Search
- Spam Detection

## The challenges of unstructured language data: 

Up until this point in the class we have worked mostly with what would be called "tabluar data" (arranged in a table of rows and columns) or "structured data" - relational database tables, json, etc. Human language (text/audio) is one form of "unstructured" data. Computers are inherently bad at processing unstructured data. You give them Python or JavaScript and they'll interpret it in a split second without making a single mistake, but what about some of the following situations:

### Metaphors, idioms, etc.

> "Steph Curry was on fire last nice. He totally destroyed the other team"

### Punctuation - (COMMAS SAVE LIVES!!)

> "Let's eat, Grandma!" vs "Let's eat Grandma!"

### Pronoun Resolution (requires context or past experience)

> "The thieves stole the paintings. They were subsequently sold."

> "The thieves stole the paintings. They were subsequently caught."

> "The thieves stole the paintings. They were subsequently found."

### Lack of Context (ambiguity yet again)

> "The lamb was ready to eat."

### Synonyms / Homonyms (Word Sense Disambiguation - can be different depending on if the language is written or spoken)

> "That guy is the best bass player I've ever seen!"

![Bass Player](http://vision.cs.uml.edu/old_projects_files/senses.png)

> "She threw up her dinner" vs. "She threw up her hands"

### The conventions of language are shifting under our feet

Consider:

![Emoji Story](https://static.freemake.com/blog/wp-content/uploads/2014/07/emoji-texts-2.jpg)

### My personal favorite:

> "Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo."

The above is in fact a grammatically correct English sentence with real meaning. It's hard for native English speakers to understand, computers would have a hard time with it. <https://en.wikipedia.org/wiki/Buffalo_buffalo_Buffalo_buffalo_buffalo_buffalo_Buffalo_buffalo> It's a little easier to understand if you know all of the word meanings and you have it read to you with correct cadence and tone: If I read it a little a little bit differently, notice how I'm saying the same words but conveying meaning through tone.

### Buffalo (City / Entity):

<img src="http://pics2.city-data.com/city/maps/fr58.png" />

### buffalo (noun) - American bison?:

<img src="https://cdn.pixabay.com/photo/2014/11/11/13/52/bison-526805_1280.jpg" width="400" />

### buffalo (verb): 

<https://youtu.be/L4eOhuLDfeU?t=115>

> "Buffalo buffalo ... (who) Buffalo buffalo buffalo... buffalo Buffalo buffalo." - Can you see how my tone of voice and rythm and pacing conveyed quite a lot of information. That's a tough one to parse from the raw text. I guess there are a lot of mean bison in New York that like to bully each other.

Most all of these problems boil down to some level of ambiguity or lack of context. This is what we mean when we say unstructured data. Language often breaks its own rules.

![XKCD Pictographs](https://imgs.xkcd.com/comics/inflection_2x.png)

# Python String Basics


## Strings as a sequence of characters

In [0]:
test = "This is a test string"

# Strings are indexed like lists
print(test[0])
print(test[1])
print(test[2])

T
h
i


In [0]:
# You can also index strings from the back like lists
print(test[-1])
print(test[-2])
print(test[-3])

g
n
i


In [0]:
# You can loop over individual characters
for character in test:
  print(character)

T
h
i
s
 
i
s
 
a
 
t
e
s
t
 
s
t
r
i
n
g


In [0]:
# Strings are "immutable" in Python - can't be modified
# You can't assign characters in a string to change them.
# You would need to create a new string to do this.
hello_string = "hello"
hello_string[1] = "a"

TypeError: ignored

## len() function

In [0]:
word = "woooooord"

print(len(word))

9


## Type conversion with str()

In [62]:
my_int = 4
my_float = 3.14159265

print(type(my_int))
print(my_int)
print(type(my_float))
print(my_float)

int_string = str(my_int)
float_string = str(my_float)

print('\n')
print(type(int_string))
print(int_string)
print(type(float_string))
print(float_string)

<class 'int'>
4
<class 'float'>
3.14159265


<class 'str'>
4
<class 'str'>
3.14159265


## Slice Notation with strings (`:` operator)

In [0]:
# DON'T RUN THIS
# You can use slice methods on strings just as you would a list
# The syntax goes something like this: 

a[start:stop]  # items start through stop-1 (or you can think start+1 to stop if not zero indexed)
a[start:]      # items start through the rest of the array
a[:stop]       # items from the beginning through stop-1
a[:]           # a copy of the whole array

In [0]:
# Where you see a colon say to yourself "go to" to help remember its function
print(test[10:14])
print(test[10:])
print(test[:14])

test
test string
This is a test


In [0]:
print(test[:-4])

# String slicing only works from left -> right
# This doesn't work
print(test[-2:-4])

# But this does
print(test[-4:-2])

This is a test st

ri


In [0]:
hello_string = "hello"
# hello_string[1] = "a" # This doesn't work, remember?
# hello_string[1:2] = "a" # Well, this also doesn't work.

first_section = hello_string[:1]
last_section = hello_string[2:]

hallo_string = first_section + "a" + last_section
print(hallo_string)

# This is still not a great solution, we'll look at a better one in a minute.

hallo


## The `in` Operator

In [0]:
test = "This is a test string"

search_string = "test"

print(search_string in test)

search_string = "testing"

print(search_string in test)

True
False


## The `+` Operator (Concatenation)

In [0]:
no_space_phrase = "Combine" + "strings" + "into" + "one" + "."
print(no_space_phrase)

# When using "+" you have to add your own separators.
my_phrase = "Combine" + " " + "strings" + " " + "into" + " " + "one" + "."
print(my_phrase)

Combinestringsintoone.
Combine strings into one.


In [0]:
year = "2018"
month = "March"
day = "14"

pi_day = month + " " + day + ", " + year

print(pi_day)

March 14, 2018


## `.join(<iterable>)` 

In [0]:
word_list = ["These", "are", "some", "strings", "in", "a", "list"]

# If an individual string is indexable does that make this a 2d list?
print(word_list[3][2:4])
print(word_list[1:2]) # Returns a list
print(word_list[1:2][0]) #Gets the first item in that list
print(word_list[1:2][0][1:2]) # gets the second letter of the second item in word_list


# The .join() operator can concatenate strings in a list with some kind of separator

#.join for multiple
no_spaces = "".join(word_list)
print(no_spaces)

with_spaces = " ".join(word_list)
print(with_spaces)

comma_separated = ", ".join(word_list)
print(comma_separated)

ri
['are']
are
r
Thesearesomestringsinalist
These are some strings in a list
These, are, some, strings, in, a, list


## `.split(<sep>)`
The opposite of .join()

In [0]:
split_me = "These are some strings in a list"

words = split_me.split(" ")

print(words)

put_it_back = " ".join(words)

print(put_it_back+".")

['These', 'are', 'some', 'strings', 'in', 'a', 'list']
These are some strings in a list.


In [0]:
# Turn a string into a list of characters

hello_string = "hello"

# hello_list = hello_string.split("") # This doesn't work either.
# print(hello_list)

hello_list = list(hello_string)
print(hello_list)

hello_list[1] = "a"

print(hello_list)

hallo_string = ''.join(hello_list)
print(hallo_string) 

['h', 'e', 'l', 'l', 'o']
['h', 'a', 'l', 'l', 'o']
hallo


## `.partition(<sep>)`

Say you want to split up a string but you don't want to remove the separator character.

In [0]:
email = 'ryan.allred@lambdaschool.com'

partitioned_email = email.partition("@")
print(partitioned_email)

('ryan.allred', '@', 'lambdaschool.com')


## `.upper()` and `.lower()` methods

In [0]:
name = "Ryan Allred"

print("Uppercase:", name.upper())

print("Lowercase:", name.lower())

Uppercase: RYAN ALLRED
Lowercase: ryan allred


# String Interpolation 

Inserting values into a string 

## Printing separated by commas

In [0]:
speed_limit = 70

print("The speed limit is", speed_limit, "miles per hour.")

The speed limit is 70 miles per hour.


## %s-formatting (the old-school way)

In [0]:
my_name = "Ryan"
print("My name is %s." %my_name)

pi = 3.14159265359
print("Pi is approximately %.3f" %pi)

My name is Ryan.
Pi is approximately 3.142


This method can get unwieldy if you have a lot of variables. You have to match arguments to inserts to follow what's happening. 

In [0]:
day = 14
month = "March"
year = 2018
forecast = "sunny"
high = 52.287
low = 33.543

print("On %s %s, %s the forecast was %s with a high of %.0f and a low of %.0f." %(month, day, year, forecast, high, low))

On March 14, 2018 the forecast was sunny with a high of 52 and a low of 34.


The syntax becomes particularly ugly if your values are stored in a dictionary.

In [0]:
weather_dict = {
    'day': 14,
    'month': 'March',
    'year': 2018,
    'forecast': 'sunny',
    'high': 52.287,
    'low': 33.543
}

print("On %s %s, %s the forecast was %s with a high of %.0f and a low of %.0f." 
      %(weather_dict['month'], weather_dict['day'], weather_dict['year'], 
        weather_dict['forecast'], weather_dict['high'], weather_dict['low']))

On March 14, 2018 the forecast was sunny with a high of 52 and a low of 34.


## `.format()` - The new old-school way - As of Python 3.1 
More explicit, and that's good, but more verbose.

In [0]:
day = 14
month = "March"
year = 2018
forecast = "sunny"
high = 52.287
low = 33.543

print('''On {month} {day}, {year} the forecast was {forecast} with a high of 
      {high:.0f} and a low of {low:.0f}.'''.format(month=month, day=day, 
                                                   year=year, forecast=forecast, 
                                                   high=high, low=low))

On March 14, 2018 the forecast was sunny with a high of 
      52 and a low of 34.


You can omit the inner variables to make things a little shorter, but then it's less explicit. 

In [0]:
day = 14
month = "March"
year = 2018
forecast = "sunny"
high = 52.287
low = 33.543

print('''On {} {}, {} the forecast was {} with a high of {:.0f} and a low of 
      {:.0f}.'''.format(month, day, year, forecast, high, low))

On March 14, 2018 the forecast was sunny with a high of 52 and a low of 
      34.


Imagine if you wanted both explicitness and your values were stored in a dictionary. 

In [0]:
weather_dict = {
    'day': 14,
    'month': 'March',
    'year': 2018,
    'forecast': 'sunny',
    'high': 52.287,
    'low': 33.543
}

print('''On {month} {day}, {year} the forecast was {forecast} with a high of 
      {high:.0f} and a low of {low:.0f}.'''.format(month=weather_dict['month'], day=weather_dict['day'], 
                                                   year=weather_dict['year'], forecast=weather_dict['forecast'], 
                                                   high=weather_dict['high'], low=weather_dict['low']))

On March 14, 2018 the forecast was sunny with a high of 
      52 and a low of 34.


## f-strings - The new-school way - As of Python 3.6 

f-strings are great because they are be both explicit and succinct.



In [0]:
weather_dict = {
    'day': 14,
    'month': 'March',
    'year': 2018,
    'forecast': 'sunny',
    'high': 52.287,
    'low': 33.543
}

print(f'''On {weather_dict['month']} {weather_dict['day']}, {weather_dict['year']} 
      the forecast was {weather_dict['forecast']} with a high of {weather_dict['high']:.0f} 
      and a low of {weather_dict['low']:.0f}.''')

On March 14, 2018 
      the forecast was sunny with a high of 52 
      and a low of 34.


In [0]:
# This works with any kind of string syntax:
dog_noise = 'Bark'

print(f'A dog says {dog_noise}!')
print(f"A dog says {dog_noise}!")
print(f'''A dog says {dog_noise}!''')

# You can also call functions within them which is pretty cool.
print('\n')
print(f'A dog says {dog_noise.upper()}!')

A dog says Bark!
A dog says Bark!
A dog says Bark!


A dog says BARK!


# Escape Characters and Whitespace in Text

Sometimes there are symbols in text that you don't want to be recognized as their literall string characters. To indicate to the python interpreter that these characters have special meaning, we proceed them with a backslash "\". This backslash says that whatever character lies after it has a special function.

## Escape Characters for Quotes

In [0]:
print("He said, \"I want double quotes inside of double quotes.\"")
print('He\'s also using single quotes inside of single quotes.')

He said, "I want double quotes inside of double quotes."
He's also using single quotes inside of single quotes.


## Escape Characters for Whitespace

In [0]:
print("single space")
print("---------------")
print("tabbed\tspace") # \t
print("---------------")
print("newline\ncharacter") # \n

single space
---------------
tabbed	space
---------------
newline
character


## Cleaning irregular whitespace.

In [0]:
spacey_sentence = "     This   is a \t\t sentence with \n   lots    of  \t spaces.     "
print(spacey_sentence)
print('-------------')
# Strip whitespace from the beginning and end of the string
print(spacey_sentence.strip())
print('-------------')
# Make all multiple spaces single spaces and remove beginning and end spaces
print(" ".join(spacey_sentence.split()))

     This   is a 		 sentence with 
   lots    of  	 spaces.     
-------------
This   is a 		 sentence with 
   lots    of  	 spaces.
-------------
This is a sentence with lots of spaces.


## Raw Strings

A Python raw string (also called a "raw string literal") is created by prefixing a string literal with ‘r’ or ‘R’. Python raw strings treat backslashes `\` as a literal character. This is useful when we want to have a string that contains backslash and don’t want it to be treated as an escape character.

In [0]:
greeting = 'Hi\nHello'
print(greeting)

Hi
Hello


In [0]:
greeting = r'Hi\nHello'
print(greeting)

Hi\nHello


# Regular Expressions

A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern.

RegEx can be used to check if a string contains the specified search pattern.

Regular Expressions aren't specific to any single programming language but are a standalone syntax that is more or less standard among programming langues, so these skills will help you no matter the language that you're working with.  

The only downside with regular expressions is that the syntax is pretty abstract, but with a little bit of practice  **and a lot of references to cheat sheets** we can make it happen.

### Here are some cheat sheets for your reference: 

<https://www.debuggex.com/cheatsheet/regex/python>

<https://www.dataquest.io/blog/regex-cheatsheet>

In [0]:
import re

When using regular expressions in Python it is recommended that you use raw strings instead of regular Python strings so that we can easily interpret whitespace (and other escaped characters) by their escape character sequence.

## Single Letter Matching

In [0]:
# matchObject = re.search(pattern, input_str)

# Search for the first lowercase letter
regex = r"[a-z]" 
match = re.search(regex, "June 24")

# Look at the match object
print(match)

# Look at the matching string
print(match[0])

# Look at slice indices and grab the match manually
print("Slice indices for this string:", match.start(), match.end())
print("June 24"[match.start():match.end()])

<_sre.SRE_Match object; span=(1, 2), match='u'>
u
Slice indices for this string: 1 2
u


In [0]:
# Search for the first uppercase letter
regex = r"[A-Z]" 

search_result = re.search(regex, "June 24")
print(search_result)

print(search_result[0])

<_sre.SRE_Match object; span=(0, 1), match='J'>
J


In [0]:
# Search for groups of 1 or more uppercase or lowercase letters (word things)
regex = r"[a-zA-Z]+" 

search_result = re.search(regex, "June 24")
print(search_result)

print(search_result[0])

<_sre.SRE_Match object; span=(0, 4), match='June'>
June


## Multiple Matches

In [0]:
# Search for groups of 1 or more letters
regex = r"[a-zA-Z]+" 

string = "June 24, other thing, February 6, doesn't match, March 14"

# .findall() returns a list of the matching strings
search_result = re.findall(regex, string)
print(search_result)

for match in search_result:
  print(match)

['June', 'February', 'March']
June
February
March


In [0]:
# Search for groups of 1 or more letters
regex = r"[a-zA-Z]+" 

string = "June 24, other thing, February 6, doesn't match, March 14"

# finditer returns a list of match objects 
search_result = re.finditer(regex, string)

for match in search_result:
  start = match.start()
  end = match.end()
  print(f"The string starts at {start} and ends at {end}")
  print(f"The string is {match[0]}")
#   print(f"The string is {string[start:end]}")
  print('--------------')

The string starts at 0 and ends at 4
The string is June
--------------
The string starts at 9 and ends at 17
The string is February
--------------
The string starts at 21 and ends at 26
The string is March
--------------


## Capture Groups

In [0]:
# Capture something that matches date
regex = r"[a-zA-Z]+ \d+" 

string = "June 24, other thing, February 6, doesn't match, March 14"

search_result = re.findall(regex, string)
print(search_result)

for match in search_result:
  print(match)

['June 24', 'February 6', 'March 14']
June 24
February 6
March 14


In [0]:
# Capture something that matches date but only capture the month
regex = r"([a-zA-Z]+) \d+" 

string = "June 24, other thing, February 6, doesn't match, March 14"

search_result = re.findall(regex, string)
print(search_result)

for match in search_result:
  print(match)

['June', 'February', 'March']
June
February
March


In [58]:
# Capture both month and day separately
regex = r"([a-zA-Z]+) (\d+)" 

string = "June 24, other thing, February 6, doesn't match, March 14"

search_result = re.findall(regex, string)
print(search_result)

for match in search_result:
  print(match)

[('June', '24'), ('February', '6'), ('March', '14')]
('June', '24')
('February', '6')
('March', '14')


Lists of tuples are great because they go into dataframes easily!

In [61]:
df = pd.DataFrame(search_result, columns=['Month', 'Day'])
df.head()

Unnamed: 0,Month,Day
0,June,24
1,February,6
2,March,14


# Regex Practice

In [0]:
text_to_search = '''
abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
1234567890

lambdaschool.com
numpy.org
pythonprogramming.net

http://www.LambdaSchool.com
http://www.google.com
https://www.twitter.com

8014893087
801-489-8116
801.491.8037
345&123(1209?

Mr. Allred
Mrs. Smith
Ms. Todorojo
Mr Priest
Dr Strangelove

MetaCharacters that need to be escaped:
\.{[()*&+?|$
.$^*(){}[]\|+?

This is a sentence of text.
'''

## Search for the string "abc"

abc


## Search for the string "."

## Search for the string "lambdaschool.com"

## Match all digits

## Match anything that is not a digit

## Match any clumps of digits

## Match clumps of non-digits

## Search for any letter, digit or underscore (then .join them)

## Beginning and End of string searches

In [0]:
text = "https://www.lambdaschool.com"

## Search for urls beginning with "http"

## Search for phone numbers

## Search for all URLs

### When you write the correct Regex pattern without looking at a cheat sheet:

![Wand](https://media1.tenor.com/images/770bc6dce2506cb3cbe8a8561eebf969/tenor.gif?itemid=13050462)

![You're a wizard Harry](https://media1.tenor.com/images/0f8588a031f9aaa157df4f519b65180e/tenor.gif?itemid=5453410)

# Web Scraping via BeautifulSoup and requests

In [0]:
from bs4 import BeautifulSoup
import requests

## Make the Soup

In [45]:
page_to_scrape = 'https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anZ1.htm'

# This makes a request to the page just as if we were navigating
# to that page in the browser
page = requests.get(page_to_scrape)

# The "soup" is the parsed html of a webpage.
soup = BeautifulSoup(page.text, 'html.parser')
print(soup)

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml"><!-- InstanceBegin template="/Templates/template-collection.dwt" codeOutsideHTMLIsLocked="false" -->
<head><script src="//archive.org/includes/analytics.js?v=cf34f82" type="text/javascript"></script>
<script type="text/javascript">window.addEventListener('DOMContentLoaded',function(){var v=archive_analytics.values;v.service='wb';v.server_name='wwwb-app41.us.archive.org';v.server_ms=390;archive_analytics.send_pageview({});});</script>
<script charset="utf-8" src="/static/js/ait-client-rewrite.js?v=1549496156" type="text/javascript"></script>
<script type="text/javascript">
WB_wombat_Init('https://web.archive.org/web', '20121007172955', 'www.nga.gov');
</script>
<script charset="utf-8" src="/static/js/wbhack.js?v=1549496156" type="text/javascript"></script>
<script type="text/javascript">
__wbhack.init('https://web.archive.o

## .find()

.find() gives us everything within an html tag as if it were a smaller soup. We can search by id or by class.

In [66]:
nav_stuff = soup.find(id='globalNav')
print(nav_stuff)

<ul id="globalNav">
<li><a href="/web/20121007172955/https://www.nga.gov/collection/index.shtm">The Collection</a></li>
<li><a href="/web/20121007172955/https://www.nga.gov/exhibitions/index.shtm">Exhibitions</a></li>
<li><a href="/web/20121007172955/https://www.nga.gov/ginfo/index.shtm">Plan a Visit</a></li>
<li><a href="/web/20121007172955/https://www.nga.gov/programs/index.shtm">Programs &amp; Events</a></li>
<li><a href="/web/20121007172955/https://www.nga.gov/onlinetours/index.shtm">Online Tours</a></li>
<li><a href="/web/20121007172955/https://www.nga.gov/education/index.shtm">Education</a></li>
<li><a href="/web/20121007172955/https://www.nga.gov/resources/index.shtm">Resources</a></li>
<li><a href="https://web.archive.org/web/20121007172955/http://shop.nga.gov/">Gallery Shop</a></li>
<li><a href="/web/20121007172955/https://www.nga.gov/support/index.shtm">Support the Gallery</a></li>
<li><a href="https://web.archive.org/web/20121007172955/https://images.nga.gov/en/page/show_hom

In [64]:
artist_name_list = soup.find(class_='BodyText')
print(artist_name_list)

<div class="BodyText">
<!-- InstanceBeginEditable name="BodyText" -->
<h3>Artist names beginning with Z</h3><table>
<tr valign="top"><td><a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=11630">Zabaglia, Niccola</a></td><td>Italian, 1664 - 1750</td></tr>
<tr valign="top"><td><a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=34202">Zaccone, Fabian</a></td><td>American, 1910 - 1992</td></tr>
<tr valign="top"><td><a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=3475">Zadkine, Ossip</a></td><td>French, 1890 - 1967</td></tr>
<tr valign="top"><td><a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=25135">Zaech, Bernhard</a></td><td>German, active c. 1650</td></tr>
<tr valign="top"><td><a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=2298">Zagar, Jacob</a></td><td>Flemish, c. 1530 - after 1580</td></tr>
<tr valign="top"><td><a href="/web/20121007172955/https://www.nga.gov/cg

In [47]:
artist_names = artist_name_list.find_all('a')
print(artist_names)

[<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=11630">Zabaglia, Niccola</a>, <a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=34202">Zaccone, Fabian</a>, <a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=3475">Zadkine, Ossip</a>, <a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=25135">Zaech, Bernhard</a>, <a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=2298">Zagar, Jacob</a>, <a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=23988">Zagroba, Idalia</a>, <a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=8232">Zaidenberg, A.</a>, <a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=34154">Zaidenberg, Arthur</a>, <a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=4910">Zaisinger, Matthäus</a>, <a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=3450">Z

In [63]:
for artist_name in artist_names:
    print(artist_name)

<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=11630">Zabaglia, Niccola</a>
<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=34202">Zaccone, Fabian</a>
<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=3475">Zadkine, Ossip</a>
<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=25135">Zaech, Bernhard</a>
<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=2298">Zagar, Jacob</a>
<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=23988">Zagroba, Idalia</a>
<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=8232">Zaidenberg, A.</a>
<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=34154">Zaidenberg, Arthur</a>
<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=4910">Zaisinger, Matthäus</a>
<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=3450">Zajac, Jack

In [50]:
#.text gets you the text inside of an html element
for artist_name in artist_names:
    print(artist_name.text) 

Zabaglia, Niccola
Zaccone, Fabian
Zadkine, Ossip
Zaech, Bernhard
Zagar, Jacob
Zagroba, Idalia
Zaidenberg, A.
Zaidenberg, Arthur
Zaisinger, Matthäus
Zajac, Jack
Zak, Eugène
Zakharov, Gurii Fillipovich
Zakowortny, Igor
Zalce, Alfredo
Zalopany, Michele
Zammiello, Craig
Zammitt, Norman
Zampieri, Domenico
Zampieri, called Domenichino, Domenico
Zanartú, Enrique Antunez
Zanchi, Antonio
Zanetti, Anton Maria
Zanetti Borzino, Leopoldina
Zanetti I, Antonio Maria, conte
Zanguidi, Jacopo
Zanini, Giuseppe
Zanini-Viola, Giuseppe
Zanotti, Giampietro
Zao Wou-Ki
Zas-Zie
Zie-Zor
nextpage


In [53]:
artists = []
for artist_name in artist_names:
    artists.append(artist_name.text)
    
import pandas as pd

df = pd.DataFrame({'full_name': artists})
print(df.shape)
df.head()

(32, 1)


Unnamed: 0,full_name
0,"Zabaglia, Niccola"
1,"Zaccone, Fabian"
2,"Zadkine, Ossip"
3,"Zaech, Bernhard"
4,"Zagar, Jacob"


In [55]:
df['first_name'] = df['full_name'].str.extract('(\w+$)', expand=True)
df.head(31)

Unnamed: 0,full_name,first_name
0,"Zabaglia, Niccola",Niccola
1,"Zaccone, Fabian",Fabian
2,"Zadkine, Ossip",Ossip
3,"Zaech, Bernhard",Bernhard
4,"Zagar, Jacob",Jacob
5,"Zagroba, Idalia",Idalia
6,"Zaidenberg, A.",
7,"Zaidenberg, Arthur",Arthur
8,"Zaisinger, Matthäus",Matthäus
9,"Zajac, Jack",Jack


![XKCD Regex](https://imgs.xkcd.com/comics/regular_expressions.png)

# Additional Resources:

Python String Functions Overview:
<https://realpython.com/python-strings/>

Good Python String Overview:
<https://automatetheboringstuff.com/chapter6/>

Python Regex Overview:
<https://regexone.com/references/python>

Regex in Python with Corey Schafer:
<https://www.youtube.com/watch?v=K8L6KVGG-7o&t=120s>

Regex Practice:
<https://regexone.com/>

Breaking up a column into strings using regex in pandas
<https://chrisalbon.com/python/data_wrangling/pandas_regex_to_create_columns/>

Beautiful Soup Documentation:
<https://www.crummy.com/software/BeautifulSoup/bs4/doc/>