Scraping Webpages with BeautifulSoup
====================================

Lets try to get a list of all the years of all of Amitabh Bachchan movies!  If you don't know, he's kind of the Sean Connery of India.

BeautifulSoup lets you download webpages and search them for specific HTML entities. You can use this ability to scrape data out of the webpage, or a series of webpages.  It is fast and works well.  Their [documentation](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a handy reference.

Getting the Content
------------------
First you gotta grab the content (I like to use [requests](http://docs.python-requests.org/en/latest/) for this)

In [1]:
import requests
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
r = requests.get(url='http://www.imdb.com/name/nm0000821', headers=headers) # lets look at Amitabh Bachchan's list of movies

How you can make your "beautiful soup"! This turns the HTML into a DOM tree that you can navigate with code.

In [2]:
from bs4 import BeautifulSoup
webpage = BeautifulSoup(r.text, "html.parser")

Scraping the Info You Want
------------------------
Now there are a few ways to get content out.   For instance, to get the title you could treat it like an object:

In [3]:
webpage.title.text

'Amitabh Bachchan - IMDb'

Or you can search for specific tags. This would get all the links (as DOM elements):

In [4]:
len(webpage.find_all('a'))

288

Or you can use good old CSS selectors, to actually find all the years his movies were made in:

In [5]:
len(webpage.select('.ipc-metadata-list-summary-item__li'))

40

Of course, we really want to turn this into a list of years... not DOM elements
TODO:Expand all films

In [7]:
raw_year_list = [e.text.strip() for e in webpage.select('.ipc-metadata-list-summary-item__li')]

Cleaning and Analyzing the Data
-----------------------------
So we can check if he made any films in a particular year

In [8]:
'1972' in raw_year_list

False

And we can look for messy data:

In [None]:
[year for year in raw_year_list if not year.isnumeric()]

And we can remove these messy entries (even though that isn't the best thing to do):

In [9]:
year_list = [year for year in raw_year_list if year.isnumeric()]
','.join(year_list)

'2023,2024,2024,2023,2022,2022,2022,2022,2022,2022,2022,2021,2021,2021,2020,2020,2020,2019,2016,2011,2005,2001,1998,1998,1997,1996,2016'

In [10]:
import collections
year_freq = collections.Counter(year_list)
for year in sorted(year_freq.keys()):
    print(str(year)+': '+('+'*year_freq[year]))

1996: +
1997: +
1998: ++
2001: +
2005: +
2011: +
2016: ++
2019: +
2020: +++
2021: +++
2022: +++++++
2023: ++
2024: ++
