## Web Scrapping Example

In [2]:
# package use
import requests
import urllib
import urllib.request
import time
import re
from bs4 import BeautifulSoup
from IPython.core.display import HTML

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")
import copy
from collections import Counter

In [2]:
# set the url to the website and access the site with our requests library
url = 'http://web.mta.info/developers/turnstile.html'
response = requests.get(url)
response

<Response [200]>

In [3]:
# Next we parse the html with BeautifulSoup so that we can work with a nicer, nested BeautifulSoup data structure
soup = BeautifulSoup(response.text, "html.parser")
# print soup to show the nested data structure
# soup

In [4]:
# We use the method .findAll to locate all of our <a> tags for the first one to ten records
soup.findAll('a')[1:10]

[<a href="http://www.mta.info"><img alt="Go to MTA homepage" src="/template/images/mta_info.gif"/></a>,
 <a href="/accessibility">Accessibility</a>,
 <a href="http://assistive.usablenet.com/tt/http://www.mta.info">Text-only</a>,
 <a href="/selfserve">Customer Self-Service</a>,
 <a href="/mta/employment/">Employment</a>,
 <a href="/faqs.htm">FAQs/Contact Us</a>,
 <a href="http://www.mta.info" style="padding-left:18px;">Home</a>,
 <a href="http://www.mta.info">MTA Home</a>,
 <a href="http://www.mta.info/nyct">NYC Subways and Buses</a>]

### Explanation for the tag records above:
- Explanation: This code gives us every line of code that has an <a> tag. 
- The information that we are interested in starts on line 36. Not all links are relevant to what we want, but most of it is, so we can easily slice from line 36.

In [5]:
# let’s extract the actual link that we want. Let’s test out the first link
# Notice that all the .txt files are inside the <a> tag following the line above
one_a_tag = soup.findAll('a')[36]
print(one_a_tag)
# extract the address of txt
link = one_a_tag['href']

<a href="data/nyct/turnstile/turnstile_190330.txt">Saturday, March 30, 2019</a>


### Explanation for operations above:
- This code saves ‘data/nyct/turnstile/turnstile_190316.txt’ to our variable link. The full url to download the data is actually ‘http://web.mta.info/developers/data/nyct/turnstile/turnstile_190316.txt’ which I discovered by clicking on the first data file on the website as a test.
- We can use our urllib.request library to download this file path to our computer. We provide request.urlretrieve with two parameters: file url and the filename. For my files, I named them “turnstile_180922.txt”, “turnstile_180901”, etc.

In [6]:
# create full download url string
download_url = 'http://web.mta.info/developers/'+ link
urllib.request.urlretrieve(download_url,'./'+link[link.find('/turnstile_')+1:]) 

('./turnstile_190330.txt', <http.client.HTTPMessage at 0x187f5673550>)

In [7]:
# Last but not least, we should include this line of code 
# so that we can pause our code for a second 
# so that we are not spamming the website with requests.
# This helps us avoid getting flagged as a spammer
time.sleep(1)

## Read Table

- useful tutorial websites: 

https://www.dataquest.io/blog/web-scraping-tutorial-python/

https://stackoverflow.com/questions/46015006/how-to-scrape-the-first-n-paragraphs-from-a-url

https://cfss.uchicago.edu/webdata005_scraping.html

In [2]:
# read table and show the first five lines
director_table = pd.read_csv('director table.csv', encoding='ISO-8859-1')
director_table['bio_url'] = np.nan
director_table.head(3)

Unnamed: 0,photoimage,actorname,actorimdb,dateofbirth,placeofbirth,minibio,trivia,race,gender,Domestic,bio_url
0,,A.R. Murugadoss,http://www.imdb.com/name/nm1436693/,,,,,,,0,
1,,Aanand Rai,http://www.imdb.com/name/nm2399862/,,,,,,,0,
2,,Aaron Schneider,http://www.imdb.com/name/nm0773689/,,,,,,,1,


### gender-guesser package
This package uses the underlying data from the program “gender” by Jorg Michael (described here). 
Its use is pretty straightforward.

https://pypi.org/project/gender-guesser/

## Some hint on how to extract those information
- dateofbirth: extract from the profile
- placeofbirth: extract from the profie
- minibio: extract from the "view more bio" link
- trivia: extract from the "vie more bio" link
- race: depends
- gender: deduce from the mini_bio by detecting him or her & otherwise use gender-guesser package

In [3]:
# detect whether people have "view more bio" or not and record
def detect_bio_link(url):
    response = requests.get(url)
    # form the txt
    # call BeautifulSoup data structure to work
    soup = BeautifulSoup(response.text, "html.parser")
    
    mini_bio_url_num = 0
    # Find all the links on the page
    for link in soup.find_all('a', href=True):
        # find the mini_bio page
        if "bio_sm" in link['href']:
            # calculate how many it finds
            mini_bio_url_num += 1
            mini_bio_url = link['href']
            mini_bio_url = "https://www.imdb.com" + mini_bio_url
    # check
    if mini_bio_url_num == 1:
        return True, mini_bio_url
    else:
        mini_bio_url = " "
        return False, mini_bio_url

In [None]:
# decide number of directors to scrap
length = len(director_table)
# make a copy to modify the original table
director_table_bioAdd = copy.deepcopy(director_table)

# update and fill the new table
for i in range(length):
    judge, mini_bio_url_add = detect_bio_link(director_table['actorimdb'][i])
    if judge:
        director_table_bioAdd['bio_url'][i] = mini_bio_url_add
    # keep track of process
    if i%200 == 0:
        print(i)
# write csv to store fist
director_table_bioAdd.to_csv('bio_url.csv')

In [3]:
director_table_bioAdd = pd.read_csv('bio_url.csv')

In [4]:
# after check
director_table_bioAdd.head(3)

Unnamed: 0,photoimage,actorname,actorimdb,dateofbirth,placeofbirth,minibio,trivia,race,gender,Domestic,bio_url
0,,A.R. Murugadoss,http://www.imdb.com/name/nm1436693/,,,,,,,0,https://www.imdb.com/name/nm1436693/bio?ref_=n...
1,,Aanand Rai,http://www.imdb.com/name/nm2399862/,,,,,,,0,https://www.imdb.com/name/nm2399862/bio?ref_=n...
2,,Aaron Schneider,http://www.imdb.com/name/nm0773689/,,,,,,,1,https://www.imdb.com/name/nm0773689/bio?ref_=n...


## Operation on director with bio urls first

In [5]:
bio_url_fill = director_table_bioAdd[director_table_bioAdd['bio_url'].notnull()]
bio_url_fill = bio_url_fill.reset_index(drop=True)

bio_url_notfill = director_table_bioAdd.append(bio_url_fill).drop_duplicates(keep=False)
bio_url_notfill = bio_url_notfill.reset_index(drop=True)
bio_url_fill.head(3)

Unnamed: 0,photoimage,actorname,actorimdb,dateofbirth,placeofbirth,minibio,trivia,race,gender,Domestic,bio_url
0,,A.R. Murugadoss,http://www.imdb.com/name/nm1436693/,,,,,,,0,https://www.imdb.com/name/nm1436693/bio?ref_=n...
1,,Aanand Rai,http://www.imdb.com/name/nm2399862/,,,,,,,0,https://www.imdb.com/name/nm2399862/bio?ref_=n...
2,,Aaron Schneider,http://www.imdb.com/name/nm0773689/,,,,,,,1,https://www.imdb.com/name/nm0773689/bio?ref_=n...


## Extract dateofbirth and placeofbirth information

In [6]:
print("length of bio url filled data set: ", len(bio_url_fill))

length of bio url filled data set:  1362


In [8]:
# extract birth date & place at least
# and minibio and trivia information
def extract_info(url):
    response_mini = requests.get(url)
    # still form the text
    soup_mini = BeautifulSoup(response_mini.text, "html.parser")
    
    # extract birth date and place first
    birth_monthday = " "
    birth_year = " "
    placeofbirth = " "
    for link in soup_mini.find_all('a', href=True):
        if "birth_monthday" in link['href']:
            birth_monthday = link.string
        if "birth_year" in link['href']:
            birth_year = link.string
        if "birth_place" in link['href']:
            placeofbirth = link.string
    # form date of birth string
    dateofbirth = birth_monthday+" "+birth_year
    
    # extract all relevant information -- topics and content
    # soda odd and soda even for content, li_group for title
    table = soup_mini.find_all(True, {"class": {"soda odd", "soda even", "li_group"}})
    # extract all the text first
    table_text = []
    for i in range(len(table)):
        table_text.append(table[i].get_text(strip=True))
    
    # extract mini_bio and trivia now
    mini_trivia_para = [" ", " "]
    search_title = ["Mini Bio", "Trivia"]
    for j in range(len(search_title)):
        for i in range(len(table_text)):
            if search_title[j] in table_text[i]:
                # extract the number of records to append after
                numOfRecord = re.sub("\D", "", table_text[i])
                # fill in the information
                mini_trivia_para[j] = '/'.join(table_text[(i+1):(i+1+int(numOfRecord))])
                break
                
    return dateofbirth, placeofbirth, mini_trivia_para[0], mini_trivia_para[1]

In [None]:
# fill the information now
length = len(bio_url_fill)
for count in range(length):
    url = bio_url_fill['bio_url'][count]
    birthday_fill, place_fill, mini_fill, trivia_fill = extract_info(url)
    bio_url_fill['dateofbirth'][count] = birthday_fill
    bio_url_fill['placeofbirth'][count] = place_fill
    bio_url_fill['minibio'][count] = mini_fill
    bio_url_fill['trivia'][count] = trivia_fill
    if count % 100 == 0:
        print(count)
# write to csv
bio_url_fill.to_csv("bio_url_fill.csv")

In [7]:
bio_url_fill = pd.read_csv('bio_url_fill.csv')

## Operation on director without bio urls now

- Since they have no bio urls, minibio and trivia information would be absent
- Try to extract their birthplace and birthday information if possible

In [8]:
print("length of bio url not filled data set: ", len(bio_url_notfill))

length of bio url not filled data set:  75


In [9]:
bio_url_notfill.head(3)

Unnamed: 0,photoimage,actorname,actorimdb,dateofbirth,placeofbirth,minibio,trivia,race,gender,Domestic,bio_url
0,,Aaron Seltzer,http://www.imdb.com/name/nm0783536/,,,,,,,1,
1,,Abhishek Varman,http://www.imdb.com/name/nm2831530/,,,,,,,0,
2,,Adam Chapman,http://www.imdb.com/name/nm7920865/,,,,,,,1,


In [12]:
# extract birth information for notfill data set--function
def extract_birth_notfill(url):
    response_mini = requests.get(url)
    # still form the text
    soup_mini = BeautifulSoup(response_mini.text, "html.parser")

    birth_monthday = " "
    birth_year = " "
    placeofbirth = " "
    for link in soup_mini.find_all('a', href=True):
        if "birth_monthday" in link['href']:
            birth_monthday = link.string
        if "birth_year" in link['href']:
            birth_year = link.string
        if "birth_place" in link['href']:
            placeofbirth = link.string
    # form date of birth string
    dateofbirth = birth_monthday+" "+birth_year
    
    return dateofbirth, placeofbirth

In [None]:
# extract birth information for notfill data set
length_notfill = len(bio_url_notfill)
for i in range(length_notfill):
    date, place = extract_birth_notfill(bio_url_notfill['actorimdb'][i])
    bio_url_notfill['dateofbirth'][i] = date
    bio_url_notfill['placeofbirth'][i] = place

In [10]:
# combine the total information
bio_url_total = bio_url_fill.append(bio_url_notfill, ignore_index=True)
# clean messy string "Born Today"
bio_url_total['dateofbirth'] = bio_url_total['dateofbirth'].str.replace("Born Today", "")
# write to csv
bio_url_total.to_csv("bio_url_total.csv", index=False)

In [10]:
bio_url_total = pd.read_csv('bio_url_total.csv')
bio_url_total.head(5)

Unnamed: 0,photoimage,actorname,actorimdb,dateofbirth,placeofbirth,minibio,trivia,race,gender,Domestic,bio_url,tell_gender_fromText,gender2
0,,A.R. Murugadoss,http://www.imdb.com/name/nm1436693/,,,"A.R. Murugadoss is a writer and director, know...",,,,0,https://www.imdb.com/name/nm1436693/bio?ref_=n...,0,
1,,Aanand Rai,http://www.imdb.com/name/nm2399862/,,,Aanand L. Rai is a Hindi film director and pro...,,,male,0,https://www.imdb.com/name/nm2399862/bio?ref_=n...,1,male
2,,Aaron Schneider,http://www.imdb.com/name/nm0773689/,,,Aaron Schneider is known for his work onTwo So...,Member of the American Society of Cinematograp...,,male,1,https://www.imdb.com/name/nm0773689/bio?ref_=n...,1,male
3,,Abbas Alibhai Burmawalla,http://www.imdb.com/name/nm0122216/,,,Abbas Alibhai Burmawalla is a director and pro...,The name Abbas-Mastan is used for films co-dir...,,male,0,https://www.imdb.com/name/nm0122216/bio?ref_=n...,1,male
4,,Abbas Kiarostami,http://www.imdb.com/name/nm0452102/,June 22 1940,"Tehran, Iran","Abbas Kiarostami was born in Tehran, Iran, in ...",Received the UNESCO Fellini-Medal in Gold for ...,,male,0,https://www.imdb.com/name/nm0452102/bio?ref_=n...,1,male


## Deduce the gender information from previous columns - bio and trivia

In [11]:
male_dic = ['him', 'his', 'he', 'himself']
female_dic = ['her', 'she', 'hers', 'herself']
# add tell_gender column
# tell_gender = 1 if can be deduced from bio or trivia
bio_url_total['gender'] = np.nan
bio_url_total['tell_gender_fromText'] = 0

In [2]:
# NLP packages
import spacy
# loading up the language model: English
# nlp = spacy.load('en')
import en_core_web_sm
nlp = en_core_web_sm.load()
# gender package
import gender_guesser.detector as gender

In [13]:
# d = gender.Detector()
d = gender.Detector(case_sensitive=False)
print(d.get_gender("Bob"))
print(d.get_gender("Pauley")) # should be androgynous
print(d.get_gender("Adam Shankman"))

male
andy
unknown


In [17]:
punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~''' # list of special characters you want to exclude
def kick_out_special_char(sent):
    done = ""
    for char in sent:
        if char not in punctuations:
             done = done + char
        else:
            done = done + " "
    return done

def tell_gender(num):
    # read text from two
    bio_ = bio_url_total.minibio[num]
    trivia_ = bio_url_total.trivia[num]  
    # lower chars and store words into list
    bio_ = str(bio_)
    bio_temp = bio_.lower().split()
    trivia_temp = kick_out_special_char(trivia_).split()
    temp = bio_temp + trivia_temp
    # check
    find = 0
    gender = np.nan
    for item in male_dic:
        if item in temp:
            gender = "male"
            find = 1
            break
    # not male the check female
    if not find:
        for item in female_dic:
            if item in temp:
                gender = "female"
                find = 1
                break
    return find, gender

In [None]:
# fill in gender column
length = len(bio_url_fill)
for i in range(length):
    find_or_not, gender_info = tell_gender(i)
    bio_url_total['gender'][i] = gender_info
    bio_url_total['tell_gender_fromText'][i] = find_or_not
    if i%100 == 0:
        print(i)

In [None]:
# need to double check to make sure
def tell_gender_female_first(num):
    # read text from two
    bio_ = bio_url_total.minibio[num]
    trivia_ = bio_url_total.trivia[num]  
    # lower chars and store words into list
    bio_ = str(bio_)
    bio_temp = bio_.lower().split()
    trivia_temp = kick_out_special_char(trivia_).split()
    temp = bio_temp + trivia_temp
    # check
    find = 0
    gender = np.nan
    for item in female_dic:
        if item in temp:
            gender = "female"
            find = 1
            break
    # not male the check female
    if not find:
        for item in male_dic:
            if item in temp:
                gender = "male"
                find = 1
                break
    return find, gender

bio_url_total['gender2'] = np.nan
# double check
# fill in gender2 column
for i in range(length):
    find_or_not, gender_info = tell_gender_female_first(i)
    bio_url_total['gender2'][i] = gender_info
    if i%100 == 0:
        print(i)

In [24]:
# write to csv
bio_url_total.to_csv("bio_url_total.csv", index=False)

In [14]:
bio_url_total = pd.read_csv('bio_url_total.csv')
bio_url_total.head(3)

Unnamed: 0,photoimage,actorname,actorimdb,dateofbirth,placeofbirth,minibio,trivia,race,gender,Domestic,bio_url,tell_gender_fromText,gender2
0,,A.R. Murugadoss,http://www.imdb.com/name/nm1436693/,,,"A.R. Murugadoss is a writer and director, know...",,,,0,https://www.imdb.com/name/nm1436693/bio?ref_=n...,0,
1,,Aanand Rai,http://www.imdb.com/name/nm2399862/,,,Aanand L. Rai is a Hindi film director and pro...,,,male,0,https://www.imdb.com/name/nm2399862/bio?ref_=n...,1,male
2,,Aaron Schneider,http://www.imdb.com/name/nm0773689/,,,Aaron Schneider is known for his work onTwo So...,Member of the American Society of Cinematograp...,,male,1,https://www.imdb.com/name/nm0773689/bio?ref_=n...,1,male


In [15]:
cant_deduce = len(bio_url_total[bio_url_total['tell_gender_fromText'] == 0])
dont_match = len(bio_url_total[bio_url_total['gender'] != bio_url_total['gender2']])
print("don't match gender number: ", dont_match)
print("can't deduce gender number: ", cant_deduce)

don't match gender number:  416
can't deduce gender number:  220


In [16]:
# filter out all these cant_deduce and dont_match information
# re-estimate according to the majority of gender_dic appearance
cant_deduce_set = bio_url_total[bio_url_total['tell_gender_fromText'] == 0]
dont_match_set = bio_url_total[bio_url_total['gender'] != bio_url_total['gender2']]

re_estimate_set = pd.concat([cant_deduce_set, dont_match_set], axis=0).drop_duplicates(keep='first')
re_estimate_set = re_estimate_set[~re_estimate_set['bio_url'].isnull()].reset_index(drop=True)
not_re_estimate_set = pd.concat([bio_url_total, re_estimate_set]).drop_duplicates(keep=False).reset_index(drop=True)

In [17]:
print(len(not_re_estimate_set))
print(len(re_estimate_set))

1096
341


In [18]:
re_estimate_set.head(3)

Unnamed: 0,photoimage,actorname,actorimdb,dateofbirth,placeofbirth,minibio,trivia,race,gender,Domestic,bio_url,tell_gender_fromText,gender2
0,,A.R. Murugadoss,http://www.imdb.com/name/nm1436693/,,,"A.R. Murugadoss is a writer and director, know...",,,,0,https://www.imdb.com/name/nm1436693/bio?ref_=n...,0,
1,,Abhinav Kashyap,http://www.imdb.com/name/nm3508781/,,,"Abhinav Kashyap is a writer and director, know...",Brother of noted director Anurag Kashyap.,,,0,https://www.imdb.com/name/nm3508781/bio?ref_=n...,0,
2,,Abhinay Deo,http://www.imdb.com/name/nm3218978/,,,"Abhinay Deo is a director and producer, known ...",Brother of Ajinkya Deo./Son of Ramesh Deo and ...,,,0,https://www.imdb.com/name/nm3218978/bio?ref_=n...,0,


In [19]:
not_re_estimate_set.head(3)

Unnamed: 0,photoimage,actorname,actorimdb,dateofbirth,placeofbirth,minibio,trivia,race,gender,Domestic,bio_url,tell_gender_fromText,gender2
0,,Aanand Rai,http://www.imdb.com/name/nm2399862/,,,Aanand L. Rai is a Hindi film director and pro...,,,male,0,https://www.imdb.com/name/nm2399862/bio?ref_=n...,1,male
1,,Aaron Schneider,http://www.imdb.com/name/nm0773689/,,,Aaron Schneider is known for his work onTwo So...,Member of the American Society of Cinematograp...,,male,1,https://www.imdb.com/name/nm0773689/bio?ref_=n...,1,male
2,,Abbas Alibhai Burmawalla,http://www.imdb.com/name/nm0122216/,,,Abbas Alibhai Burmawalla is a director and pro...,The name Abbas-Mastan is used for films co-dir...,,male,0,https://www.imdb.com/name/nm0122216/bio?ref_=n...,1,male


In [24]:
# find the majority
def tell_gender_majority(num):
    # read text from two
    bio_ = re_estimate_set['minibio'][num]
    trivia_ = re_estimate_set['trivia'][num] 
    
    # lower chars and store words into list
    bio_ = str(bio_)
    bio_temp = bio_.lower().split()
    if trivia_ != " ":
        trivia_temp = kick_out_special_char(trivia_).split()
        temp = bio_temp + trivia_temp
    else:
        temp = bio_temp
    # check
    find = 0
    gender_male = 0
    gender_female = 0
    
    for item in female_dic:
        gender_female = gender_female + temp.count(item)
    for item in male_dic:
        gender_male = gender_male + temp.count(item)
    
    if gender_male > gender_female:
        find = 1
        gender = "male"
    elif gender_male < gender_female:
        find = 1
        gender = "female"
    else:
        gender = np.nan
    
    return find, gender

In [107]:
not_re_estimate_set['gender_majority'] = not_re_estimate_set['gender']
re_estimate_set['gender_majority'] = np.nan

for i in range(len(re_estimate_set)):
    find_or_not, gender_info = tell_gender_majority(i)
    re_estimate_set['gender_majority'][i] = gender_info
    if i%100 == 0:
        print(i)

0
100
200
300


In [110]:
bio_url_total_gender = pd.concat([not_re_estimate_set, re_estimate_set]).drop_duplicates(keep='first').reset_index(drop=True)
bio_url_total_gender.to_csv("bio_url_total_gender.csv")

In [20]:
bio_url_total_gender = pd.read_csv("bio_url_total_gender.csv")
bio_url_total_gender.head(3)

Unnamed: 0,photoimage,actorname,actorimdb,dateofbirth,placeofbirth,minibio,trivia,race,gender,Domestic,bio_url,tell_gender_fromText,gender2,gender_majority
0,,Aanand Rai,http://www.imdb.com/name/nm2399862/,,,Aanand L. Rai is a Hindi film director and pro...,,,male,0,https://www.imdb.com/name/nm2399862/bio?ref_=n...,1,male,male
1,,Aaron Schneider,http://www.imdb.com/name/nm0773689/,,,Aaron Schneider is known for his work onTwo So...,Member of the American Society of Cinematograp...,,male,1,https://www.imdb.com/name/nm0773689/bio?ref_=n...,1,male,male
2,,Abbas Alibhai Burmawalla,http://www.imdb.com/name/nm0122216/,,,Abbas Alibhai Burmawalla is a director and pro...,The name Abbas-Mastan is used for films co-dir...,,male,0,https://www.imdb.com/name/nm0122216/bio?ref_=n...,1,male,male


## Scrap Book My Show website url address

In [None]:
# !pip install webdriver-manager

In [3]:
# web driver
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
from webdriver_manager.chrome import ChromeDriverManager

# automatic scrap
import re
from threading import Thread
import requests

# google search prepared for book my view
import google
from googlesearch import search

In [None]:
# set the url to the website and access the site with our requests library
url = bio_url_total_domestic['actorimdb'][0]
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
# print soup to show the nested data structure
# soup
print(url)
print(soup)
print(soup.findAll('a'))
# all_image = soup.find_all('img')
# print(all_image)

In [25]:
tar_url = "https://in.bookmyshow.com/hyderabad#!quickbook"
# tar_url = "https://email.itd.uts.edu.au/webapps/directory/byname/index.mason"
response = requests.get(tar_url)
soup = BeautifulSoup(response.text, "html.parser")
soup.find_all('input', {"class": {"search-box typeahead"}})

[<input autocomplete="search-box" class="search-box typeahead" id="input-search-box" name="inputSearchBox" onfocus="BMS.Misc.fnTriggerQuickbook();" placeholder="Search for Movies, Events, Plays, Sports and Activities" type="text">
 </input>,
 <input class="search-box typeahead" placeholder="Search for Movies, Events, Plays, Sports, Activities" type="text">
 <span class="search-list-icon" id="search-list-icon">
 <svg enable-background="new 0 0 100 100" version="1.1" viewbox="0 0 100 100" x="0px" xml:space="preserve" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" y="0px">
 <use xlink:href="/icons/common-icons.svg#icon-list"></use>
 </svg>
 </span>
 <!-- <div class="qb-region none">
 			<a class="location" id="qb-region-link">
 	          <span class="icon-location">
 	            <svg version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" x="0px" y="0px" viewBox="0 0 100 100" enable-background="new 0 0 100 100" xml:space

In [26]:
# example provide available url
tar_url = "https://email.itd.uts.edu.au/webapps/directory/byname/index.mason"
# provide all needed information to decide url-text information
data = {'searchfield': 'cn',
        'searchoption': 'Contains',
        'searchstring': 'Michael',
        '.submit': 'Search',
        'submittingsearch': '1'}
# page_content = requests.post(url=tar_url, data=data)

response = requests.get(tar_url)
soup = BeautifulSoup(response.text, "html.parser")
soup.find_all('input')

[<input id="searchstring" name="searchstring" size="30" type="text"/>,
 <input name=".submit" type="submit" value="Search"/>,
 <input name="search_within" type="checkbox" value="1"/>,
 <input name="submittingsearch" type="hidden" value="1"/>]

In [104]:
# !pip install google

Collecting google
  Downloading https://files.pythonhosted.org/packages/34/4c/9bc51ae2611e5893ff45f8972f20dd7c8408eb5d706a541182ac2da3b0b7/google-2.0.2.tar.gz (45kB)
Building wheels for collected packages: google
  Running setup.py bdist_wheel for google: started
  Running setup.py bdist_wheel for google: finished with status 'done'
  Stored in directory: C:\Users\TK\AppData\Local\pip\Cache\wheels\ac\25\a4\837e13b998311f9824965755f86ecf69ef9ee1a7db10834cfc
Successfully built google
Installing collected packages: google
Successfully installed google-2.0.2


You are using pip version 9.0.3, however version 19.0.3 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [27]:
director_table.head(3)

Unnamed: 0,photoimage,actorname,actorimdb,dateofbirth,placeofbirth,minibio,trivia,race,gender,Domestic,bio_url
0,,A.R. Murugadoss,http://www.imdb.com/name/nm1436693/,,,,,,,0,
1,,Aanand Rai,http://www.imdb.com/name/nm2399862/,,,,,,,0,
2,,Aaron Schneider,http://www.imdb.com/name/nm0773689/,,,,,,,1,


In [28]:
director_table_bookmyshow = copy.deepcopy(director_table)
# initialize
director_table_bookmyshow['keyword'] = np.nan
# update keyword in google search
director_table_bookmyshow['keyword'] = 'bookmyshow '+ director_table_bookmyshow['actorname']

In [29]:
director_table_bookmyshow.head(3)

Unnamed: 0,photoimage,actorname,actorimdb,dateofbirth,placeofbirth,minibio,trivia,race,gender,Domestic,bio_url,keyword
0,,A.R. Murugadoss,http://www.imdb.com/name/nm1436693/,,,,,,,0,,bookmyshow A.R. Murugadoss
1,,Aanand Rai,http://www.imdb.com/name/nm2399862/,,,,,,,0,,bookmyshow Aanand Rai
2,,Aaron Schneider,http://www.imdb.com/name/nm0773689/,,,,,,,1,,bookmyshow Aaron Schneider


In [30]:
# prepare -- change name
director_table_bookmyshow.rename(columns={'actorimdb':'actorbookmyshow'}, inplace=True)
director_table_bookmyshow.head(1)

Unnamed: 0,photoimage,actorname,actorbookmyshow,dateofbirth,placeofbirth,minibio,trivia,race,gender,Domestic,bio_url,keyword
0,,A.R. Murugadoss,http://www.imdb.com/name/nm1436693/,,,,,,,0,,bookmyshow A.R. Murugadoss


In [149]:
for i in range(len(director_table_bookmyshow)):
    for url in search(director_table_bookmyshow['keyword'][i], tld='com.pk', lang='es', stop=1):
        director_table_bookmyshow['actorbookmyshow'][i] = url
    if i%100 == 0:
        print(i)

0
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400


In [150]:
director_table_bookmyshow.head(3)

Unnamed: 0,photoimage,actorname,actorbookmyshow,dateofbirth,placeofbirth,minibio,trivia,race,gender,Domestic,bio_url,keyword
0,,A.R. Murugadoss,https://in.bookmyshow.com/entertainment/movies...,,,,,,,0,,bookmyshow A.R. Murugadoss
1,,Aanand Rai,https://in.bookmyshow.com/person/aanand-l-rai/...,,,,,,,0,,bookmyshow Aanand Rai
2,,Aaron Schneider,https://en.wikipedia.org/wiki/Aaron_Schneider,,,,,,,1,,bookmyshow Aaron Schneider


In [151]:
director_table_bookmyshow.to_csv("director_table_bookmyshow.csv", index=False)

In [31]:
director_table_bookmyshow = pd.read_csv('director_table_bookmyshow.csv')

In [32]:
director_table_bookmyshow.head(3)

Unnamed: 0,photoimage,actorname,actorbookmyshow,dateofbirth,placeofbirth,minibio,trivia,race,gender,Domestic,bio_url,keyword
0,,A.R. Murugadoss,https://in.bookmyshow.com/person/ar-murugadoss...,,,,,,,0,,bookmyshow A.R. Murugadoss
1,,Aanand Rai,https://in.bookmyshow.com/person/aanand-l-rai/...,,,,,,,0,,bookmyshow Aanand Rai
2,,Aaron Schneider,https://in.bookmyshow.com/person/aaron-schneid...,,,,,,,1,,bookmyshow Aaron Schneider


In [33]:
# check whether all actorbookmyshow are correct url address to use

# fail to fill in correct book my show url for the first time
book_doublecheck = director_table_bookmyshow[~director_table_bookmyshow['actorbookmyshow'].str.
                                  contains('https://in.bookmyshow.com/person/')]
book_doublecheck = book_doublecheck[~book_doublecheck['actorbookmyshow'].str.
                                  contains('https://id.bookmyshow.com/person/')]
print("can not find at the first url address: ", len(book_doublecheck))

# found book my show but no director information
lack_part1 =  book_doublecheck[book_doublecheck['actorbookmyshow'].str.
                                  contains('https://in.bookmyshow.com/')]
# delete lack_part1 and update book_doublecheck
book_doublecheck = pd.concat([book_doublecheck, lack_part1]).drop_duplicates(keep=False).reset_index(drop=True)

print("not exist in the book my show: ", len(lack_part1))
print("updated book_doublecheck length: ", len(book_doublecheck))

can not find at the first url address:  684
not exist in the book my show:  79
updated book_doublecheck length:  605


In [49]:
# search again
count_show = 0
for keyword in book_doublecheck['keyword']:
    count_show += 1
    for url in search(keyword, tld='com.pk', lang='es', stop=1):
        if 'https://in.bookmyshow.com/person/' in url:
            index = director_table_bookmyshow[(director_table_bookmyshow['keyword'] 
                                               == keyword)].index.tolist()[0]
            director_table_bookmyshow['actorbookmyshow'][index] = url
        elif 'https://id.bookmyshow.com/person/' in url:
            index = director_table_bookmyshow[(director_table_bookmyshow['keyword'] 
                                               == keyword)].index.tolist()[0]
            director_table_bookmyshow['actorbookmyshow'][index] = url
    if count_show % 100 == 0:
        print(count_show)

100
200
300
400
500
600


In [34]:
# fail to fill in correct book my show url for the first time
book_doublecheck = director_table_bookmyshow[~director_table_bookmyshow['actorbookmyshow'].str.
                                  contains('https://in.bookmyshow.com/person/')]
book_doublecheck = book_doublecheck[~book_doublecheck['actorbookmyshow'].str.
                                  contains('https://id.bookmyshow.com/person/')]
print("can not find at the first url address: ", len(book_doublecheck))

# found book my show but no director information
lack_part1 =  book_doublecheck[book_doublecheck['actorbookmyshow'].str.
                                  contains('https://in.bookmyshow.com/')]
# delete lack_part1 and update book_doublecheck
book_doublecheck = pd.concat([book_doublecheck, lack_part1]).drop_duplicates(keep=False).reset_index(drop=True)

print("not exist in the book my show: ", len(lack_part1))
print("updated book_doublecheck length: ", len(book_doublecheck))

can not find at the first url address:  684
not exist in the book my show:  79
updated book_doublecheck length:  605


In [51]:
director_table_bookmyshow.to_csv("director_table_bookmyshow.csv", index=False)

In [5]:
director_table_bookmyshow = pd.read_csv('director_table_bookmyshow.csv')
director_table_bookmyshow_manu = pd.read_csv('director_table_bookmyshow_manu.csv')

## Some problems in Book my show website
- Part of these websites provide almost no information thus the first url may return IMDb sometimes even if the keywords have been set to book my show
- There are some book my show websites that have been missed but most of them are missed because of too limited information and thus return to IMDb as well. Or IMDb may contain more information although book my show may be good as well
- Sometimes the first url may be more reliable source then IMDb or book my show
- book my show may provide wrong information sometimes -- born in year 2011
- Definitely, IMDb contains almost all directors mentioned above but book my show may provide only part of them

## Analysis of the first url source problem
- Two available csv files to use: director_table_bookmyshow and director_table_bookmyshow_manu
- Difference: manually check those addresses that don't contain 'https://in.bookmyshow.com/person/' or 'https://id.bookmyshow.com/person/'
- After checking: the first source from Google Search may be the most reliable. 
    - Those find IMDb with keyword 'bookmyshow' mean that IMDb provides much more information - rely on IMDb
    - Those find other than IMBb or bookmyshow with keyword 'bookmyshow' mean that neither IMDb and bookmyshow provide enough info - either IMDb or bookmyshow would be almost the same
    - Those find bookmyshow would be good
- Thus it won't influence too much if use director_table_bookmyshow directly. Here we use the updated manually version to continue.

In [6]:
book_doublecheck_1 = director_table_bookmyshow_manu[director_table_bookmyshow_manu['actorbookmyshow'].str.
                                  contains('https://in.bookmyshow.com/person/')]
book_doublecheck_2 = director_table_bookmyshow_manu[director_table_bookmyshow_manu['actorbookmyshow'].str.
                                  contains('https://id.bookmyshow.com/person/')]

In [7]:
print('Found number: ', len(book_doublecheck_2)+len(book_doublecheck_1))

Found number:  913


## Wikipedia url address scraping and html analysis

In [8]:
director_table_wiki = copy.deepcopy(director_table_bookmyshow)
director_table_wiki.rename(columns={'actorbookmyshow':'actorwiki'}, inplace=True)
director_table_wiki['keyword'] = 'wikipedia ' + director_table_wiki['actorname'] + ' director'

In [10]:
director_table_wiki.head(3)

Unnamed: 0,photoimage,actorname,actorwiki,dateofbirth,placeofbirth,minibio,trivia,race,gender,Domestic,bio_url,keyword
0,,A.R. Murugadoss,https://in.bookmyshow.com/person/ar-murugadoss...,,,,,,,0,,wikipedia A.R. Murugadoss director
1,,Aanand Rai,https://in.bookmyshow.com/person/aanand-l-rai/...,,,,,,,0,,wikipedia Aanand Rai director
2,,Aaron Schneider,https://in.bookmyshow.com/person/aaron-schneid...,,,,,,,1,,wikipedia Aaron Schneider director


In [None]:
# update the actorwiki address
for i in range(len(director_table_wiki)):
    for url in search(director_table_wiki['keyword'][i], tld='com.pk', lang='es', stop=1):
        director_table_wiki['actorwiki'][i] = url
    if i%100 == 0:
        print(i)

In [14]:
director_table_wiki.head(3)

Unnamed: 0,photoimage,actorname,actorwiki,dateofbirth,placeofbirth,minibio,trivia,race,gender,Domestic,bio_url,keyword
0,,A.R. Murugadoss,https://en.wikipedia.org/wiki/AR_Murugadoss,,,,,,,0,,wikipedia A.R. Murugadoss director
1,,Aanand Rai,https://en.wikipedia.org/wiki/Aanand_L._Rai,,,,,,,0,,wikipedia Aanand Rai director
2,,Aaron Schneider,https://en.wikipedia.org/wiki/Aaron_Schneider,,,,,,,1,,wikipedia Aaron Schneider director


In [15]:
director_table_wiki.to_csv("director_table_wiki.csv", index=False)

In [108]:
# check the number of correct wiki url address
director_table_wiki_found_en = director_table_wiki[director_table_wiki['actorwiki'].
                              str.contains('https://en.wikipedia.org/wiki/')].reset_index(drop=True)
director_table_wiki_found_es = director_table_wiki[director_table_wiki['actorwiki'].
                              str.contains('https://es.wikipedia.org/wiki/')].reset_index(drop=True)
print("wikipedia found length: ", len(director_table_wiki_found_en)+len(director_table_wiki_found_es))

wikipedia found length:  1334


## Work on director_table_wiki_found_en/es websites

In [112]:
# an example for text information in tables

# scrap all the html information
url = director_table_wiki_found_en['actorwiki'][2]
print(url)
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
# print(soup.prettify())
# print(soup)

# find the correct table
table=soup.find('table', {'class', 'infobox biography vcard'})
# print(" ")
# print(table)

# here \xa0 should indicate one blank in html language
# string = string.replace(u'\xa0', u' ')
# scrap headers of the table
headers= [header.text.replace(u'\xa0', u' ') for header in table.find_all('th')]
print(" ")
print(headers)
# scrap some row information
rows = []
for row in table.find_all('tr'):
    rows.extend([val.text.replace(u'\xa0', u' ') for val in row.find_all('td')])
print(" ")
print(rows)
print(len(rows))

https://en.wikipedia.org/wiki/Aaron_Schneider
 
['Aaron Schneider', 'Born', 'Occupation', 'Years active']
 
['Aaron Schneider at the 2009 Toronto International Film Festival', 'Mossville, Illinois', 'Filmmaker, cinematographer', '1990 – present']
4


In [110]:
director_table_wiki_found_es.head(3)

Unnamed: 0,photoimage,actorname,actorwiki,dateofbirth,placeofbirth,minibio,trivia,race,gender,Domestic,bio_url,keyword
0,,Aaron Seltzer,https://es.wikipedia.org/wiki/Jason_Friedberg_...,,,,,,,1,,wikipedia Aaron Seltzer director
1,,Abdellatif Kechiche,https://es.wikipedia.org/wiki/Abdellatif_Kechiche,,,,,,,0,,wikipedia Abdellatif Kechiche director
2,,Adam McKay,https://es.wikipedia.org/wiki/Adam_McKay,,,,,,,1,,wikipedia Adam McKay director


In [111]:
# scrap all the html information
url = director_table_wiki_found_es['actorwiki'][3]
print(url)
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
# print(soup.prettify())
# print(soup)

# find the correct table
table=soup.find('table', {'class', 'infobox biography vcard'})
print(table == None)
# here \xa0 should indicate one blank in html language
# string = string.replace(u'\xa0', u' ')
# scrap headers of the table
headers= [header.text.replace(u'\xa0', u' ').replace(u'\n', u'') for header in table.find_all('th')]
print(" ")
print(headers)
# scrap some row information
rows = []
for row in table.find_all('tr'):
    rows.extend([val.text.replace(u'\xa0', u' ') for val in row.find_all('td')])
print(" ")
print(rows)
print(len(rows))

https://es.wikipedia.org/wiki/Adam_Shankman
False
 
['Adam Shankman', 'Información personal', 'Nombre de nacimiento', 'Nacimiento', 'Nacionalidad', 'Educación', 'Educado en', 'Información profesional', 'Ocupación', 'Años activo']
 
['\nAdam Shankman', '\nAdam Michael Shankman', '\n27 de noviembre de 1964 (54 años) Los Ángeles, (California), Estados Unidos', '\nEstadounidense ', '\nEscuela Juilliard ', '\nDirector de cine, productor de cine, coreógrafo, actor de cine, director de televisión, actor y bailarín ', '\n1983–presente', '[editar datos en Wikidata]']
8


In [142]:
director_table_wiki_found_en['birth_all'] = np.nan
director_table_wiki_found_es['birth_all'] = np.nan

# define function to scrap birth_all information
def scrap_wiki_table(set_, header, num):
    # get text whole
    url = set_['actorwiki'][num]
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    # get table
    table=soup.find('table', {'class', 'infobox biography vcard'})
    # no tables on the website
    if table == None:
        result = np.nan
    else:
        headers= [header.text.replace(u'\xa0', u' ') for header in table.find_all('th')]
        if header in headers:
            # continue to search table headers' content
            rows = []
            for row in table.find_all('tr'):
                rows.extend([val.text.replace(u'\xa0', u' ') for val in row.find_all('td')])
            # decide content index of rows
            if str(set_).count('en') > 0:
                if len(headers) == len(rows):
                    index = headers.index(header)
                else: # no photos then advance one
                    index = headers.index(header)-1
                    
            elif str(set_).count('es') > 0:
                index = headers.index(header)-1
            result = rows[index]
        else:
            # not in the table
            result = np.nan
    return result

In [143]:
scrap_wiki_table(director_table_wiki_found_en, 'Born', 21)

' (1963-06-24) 24 June 1963 (age 55)Bangkok, Thailand'

In [None]:
# update director_table_wiki_found_en first
for i in range(len(director_table_wiki_found_en)):
    director_table_wiki_found_en['birth_all'][i] = scrap_wiki_table(director_table_wiki_found_en, 'Born', i)
    if i%50 == 0:
        print(i)
director_table_wiki_found_en.head(3)

In [145]:
director_table_wiki_found_en.to_csv('director_table_wiki_found_en.csv')

In [147]:
director_table_wiki_found_en = pd.read_csv('director_table_wiki_found_en.csv')
director_table_wiki_found_en.head(3)

Unnamed: 0.1,Unnamed: 0,photoimage,actorname,actorwiki,dateofbirth,placeofbirth,minibio,trivia,race,gender,Domestic,bio_url,keyword,birth_all
0,0,,A.R. Murugadoss,https://en.wikipedia.org/wiki/AR_Murugadoss,,,,,,,0,,wikipedia A.R. Murugadoss director,Murugadoss Arunasalam (1974-09-25) 25 Septembe...
1,1,,Aanand Rai,https://en.wikipedia.org/wiki/Aanand_L._Rai,,,,,,,0,,wikipedia Aanand Rai director,(1971-06-28) 28 June 1971 (age 47)Delhi
2,2,,Aaron Schneider,https://en.wikipedia.org/wiki/Aaron_Schneider,,,,,,,1,,wikipedia Aaron Schneider director,"Mossville, Illinois"


In [None]:
# update director_table_wiki_found_es then
for i in range(len(director_table_wiki_found_es)):
    director_table_wiki_found_es['birth_all'][i] = scrap_wiki_table(director_table_wiki_found_es, 
                                                                    'Nacimiento', i)
    if i%50 == 0:
        print(i)

In [224]:
director_table_wiki_found_es.to_csv('director_table_wiki_found_es.csv', index=False)

In [225]:
director_table_wiki_found_es.head(3)

Unnamed: 0,photoimage,actorname,actorwiki,dateofbirth,placeofbirth,minibio,trivia,race,gender,Domestic,bio_url,keyword,birth_all
0,,Aaron Seltzer,https://es.wikipedia.org/wiki/Jason_Friedberg_...,,,,,,,1,,wikipedia Aaron Seltzer director,
1,,Abdellatif Kechiche,https://es.wikipedia.org/wiki/Abdellatif_Kechiche,,,,,,,0,,wikipedia Abdellatif Kechiche director,"\n17 de diciembre de 1960 (58 años)Túnez, Túnez"
2,,Adam McKay,https://es.wikipedia.org/wiki/Adam_McKay,,,,,,,1,,wikipedia Adam McKay director,"\n17 de abril de 1968 (50 años)Filadelfia, Pen..."


In [48]:
def regular_exp_year(tosearch):
    if str(tosearch) == 'nan':
        result = np.nan
    else:
        temp = re.search('([0-9]{4})', tosearch)
        if temp == None:
            result = np.nan
        else:
            result = temp.group(0)
    return result
regular_exp_year(director_table_wiki_found_en['birth_all'][0])

'1974'

In [51]:
def filter_placeOfBirth(dealt_string):
    if str(dealt_string) != 'nan':
        # replace all numbers with blanks
        temp = re.sub('([0-9])', '', dealt_string)
        # replace all [] and () with blanks
        temp = temp.replace('[]', '').replace('()', '')

        # kick out all words before '(age-num)' and keep the left
        index = temp.rfind(')')
        temp = temp[index+1:]
    else:
        temp = np.nan
    return temp
temp_string = "Murugadoss Arunasalam (1974-09-25) 25 September 1974 (age 44)Kallakkurichi, Tamil Nadu, India"
filter_placeOfBirth(temp_string)

'Kallakkurichi, Tamil Nadu, India'

In [52]:
# regular expression to tell birth year information from 'birth_all' column and fill
# fill in the place of birth information by 'birth_all' columns
for i in range(len(director_table_wiki_found_en)):
    director_table_wiki_found_en['dateofbirth'][i] = regular_exp_year(director_table_wiki_found_en['birth_all'][i])
    director_table_wiki_found_en['placeofbirth'][i] = filter_placeOfBirth(director_table_wiki_found_en['birth_all'][i])
    if i%100 == 0:
        print(i)

0
100
200
300
400
500
600


In [53]:
director_table_wiki_found_en.to_csv('director_table_wiki_found_en.csv', index=False)

In [17]:
director_table_wiki_found_en = pd.read_csv('director_table_wiki_found_en.csv')
director_table_wiki_found_en = director_table_wiki_found_en[['photoimage', 'actorname', 'actorwiki', 'dateofbirth', 
           'placeofbirth', 'minibio', 'trivia', 'race', 'gender', 'Domestic', 
           'bio_url', 'keyword', 'birth_all']]
director_table_wiki_found_en.head(3)

Unnamed: 0,photoimage,actorname,actorwiki,dateofbirth,placeofbirth,minibio,trivia,race,gender,Domestic,bio_url,keyword,birth_all
0,,A.R. Murugadoss,https://en.wikipedia.org/wiki/AR_Murugadoss,1974.0,"Kallakkurichi, Tamil Nadu, India",,,,,0,,wikipedia A.R. Murugadoss director,"Murugadoss Arunasalam (1974-09-25) 25 September 1974 (age 44)Kallakkurichi, Tamil Nadu, India"
1,,Aanand Rai,https://en.wikipedia.org/wiki/Aanand_L._Rai,1971.0,Delhi,,,,,0,,wikipedia Aanand Rai director,(1971-06-28) 28 June 1971 (age 47)Delhi
2,,Aaron Schneider,https://en.wikipedia.org/wiki/Aaron_Schneider,,"Mossville, Illinois",,,,,1,,wikipedia Aaron Schneider director,"Mossville, Illinois"


In [58]:
# regular expression to tell birth year information from 'birth_all' column and fill
# fill in the place of birth information by 'birth_all' columns
for i in range(len(director_table_wiki_found_es)):
    director_table_wiki_found_es['dateofbirth'][i] = regular_exp_year(director_table_wiki_found_es['birth_all'][i])
    director_table_wiki_found_es['placeofbirth'][i] = filter_placeOfBirth(director_table_wiki_found_es['birth_all'][i])
    if i%100 == 0:
        print(i)

0
100
200
300
400
500
600
700


In [59]:
director_table_wiki_found_es.to_csv('director_table_wiki_found_es.csv', index=False)

In [16]:
director_table_wiki_found_es = pd.read_csv('director_table_wiki_found_es.csv')
director_table_wiki_found_es = director_table_wiki_found_es[['photoimage', 'actorname', 'actorwiki', 'dateofbirth', 
           'placeofbirth', 'minibio', 'trivia', 'race', 'gender', 'Domestic', 
           'bio_url', 'keyword', 'birth_all']]
director_table_wiki_found_es.head(3)

Unnamed: 0,photoimage,actorname,actorwiki,dateofbirth,placeofbirth,minibio,trivia,race,gender,Domestic,bio_url,keyword,birth_all
0,,Aaron Seltzer,https://es.wikipedia.org/wiki/Jason_Friedberg_y_Aaron_Seltzer,,,,,,,1,,wikipedia Aaron Seltzer director,
1,,Abdellatif Kechiche,https://es.wikipedia.org/wiki/Abdellatif_Kechiche,1960.0,"Túnez, Túnez",,,,,0,,wikipedia Abdellatif Kechiche director,"\n17 de diciembre de 1960 (58 años)Túnez, Túnez"
2,,Adam McKay,https://es.wikipedia.org/wiki/Adam_McKay,1968.0,"Filadelfia, Pensilvania,Estados Unidos",,,,,1,,wikipedia Adam McKay director,"\n17 de abril de 1968 (50 años)Filadelfia, Pensilvania,Estados Unidos"


## Fill blanks in bio_url_total_gender csv file by wikipedia information

In [78]:
bio_url_total_gender = pd.read_csv("bio_url_total_gender.csv")
bio_url_total_gender.head(1)

Unnamed: 0,photoimage,actorname,actorimdb,dateofbirth,placeofbirth,minibio,trivia,race,gender,Domestic,bio_url,tell_gender_fromText,gender2,gender_majority
0,,Aanand Rai,http://www.imdb.com/name/nm2399862/,,,Aanand L. Rai is a Hindi film director and pro...,,,male,0,https://www.imdb.com/name/nm2399862/bio?ref_=n...,1,male,male


In [85]:
# filter all the columns that does not contain numbers in dateofbirth
date_null_name = bio_url_total_gender[bio_url_total_gender['dateofbirth'].isnull()].actorname.tolist()
print('date not filled length: ', len(date_null_name))
notNull = bio_url_total_gender[bio_url_total_gender['dateofbirth'].notnull()]
date_wrong_name = notNull[~notNull['dateofbirth'].str.contains('1')].actorname.tolist()
print('date wrong filled length: ', len(date_wrong_name))
date_need_fill = set(date_null_name + date_wrong_name)

date not filled length:  75
date wrong filled length:  335


In [113]:
def find_birthInfo(set_, name):
    birth_year = np.nan
    birth_place = np.nan
    temp_frame = set_[set_['actorname'].isin([name])].reset_index(drop=True)
    if len(temp_frame) > 0:
        birth_year = temp_frame['dateofbirth'][0]
        birth_place = temp_frame['placeofbirth'][0]
    return birth_year, birth_place

In [114]:
for name in date_need_fill:
    # find its row index
    index = bio_url_total_gender[(bio_url_total_gender.actorname == name)].index.tolist()[0]
    
    en_birth_year, en_birth_place = find_birthInfo(director_table_wiki_found_en, name)
    es_birth_year, es_birth_place = find_birthInfo(director_table_wiki_found_es, name)
    # update birth date info
    if en_birth_year != np.nan:
        bio_url_total_gender['dateofbirth'][index] = en_birth_year
    else:
        bio_url_total_gender['dateofbirth'][index] = es_birth_year
    # update birth place info
    if en_birth_place != np.nan:
        bio_url_total_gender['placeofbirth'][index] = en_birth_place
    else:
        bio_url_total_gender['placeofbirth'][index] = es_birth_place

In [121]:
# check the blank place of birth information again
place_null_name = bio_url_total_gender[bio_url_total_gender['placeofbirth'].isnull()].actorname.tolist()
print('place not filled length: ', len(place_null_name))
notNull = bio_url_total_gender[bio_url_total_gender['placeofbirth'].notnull()]
place_wrong_name = notNull[~notNull['placeofbirth'].str.contains('[a-zA-Z]')].actorname.tolist()
print('place wrong filled length: ', len(place_wrong_name))
place_need_fill = set(place_null_name + place_wrong_name)

place not filled length:  277
place wrong filled length:  35


In [122]:
for name in place_need_fill:
    # find its row index
    index = bio_url_total_gender[(bio_url_total_gender.actorname == name)].index.tolist()[0]
    en_birth_year, en_birth_place = find_birthInfo(director_table_wiki_found_en, name)
    es_birth_year, es_birth_place = find_birthInfo(director_table_wiki_found_es, name)
    # update birth place info
    if en_birth_place != np.nan:
        bio_url_total_gender['placeofbirth'][index] = en_birth_place
    else:
        bio_url_total_gender['placeofbirth'][index] = es_birth_place

In [123]:
bio_url_total_gender.to_csv("bio_url_total_gender.csv")

In [6]:
bio_url_total_gender = pd.read_csv('bio_url_total_gender.csv')
bio_url_total_gender = bio_url_total_gender[['photoimage', 'actorname', 'actorimdb', 'dateofbirth', 
                                             'placeofbirth', 'minibio', 'trivia', 'race', 'gender', 'Domestic', 
                                             'bio_url', 'tell_gender_fromText', 'gender2', 'gender_majority']]
bio_url_total_gender.head(2)

Unnamed: 0,photoimage,actorname,actorimdb,dateofbirth,placeofbirth,minibio,trivia,race,gender,Domestic,bio_url,tell_gender_fromText,gender2,gender_majority
0,,Aanand Rai,http://www.imdb.com/name/nm2399862/,1971.0,Delhi,Aanand L. Rai is a Hindi film director and pro...,,,male,0,https://www.imdb.com/name/nm2399862/bio?ref_=n...,1,male,male
1,,Aaron Schneider,http://www.imdb.com/name/nm0773689/,,"Mossville, Illinois",Aaron Schneider is known for his work onTwo So...,Member of the American Society of Cinematograp...,,male,1,https://www.imdb.com/name/nm0773689/bio?ref_=n...,1,male,male


# Image Collection from IMDb first
- collect their images to code their demographic background, possibly from IMDB by collecting data on observations that has a value of 1 on column 'Domestic' (1170 out of 1437). 

In [130]:
print("length of domestic directors: ", len(bio_url_total_gender[bio_url_total_gender['Domestic'] == 1]))

length of domestic directors:  1170


In [131]:
bio_url_total_domestic = bio_url_total_gender[bio_url_total_gender['Domestic'] == 1].reset_index(drop=True)
bio_url_total_not_domestic = pd.concat([bio_url_total_gender, bio_url_total_domestic]).drop_duplicates(keep=False).reset_index(drop=True)

In [132]:
print(len(bio_url_total_domestic))
print(len(bio_url_total_not_domestic))

1170
267


In [134]:
bio_url_total_domestic.head(1)

Unnamed: 0,photoimage,actorname,actorimdb,dateofbirth,placeofbirth,minibio,trivia,race,gender,Domestic,bio_url,tell_gender_fromText,gender2,gender_majority
0,,Aaron Schneider,http://www.imdb.com/name/nm0773689/,,"Mossville, Illinois",Aaron Schneider is known for his work onTwo So...,Member of the American Society of Cinematograp...,,male,1,https://www.imdb.com/name/nm0773689/bio?ref_=n...,1,male,male


In [151]:
def domestic_image_src(address):
    # scrap the whole information
    response = requests.get(address)
    soup = BeautifulSoup(response.text, "html.parser")
    # find name-poster image's src if available to list
    img = soup.find_all('img', {'id': "name-poster"})
    if len(img) == 0:
        # no available image to use
        result = np.nan
    else:
        # default the first one
        result = img[0].get('src')
    return result

In [None]:
# fill in the image_src column
bio_url_total_domestic['image_src'] = np.nan
for i in range(len(bio_url_total_domestic)):
    bio_url_total_domestic['image_src'][i] = domestic_image_src(bio_url_total_domestic['actorimdb'][i])
    if i%100 == 0:
        print(i)

In [153]:
bio_url_total_domestic.to_csv('bio_url_total_domestic.csv', index=False)

In [3]:
bio_url_total_domestic = pd.read_csv('bio_url_total_domestic.csv')
bio_url_total_domestic = bio_url_total_domestic[['photoimage', 'actorname', 'actorimdb', 'dateofbirth', 
                                                 'placeofbirth', 'minibio', 'trivia', 'race', 'gender', 
                                                 'Domestic', 'bio_url', 'tell_gender_fromText', 
                                                 'gender2', 'gender_majority', 'image_src']]
bio_url_total_domestic.head(1)

Unnamed: 0,photoimage,actorname,actorimdb,dateofbirth,placeofbirth,minibio,trivia,race,gender,Domestic,bio_url,tell_gender_fromText,gender2,gender_majority,image_src
0,,Aaron Schneider,http://www.imdb.com/name/nm0773689/,,"Mossville, Illinois",Aaron Schneider is known for his work onTwo So...,Member of the American Society of Cinematograp...,,male,1,https://www.imdb.com/name/nm0773689/bio?ref_=n...,1,male,male,https://m.media-amazon.com/images/M/MV5BMTY4ND...


In [4]:
# download specific image
urllib.request.urlretrieve(bio_url_total_domestic['image_src'][0], 'sample1.png')

('sample1.png', <http.client.HTTPMessage at 0x28565679a58>)

In [14]:
# one shown example
df = pd.DataFrame([['A231', 'Book', 5, 3, 150], 
                   ['M441', 'Magic Staff', 10, 7, 200]],
                   columns = ['Code', 'Name', 'Price', 'Net', 'Sales'])
# your images
images = ['https://vignette.wikia.nocookie.net/2007scape/images/7/7a/Mage%27s_book_detail.png/revision/latest?cb=20180310083825',
          'https://i.pinimg.com/originals/d9/5c/9b/d95c9ba809aa9dd4cb519a225af40f2b.png'] 
df['image'] = images
# convert your links to html tags 
def path_to_image_html(path):
    return '<img src="'+ path + '" width="60" >'
pd.set_option('display.max_colwidth', -1)
HTML(df.to_html(escape=False ,formatters=dict(image=path_to_image_html)))

Unnamed: 0,Code,Name,Price,Net,Sales,image
0,A231,Book,5,3,150,
1,M441,Magic Staff,10,7,200,


In [6]:
print('No image on IMDb: ', len(bio_url_total_domestic[bio_url_total_domestic['image_src'].isnull()]))

No image on IMDb:  144


In [5]:
from IPython.display import Image
Image(url= bio_url_total_domestic['image_src'][0],width=100)

In [None]:
# convert your links to html tags 
def path_to_image_html(path):
    return '<img src="'+ path + '" width="60" >'
pd.set_option('display.max_colwidth', -1)

bio_url_total_domestic['photoimage'] = bio_url_total_domestic['image_src']
HTML(bio_url_total_domestic.to_html(escape=False ,formatters=dict(image=path_to_image_html)))

## Fill images src again by wikipedia

In [None]:
domestic_blank = bio_url_total_domestic[bio_url_total_domestic['image_src'].isnull()].reset_index(drop=True)
# fill in those blank url address by wikipedia personal page
domestic_blank['url_wiki'] = np.nan
for i in range(len(domestic_blank)):
    name = domestic_blank['actorname'][i]
    url = np.nan
    temp_frame_es = director_table_wiki_found_es[director_table_wiki_found_es['actorname'].isin([name])].reset_index(drop=True)
    temp_frame_en = director_table_wiki_found_en[director_table_wiki_found_en['actorname'].isin([name])].reset_index(drop=True)
    if len(temp_frame_es) > 0:
        url = temp_frame_es['actorwiki'][0]
    elif len(temp_frame_en) > 0:
        url = temp_frame_en['actorwiki'][0]
    domestic_blank['url_wiki'][i] = url
    if i%20 == 0:
        print(i)

In [27]:
print('number that can be found in wiki: ', len(domestic_blank[domestic_blank['url_wiki'].notnull()]))

number that can be found in wiki:  107


In [86]:
# input the index number of the domestic_blank
# return possible wikipedia image download address if available
def scrap_wiki_image(num):
    url = domestic_blank['url_wiki'][num]
    result_src = np.nan
    # not blank wikipedia - possible available to fill
    if str(url) != 'nan':
        response = requests.get(url)
        soup = BeautifulSoup(response.text, "html.parser")
        # find the correct table 
        table=soup.find('table', {'class', 'infobox biography vcard'})
        # check blank table
        if table != None:
            # scrap some row information
            rows_temp = []
            for row in table.find_all('tr'):
                # extract images and find downloading images url address
                rows_temp.extend([val['href'] for val in row.find_all('a', {'class': 'image'})])
            # available pic to use
            if len(rows_temp) > 0:
                result_src = 'https://en.wikipedia.org'+rows_temp[0]
                # enter the picture website to scrap download address
                response = requests.get(result_src)
                soup = BeautifulSoup(response.text, "html.parser")
                # automatically scrap the first bio picture
                # check interval image existance
                if soup.find_all('a',{'class': "internal"}) != []:
                    result_src = 'https:'+soup.find_all('a',{'class': "internal"})[0]['href']
                # urllib.request.urlretrieve(result_src, 'sample.png') to download the image
    return result_src
# one example
av = scrap_wiki_image(8)
print(av)

https://upload.wikimedia.org/wikipedia/commons/b/b7/Cheryl_Dunye_2.jpg


In [87]:
# call domestic_blank's order to fill bio_url_total_domestic
for i in range(len(domestic_blank)):
    name = domestic_blank['actorname'][i]
    result = scrap_wiki_image(i)
    index = bio_url_total_domestic[(bio_url_total_domestic.actorname == name)].index.tolist()[0]
    bio_url_total_domestic['image_src'][index] = result
    if i%20 == 0:
        print(i)
print('No image on both: ', len(bio_url_total_domestic[bio_url_total_domestic['image_src'].isnull()]))

0
20
40
60
80
100
120
140
No image on both:  126


In [92]:
bio_url_total_domestic['photoimage'] = np.nan
bio_url_total_domestic.to_csv('bio_url_total_domestic.csv', index=False)

In [13]:
bio_url_total_domestic = pd.read_csv('bio_url_total_domestic.csv')
bio_url_total_domestic = bio_url_total_domestic[['photoimage', 'actorname', 'actorimdb', 'dateofbirth', 
                                                 'placeofbirth', 'minibio', 'trivia', 'race', 'gender', 
                                                 'Domestic', 'bio_url', 'tell_gender_fromText', 
                                                 'gender2', 'gender_majority', 'image_src']]
# domestic_dic set up and fill
domestic_dic = {}
for i in range(len(bio_url_total_domestic)):
    domestic_dic[bio_url_total_domestic['actorname'][i]] = bio_url_total_domestic['image_src'][i]

In [14]:
# merge bio_url_total_gender and bio_url_total_domestic together
bio_url_total_gender['image_src'] = np.nan
for i in range(len(bio_url_total_gender)):
    if bio_url_total_gender['actorname'][i] in domestic_dic:
        bio_url_total_gender['image_src'][i] = domestic_dic[bio_url_total_gender['actorname'][i]]
    else:
        bio_url_total_gender['image_src'][i] = np.nan

In [19]:
# add some other columns to fill in later
bio_url_total_gender['Image_url'] = np.nan
bio_url_total_gender['Birth_url'] = np.nan
bio_url_total_gender['Bio_url'] = np.nan
bio_url_total_gender['Image_manual'] = np.nan
bio_url_total_gender['Birth_manual'] = np.nan
bio_url_total_gender['Bio_manual'] = np.nan

In [20]:
bio_url_total_gender.to_csv('bio_url_mergeAll.csv', index=False)

In [21]:
bio_url_total_gender.head(3)

Unnamed: 0,photoimage,actorname,actorimdb,dateofbirth,placeofbirth,minibio,trivia,race,gender,Domestic,...,tell_gender_fromText,gender2,gender_majority,image_src,Image_url,Birth_url,Bio_url,Image_manual,Birth_manual,Bio_manual
0,,Aanand Rai,http://www.imdb.com/name/nm2399862/,1971.0,Delhi,Aanand L. Rai is a Hindi film director and pro...,,,male,0,...,1,male,male,,,,,,,
1,,Aaron Schneider,http://www.imdb.com/name/nm0773689/,,"Mossville, Illinois",Aaron Schneider is known for his work onTwo So...,Member of the American Society of Cinematograp...,,male,1,...,1,male,male,https://m.media-amazon.com/images/M/MV5BMTY4ND...,,,,,,
2,,Abbas Alibhai Burmawalla,http://www.imdb.com/name/nm0122216/,,,Abbas Alibhai Burmawalla is a director and pro...,The name Abbas-Mastan is used for films co-dir...,,male,0,...,1,male,male,,,,,,,


0. Download images onto an excel masterfile, using the links provided.
1. Manual Filling of Bio, Birth, and Image Info that are left blank after webscraping.
After M: Mengyao Until M: Jusang
2. Fill in the info, Fill in the link that leads to the info, and fill in the dummy indicator that it has been collected manually.
3. Fill in gender afterwards, using image downloaded on excel.
4. Fill in race afterwards, using image, birthplace, and race/ethnicity description guideline.

In [42]:
!pip install google-search-results

Collecting google-search-results
  Downloading https://files.pythonhosted.org/packages/f2/fa/f3de021d374269688dcc4d959e92e2e19902a5b5e9ff679322407d0bc4df/google_search_results-1.4.0.tar.gz
Building wheels for collected packages: google-search-results
  Running setup.py bdist_wheel for google-search-results: started
  Running setup.py bdist_wheel for google-search-results: finished with status 'done'
  Stored in directory: C:\Users\TK\AppData\Local\pip\Cache\wheels\0a\6d\64\e8891e5e0ed7177486da4af7794e1adf60accaf92ba4936246
Successfully built google-search-results
Installing collected packages: google-search-results
Successfully installed google-search-results-1.4.0


You are using pip version 9.0.3, however version 19.0.3 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [78]:
from google_search_results import GoogleSearchResults
params = {
    "q" : "Aanand Rai",
    "location" : 'United States',
    "hl" : "en",
    "gl" : "us",
    "google_domain" : "google.com",
}
query = GoogleSearchResults(params)
dictionary_results = query.get_dictionary()
# return 'search_metadata', 'search_parameters','search_information','knowledge_graph'

In [75]:
dictionary_results

{'error': 'You need a valid account to continue using our API. Sign up on SerpApi.com website.'}

## Deduce the race information from previous columns

## Clean the data frame eventually