# New York Social Diary

[New York Social Diary](http://www.newyorksocialdiary.com/) casts a fascinating lens onto New York's socially well-to-do.  The data forms a natural social graph for New York's social elite.  Take a look at this page of a recent run-of-the-mill holiday party:

`http://www.newyorksocialdiary.com/party-pictures/2014/holiday-dinners-and-doers`

Besides the brand-name celebrities, you will notice the photos have carefully annotated captions labeling those that appear in the photos.  We can think of this as implicitly implying a social graph: there is a connection between two individuals if they appear in a picture together. In this project, we will scrape data from this website, parse the captions to find which people occur in photos together, and build a social graph of the result.

The first step is to fetch the data.  This comes in two phases.

The first step is to crawl the data.  We want photos from parties before December 1st, 2014.  Go to
`http://www.newyorksocialdiary.com/party-pictures`
to see a list of (party) pages.  For each party's page, grab all the captions.

*Hints*:

  1. Click on the on the index page and see how they change the url.  Use this to determine a strategy to get all the data.

  2. Notice that each party has a date on the index page. You can use python's `datetime.strptime` function to parse it.

  3. Some captions are not useful: they contain long narrative texts that explain the event.  Usually in two stage processes like this, it is better to keep more data in the first stage and then filter it out in the second stage.  This makes your work more reproducible.  It's usually faster to download more data than you need now than to have to redownload more data later.

Now that you have a list of all captions, you should probably save the data on disk so that you can quickly retrieve it.  Now comes the parsing part.

  1. Try to find some heuristic rules to separate captions that are a list of names from those that are not. For one, consider that long captions are often not lists of people.  The cutoff is subjective so to be definitive, *let's set that cutoff at 250 characters*.

  2. You will want to separate the captions based on various forms of punctuation.  Try using `re.split`, which is more sophisticated than `string.split`.

  3. You might find a person named "ra Lebenthal".  There is no one by this name.  Can anyone spot what's happening here?

  4. This site is pretty formal and likes to say things like "Mayor Michael Bloomberg" after his election but "Michael Bloomberg" before his election.  Can you find other ('optional') titles that are being used?  They should probably be filtered out b/c they ultimately refer to the same person: "Michael Bloomberg."

For the analysis, we think of the problem in terms of a [network](http://en.wikipedia.org/wiki/Computer_network) or a [graph](http://en.wikipedia.org/wiki/Graph_%28mathematics%29).  Any time a pair of people appear in a photo together, that is considered a link.  What we have described is more appropriately called an (undirected) [multigraph](http://en.wikipedia.org/wiki/Multigraph) with no self-loops but this has an obvious analog in terms of an undirected [weighted graph](http://en.wikipedia.org/wiki/Graph_%28mathematics%29#Weighted_graph).  In this problem, we will analyze the social graph of the new york social elite.

For this problem, we recommend using python's `networkx` library.


## I. Degree

The simplest question you might want to ask is 'who is the most popular'?  The easiest way to answer this question is to look at how many connections everyone has.  Write a function that returns the top 100 people and their degree.  Remember that if an edge of the graph has weight 2, it counts for 2 in the degree.
  
*Checkpoint:*

    Top 100 .describe()
    count    100.000000
    mean     106.340000
    std       51.509579
    min       69.000000
    25%       77.000000
    50%       85.500000
    75%      116.500000
    max      372.000000


## II. PageRank

 A similar way to determine popularity is to look at their [pagerank](http://en.wikipedia.org/wiki/PageRank).  Pagerank is used for web ranking and was originally [patented](http://patft.uspto.gov/netacgi/nph-Parser?patentnumber=6285999) by Google and is essentially the [stationary distribution](http://en.wikipedia.org/wiki/Markov_chain#Stationary_distribution_relation_to_eigenvectors_and_simplices) of a [markov chain](http://en.wikipedia.org/wiki/Markov_chain) implied by the social graph.

Use 0.85 as the damping parameter so that there is a 15% chance of jumping to another vertex at random.

*Checkpoint:*

    Topp 100 .describe()
    count    100.000000
    mean       0.000185
    std        0.000076
    min        0.000124
    25%        0.000138
    50%        0.000162
    75%        0.000200
    max        0.000623
   

## III. Best Friends

Another interesting question is who tend to co-occur with each other.  Give us the 100 edges with the highest weights. Write a function which returns a list of 100 tuples of the form ((person1, person2), count) in descending order of count

    Topp 100 .describe()
    count    100.000000
    mean      25.070000
    std       15.647154
    min       13.000000
    25%       15.000000
    50%       19.000000
    75%       28.500000
    max      107.000000
   

In [517]:
import pickle
import requests
from bs4 import BeautifulSoup
from multiprocessing import Pool
from collections import namedtuple
import re
from datetime import datetime
from itertools import chain
from urllib import request
import csv
import os
%matplotlib inline
import matplotlib
import seaborn as sns
import networkx as nx
import numpy as np
import pandas as pd
os.environ['TZ'] = 'USA/New York'

In [105]:
 #download the html pages which contain links to the picture pages. from page 3 to page 29 (as all contain dates within range)
 #the first page is specified by  http://www.newyorksocialdiary.com/party-pictures?page=3 
def loadpage(x):
    request.urlretrieve('http://www.newyorksocialdiary.com/party-pictures?page={}'.format(x), 'party-pictures/page_{}.html'.format(x))

p = Pool(10) # the max number of webpages to get at once
p.map(loadpage, range(3,29))

[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None]

In [13]:
def get_page_urls(): 
    
    # get the filenames of all party-pictures pages which contain links the pages that contain the pcitures with captians
    page_urls = []
    for x in range(3,29):
        page_urls.append(os.path.dirname(os.path.realpath('NewYorkSocialDiary.ipynb'))+'/party-pictures/page_{}.html'.format(x))
    return page_urls

def get_album_urls(page_urls):
    
    # scrap and download all the html pages of the party pictures. The first page that contains the links specified by
    #url_stub concat '?page=3' as the dates on the page are greater and also less than december 1 2015.  
    #subsequent pages have `?page=X` appended, where X is an
    #integer from 4 to 28.
    album_urls = []
    
    for url in page_urls:
        # load the page
        f = open(url, 'r')
        s = f.read()
        # get the text for parsing
        soup = BeautifulSoup(s,"lxml")

        # select the chunk of html corresponding to each album listing
        album_divs = soup.select('div.view-content > div')

        #  get the url and year from this html chunk 
        for album in album_divs:
            # get year and date html
            l_date = album.select('span.views-field-created > span')
            l_url = album.select('span.views-field-title > span > a')
        
            # extract date
            date = datetime.strptime(l_date[0].text, '%A, %B %d, %Y')
            # ref date which to return the urls whose dates are before
            refdate = datetime(2015, 12, 1)
            url = l_url[0]['href']
            if date < refdate:
                album_urls.append('http://www.newyorksocialdiary.com' + url)
            
    return album_urls

au = get_album_urls(get_page_urls()) # gets all the required urls which contian the captians
print(au[0])

def loadpage(x):
    # downloads the page to disk for faster retrieval later
    a = x.rfind('/')
    request.urlretrieve(x, 'party-pictures/Pictures/{}.html'.format(x[a+1:]+'-'+x[x[:a].rfind('/')+1:a]))

p = Pool(10) # the max number of webpages to get at once
p.map(loadpage, au) # use multiprocessing to download multiple pages in parallel

#print(au)
print(len(au))

http://www.newyorksocialdiary.com/party-pictures/2015/treasures-of-new-york


MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x7f70684d8358>'. Reason: 'TypeError("cannot serialize '_io.BufferedReader' object",)'

In [9]:
print(len(au))
def get_captains_from_album_urls(album_urls):
    # return the captains underneath all the pictures form all the urls
    captains = []
    for url in album_urls:
        a = url.rfind('/')
        # load the page
        f = open(os.path.dirname(os.path.realpath('NewYorkSocialDiary.ipynb'))+
                 '/party-pictures/Pictures/{}.html'.format(url[a+1:]+'-'+url[url[:a].rfind('/')+1:a]), 'r')
        s = f.read()
        # get the text for parsing
        soup = BeautifulSoup(s,"lxml")
        # select the chunk of html corresponding to each captain listing
        content_divs = soup.find('div', attrs={'class':'panel-pane pane-page-content'})
        # the type cannot be NoneType
        if type(content_divs) != 'NoneType':
            # all the captains are inside table tags with cellpadding attribute = 1
            picture_tables = content_divs.find_all('table', attrs={'cellpadding':'1'})  #Find *all*
            # go through all the tables 
            for pic in picture_tables:
                #select the part where their might be a captain    
                pic_captains = pic.select('tr > td > table > tr')
                # a captain is present only if length of oic_captains > 1
                if(len(pic_captains) > 1):
                    # loop through pic_captains from 1 skipping every second one as this is where the captians are 
                    for j in range(1, len(pic_captains), 2):
                        # extract the captain from the html part at the index of pic_captians
                        captain = pic_captains[j].select('td > div')
                        # only append captain[0].text if the length of captian > 0 (otherwise no captain is present)
                        if len(captain) > 0:
                            captains.append(captain[0].text)        
    return captains
captains = get_captains_from_album_urls(au) # gets all the captians
print('break')
#print(captains)

1279
break


In [176]:
import pickle

# Saving the objects for use later:
with open('objs.pickle', 'wb') as f:
    pickle.dump(captains, f) # 

# Getting back the objects:
#with open('objs.pickle') as f:  # Python 3: open(..., 'rb')
#    obj0, obj1, obj2 = pickle.load(f)

print(len(captains))
captains[-1:]

78783


['\nCharlotte\n                    Frieze, Eric and Patti Fast, Gregory Long, Julie Graham,\n                Robin Graham, Fernanda Kellogg, Dominique Browning, and Joan\n                Khoury\n']

In [589]:
import random
import regex

#with open('objs.pickle', 'rb') as f:
#    captainl = pickle.load(f) 

#rc = random.choices(range(0, len(captainl)), k = 10)

#for r in rc:
#    print(captainl[r])

def parse_captian(x):
    
    # tries in a best effort way (not ideal) no extract all the names by themselves form the captains. Uses a list of 
    # ignored phrases together with a list of abbreviatians for better preson name extraction
    
    # if the length of the captain is more than 249 then most probably it does not contain names so return an empty list
    if(len(x) > 249):
        return []
    
    ignoredPhrases = [
        
        'Charles Chang-Lima',
        'Under the VIP tent',
        'Brownies',
        'Girl', 
        '-Guest',
        '-Pittel',
        'friends',
        'The Jewish Museum',
        'Knight Landesmen in red (Art Forum)',
        'MD',
        'FACS',
        'PHD',
        'The scene at Mary Boone Gallery',
        'members',
        'member',
        'Board Members',
        'Board Member and Special Events Committee Chairman',
        'Ladies in fuchsia, pink, and red at Bunky Cushing\'s Valentine\'s Tea',
        'First Lady',
        'Governor of Massachusetts',
        '2013 Jacob\'s Pillow Dance Award Winner',
        'The Hon.',
        'Life Trustee',
        'receiving the Mayer Sulzberger Award from',
        'from Singapore',
        'Chairman Emeritus',
        'Ambassador',
        'to Jamaica Brenda Johnson',
        'Recent Pratt alumni',
        'UJA-Federation\'s 24th Annual Summerfest',
        'family',
        'Breguet Watches Group',
        'NYCB General Manager',
        'NYCB Board Chairman',
        'his wife',
        'East Hampton Library Board Chairman',
        'Seated:',
        'Standing',
        '*',
        'member of MSM\'s International Advisory Board',
        'The cake',
        'presents the Distinguished Alumni Award to Joseph Macnow \'67',
        'whose Home "Villa Mille Fiori" is in the book. Her Family called it "The Villa".',
        'MAD board chair',
        'guests',
        'Guitar Maker',
        'The tables set for dinner',
        'Director',
        'Executive',
        'in a party:',
        'actor',
        'in background at Making Musical Waves',
        'A dinner for 10 prepared by Chef', 
        'of the Altamarea Group featuring one pound of Sabatino white truffles auctioned for $50',
        'of Ron Ben-Israel Cakes',
        'Mrs.',
        'Miss Universe',
        'Paul Taylor Dance Company',
        'Table by',
        'ADC',
        'Intern',
        'Steering Committee',
        '"A Touch of Nature"',
        'of Liberty News',
        'of The Beastie Boys',
        'of Chicago',
        'of Town',
        'The Honorable',
        'The Rev.',
        'their daughter',
        'her husband',
        'her mother\'s late husbands.',
        'her mother',
        'Gala Chair',
        'Gala Chairwoman',
        'Gala Co Chairs',
        'Gala Co-Chair',
        'gala concert host',
        'Gala Honoree',
        'Gala Lead Chairman',
        'Gala Vice Chairs',
        'Gala Vice Chair',
        'Generosity Division Chair',
        'Generosity vice chair',
        'Museum President',
        'OSL Board Chairman',
        'OSL Board Member',
        'ACO Trustee',
        'Board Member',
        'Guest of Honor',
        '  Prince',
        'AMC Trustee',
        'Artist',
        'Carnegie Hall Trustee',
        'Benefit chairman',
        'Benefit chair',
        'AMC Board',
        'Board President',
        'NYCB dancers',
        'NYCB dancer',
        'Honorary',
        'Honoree',
        'Event Co-Chair-',
        'Event Co-Chairs',
        'Event Co-Chair',
        'Event Co-Chairs:',
        'Event Chair',
        'Event Chairman',
        'Event Chairmen',
        'Event Chairwoman',
        'Event Chairwomen',
        'Frick Trustee'
    ]
    
    abbreviated = [
        
        ['Kat Von D','Katherine von Drachenberg'],
        ['ra Lebenthal','Alexandra Lebenthal'],
        ['L.A. Reid','Antonio Marquis'],
        ['Jon M. Huntsman','Jon Meade Huntsman'],
        ['Kimberly Guilfoyle Villency','Kimberly Guilfoyle'],
        ['Ed Sheeran','Edward Christopher'],
        ['AlexandAlexandra','Alexandra']
    ]
    
    # replaces occurances of \xa0\s with a space
    x = regex.sub(r'[\xa0\s]+',' ',x)
    
    # removes ignores phrases
    for word in ignoredPhrases:
        # "word + ' '" is for deleting the trailing whitespace after each  captain.
        x = x.replace(word + ' ', '', re.IGNORECASE)
    
    # replaces the abbreviated names with thier full names
    for names in abbreviated:
        x = x.replace(names[0], names[1])
    
    #replaces occurances of spaces followed by a capital letter followed by r and then a point or occurances of brakets: ()
    # with whats inside the brakets with a space
    x = regex.sub(r'[ ]*[A-Z][r]?[.] *| [(].+[)]',' ',x)
    
    x = x.strip() #r trims the string removing spaces at the beginning and end
    
    #splits the string on into smaller strings by removing the seperators: 'K ', ', and ', ', ', ' and ', ' width ', ' & '
    # note that a space might not be present after a comma and or with.
    rs = regex.split(r'K |(?:, and |, ?| and | with ?| & )?', x)
    rs = list(filter(None, rs)) # remove all empty lists from the result 
    ps = []
    
    # loop on all names in rs and after parsing  append only the strings which seem to be names 
    # and disregard others which are not names
    for ns in rs:
        if ns.startswith('Princess') or ns.startswith('Prince'):
            ns =ns.replace(ns[0:ns.find(' ')+1], '')
            f = ns.find('of')
            if(f != -1):
                ns = ns[0:f-1]
        
        if ns.count(' ') < 6 and not ns[0:12] == 'President of' and not ns[0:6] == 'Guests' and not ns[0:2] == 'A ' \
        and not ns[0:2] == 'a ' and not ns == 'MD' and not ns == 'PhD' and not ns[0:6] == 'CEO of'\
        and not ns == 'friend' and not ns == 'friends' and not ns.strip() == '' and not ns == 'CEO' and not ns == 'his wife'\
        and not ns == 'President':
            ps.append(ns)
    
    return ps

captain_names_lists = []
# loop an all the captains and exract the names
for r in captainl:
     captain_names_lists.append(parse_captian(r)) # appends the list of extracted names to the captain_names_lists

# store the list of listd inside a csv file for later use
with open("captain_names.csv", "w") as f:
    writer = csv.writer(f)
    writer.writerows(captain_names_lists)


In [594]:
Names_Edges = []

# get all the combinations of two friends form all the lists of names in the captain_names_list and store them in Names_Edges
for c in captain_names_lists:
    for i in range(0, len(c)):
        for j in range(i+1, len(c)):
            Names_Edges.append((c[i], c[j])) 
      
g = nx.Graph() # new graph  

# if an edge is present between two people add one to the weight. else add a new edge between the two with weight one
for e1,e2 in Names_Edges :
    if g.has_edge(e1,e2):
        g[e1][e2]['weight'] += 1
    else :
        g.add_edge(e1,e2,weight=1)

degrees = g.degree() # the dgree of the graph

# sort the degree of the graph
ges = sorted(degrees.items(), key=lambda x:x[1], reverse = True)
top100e = []

# get the top 100 persons with the most degrees (most popular)
for i in range(100):
    top100e.append([ges[i][0], ges[i][1]])
    
# print the result and the stats of the results
gesf = pd.DataFrame(top100e)
print(gesf)
gesf.describe()


                       0    1
0         Jean Shafiroff  296
1        Mark Gilbertson  293
2                   John  248
3        Gillian Miniter  187
4                  David  178
5    Alexandra Lebenthal  168
6     Geoffrey Bradfield  161
7        Kamie Lightburn  160
8        Debbie Bancroft  153
9          Somers Farkas  150
10         Andrew Saffir  149
11             Alina Cho  142
12    Lucia Hwong Gordon  141
13               Michael  140
14         Barbara Tober  137
15          Mario Buatta  129
16                 Peter  126
17         Yaz Hernandez  123
18           Sharon Bush  122
19        Martha Stewart  121
20      Eleanora Kennedy  120
21      Patrick McMullan  120
22               Barbara  120
23         Jamee Gregory  115
24         Allison Aston  113
25        Bettina Zilkha  112
26           Lydia Fenet  111
27               Richard  110
28                Robert  108
29   Muffie Potter Aston  108
..                   ...  ...
70        Bunny Williams   78
71        

Unnamed: 0,1
count,100.0
mean,103.0
std,41.350925
min,69.0
25%,77.0
50%,88.5
75%,112.25
max,296.0


In [595]:
pageranks = nx.pagerank(g) # page rank of the graph

# sort the page rank of the graph
ges = sorted(pageranks.items(), key=lambda x:x[1], reverse = True)
top100e = []

# get the top 100 persons with the higest page rank
for i in range(100):
    top100e.append([ges[i][0], ges[i][1]])

# print the result and the stats of the results
gesf = pd.DataFrame(top100e)
print(gesf)
gesf.describe()

                        0         1
0          Jean Shafiroff  0.000659
1                    John  0.000598
2         Mark Gilbertson  0.000592
3                   David  0.000467
4         Gillian Miniter  0.000462
5      Geoffrey Bradfield  0.000402
6     Alexandra Lebenthal  0.000386
7           Andrew Saffir  0.000384
8                 Michael  0.000359
9         Debbie Bancroft  0.000346
10        Kamie Lightburn  0.000335
11          Somers Farkas  0.000333
12                  Peter  0.000325
13                Barbara  0.000325
14          Barbara Tober  0.000309
15              Alina Cho  0.000289
16            Sharon Bush  0.000283
17                Richard  0.000281
18          Yaz Hernandez  0.000277
19                  Susan  0.000275
20     Lucia Hwong Gordon  0.000268
21       Eleanora Kennedy  0.000267
22           Mario Buatta  0.000265
23          Jamee Gregory  0.000263
24                 Robert  0.000256
25         Martha Stewart  0.000244
26            Lydia Fenet  0

Unnamed: 0,1
count,100.0
mean,0.000229
std,9.7e-05
min,0.000152
25%,0.000171
50%,0.000193
75%,0.000247
max,0.000659


In [598]:
# sort the edges by the weight of each edge
ges = sorted(g.edges(data=True), key=lambda data: data[2]['weight'], reverse = True)
top100e = []
# get the highest 100 edges (top 100 Best Friends)
for i in range(100):
    top100e.append(((ges[i][0], ges[i][1]), ges[i][2]['weight']))

# print the result and the stats of the results
gesf = pd.DataFrame(top100e)
print(gesf)
gesf.describe()

                                          0   1
0             (Stewart Lane, Bonnie Comley)  61
1          (Andrew Saffir, Daniel Benedict)  53
2         (Geoffrey Bradfield, Roric Tobin)  48
3        (Jay Diamond, Alexandra Lebenthal)  44
4              (Gillian, Sylvester Miniter)  36
5                    (Jamee, Peter Gregory)  35
6          (Deborah Norville, Karl Wellner)  29
7              (Gillian Miniter, Sylvester)  29
8       (Guy Robinson, Elizabeth Stribling)  28
9   (Sessa von Richthofen, Richard Johnson)  28
10         (Hilary Geary Ross, Wilbur Ross)  27
11               (Couri Hay, Janna Bullock)  24
12                (Marc Rosen, Arlene Dahl)  24
13         (Olivia Palermo, Johannes Huebl)  24
14                  (Donald Tober, Barbara)  23
15   (Jay McInerney, Anne Hearst McInerney)  23
16            (Mark Badgley, James Mischka)  23
17        (Fernanda Kellogg, Kirk Henckels)  22
18    (Frederick Anderson, Douglas Hannant)  21
19                   (Peter, Jamee Grego

Unnamed: 0,1
count,100.0
mean,16.92
std,9.063948
min,10.0
25%,12.0
50%,14.0
75%,18.25
max,61.0
