Complete your own implementation of a web spider. Combine it with the linguistic analysis code we saw previously, and use it to generate summaries of the pages. You can do this in one of two ways, either:

Scrape a single page and then run your code on this page.
Scrape a series of interlinked pages (e.g. looking for the a href element) and summarise each individual page plus a broad summary of all pages.

### Approach

This challenge asks us to trawl through a webpage, or a network of pages, and generate summaries about them.

To do this I will:
* Start from the wiki entry on 'Social Media' here https://en.wikipedia.org/wiki/Social_media
* Work through the links in the 'Most popular Social Networks' table which leads to a wiki entry for each network
* Store the paragraph texts of each wiki entry into its own text object or file
* Apply TF-IDF to the texts for each wiki entry
* Provide a summary about Social Media as a whole and the specifics of each network based on the results of the TF-IDF scores

First, I need to function-ize my web scraper to make it easier to run on multiple webpages

In [1]:
# the requests library lets us source the html code from a webpage
import requests
#BeautifulSoup helps us read the html
from bs4 import BeautifulSoup as bs

def scrape_page(address):
    # send a GET request
    page = requests.get(address)

    # read and return the html stored in our request into the BeautifulSoup HTML parser
    return bs(page.text)

In [2]:
webpage = 'https://en.wikipedia.org/wiki/Social_media'
soup = scrape_page(webpage)
# find the table we are interested in - this took trial and error to find the number table I was looking for
table = soup.find_all('table')[1]

In [3]:
# lets try to isolate the links to the networks
network = table.find_all('a')
network

[<a href="/wiki/Facebook" title="Facebook">Facebook</a>,
 <a href="/wiki/United_States" title="United States"><img alt="United States" class="thumbborder" data-file-height="650" data-file-width="1235" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/23px-Flag_of_the_United_States.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/35px-Flag_of_the_United_States.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/46px-Flag_of_the_United_States.svg.png 2x" width="23"/></a>,
 <a href="/wiki/United_States" title="United States">United States</a>,
 <a href="/wiki/YouTube" title="YouTube">YouTube</a>,
 <a href="/wiki/United_States" title="United States"><img alt="United States" class="thumbborder" data-file-height="650" data-file-width="1235" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg

Hmm, tricky. There are other hyperlinks in this table, some that I don't want.

But the ones I DO want aren't a specified class either.

I think i'll have to look through the table tr and td and only grab the links from the specific column I want.

In [4]:
network = []


# loop through the tr elements in the table
for row in table.find_all('tr'):
    col_num = 0 # set a column counter to 0 for every row
    # loop through each column (or cell)
    for cell in row.find_all('td'):
        # if that is the cell at index 1, then grab the href address and store it in the 'network' list
        if col_num == 1:
            network.append(cell.a['href'])
        # add another iteration to the column counter
        col_num += 1
        # because we only want the web address from the column at index 1


In [5]:
# lets see if that worked...
# we would expect to see the web address for Facebook at index 0 in the 'network' list
network[0]

'/wiki/Facebook'

Interesting, the href is a local path, which makes directly using the path more difficult, but not that much!
I'll quickly iterate over each of the addresses in network and add the base: 'https://en.wikipedia.org' to it.

In [6]:
base = 'https://en.wikipedia.org'

count = 0
for address in network:
    network[count] = base + address
    count += 1

# lets check if I can request the contents of one of the network's wiki entries...
facebook = scrape_page(network[0])

#### Nice!
Let's try scraping all of the wiki entries and isolating the paragraph texts.

To efficiently scrape all the text from each of our web addresses in 'network', we're going to make a function that: 
* loops through the addresses
* calls our function scrape_page on the address
* finds the p-tags and extracts the text from them only

...and since our end goal is to perform TF-IDF on the text, we will also:
* tokenize the text into seperate words
* count the frequency of each word in it's document

This function will return a Pandas DataFrame to make the result easier to work with later on.

In [7]:
# lets import the libararies we're going to use
from collections import defaultdict # to create a dynamic dictionary where we do not have to initialise each key: value pair
from collections import Counter # to count the frequency of each word
from nltk.tokenize import wordpunct_tokenize # to split the text into words
import pandas as pd # for our dataframe

In [8]:
def scrape_and_tokenize(network):
    """
    Scrapes the <p> tag text from wikipedia entries and tokenizes the results into a Counter object.
    
    params: 
        network: Type = List, Content = web addresses to wikipedia entries
    
    returns:
        Pandas DataFrame 
    """
    
    # create a dictionary to hold the word frequencies for each wiki entry
    # default dict means that we don't have to initialise each wiki entry name
    word_tf = defaultdict(lambda: Counter() )
    
    # loop through the wiki entry addresses
    for address in network:
        # scrape the webpage and find all the p-tags
        paras = scrape_page(address).body.find('div', attrs={'class': 'mw-parser-output'}).find_all('p')
        for p_tag in paras: # loop through the p-tags and...
            # ...get the text, split it into words, count the frequency of each word and...
            # ...add the results to the word-tf dictionary under the name of the social media platform (wiki entry)
            word_tf[address.partition('wiki/')[2]] += Counter(wordpunct_tokenize(p_tag.get_text()))
            # move on to the next wiki entry (address)

    # create a DataFrame out of the results
    word_df = pd.DataFrame.from_dict(word_tf, orient='index')
    # replace the NaN values (where a word appears in another page but not in this one) with zeros
    word_df.fillna(value=0, inplace=True)
    
    # return the DataFrame whenever this function is called
    return word_df

In [9]:
# okay let's test that out
# we call the function with the 
result = scrape_and_tokenize(network)

In [10]:
result

Unnamed: 0,Facebook,is,an,American,online,social,media,and,networking,service,...,humanoid,animals,mythological,creatures,disallowed,periodic,Unicorn,Riot,Identity,Evropa
Facebook,352.0,45,41,10.0,8.0,32.0,23.0,310,5.0,13.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
YouTube,4.0,71,66,2.0,17.0,9.0,13.0,323,3.0,38.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
WhatsApp,29.0,34,28,3.0,0.0,8.0,11.0,125,1.0,15.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Facebook_Messenger,42.0,9,5,1.0,0.0,0.0,1.0,38,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
WeChat,5.0,28,16,1.0,6.0,7.0,3.0,110,1.0,12.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Instagram,25.0,22,32,1.0,2.0,10.0,8.0,191,2.0,12.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
TikTok,1.0,30,16,3.0,0.0,3.0,5.0,96,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Sina_Weibo,3.0,26,17,0.0,2.0,7.0,10.0,83,1.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Reddit,2.0,36,39,2.0,4.0,5.0,1.0,213,0.0,6.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Snapchat,2.0,35,33,1.0,5.0,8.0,8.0,130,0.0,7.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### We have frequencies!

Great! We now have a series of functions that scrape the text from target wikipedia pages and counts the frequencies of each word.

Now we can progress with applying the NLP TF-IDF algorithm on this data set.

How about a reminder of the TF-IDF algorithm?

## TF-IDF
#### Term Frequency-Inverse Document Frequency

This algorithm scores each term (word) based on its relative importance in singular documents (*TF*), normalised by its commonness in a group of documents (*IDF*)

*For calrity here let's define some terms we will use when building the TF-IDF algorithm*:
* corpus = a collection of documents
* documents = a string of text (in this case our wikipedia entries are considered documents)
* term = a singular tokenized word from a document
* tokenized = a singular word that has been isolated from other words in the document (terms are tokenized text documents)

#### TF
To calculate the TF of a term, we simply divide the frequency of the term by the total number of unique terms in the document.

$$TF = \frac{f_{t}}{|T|}$$

#### IDF
To calculate the IDF of a term, we take the logarithm of the *total number of documents* divided by the *count of documents in the corpus that term appears in*.

$$idf_{i} = log\frac{|D|}{|df_{i}|}$$

For efficiency, I think we can insert the TF step into our existing functions, let's do a retake of the scrape_and_tokenize function.

In [11]:
def scrape_and_getTF(network):
    """
    Scrapes the <p> tag text from wikipedia entries and tokenizes the results into a Counter object.
    Calculates the TermFrequency (wordCount / totalWords) for each word in each document and stores the results in a Dict
    
    params: 
        network: Type = List, Content = web addresses to wikipedia entries
    
    returns:
        Pandas DataFrame 
    """
    
    # create a dictionary to hold the word frequencies for each wiki entry
    # default dict means that we don't have to initialise each wiki entry name
    word_tfs = defaultdict(lambda: {})
    
    # loop through the wiki entry addresses
    for address in network:
        #print(address) #debug line
        # scrape the webpage and find all the p-tags
        paras = scrape_page(address).body.find('div', attrs={'class': 'mw-parser-output'}).find_all('p')
        tf = Counter()
        for p_tag in paras: # loop through the p-tags and...
            # ...get the text, split it into words, count the frequency of each word
            tf += Counter(wordpunct_tokenize(p_tag.get_text()))
        
        # sum the total number of words to divide by
        total_terms = sum(tf.values())
        # calculate the TermFrequency (count/total words) of each word, create a dict of those results
        # add the results to the word-tfs dictionary under the name of the social media platform (wiki entry)
        word_tfs[address.partition('wiki/')[2]] = {t: tf[t]/total_terms for t in tf}
        # move on to the next wiki entry (address)

    # create a DataFrame out of the results
    word_df = pd.DataFrame.from_dict(word_tfs, orient='index')
    # replace the NaN values (where a word appears in another page but not in this one) with zeros
    word_df.fillna(value=0, inplace=True)
    
    # return the DataFrame whenever this function is called
    return word_df#, word_tfs #debug line

What did we change?
* word_tf dict to word_tfs to indicate that this dict holds multiple documents

* took the TF section out of the p-tag for loop so that we can work with all the words in one document (not just within the p-tag)
* added a total_terms object that givs us the denominator for the TF calculation
* added a step to calculate the TF of each word before adding to the word_tfs dict
* renamed our function to 'scrape_and_getTF'

Let's give it a try

In [12]:
result = scrape_and_getTF(network)

In [13]:
result

Unnamed: 0,Facebook,is,an,American,online,social,media,and,networking,service,...,humanoid,animals,mythological,creatures,disallowed,periodic,Unicorn,Riot,Identity,Evropa
Facebook,0.024775,0.003167,0.002886,0.000704,0.000563,0.002252,0.001619,0.021819,0.000352,0.000915,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
YouTube,0.000226,0.004004,0.003722,0.000113,0.000959,0.000508,0.000733,0.018216,0.000169,0.002143,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
WhatsApp,0.004269,0.005005,0.004122,0.000442,0.0,0.001178,0.001619,0.018401,0.000147,0.002208,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Facebook_Messenger,0.025347,0.005432,0.003018,0.000604,0.0,0.0,0.000604,0.022933,0.0,0.002414,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
WeChat,0.000979,0.005484,0.003134,0.000196,0.001175,0.001371,0.000588,0.021543,0.000196,0.00235,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Instagram,0.002746,0.002417,0.003515,0.00011,0.00022,0.001098,0.000879,0.02098,0.00022,0.001318,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
TikTok,0.000208,0.006225,0.00332,0.000623,0.0,0.000623,0.001038,0.019921,0.000208,0.000208,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Sina_Weibo,0.00081,0.007023,0.004592,0.0,0.00054,0.001891,0.002701,0.02242,0.00027,0.00108,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Reddit,0.000205,0.003686,0.003993,0.000205,0.00041,0.000512,0.000102,0.02181,0.0,0.000614,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Snapchat,0.000287,0.005027,0.00474,0.000144,0.000718,0.001149,0.001149,0.018673,0.0,0.001005,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Cool!
Each word is now a function of it's count / total words in the document.

You want proof? Look at the word 'Facebook' in the Facebook wiki entry, as described in the first scrape_and_tokenize function we created

In [14]:
word_Facebook_count = 352
total_FB_words = 14208
print(word_Facebook_count / total_FB_words)

0.024774774774774775


The result matches our new function - yay!

Now for the IDF part.

I think we can more easily perform the IDF calculation using vectorized operations rather than looping through each word in each document.

The plan is create an IDF value for each word as a vector, which will be the result of:
- sum of (count of rows where the value > 0) for each column
- divided by the total number of rows
- and take the logarithm of the result

Then, calculate the TF-IDF score by multiplying the result *matrix* (DataFrame) by the IDF value *vector* (Series)

In [15]:
# first lets test if we can do this as a vectorized approach

# importing numpy for its maths
import numpy as np

idf_filter = result.loc[:,:] > 0
idf_filter.head()

# this shows us that we can identify how many documents a word appears

Unnamed: 0,Facebook,is,an,American,online,social,media,and,networking,service,...,humanoid,animals,mythological,creatures,disallowed,periodic,Unicorn,Riot,Identity,Evropa
Facebook,True,True,True,True,True,True,True,True,True,True,...,False,False,False,False,False,False,False,False,False,False
YouTube,True,True,True,True,True,True,True,True,True,True,...,False,False,False,False,False,False,False,False,False,False
WhatsApp,True,True,True,True,False,True,True,True,True,True,...,False,False,False,False,False,False,False,False,False,False
Facebook_Messenger,True,True,True,True,False,False,True,True,False,True,...,False,False,False,False,False,False,False,False,False,False
WeChat,True,True,True,True,True,True,True,True,True,True,...,False,False,False,False,False,False,False,False,False,False


In [16]:
idf_numeric_filter = idf_filter.replace([True, False], [1, 0])
idf_numeric_filter.head()

# this shows us we can convert that filter into a numeric form

Unnamed: 0,Facebook,is,an,American,online,social,media,and,networking,service,...,humanoid,animals,mythological,creatures,disallowed,periodic,Unicorn,Riot,Identity,Evropa
Facebook,1,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
YouTube,1,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
WhatsApp,1,1,1,1,0,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
Facebook_Messenger,1,1,1,1,0,0,1,1,0,1,...,0,0,0,0,0,0,0,0,0,0
WeChat,1,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0


In [17]:
idf_vector = idf_numeric_filter.sum(axis=0)
idf_vector
# this shows us we can create a vector for the IDF values

Facebook    12
is          19
an          19
American    13
online      13
            ..
periodic     1
Unicorn      1
Riot         1
Identity     1
Evropa       1
Length: 10677, dtype: int64

#### Cool!
Lets put that all together and calculate the TF-IDF values for each word in our documents

In [18]:
# importing numpy for its maths
import numpy as np

# log of (length of result table = num of documents) / number of documents that word appears in
idf_values = np.log(len(result.index) / ((result.loc[:,:] > 0).replace([True, False], [1, 0])).sum(axis=0))
idf_values

Facebook    0.459532
is          0.000000
an          0.000000
American    0.379490
online      0.379490
              ...   
periodic    2.944439
Unicorn     2.944439
Riot        2.944439
Identity    2.944439
Evropa      2.944439
Length: 10677, dtype: float64

In [19]:
# multiple the IDF scores by the TF scores to create the TF-IDF values
tf_idf_table = result * idf_values

In [20]:
tf_idf_table

Unnamed: 0,Facebook,is,an,American,online,social,media,and,networking,service,...,humanoid,animals,mythological,creatures,disallowed,periodic,Unicorn,Riot,Identity,Evropa
Facebook,0.011385,0.0,0.0,0.000267,0.000214,0.000387,0.00018,0.0,0.000192,4.9e-05,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
YouTube,0.000104,0.0,0.0,4.3e-05,0.000364,8.7e-05,8.2e-05,0.0,9.2e-05,0.000116,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
WhatsApp,0.001962,0.0,0.0,0.000168,0.0,0.000202,0.00018,0.0,8e-05,0.000119,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Facebook_Messenger,0.011648,0.0,0.0,0.000229,0.0,0.0,6.7e-05,0.0,0.0,0.000131,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
WeChat,0.00045,0.0,0.0,7.4e-05,0.000446,0.000236,6.5e-05,0.0,0.000107,0.000127,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Instagram,0.001262,0.0,0.0,4.2e-05,8.3e-05,0.000189,9.8e-05,0.0,0.00012,7.1e-05,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
TikTok,9.5e-05,0.0,0.0,0.000236,0.0,0.000107,0.000115,0.0,0.000113,1.1e-05,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Sina_Weibo,0.000372,0.0,0.0,0.0,0.000205,0.000325,0.0003,0.0,0.000148,5.8e-05,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Reddit,9.4e-05,0.0,0.0,7.8e-05,0.000155,8.8e-05,1.1e-05,0.0,0.0,3.3e-05,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Snapchat,0.000132,0.0,0.0,5.5e-05,0.000273,0.000197,0.000128,0.0,0.0,5.4e-05,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [21]:
# test for the word 'facebook' in Facebook
facebook_word_in_Facebook = result.iloc[0,0]
idf_value_for_facebook = idf_values[0]
manual_tfidf = facebook_word_in_Facebook * idf_value_for_facebook
print("Manual TFIDF calc = " + str(manual_tfidf) + "\n" +
"Programmed TFIDF calc = " + str(tf_idf_table.iloc[0,0]))

Manual TFIDF calc = 0.011384809962078472
Programmed TFIDF calc = 0.011384809962078472


In [22]:
# test for the word 'is' in Facebook
is_word_in_Facebook = result.iloc[0,1]
idf_value_for_is = idf_values[1]
manual_tfidf = is_word_in_Facebook * idf_value_for_is
print("Manual TFIDF calc = " + str(manual_tfidf) + "\n" +
"Programmed TFIDF calc = " + str(tf_idf_table.iloc[0,1]))

Manual TFIDF calc = 0.0
Programmed TFIDF calc = 0.0


#### Awesome!

Now for the hard part...

How can we interpret the TF-IDF values to allow us to summarise the content of each document?

In [23]:
# to provide a summary for each document, we can iterate through the index in our table
for platform in tf_idf_table.index:
    # isolate the top 5 highest scoring tf-idf values
    keywords = tf_idf_table.loc[platform, :].nlargest(5).index
    #...and print them out in a nice presentable text
    print("""Here are some keywords to describe the {platform} Wikipedia entry:
    - {kw0}
    - {kw1}
    - {kw2}
    - {kw3}
    - {kw4}
    """.format(platform = platform, kw0 = keywords[0], kw1 = keywords[1], kw2 = keywords[2], kw3 = keywords[3], 
               kw4 = keywords[4] )
         )

Here are some keywords to describe the Facebook Wikipedia entry:
    - Facebook
    - Zuckerberg
    - News
    - Cambridge
    - Analytica
    
Here are some keywords to describe the YouTube Wikipedia entry:
    - YouTube
    - children
    - channels
    - site
    - streaming
    
Here are some keywords to describe the WhatsApp Wikipedia entry:
    - WhatsApp
    - Koum
    - Acton
    - Nokia
    - Telegram
    
Here are some keywords to describe the Facebook_Messenger Wikipedia entry:
    - Messenger
    - Facebook
    - ]:
    - bots
    - Coldewey
    
Here are some keywords to describe the WeChat Wikipedia entry:
    - WeChat
    - Pay
    - Moments
    - envelopes
    - mini
    
Here are some keywords to describe the Instagram Wikipedia entry:
    - Instagram
    - Systrom
    - Stories
    - ][
    - Krieger
    
Here are some keywords to describe the TikTok Wikipedia entry:
    - TikTok
    - ByteDance
    - Douyin
    - ly
    - lip
    
Here are some keywords to describe 

#### Feedback
It looks like the most relevant key words returned for each wikipedia entry almost always begins with the name of the platform itself. Maybe it would be best to exclude these words from its own document by replacing the word with blank at the tokenize stage.
1. Remove platform name from document at tokenize stage

Some words look like variations of the same word, for example in the LinkedIn entry 'professional' and 'professionals' are two of the highest scoring terms. There is a preprocessing technique called *stemming* which reduces words to the stem of their meaning. For example, we could reduce both these terms to 'professional' after the tokenize stage.
2. Stem the words after tokenizing

Similarly, we have cases where the algorithm is case-sensitive. We can mitigate this by converted all the terms to lower case
3. Convert all terms to lower case when we tokenize

Some keywords showing up are actually special characters, such as '[]'. We can remove special characters with some parameters in the tokenize process
4. Remove special characters during tokenize

Stopwords such as 'the', 'he' and 'she' do not add much semantic value and tend to clog up the analysis. Lets remove them with a well known dictionary of stopwords such as the one provided by NLTK library.
5. Remove stop words

Let's try to implement those then...

In [24]:
from nltk.corpus import stopwords # provides our list of generic stopwords
from nltk.stem import WordNetLemmatizer # Lemmatizing is the process of reducing a word to it's root meaning


wordnet_lemmatizer = WordNetLemmatizer() # create the Lemmatizer object
stopwords = set(stopwords.words('english')) # store the stop words in an iterable set for later
punctuation = ["'",'"','?',':','!','.',',',';','[',']','{','}', ']:', ']['] 
# had to include specific punctuation to remove from dataset e.g. ][ as could not work out how to do it naivly

def scrape_and_getTF(network):
    """
    Scrapes the <p> tag text from wikipedia entries, cleans and tokenizes the results into a Counter object.
    Calculates the TermFrequency (wordCount / totalWords) for each word in each document and stores the results in a Dict
    
    params: 
        network: Type = List, Content = web addresses to wikipedia entries
    
    returns:
        Pandas DataFrame 
    """
    
    # create a dictionary to hold the word frequencies for each wiki entry
    # default dict means that we don't have to initialise each wiki entry name
    word_tfs = defaultdict(lambda: {})
    
    # loop through the wiki entry addresses
    for address in network:
        # store platform name to exclude from tokens later
        platform_name = address.partition('wiki/')[2].lower()
        # scrape the webpage and find all the p-tags
        paras = scrape_page(address).body.find('div', attrs={'class': 'mw-parser-output'}).find_all('p')
        tf = Counter() # initiate a counter object
        for p_tag in paras: # loop through the p-tags and...
            # ...get the text, split it into words
            tokens = wordpunct_tokenize(p_tag.get_text().lower())
            # remove: stopwords, punctuation and platform_name
            tokens = [wordnet_lemmatizer.lemmatize(w) for w in tokens if 
                      w not in stopwords and 
                      w not in punctuation and 
                      w not in platform_name
                     ]
            tf += Counter(tokens) # count the frequency of each word and store it in tf
        
        # now for the TF part of TF-IDF
        # sum the total number of words to divide by
        total_terms = sum(tf.values())
        # calculate the TermFrequency (count/total words) of each word, create a dict of those results
        # add the results to the word-tfs dictionary under the name of the social media platform (wiki entry)
        word_tfs[address.partition('wiki/')[2]] = {t: tf[t]/total_terms for t in tf}
        # move on to the next wiki entry (address)

    # create a DataFrame out of the results
    word_df = pd.DataFrame.from_dict(word_tfs, orient='index')
    # replace the NaN values (where a word appears in another page but not in this one) with zeros
    word_df.fillna(value=0, inplace=True)
    
    # return the DataFrame whenever this function is called
    return word_df#, word_tfs #debug line

In [25]:
import numpy as np

result = scrape_and_getTF(network)

idf_values = np.log(len(result.index) / ((result.loc[:,:] > 0).replace([True, False], [1, 0])).sum(axis=0))
tf_idf_table = result * idf_values
tf_idf_table

Unnamed: 0,american,online,social,medium,networking,service,based,menlo,park,california,...,cub,furry,artwork,shotacon,allowable,humanoid,mythological,disallowed,periodic,evropa
Facebook,0.000526,0.000308,0.000736,0.000336,0.000344,0.000136,0.000238,0.000741,0.000465,0.000565,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
YouTube,0.000117,0.000594,0.000176,0.000182,0.000168,0.000271,0.000334,0.0,0.000189,0.000306,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
WhatsApp,0.0003,0.0,0.000362,0.000352,0.000144,0.000399,0.000226,0.0,0.0,0.000394,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Facebook_Messenger,0.000429,0.0,0.0,0.000251,0.0,0.000244,0.000388,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
WeChat,0.000137,0.000659,0.000433,0.00012,0.000197,0.000447,0.000433,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Instagram,7.6e-05,0.000122,0.000344,0.000178,0.000219,0.000227,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
TikTok,0.000577,0.000116,0.000196,0.000211,0.000208,2.1e-05,0.000196,0.0,0.0,0.000568,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Reddit,0.000141,0.00034,0.00016,6.2e-05,0.0,9e-05,0.000223,0.0,0.0,0.000139,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Snapchat,0.000101,0.000406,0.000366,0.000237,0.0,0.00013,0.000274,0.0,0.0,0.000199,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Twitter,0.000359,0.000173,0.000732,0.000221,0.00031,0.000276,0.000228,0.0,0.000175,0.000141,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [26]:
# to provide a summary for each document, we can iterate through the index in our table
for platform in tf_idf_table.index:
    # isolate the top 5 highest scoring tf-idf values
    keywords = tf_idf_table.loc[platform, :].nlargest(5).index
    #...and print them out in a nice presentable text
    print("""Here are some keywords to describe the {platform} Wikipedia entry:
    - {kw0}
    - {kw1}
    - {kw2}
    - {kw3}
    - {kw4}
    """.format(platform = platform, kw0 = keywords[0], kw1 = keywords[1], kw2 = keywords[2], kw3 = keywords[3], 
               kw4 = keywords[4] )
         )

Here are some keywords to describe the Facebook Wikipedia entry:
    - zuckerberg
    - cambridge
    - analytica
    - infowars
    - political
    
Here are some keywords to describe the YouTube Wikipedia entry:
    - channel
    - child
    - site
    - streaming
    - creator
    
Here are some keywords to describe the WhatsApp Wikipedia entry:
    - koum
    - acton
    - facebook
    - nokia
    - telegram
    
Here are some keywords to describe the Facebook_Messenger Wikipedia entry:
    - bot
    - coldewey
    - standalone
    - encrypted
    - eff
    
Here are some keywords to describe the WeChat Wikipedia entry:
    - moment
    - envelope
    - pay
    - tencent
    - mini
    
Here are some keywords to describe the Instagram Wikipedia entry:
    - systrom
    - photo
    - krieger
    - #
    - story
    
Here are some keywords to describe the TikTok Wikipedia entry:
    - bytedance
    - douyin
    - musical
    - ly
    - song
    
Here are some keywords to describe the

#### Much better!

Now our keywords are more useful as they do not repeat the platform name and are semantic!

But can I summarise the whole corpus?

In [27]:
print('Here are some words that generally describe all the articles in this analysis:')
for platform in tf_idf_table.index:
    summary_word = tf_idf_table.loc[platform, :].nsmallest(2).index
    print("- {word}".format(word=summary_word[0]))

Here are some words that generally describe all the articles in this analysis:
- company
- menlo
- online
- online
- menlo
- based
- menlo
- networking
- networking
- menlo
- networking
- menlo
- online
- american
- american
- american
- american
- american
- american


An ok representation given my knowledge of the subject area. Whilst not a perfect description, I think this does describe the articles quite well. It's a start!

To develop this project further I would like to figure out how to best summarize the whole corpus, for example does taking the median score TF-IDF value word provide a better summary?

I would also be interested in seeing if we could structure the summary in such a way that it looks as though a human is summarising the information. For example:
"The Facebook article is about an American Social Media platform"