# urlExpander Quickstart
[NBViewer](http://nbviewer.jupyter.org/github/SMAPPNYU/urlExpander/blob/master/examples/quickstart.ipynb?flush_cache=true) | [Github](https://github.com/SMAPPNYU/urlExpander/blob/master/examples/quickstart.ipynb)| By [Leon Yin](leonyin.org) for [SMaPP NYU](https://wp.nyu.edu/smapp/)


[urlExpander](https://github.com/SMAPPNYU/urlExpander) is a Python package for quickly and thoroughly expanding URLs.

You can download the software using pip:

In [None]:
!pip install urlexpander runtimestamp -U

In [1]:
import urlexpander as ux
from runtimestamp.runtimestamp import runtimestamp
runtimestamp('Leon')
print(f"This notebook is using urlExpander v{ux.__version__}.")

Updated 2018-07-13 11:18:10.601561
By Leon
Using Python 3.6.5
On Linux-3.10.0-514.10.2.el7.x86_64-x86_64-with-centos-7.3.1611-Core
This notebook is using urlExpander v0.0.19.


Here is a toy example of some URLs taken from Congressional Twitter accounts:

In [2]:
urls = [
    'https://trib.al/xXI5ruM',
    'http://bit.ly/1Sv81cj',
    'https://www.youtube.com/watch?v=8NwKcfXvGl4',
    'https://t.co/zNU1eHhQRn',
]

We can use the `expand` function (see the code) to unshorten any link:

In [3]:
ux.expand(urls[0])

{'original_url': 'https://trib.al/xXI5ruM',
 'resolved_domain': 'breitbart.com',
 'resolved_url': 'https://www.breitbart.com/video/2017/12/31/lindsey-graham-trump-just-cant-tweet-iran/'}

To save compute time, we can skip links that don't need to be expanded.<br>
The `is_short` function takes any url and checks if the domain is from a known list of link shorteners

In [4]:
print(f"{urls[1]} returns:")
ux.is_short(urls[1])

http://bit.ly/1Sv81cj returns:


True

bit.ly is probably the best known link shortener, Youtube.com however is not a link shortener!

In [5]:
print(f"{urls[2]} returns:")
ux.is_short(urls[2])

https://www.youtube.com/watch?v=8NwKcfXvGl4 returns:


False

urlExpander takes advantage of a list of known domains that offer link shortening services.

In [6]:
known_shorteners = ux.constants.all_short_domains.copy()
known_shorteners[:25]

['sh.st',
 'adf.ly',
 'lnx.lu',
 'adfoc.us',
 'dlvr.it',
 'bit.ly',
 'buff.ly',
 'ow.ly',
 'goo.gl',
 'shar.es',
 'ift.tt',
 'fb.me',
 'washex.am',
 'smq.tc',
 'trib.al',
 'is.gd',
 'paper.li',
 'waa.ai',
 'tinyurl.com',
 'ht.ly',
 '1.usa.gov',
 'deck.ly',
 'bit.do',
 'tiny.cc',
 'lc.chat']

You can make modifications or use your own `list_of_domains` as an argument for the`is_short` function or `is_short_domain` (which is faster and operates on the domain-level).

In [7]:
known_shorteners += ['youtube.com']

In [8]:
print(f"Now {urls[2]} returns:")
ux.is_short(urls[2], list_of_domains=known_shorteners)

Now https://www.youtube.com/watch?v=8NwKcfXvGl4 returns:


True

Now we can shorten our workload:

In [9]:
# filter only domains that need to be shortenened
urls_to_shorten = [link for link in urls if ux.is_short(link)]
urls_to_shorten

['https://trib.al/xXI5ruM', 'http://bit.ly/1Sv81cj']

urlExpander's `multithread_expand()` does heavy lifting to quickly and thoroughly expand a list of links:

In [11]:
resolved_links = ux.multithread_expand(urls_to_shorten,  
                                       n_workers=2,
                                       return_errors=False)

1it [00:02,  2.27s/it]


In [12]:
resolved_links

[{'original_url': 'https://trib.al/xXI5ruM',
  'resolved_domain': 'breitbart.com',
  'resolved_url': 'https://www.breitbart.com/video/2017/12/31/lindsey-graham-trump-just-cant-tweet-iran/'},
 {'original_url': 'http://bit.ly/1Sv81cj',
  'resolved_domain': 'www.billshusterforcongress.com',
  'resolved_url': 'http://www.billshusterforcongress.com port=80): Read timed out. (read timeout=2)'}]

The output works really nicely with [Pandas](https://pandas.pydata.org/).

In [13]:
import pandas as pd

df_resolved_links = pd.DataFrame(resolved_links)
df_resolved_links.tail(2)

Unnamed: 0,original_url,resolved_domain,resolved_url
0,https://trib.al/xXI5ruM,breitbart.com,https://www.breitbart.com/video/2017/12/31/lin...
1,http://bit.ly/1Sv81cj,www.billshusterforcongress.com,http://www.billshusterforcongress.com port=80)...


<hr>

But that is a toy example, let's see how this fairs with a larger dataset.<br>
This package comes with a [sampled dataset](https://github.com/SMAPPNYU/urlExpander/blob/master/urlexpander/core/datasets.py#L8-L29) of links extracted from Twitter accounts from the 115th Congress. <br>
If you work with Twitter data you'll be glad to know there is a function`ux.tweet_parser.get_link()` for creating a similar dataset from Tweets.

In [14]:
df_congress = ux.datasets.congress_twitter_links()

print(f'The dataset has {len(df_congress)} rows')
df_congress.tail(2)

The dataset has 50000 rows


Unnamed: 0,link.domain,link.url_long,link.url_short,tweet.created_at,tweet.id,tweet.text,user.id
49998,youtube.com,https://www.youtube.com/watch?v=KzanCL2Ui4Y,https://t.co/Ilwci2gNFa,Mon Nov 28 19:44:30 +0000 2016,803323702444171265,LIVE: States' Economic Development Assistance ...,269992801
49999,twitter.com,https://twitter.com/ap/status/818071378469519361,https://t.co/2SEKhfEXeB,Sun Jan 08 15:01:58 +0000 2017,818110504694595585,Prayers for #Jerusalem. https://t.co/2SEKhfEXeB,22055226


Let's just work with shortened URLs:

In [15]:
short_urls = df_congress[
    df_congress['link.url_long'].apply(ux.is_short)
]['link.url_long'].unique()

len(short_urls)

15035

About 30% of the links are short!<br>
Let's see how long it takes to expand these 15k links.<br>
This is where the parameters for `multithread_expand()` shine.
We can created multiple threads for requests, cache results into a json, and chunk the 15k input into smaller pieces. Why does this last part matter? Something I noticed when expanding links in mass is that performance over time degrades. Chunking the input prevents this from happening (not sure why though)!

In [19]:
resolved_links, errors = ux.multithread_expand(short_urls, 
                                               chunksize=1280, 
                                               n_workers=64,
                                               cache_file='tmp.json',
                                               return_errors=True)

12it [05:13, 26.11s/it]


We were able to expand 15K links in 5 minutes! With very few errors!

In [20]:
len(resolved_links), len(errors)

(15034, 1)

In [21]:
df_resolved_links = pd.DataFrame(resolved_links)
df_resolved_links.tail(3)

Unnamed: 0,original_url,resolved_domain,resolved_url
15031,http://bit.ly/U10Sh2,-1,Exceeded 30 redirects.
15032,http://ow.ly/wftqs,neindiana.com,http://neindiana.com/regional-initiatives/visi...
15033,http://bit.ly/HNRyxf,dispatch.com,http://www.dispatch.com/content/stories/local/...


At SMaPP, the process of link expansion has been a burden on our research.<br>
We hope that this software helps you overcome similar obsticles!

<hr>

# Bonus Round!
After unshortening links, you can join them back into the new dataframe

In [22]:
import numpy as np

In [23]:
df_merged = df_congress.merge(df_resolved_links,
                              left_on='link.url_long',
                              right_on='original_url',
                              how='left')

# these steps fill in `resolved_domain` for URLs that were not from link shortening services...
df_merged['resolved_domain'] = np.where(df_merged['resolved_domain'].isnull(), 
                                        df_merged['link.domain'], 
                                        df_merged['resolved_domain'])

df_merged['resolved_url'] = np.where(df_merged['resolved_url'].isnull(), 
                                     df_merged['link.url_long'], 
                                     df_merged['resolved_url'])

df_merged.tail(2)

Unnamed: 0,link.domain,link.url_long,link.url_short,tweet.created_at,tweet.id,tweet.text,user.id,original_url,resolved_domain,resolved_url
49998,youtube.com,https://www.youtube.com/watch?v=KzanCL2Ui4Y,https://t.co/Ilwci2gNFa,Mon Nov 28 19:44:30 +0000 2016,803323702444171265,LIVE: States' Economic Development Assistance ...,269992801,,youtube.com,https://www.youtube.com/watch?v=KzanCL2Ui4Y
49999,twitter.com,https://twitter.com/ap/status/818071378469519361,https://t.co/2SEKhfEXeB,Sun Jan 08 15:01:58 +0000 2017,818110504694595585,Prayers for #Jerusalem. https://t.co/2SEKhfEXeB,22055226,,twitter.com,https://twitter.com/ap/status/818071378469519361


You can count number of `resolved_domain`s for each `user.id ` using `count_matrix()`.<br>
You can even choose which domains are counted by modifying the `domain_list` arg:

In [25]:
count_matrix = ux.tweet_utils.count_matrix(df_merged,
                                           user_col='user.id', 
                                           domain_col='resolved_domain', 
                                           unique_count_col='tweet.id',
                                           domain_list=['youtube.com','facebook.com', 'google.com', 'twitter.com'])

count_matrix.tail(3)

Unnamed: 0_level_0,facebook.com,youtube.com,twitter.com,google.com
user.id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
98471035,10,1,6,0
993153006,5,6,25,0
995193054,2,3,1,0


<hr>
We also built a one-size-fits-all scraper that returns the title, description, and/or paragraphs from any given URL.

In [26]:
ux.html_utils.get_webpage_title(urls[0])

"Lindsey Graham to Trump: 'You Just Can't Tweet' About Iran | Breitbart"

In [27]:
ux.html_utils.get_webpage_description(urls[0])

'Sunday CBS\'s "Face the Nation," while discussing the last several\xa0days of protests in Iran over\xa0government corruption, Sen. Lindsey Graham (R-SC) warned | Breitbart TV'

In [28]:
ux.html_utils.get_webpage_meta(urls[0])

OrderedDict([('url', 'https://trib.al/xXI5ruM'),
             ('title',
              "Lindsey Graham to Trump: 'You Just Can't Tweet' About Iran | Breitbart"),
             ('description',
              'Sunday CBS\'s "Face the Nation," while discussing the last several\xa0days of protests in Iran over\xa0government corruption, Sen. Lindsey Graham (R-SC) warned | Breitbart TV'),
             ('paragraphs',
              ['Sunday CBS’s “Face the Nation,” while discussing the last several\xa0days of protests in Iran over\xa0government corruption, Sen. Lindsey Graham (R-SC) warned President Donald Trump that he couldn’t “just tweet” about the protests.',
               'Graham said, “The Iranian people are not our enemy. The Ayatollah is the enemy of the world. Here is what I would do if I were President Trump. I would explain what a better deal would look like. It’s not enough to watch. President Trump is tweeting very sympathetically to the Iranian people. But you just can’t tweet here

## Counclusion
Thanks for stumbling upon this package, we hope that it will lead to more research around links.<br>
We're working on some projects in thie vein and would love to know if you are too!

As an open source package, please feel to reach out about bugs, feature requests, or collaboration!