# urlExpander Quickstart
[NBViewer](http://nbviewer.jupyter.org/https://github.com/SMAPPNYU/urlExpander/blob/master/examples/quickstart.ipynb) | [Github](https://github.com/SMAPPNYU/urlExpander/blob/master/examples/quickstart.ipynb)| By [Leon Yin](leonyin.org) for [SmaPP NYU](https://wp.nyu.edu/smapp/)


The core functions for this package as exactly as the name suggests -- to expand shortened urls!<br>
You can download the software using pip:

In [1]:
!pip install urlexpander runtimestamp --upgrade

In [5]:
import urlexpander as ux
from runtimestamp.runtimestamp import runtimestamp
runtimestamp('Leon')

Updated 2018-07-12 16:50:09.008560
By Leon
Using Python 3.6.5
On Linux-3.10.0-514.10.2.el7.x86_64-x86_64-with-centos-7.3.1611-Core


Here is a toy example of some URLs taken from Congressional Twitter accounts:

In [58]:
urls = [
    'https://trib.al/xXI5ruM',
    'http://bit.ly/1Sv81cj',
    'https://www.youtube.com/watch?v=8NwKcfXvGl4',
    'https://t.co/zNU1eHhQRn',
]

We can use the `expand` function (see the code) to unshorten any link:

In [59]:
ux.expand(urls[0])

{'original_url': 'https://trib.al/xXI5ruM',
 'resolved_domain': 'breitbart.com',
 'resolved_url': 'https://www.breitbart.com/video/2017/12/31/lindsey-graham-trump-just-cant-tweet-iran/'}

To save compute time, we can skip links that don't need to be expanded.<br>
The `is_short` function takes any url and checks if the domain is from a known list of link shorteners

In [60]:
print(f"{urls[1]} returns:")
ux.is_short(urls[1])

http://bit.ly/1Sv81cj returns:


True

bit.ly is probably the best known link shortener, Youtube.com however is not a link shortener!

In [61]:
print(f"{urls[2]} returns:")
ux.is_short(urls[2])

https://www.youtube.com/watch?v=8NwKcfXvGl4 returns:


False

urlExpander takes advantage of a list of known domains that offer link shortening services.

In [62]:
known_shorteners = ux.constants.all_short_domains.copy()
known_shorteners[:25]

['sh.st',
 'adf.ly',
 'lnx.lu',
 'adfoc.us',
 'dlvr.it',
 'bit.ly',
 'buff.ly',
 'ow.ly',
 'goo.gl',
 'shar.es',
 'ift.tt',
 'fb.me',
 'washex.am',
 'smq.tc',
 'trib.al',
 'is.gd',
 'paper.li',
 'waa.ai',
 'tinyurl.com',
 'ht.ly',
 '1.usa.gov',
 'deck.ly',
 'bit.do',
 'tiny.cc',
 'lc.chat']

You can make modifications or use your own `list_of_domains` as an argument for the`is_short` function or `is_short_domain` (which is faster and operates on the domain-level).

In [63]:
known_shorteners += ['youtube.com']

In [64]:
print(f"Now {urls[2]} returns:")
ux.is_short(urls[2], list_of_domains=known_shorteners)

Now https://www.youtube.com/watch?v=8NwKcfXvGl4 returns:


True

Now we can shorten our workload:

In [65]:
# filter only domains that need to be shortenened
urls_to_shorten = [link for link in urls if ux.is_short(link)]
urls_to_shorten

['https://trib.al/xXI5ruM', 'http://bit.ly/1Sv81cj']

urlExpander's `multithread_expand` does heavy lifting to quickly and thoroughly expand a list of links:

In [66]:
df_resolved_links = ux.multithread_expand(urls_to_shorten,  
                                          n_workers=2,
                                          return_errors=False)

1it [00:01,  1.00s/it]


In [67]:
df_resolved_links.head()

Unnamed: 0,original_url,resolved_domain,resolved_url
0,https://trib.al/xXI5ruM,breitbart.com,https://www.breitbart.com/video/2017/12/31/lin...
1,http://bit.ly/1Sv81cj,billshusterforcongress.com,http://www.billshusterforcongress.com/congress...


But that is a toy example, let's see how this fairs with a larger dataset

In [13]:
import pandas as pd

In [None]:
ux.

In [45]:
df_congress = pd.read_csv('https://raw.githubusercontent.com/SMAPPNYU/urlExpander/master/datasets/congress_sample_links.csv?flush=true',
                          dtype={'tweet.id':str,'user.id':str})

In [46]:
len(df_congress)

50000

In [47]:
df_congress.columns

Index(['link.domain', 'link.url_long', 'link.url_short', 'tweet.created_at',
       'tweet.id', 'tweet.text', 'user.id'],
      dtype='object')

In [48]:
df_congress.head()

Unnamed: 0,link.domain,link.url_long,link.url_short,tweet.created_at,tweet.id,tweet.text,user.id
0,m.huffpost.com,http://m.huffpost.com/us/entry/55fc2c6ce4b0fde...,http://t.co/pSujNSfXzT,Tue Sep 22 13:28:41 +0000 2015,646315181157535744,Such a wonderful thing to do! #Detroit attorne...,2863006655
1,goo.gl,http://goo.gl/O13Fjd,https://t.co/NMkW0wKcAa,Mon Oct 19 21:23:54 +0000 2015,656219242828697600,Burdensome fines. Failing co-ops. Skyrocketing...,28267055
2,bit.ly,http://bit.ly/2nU8ifO,https://t.co/fG7dUYX6d6,Thu Feb 15 04:13:05 +0000 2018,963989517320572929,RT @NoticentroWAPA: Puerto de Ponce recibe el ...,400246874
3,bit.ly,http://bit.ly/11h3mA7,http://t.co/tcoHulG4iJ,Thu Jul 25 20:27:59 +0000 2013,360496673778188289,"Not my 1st errant tweet, won't be my last. MT ...",16056306
4,is.gd,http://is.gd/T3FXGl,http://t.co/veFfyXnZ,Wed Feb 01 04:07:04 +0000 2012,164560372081229824,"“@michellemalkin: My latest column: First, the...",54412900


In [39]:
short_urls = df_congress[df_congress['link.url_long'].apply(ux.is_short)]['link.url_long'].unique()
len(short_urls)

15035

In [42]:
df_resolved_links = ux.multithread_expand(short_urls, 
                                          chunksize=1280, 
                                          n_workers=64,
                                          cache_file='tmp.json',
                                          return_errors=False)

12it [04:28, 22.39s/it]


We were able to expand 15K links in less than 4.5 minutes!

In [25]:
df_resolved_links.head()

Unnamed: 0,original_url,resolved_domain,resolved_url
0,https://trib.al/xXI5ruM,breitbart.com,https://www.breitbart.com/video/2017/12/31/lin...
1,http://bit.ly/1Sv81cj,billshusterforcongress.com,http://www.billshusterforcongress.com/congress...
2,http://1.usa.gov/1JW0z4u,manchin.senate.gov,http://www.manchin.senate.gov/public/index.cfm...
3,http://ow.ly/fVR0302XZqu,mass.gov,https://www.mass.gov/news/governor-baker-signs...
4,http://ow.ly/a6Blc,forbes.com,https://www.forbes.com/sites/work-in-progress/...


This process has hsitorically been a huge bottleneck for using links as data. We hope that this software helps you overcome similar obsticles!

In [11]:
ux.html_parser.get_webpage_title(urls[0])

'Congressman Shuster Endorses Donald Trump » Congressman Bill Shuster'

In [12]:
ux.html_parser.get_webpage_description(urls[0])

'HOLLIDAYSBURG, PA –\xa0Congressman Bill Shuster (R-PA), Chairman of the House Transportation and Infrastructure Committee and delegate for the 9th\xa0Congressional District has announced his endorsement of Donald Trump for President: “The people of the 9th Congressional District, the Commonwealth of Pennsylvania, and states across the nation have made their voices heard, and I join them in ...Read more here.'