# urlExpander Quickstart
[NBViewer](http://nbviewer.jupyter.org/https://github.com/SMAPPNYU/urlExpander/blob/master/examples/quickstart.ipynb) | [Github](https://github.com/SMAPPNYU/urlExpander/blob/master/examples/quickstart.ipynb)| By [Leon Yin](leonyin.org) for [SmaPP NYU](https://wp.nyu.edu/smapp/)


The core functions for this package as exactly as the name suggests -- to expand shortened urls!<br>
You can download the software using pip:

In [3]:
!pip install urlexpander runtimestamp --upgrade

Collecting urlexpander
  Downloading https://files.pythonhosted.org/packages/b5/f9/4b378696a9a631390792041aabeb39ca779a7f96d01ace5f80e155e7611d/urlexpander-0.0.14.tar.gz
Requirement already up-to-date: runtimestamp in /home/ly501/anaconda3/lib/python3.6/site-packages
Requirement already up-to-date: tldextract in /home/ly501/anaconda3/lib/python3.6/site-packages (from urlexpander)
Requirement already up-to-date: pandas in /home/ly501/anaconda3/lib/python3.6/site-packages (from urlexpander)
Requirement already up-to-date: numpy in /home/ly501/anaconda3/lib/python3.6/site-packages (from urlexpander)
Requirement already up-to-date: tqdm in /home/ly501/anaconda3/lib/python3.6/site-packages (from urlexpander)
Requirement already up-to-date: unshortenit in /home/ly501/anaconda3/lib/python3.6/site-packages (from urlexpander)
Requirement already up-to-date: requests-file>=1.4 in /home/ly501/anaconda3/lib/python3.6/site-packages (from tldextract->urlexpander)
Requirement already up-to-date: requ

In [4]:
import urlexpander as ux
from runtimestamp.runtimestamp import runtimestamp
runtimestamp('Leon')

Updated 2018-07-12 17:30:37.392022
By Leon
Using Python 3.6.5
On Linux-3.10.0-514.10.2.el7.x86_64-x86_64-with-centos-7.3.1611-Core


Here is a toy example of some URLs taken from Congressional Twitter accounts:

In [5]:
urls = [
    'https://trib.al/xXI5ruM',
    'http://bit.ly/1Sv81cj',
    'https://www.youtube.com/watch?v=8NwKcfXvGl4',
    'https://t.co/zNU1eHhQRn',
]

We can use the `expand` function (see the code) to unshorten any link:

In [6]:
ux.expand(urls[0])

{'original_url': 'https://trib.al/xXI5ruM',
 'resolved_domain': 'breitbart.com',
 'resolved_url': 'https://www.breitbart.com/video/2017/12/31/lindsey-graham-trump-just-cant-tweet-iran/'}

To save compute time, we can skip links that don't need to be expanded.<br>
The `is_short` function takes any url and checks if the domain is from a known list of link shorteners

In [7]:
print(f"{urls[1]} returns:")
ux.is_short(urls[1])

http://bit.ly/1Sv81cj returns:


True

bit.ly is probably the best known link shortener, Youtube.com however is not a link shortener!

In [8]:
print(f"{urls[2]} returns:")
ux.is_short(urls[2])

https://www.youtube.com/watch?v=8NwKcfXvGl4 returns:


False

urlExpander takes advantage of a list of known domains that offer link shortening services.

In [9]:
known_shorteners = ux.constants.all_short_domains.copy()
known_shorteners[:25]

['sh.st',
 'adf.ly',
 'lnx.lu',
 'adfoc.us',
 'dlvr.it',
 'bit.ly',
 'buff.ly',
 'ow.ly',
 'goo.gl',
 'shar.es',
 'ift.tt',
 'fb.me',
 'washex.am',
 'smq.tc',
 'trib.al',
 'is.gd',
 'paper.li',
 'waa.ai',
 'tinyurl.com',
 'ht.ly',
 '1.usa.gov',
 'deck.ly',
 'bit.do',
 'tiny.cc',
 'lc.chat']

You can make modifications or use your own `list_of_domains` as an argument for the`is_short` function or `is_short_domain` (which is faster and operates on the domain-level).

In [10]:
known_shorteners += ['youtube.com']

In [11]:
print(f"Now {urls[2]} returns:")
ux.is_short(urls[2], list_of_domains=known_shorteners)

Now https://www.youtube.com/watch?v=8NwKcfXvGl4 returns:


True

Now we can shorten our workload:

In [12]:
# filter only domains that need to be shortenened
urls_to_shorten = [link for link in urls if ux.is_short(link)]
urls_to_shorten

['https://trib.al/xXI5ruM', 'http://bit.ly/1Sv81cj']

urlExpander's `multithread_expand` does heavy lifting to quickly and thoroughly expand a list of links:

In [15]:
resolved_links = ux.multithread_expand(urls_to_shorten,  
                                       n_workers=2,
                                       return_errors=False)

1it [00:02,  2.13s/it]


The output works really nicely with [Pandas](https://pandas.pydata.org/).

In [17]:
import pandas as pd

df_resolved_links = pd.DataFrame(resolved_links)
df_resolved_links.tail(2)

Unnamed: 0,original_url,resolved_domain,resolved_url
0,https://trib.al/xXI5ruM,breitbart.com,https://www.breitbart.com/video/2017/12/31/lin...
1,http://bit.ly/1Sv81cj,billshusterforcongress.com,http://www.billshusterforcongress.com/congress...


But that is a toy example, let's see how this fairs with a larger dataset

In [25]:
df_congress = ux.datasets.congress_twitter_links()
print(f'The dataset has {len(df_congress)} rows')
df_congress.tail(2)

The dataset has 50000 rows


Unnamed: 0,link.domain,link.url_long,link.url_short,tweet.created_at,tweet.id,tweet.text,user.id
49998,youtube.com,https://www.youtube.com/watch?v=KzanCL2Ui4Y,https://t.co/Ilwci2gNFa,Mon Nov 28 19:44:30 +0000 2016,803323702444171265,LIVE: States' Economic Development Assistance ...,269992801
49999,twitter.com,https://twitter.com/ap/status/818071378469519361,https://t.co/2SEKhfEXeB,Sun Jan 08 15:01:58 +0000 2017,818110504694595585,Prayers for #Jerusalem. https://t.co/2SEKhfEXeB,22055226


In [26]:
short_urls = df_congress[df_congress['link.url_long'].apply(ux.is_short)]['link.url_long'].unique()
len(short_urls)

15035

Let's see how long it takes to expand these 15k links.<br>
This is where the parameters for `multithread_expand` shine.
We can created multiple threads for requests, cache results into a json, and chunk the 15k input into smaller pieces. Why does this last part matter? Something I noticed when expanding links in mass is that performance over time degrades. Chunking the input prevents this from happening (not sure why though)!

In [31]:
resolved_links = ux.multithread_expand(short_urls, 
                                       chunksize=1280, 
                                       n_workers=64,
                                       cache_file='tmp.json',
                                       return_errors=False)

12it [04:29, 22.47s/it]


We were able to expand 15K links in less than 4.5 minutes!

In [28]:
df_resolved_links = pd.DataFrame(print(f'The dataset has {len(df_congress)} rows'))
df_resolved_links.head(3)

The dataset has 50000 rows


This process has hsitorically been a huge bottleneck for using links as data. We hope that this software helps you overcome similar obsticles!

In [11]:
ux.html_parser.get_webpage_title(urls[0])

'Congressman Shuster Endorses Donald Trump » Congressman Bill Shuster'

In [12]:
ux.html_parser.get_webpage_description(urls[0])

'HOLLIDAYSBURG, PA –\xa0Congressman Bill Shuster (R-PA), Chairman of the House Transportation and Infrastructure Committee and delegate for the 9th\xa0Congressional District has announced his endorsement of Donald Trump for President: “The people of the 9th Congressional District, the Commonwealth of Pennsylvania, and states across the nation have made their voices heard, and I join them in ...Read more here.'