# urlExpander Quickstart
View this notebook on [NBViewer](http://nbviewer.jupyter.org/github/SMAPPNYU/urlExpander/blob/master/examples/quickstart.ipynb?flush_cache=true) or [Github](https://github.com/SMAPPNYU/urlExpander/blob/master/examples/quickstart.ipynb)| Run it interactively on
[Binder](https://mybinder.org/v2/gh/SMAPPNYU/urlExpander/master?filepath=examples%2Fquickstart.ipynb) <br>
By [Leon Yin](leonyin.org) for [SMaPP NYU](https://wp.nyu.edu/smapp/)


[urlExpander](https://github.com/SMAPPNYU/urlExpander) is a Python package for quickly and thoroughly expanding URLs.

You can download the software using pip:

In [2]:
!pip install urlexpander runtimestamp -U

In [3]:
import urlexpander as ux
from runtimestamp.runtimestamp import runtimestamp
runtimestamp('QuickStart User')
print(f"This notebook is using urlExpander v{ux.__version__}")

Updated 2018-07-19 09:34:55.593999
By QuickStart User
Using Python 3.6.5
On Linux-3.10.0-514.10.2.el7.x86_64-x86_64-with-centos-7.3.1611-Core
This notebook is using urlExpander v0.0.28


Here is a toy example of some URLs taken from Congressional Twitter accounts:

In [2]:
urls = [
    'https://trib.al/xXI5ruM',
    'http://bit.ly/1Sv81cj',
    'https://www.youtube.com/watch?v=8NwKcfXvGl4',
    'https://t.co/zNU1eHhQRn',
]

We can use the `expand` function (see the code) to unshorten any link:

In [3]:
ux.expand(urls[0])

{'original_url': 'https://trib.al/xXI5ruM',
 'resolved_domain': 'breitbart.com',
 'resolved_url': 'https://www.breitbart.com/video/2017/12/31/lindsey-graham-trump-just-cant-tweet-iran/'}

To save compute time, we can skip links that don't need to be expanded.<br>
The `is_short` function takes any url and checks if the domain is from a known list of link shorteners

In [4]:
print(f"{urls[1]} returns:")
ux.is_short(urls[1])

http://bit.ly/1Sv81cj returns:


True

bit.ly is probably the best known link shortener, Youtube.com however is not a link shortener!

In [5]:
print(f"{urls[2]} returns:")
ux.is_short(urls[2])

https://www.youtube.com/watch?v=8NwKcfXvGl4 returns:


False

urlExpander takes advantage of a list of known domains that offer link shortening services.

In [6]:
known_shorteners = ux.constants.all_short_domains.copy()
known_shorteners[:25]

['sh.st',
 'adf.ly',
 'lnx.lu',
 'adfoc.us',
 'j.gs',
 'q.gs',
 'u.bb',
 'ay.gy',
 'atominik.com',
 'tinyium.com',
 'microify.com',
 'linkbucks.com',
 'www.linkbucks.com',
 'jzrputtbut.net',
 'any.gs',
 'cash4links.co',
 'cache4files.co',
 'dyo.gs',
 'filesonthe.net',
 'goneviral.com',
 'megaline.co',
 'miniurls.co',
 'qqc.co',
 'seriousdeals.net',
 'theseblogs.com']

You can make modifications or use your own `list_of_domains` as an argument for the`is_short` function or `is_short_domain` (which is faster and operates on the domain-level).

In [7]:
known_shorteners += ['youtube.com']

In [8]:
print(f"Now {urls[2]} returns:")
ux.is_short(urls[2], list_of_domains=known_shorteners)

Now https://www.youtube.com/watch?v=8NwKcfXvGl4 returns:


True

Now we can shorten our workload:

In [9]:
# filter only domains that need to be shortenened
urls_to_shorten = [link for link in urls if ux.is_short(link)]
urls_to_shorten

['https://trib.al/xXI5ruM', 'http://bit.ly/1Sv81cj']

urlExpander's `multithread_expand()` does heavy lifting to quickly and thoroughly expand a list of links:

In [10]:
resolved_links = ux.multithread_expand(urls_to_shorten,  
                                       n_workers=2,
                                       return_errors=False)

1it [00:01,  1.00s/it]


In [11]:
resolved_links

[{'original_url': 'https://trib.al/xXI5ruM',
  'resolved_domain': 'breitbart.com',
  'resolved_url': 'https://www.breitbart.com/video/2017/12/31/lindsey-graham-trump-just-cant-tweet-iran/'},
 {'original_url': 'http://bit.ly/1Sv81cj',
  'resolved_domain': 'billshusterforcongress.com',
  'resolved_url': 'http://www.billshusterforcongress.com/congressman-shuster-endorses-donald-trump/'}]

The output works really nicely with [Pandas](https://pandas.pydata.org/).

In [12]:
import pandas as pd

df_resolved_links = pd.DataFrame(resolved_links)
df_resolved_links.tail(2)

Unnamed: 0,original_url,resolved_domain,resolved_url
0,https://trib.al/xXI5ruM,breitbart.com,https://www.breitbart.com/video/2017/12/31/lin...
1,http://bit.ly/1Sv81cj,billshusterforcongress.com,http://www.billshusterforcongress.com/congress...


<hr>

But that is a toy example, let's see how this fairs with a larger dataset.<br>
This package comes with a [sampled dataset](https://github.com/SMAPPNYU/urlExpander/blob/master/urlexpander/core/datasets.py#L8-L29) of links extracted from Twitter accounts from the 115th Congress. <br>
If you work with Twitter data you'll be glad to know there is a function`ux.tweet_utils.get_link()` for creating a similar dataset from Tweets.

In [13]:
df_congress = ux.datasets.load_congress_twitter_links()

print(f'The dataset has {len(df_congress)} rows')
df_congress.tail(2)

The dataset has 50000 rows


Unnamed: 0,link_domain,link_url_long,link_url_short,tweet_created_at,tweet_id,tweet_text,user_id
49998,youtube.com,https://www.youtube.com/watch?v=KzanCL2Ui4Y,https://t.co/Ilwci2gNFa,Mon Nov 28 19:44:30 +0000 2016,803323702444171265,LIVE: States' Economic Development Assistance ...,269992801
49999,twitter.com,https://twitter.com/ap/status/818071378469519361,https://t.co/2SEKhfEXeB,Sun Jan 08 15:01:58 +0000 2017,818110504694595585,Prayers for #Jerusalem. https://t.co/2SEKhfEXeB,22055226


Let's just work with shortened URLs:

In [14]:
short_urls = df_congress[
    df_congress['link_url_long'].apply(ux.is_short)
]['link_url_long'].unique()

len(short_urls)

15035

About 30% of the links are short!<br>
The performance of the next script is dependent on your internet connection:

In [None]:
!curl -s https://raw.githubusercontent.com/sivel/speedtest-cli/master/speedtest.py | python -

Let's see how long it takes to expand these 15k links.<br>
This is where the parameters for `multithread_expand()` shine.
We can created multiple threads for requests, cache results into a json file, and chunk the 15k input into smaller pieces. Why does this last part matter? Something I noticed when expanding links in mass is that performance degrades over time. Chunking the input prevents this from happening (not sure why though)!

In [18]:
!rm tmp.json

In [21]:
resolved_links, errors = ux.multithread_expand(short_urls, 
                                               chunksize=1280, 
                                               n_workers=64,
                                               cache_file='tmp.json',
                                               return_errors=True)


0it [00:00, ?it/s][A
1it [00:26, 26.73s/it][A
2it [00:51, 25.70s/it][A
3it [01:23, 27.95s/it][A
4it [01:48, 27.11s/it][A
5it [02:15, 27.04s/it][A
6it [02:56, 29.37s/it][A
7it [03:29, 29.96s/it][A
8it [03:56, 29.53s/it][A
9it [04:20, 28.99s/it][A
10it [04:46, 28.65s/it][A
11it [05:11, 28.31s/it][A
12it [05:17, 26.42s/it][A
[A

We were able to expand 15K links in 5 minutes! With very few errors!

In [22]:
len(resolved_links), len(errors)

(15025, 10)

In [17]:
errors

[{'http://tiny.cc/o9p2dy': "<class 'UnicodeDecodeError'>"},
 {'http://bit.ly/18u7zSS': "<class 'requests.exceptions.TooManyRedirects'>"},
 {'http://bit.ly/2nfaLyx': "<class 'requests.exceptions.ConnectionError'>"},
 {'http://tinyurl.com/kc87fug': "<class 'requests.exceptions.TooManyRedirects'>"},
 {'http://ow.ly/peiY303JCSO': "<class 'requests.exceptions.TooManyRedirects'>"},
 {'http://bit.ly/L7iSrM': "<class 'requests.exceptions.TooManyRedirects'>"},
 {'http://bit.ly/mWfk6I': "<class 'requests.exceptions.TooManyRedirects'>"},
 {'http://bit.ly/2wC0Zy3': "<class 'requests.exceptions.TooManyRedirects'>"},
 {'http://bit.ly/QKiQ9u': "<class 'requests.exceptions.TooManyRedirects'>"},
 {'http://bit.ly/2rj4XGy': "<class 'requests.exceptions.TooManyRedirects'>"}]

At SMaPP, the process of link expansion has been a burden on our research.<br>
We hope that this software helps you overcome similar obstacles!

In [23]:
df_resolved_links = pd.DataFrame(resolved_links)
df_resolved_links.tail(3)

Unnamed: 0,original_url,resolved_domain,resolved_url
15022,http://1.usa.gov/9n1pJ,www.loc.gov,http://www.loc.gov port=443): Read timed out. ...
15023,https://buff.ly/2vFbqNn,www.prnewswire.com,http://www.prnewswire.com port=443): Read time...
15024,http://1.usa.gov/vyXSYI,obamawhitehouse.archives.gov,https://obamawhitehouse.archives.gov/blog/2011...


Here are the top 25 shared domains from this sampled Congress dataset:

In [24]:
df_resolved_links.resolved_domain.value_counts().head(25)

facebook.com                1346
youtube.com                  762
ow.ly                        251
thehill.com                  210
energycommerce.house.gov     120
washingtonexaminer.com       104
medium.com                    91
sherrodbrown.com              66
washingtonpost.com            66
mn.gov                        65
wicker.senate.gov             56
flickr.com                    54
enzi.senate.gov               54
foreignaffairs.house.gov      53
maine.gov                     49
adriansmith.house.gov         49
democraticwhip.gov            46
blunt.senate.gov              46
cotton.senate.gov             46
rollcall.com                  46
capito.senate.gov             46
www.cochran.senate.gov        43
foxnews.com                   42
governor.hawaii.gov           41
boozman.senate.gov            41
Name: resolved_domain, dtype: int64

<hr>

# Bonus Round!
After unshortening links, you can join them back into the new dataframe

In [25]:
import numpy as np

In [27]:
df_merged = df_congress.merge(df_resolved_links,
                              left_on='link_url_long',
                              right_on='original_url',
                              how='left')

# these steps fill in `resolved_domain` for URLs that were not from link shortening services...
df_merged['resolved_domain'] = np.where(df_merged['resolved_domain'].isnull(), 
                                        df_merged['link_domain'], 
                                        df_merged['resolved_domain'])

df_merged['resolved_url'] = np.where(df_merged['resolved_url'].isnull(), 
                                     df_merged['link_url_long'], 
                                     df_merged['resolved_url'])

df_merged.tail(2)

Unnamed: 0,link_domain,link_url_long,link_url_short,tweet_created_at,tweet_id,tweet_text,user_id,original_url,resolved_domain,resolved_url
49998,youtube.com,https://www.youtube.com/watch?v=KzanCL2Ui4Y,https://t.co/Ilwci2gNFa,Mon Nov 28 19:44:30 +0000 2016,803323702444171265,LIVE: States' Economic Development Assistance ...,269992801,,youtube.com,https://www.youtube.com/watch?v=KzanCL2Ui4Y
49999,twitter.com,https://twitter.com/ap/status/818071378469519361,https://t.co/2SEKhfEXeB,Sun Jan 08 15:01:58 +0000 2017,818110504694595585,Prayers for #Jerusalem. https://t.co/2SEKhfEXeB,22055226,,twitter.com,https://twitter.com/ap/status/818071378469519361


You can count number of `resolved_domain`s for each `user_id ` using `count_matrix()`.<br>
You can even choose which domains are counted by modifying the `domain_list` arg:

In [34]:
count_matrix = ux.tweet_utils.count_matrix(df_merged,
                                           user_col='user_id', 
                                           domain_col='resolved_domain', 
                                           unique_count_col='tweet_id',
                                           domain_list=['youtube.com','facebook.com', 'google.com', 'twitter.com'])

count_matrix.tail(3)

Unnamed: 0_level_0,facebook.com,youtube.com,twitter.com,google.com
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
941000686275387392,8,0,8,0
941080085121175552,0,0,2,0
948946378939609089,0,2,1,0


One of the domain lists you might be interested in are US national media outlets -
`datasets.load_us_national_media_outlets()` compiled by Gregory Eady (Forthcoming).

In [37]:
ux.datasets.load_us_national_media_outlets()[:5]

array(['abcnews.go.com', 'aim.org', 'alternet.org',
       'theamericanconservative.com', 'prospect.org'], dtype=object)

<hr>
We also built a one-size-fits-all scraper that returns the title, description, and/or paragraphs from any given URL.

In [38]:
ux.html_utils.get_webpage_title(urls[0])

"Lindsey Graham to Trump: 'You Just Can't Tweet' About Iran | Breitbart"

In [39]:
ux.html_utils.get_webpage_description(urls[0])

'Sunday CBS\'s "Face the Nation," while discussing the last several\xa0days of protests in Iran over\xa0government corruption, Sen. Lindsey Graham (R-SC) warned | Breitbart TV'

In [40]:
ux.html_utils.get_webpage_meta(urls[0])

OrderedDict([('url', 'https://trib.al/xXI5ruM'),
             ('title',
              "Lindsey Graham to Trump: 'You Just Can't Tweet' About Iran | Breitbart"),
             ('description',
              'Sunday CBS\'s "Face the Nation," while discussing the last several\xa0days of protests in Iran over\xa0government corruption, Sen. Lindsey Graham (R-SC) warned | Breitbart TV'),
             ('paragraphs',
              ['Sunday CBS’s “Face the Nation,” while discussing the last several\xa0days of protests in Iran over\xa0government corruption, Sen. Lindsey Graham (R-SC) warned President Donald Trump that he couldn’t “just tweet” about the protests.',
               'Graham said, “The Iranian people are not our enemy. The Ayatollah is the enemy of the world. Here is what I would do if I were President Trump. I would explain what a better deal would look like. It’s not enough to watch. President Trump is tweeting very sympathetically to the Iranian people. But you just can’t tweet here

## Counclusion
Thanks for stumbling upon this package, we hope that it will lead to more research around links.<br>
We're working on some projects in thie vein and would love to know if you are too!

As an open source package, please feel to reach out about bugs, feature requests, or collaboration!