# Scraping Improvements

Aims:

- Import database and examine poorly scraped articles
- Check for retrieval failed
- Check for short/nonsense text
- Check for common domains that are improperly scraped
- Check for extra material scraped

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from urllib import parse
from newspaper import Article

%matplotlib inline
%load_ext autoreload
%autoreload 2

In [2]:
from sqlalchemy import create_engine

## Import Data

In [3]:
def postgres_to_df(db, table):
    engine = create_engine('postgres://postgres:password@localhost/{}'.format(db))
    conn = engine.connect()
    
    with engine.connect() as conn, conn.begin():
        df = pd.read_sql_table(table, conn)
        
    return df

In [4]:
df_article_1 = postgres_to_df('id_test', 'article')
df_article_2 = postgres_to_df('id_test_simon', 'article')
df_content_1 = postgres_to_df('id_test', 'content')
df_content_2 = postgres_to_df('id_test_simon', 'content')

In [5]:
df_1 = df_article_1.merge(df_content_1, left_on='id', right_on='article', how='outer')
df_2 = df_article_2.merge(df_content_2, left_on='id', right_on='article', how='outer')
df = pd.concat([df_1, df_2])

In [6]:
def parse_url(url):
    url = parse.urlparse(url)
    return url.hostname

In [7]:
df['domain'] = df['url'].apply(lambda x: parse_url(x))

## Articles that Failed to Fetch

In [8]:
df_failed = df[df['status'] == 'fetching failed']

In [9]:
df_success = df[df['status'] != 'fetching failed']

Over 40% of articles fail to retrieve. Most disappointing.

In [10]:
print("Failures", len(df_failed)/len(df)*100)
print("Successes", len(df_success)/len(df)*100)

Failures 41.65561562786902
Successes 58.344384372130975


No single domain was responsible for a large percentage of the failures, but some crop up much more often than others.

In [22]:
print(df_failed['domain'].value_counts(normalize=True)[0:20] * 100)

hosted.ap.org             1.114240
www.nigeriasun.com        0.636709
siouxcityjournal.com      0.624464
beatricedailysun.com      0.612220
www.yahoo.com             0.587731
www.sfgate.com            0.575487
english.wafa.ps           0.538754
lacrossetribune.com       0.489776
tucson.com                0.465287
journalstar.com           0.453043
thetandd.com              0.453043
www.ksby.com              0.428554
www.hawaiinewsnow.com     0.416310
rapidcityjournal.com      0.404065
www.shorelinemedia.net    0.391821
missoulian.com            0.391821
www.chron.com             0.391821
trib.com                  0.391821
www.omaha.com             0.391821
helenair.com              0.379576
Name: domain, dtype: float64


### 404 Errors

In [25]:
hosted_ap = df_failed['url'][df_failed['domain'] == 'hosted.ap.org']

In [15]:
def scrape_articles(url):
    a = Article(url)
    a.download()
    a.parse()
    print(len(a.text))

In [50]:
a = Article(hosted_ap.iloc[0])
a.download()
a.is_downloaded

False

The AP articles are note downloading at all. Checking the error code sees that we return a 404. Currently our scraper does not return status code, so it's hard to distinguish why articles are not retrieving.

In [47]:
import requests

In [48]:
b = requests.get(hosted_ap.iloc[0])

In [49]:
b.status_code

404

**1. Get articles to return status code**

Possible solution in `scraper.py`

In [89]:
def html_article(self, url):
    a = newspaper.Article(url)
    a.download()
    if a.is_downloaded:
        a.parse()
        article_domain = a.source_url
        article_title = a.title
        article_authors = a.authors
        article_pub_date = a.publish_date
        article_text = remove_newline(a.text)
        # tag the type of article
        # currently default to text but should be able to determine img/video
        # etc
        article_content_type = 'text'
        return article_text, article_pub_date, article_title, article_content_type, article_authors, article_domain
    else:  # Temporary fix to deal with https://github.com/codelucas/newspaper/issues/280
        a = requests.get(url)
        status = a.status_code
        return status_code, None, "", datetime.datetime.now(), "", ""

In [52]:
nigeriasun = df_failed['url'][df_failed['domain'] == 'www.nigeriasun.com']

In [75]:
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}
c = requests.get(nigeriasun.iloc[0], headers=headers)

In [76]:
html = c.text

In [85]:
from newspaper import fulltext, build_article, Source

In [86]:
text = Source(html)

Exception: Input url is bad!

In [None]:
text = build_article

In [83]:
text.text

''

TypeError: __init__() takes 1 positional argument but 2 were given

In [66]:
c.get_article??

In [46]:
a

<newspaper.article.Article at 0x118095358>

In [40]:
a.source_url

'http://hosted.ap.org'

In [102]:
df_success['domain'].value_counts(normalize=True)[0:20]

english.wafa.ps              0.028849
reliefweb.int                0.011627
www.dailymail.co.uk          0.008655
www.yahoo.com                0.008655
hosted2.ap.org               0.007955
www.worldbulletin.net        0.007518
humanitariannews.org         0.005857
news.trust.org               0.005420
allafrica.com                0.004633
www.castanet.net             0.004546
newsviewsnreviews.com        0.004284
newsok.com                   0.003846
www.business-standard.com    0.003759
www.palestinemonitor.org     0.003584
www.dailytelegraph.com.au    0.003322
abcnews.go.com               0.003060
m.philstar.com               0.003060
en.farsnews.com              0.003060
www.newsnow.in               0.002972
www.washingtontimes.com      0.002885
Name: domain, dtype: float64

## Code 400 Errors

In [74]:
df_400 = df[df['content'].str.contains('40')]

ValueError: cannot index with vector containing NA / NaN values