# Scraping Improvements

Aims:

- Import database and examine poorly scraped articles
- Check for retrieval failed
- Check for short/nonsense text
- Check for common domains that are improperly scraped
- Check for extra material scraped

In [111]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from urllib import parse
from newspaper import Article

%matplotlib inline
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [16]:
from sqlalchemy import create_engine

## Import Data

In [40]:
def postgres_to_df(db, table):
    engine = create_engine('postgres://postgres:password@localhost/{}'.format(db))
    conn = engine.connect()
    
    with engine.connect() as conn, conn.begin():
        df = pd.read_sql_table(table, conn)
        
    return df

In [43]:
df_article_1 = postgres_to_df('id_test', 'article')
df_article_2 = postgres_to_df('id_test_simon', 'article')
df_content_1 = postgres_to_df('id_test', 'content')
df_content_2 = postgres_to_df('id_test_simon', 'content')

In [71]:
df_1 = df_article_1.merge(df_content_1, left_on='id', right_on='article', how='outer')
df_2 = df_article_2.merge(df_content_2, left_on='id', right_on='article', how='outer')
df = pd.concat([df_1, df_2])

In [87]:
def parse_url(url):
    url = parse.urlparse(url)
    return url.hostname

In [88]:
df['domain'] = df['url'].apply(lambda x: parse_url(x))

## Fetching Failed

In [89]:
df_failed = df[df['status'] == 'fetching failed']

In [99]:
df_success = df[df['status'] != 'fetching failed']

Over 40% of articles fail to retrieve.

In [109]:
print("Failures", len(df_failed)/len(df)*100)
print("Successes", len(df_success)/len(df)*100)

Failures 41.65561562786902
Successes 58.344384372130975


In [90]:
df_failed.head()

Unnamed: 0,id,url,domain,status,title,publication_date,authors,language,relevance,reliability,article,retrieval_date,content,content_type
2,3,http://www.9and10news.com/story/33699640/syria...,www.9and10news.com,fetching failed,,NaT,,,,,,NaT,,
6,7,http://journalstar.com/news/world/un-approves-...,journalstar.com,fetching failed,,NaT,,,,,,NaT,,
8,9,http://www.nbc-2.com/story/32756524/the-latest...,www.nbc-2.com,fetching failed,,NaT,,,,,,NaT,,
12,13,http://www.newsradio1440.com/news/hurricane-ma...,www.newsradio1440.com,fetching failed,,NaT,,,,,,NaT,,
13,14,http://www.newson6.com/story/33318828/hurrican...,www.newson6.com,fetching failed,,NaT,,,,,,NaT,,


In [104]:
df_failed['domain'].value_counts()[0:20]

hosted.ap.org             91
www.nigeriasun.com        52
siouxcityjournal.com      51
beatricedailysun.com      50
www.yahoo.com             48
www.sfgate.com            47
english.wafa.ps           44
lacrossetribune.com       40
tucson.com                38
journalstar.com           37
thetandd.com              37
www.ksby.com              35
www.hawaiinewsnow.com     34
rapidcityjournal.com      33
www.omaha.com             32
trib.com                  32
www.chron.com             32
missoulian.com            32
www.shorelinemedia.net    32
helenair.com              31
Name: domain, dtype: int64

In [112]:
nigeriasun = df_failed['url'][df_failed['domain'] == 'www.nigeriasun.com']

In [117]:
def scrape_articles(url):
    a = Article(url)
    a.download()
    a.parse()
    print(len(a.text))

In [121]:
a = Article(nigeriasun.iloc[0])
a.download()
a.parse()
print(a.text)

News24 - Monday 17th October, 2016

Abuja - Twenty-one of the over 200 missing Chibok schoolgirls freed after being held by Nigeria's Boko Haram Islamists for more than two years on Sunday spoke of their ordeal as they were reunited with their families.

During a Christian ceremony held for them in the capital Abuja, a schoolgirl named Gloria Dame said they had survived for 40 days without food and narrowly escaped death at least once.

"I was... (in) the woods when the plane dropped a bomb near me but I wasn't hurt," Dame told the congregation.

"We had no food for one month and 10 days but we did not die. We thank God," she said, speaking in the local Hausa language.

The ceremony was organised by Nigeria's security services which negotiated their release. Most of the kidnapped students were Christian but had been forcibly converted to Islam during captivity.

The Chibok girls were abducted in April 2014, drawing global attention to the Boko Haram insurgency engulfing the area when U

In [118]:
for url in nigeriasun:
    scrape_articles(url)

0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0


In [102]:
df_success['domain'].value_counts(normalize=True)[0:20]

english.wafa.ps              0.028849
reliefweb.int                0.011627
www.dailymail.co.uk          0.008655
www.yahoo.com                0.008655
hosted2.ap.org               0.007955
www.worldbulletin.net        0.007518
humanitariannews.org         0.005857
news.trust.org               0.005420
allafrica.com                0.004633
www.castanet.net             0.004546
newsviewsnreviews.com        0.004284
newsok.com                   0.003846
www.business-standard.com    0.003759
www.palestinemonitor.org     0.003584
www.dailytelegraph.com.au    0.003322
abcnews.go.com               0.003060
m.philstar.com               0.003060
en.farsnews.com              0.003060
www.newsnow.in               0.002972
www.washingtontimes.com      0.002885
Name: domain, dtype: float64

## Code 400 Errors

In [74]:
df_400 = df[df['content'].str.contains('40')]

ValueError: cannot index with vector containing NA / NaN values