# Host Label Validation and Exploration
After plotting every feature's density distribution by relevance, the top 10 that are “most wrong” in either part of the value range were manually checked (i.e. videos often repeated across articles, yet labeled as "relevant"). This way, some mislabeled data could be identified and corrected. This was done iteratively until the top 10 "most wrong" labels in any value range were determined to be correctly labeled.

In [1]:
import pandas as pd
import psycopg2

%load_ext autoreload
%autoreload 2

  """)


In [2]:
# Load the dataset from the database 
# TODO this is not DRY yet
# TODO don't use "most wrong" terminology
conn = psycopg2.connect(database="gdelt_social_video", user="postgres")
c = conn.cursor()

# Just work with youtube for now
platform = "youtube"
samples = pd.read_sql_query('''SELECT h.*, lh.twitter_relevant, lh.facebook_relevant, lh.youtube_relevant 
                                FROM hosts h RIGHT JOIN labeled_hosts lh  ON h.hostname=lh.hostname 
                                WHERE lh.%s_relevant <> -1''' % platform,con=conn)

samples = samples[["hostname", 
                   "article_count", 
                   "%s_video_sum" % platform, 
                   "%s_video_sum_distinct" % platform, 
                   "%s_video_count" % platform, 
                   "%s_relevant" % platform]]

# Compute the remaining interesting features
# Average number of videos per article, including articles without videos
samples["%s_video_average" % platform] = samples["%s_video_sum" % platform] / samples["article_count"]
# Average distinct videos per article
samples["%s_video_average_distinct" % platform] = samples["%s_video_sum_distinct" % platform] / samples["article_count"]
# Total videos to distinct videos
samples["%s_video_distinct_to_sum" % platform] = samples["%s_video_sum_distinct" % platform] / samples["%s_video_sum" % platform]
# Percentage of articles with videos
samples["%s_video_percentage" % platform] = samples["%s_video_count" % platform] / samples["article_count"]

samples.head()

Unnamed: 0,hostname,article_count,youtube_video_sum,youtube_video_sum_distinct,youtube_video_count,youtube_relevant,youtube_video_average,youtube_video_average_distinct,youtube_video_distinct_to_sum,youtube_video_percentage
0,1005thefox.iheart.com,33,8,8,4,1,0.242424,0.242424,1.0,0.121212
1,1013thebrew.iheart.com,105,10,10,10,1,0.095238,0.095238,1.0,0.095238
2,1015elpatron.iheart.com,102,6,6,6,1,0.058824,0.058824,1.0,0.058824
3,1025thefox.iheart.com,106,8,8,8,1,0.075472,0.075472,1.0,0.075472
4,1025wynr.iheart.com,8,3,3,3,1,0.375,0.375,1.0,0.375


In [3]:
# Print 10 hosts that are "most wrong"
samples[samples["youtube_relevant"] != 1].sort_values("youtube_video_count", ascending=True).head(10)

Unnamed: 0,hostname,article_count,youtube_video_sum,youtube_video_sum_distinct,youtube_video_count,youtube_relevant,youtube_video_average,youtube_video_average_distinct,youtube_video_distinct_to_sum,youtube_video_percentage
397,www.coloradostar.com,2,4,4,1,2,2.0,2.0,1.0,0.5
690,www.tpr.org,17,1,1,1,2,0.058824,0.058824,1.0,0.058824
648,www.state.gov,123,1,1,1,2,0.00813,0.00813,1.0,0.00813
617,www.sailing.org,1,1,1,1,2,1.0,1.0,1.0,1.0
582,www.oklahomastar.com,2,4,4,1,2,2.0,2.0,1.0,0.5
541,www.mdt.mt.gov,1,1,1,1,2,1.0,1.0,1.0,1.0
464,www.highwaysmagazine.co.uk,1,1,1,1,2,1.0,1.0,1.0,1.0
460,www.helpage.org,5,1,1,1,2,0.2,0.2,1.0,0.2
413,www.diyweek.net,1,2,2,1,2,2.0,2.0,1.0,1.0
332,www.apr.org,17,4,4,1,2,0.235294,0.235294,1.0,0.058824


In [8]:
# Manually look at the articles from a host
articles = pd.read_sql_query("SELECT DISTINCT website_url FROM found_videos WHERE hostname='www.premiumtimesng.com' AND platform='youtube'",con=conn)
for row in articles.iterrows():
    print(row[1][0])

https://www.premiumtimesng.com/business/275723-nupeng-suspends-21-day-ultimatum-to-oil-and-gas-sector.html
https://www.premiumtimesng.com/business/276006-amcon-gets-senates-backing-to-publish-debtors-list.html
https://www.premiumtimesng.com/business/277848-timeline-for-9mobile-acquisition-extended.html
https://www.premiumtimesng.com/business/business-news/275315-labour-criticises-new-tariffs-on-cigarette-alcohol-tobacco.html
https://www.premiumtimesng.com/business/business-news/275576-economist-magazine-ranks-lagos-business-school-among-worlds-50-top-business-institutions.html
https://www.premiumtimesng.com/business/business-news/276973-sec-advises-investors-on-shares-warns-against-ponzi-schemes.html
https://www.premiumtimesng.com/business/business-news/277078-why-exxonmobil-sacked-spy-police-officers-official.html
https://www.premiumtimesng.com/business/business-news/277748-konga-to-re-launch-pay-on-delivery.html
https://www.premiumtimesng.com/business/business-news/277804-dangote-don

Mislabled examples were manually change in the database.
**The hosts dataset is now considered clean**