# Validation of handcoding sample

This file is to check whether the handcoding sample could contain potential results that are filtered out in the remove error step.

In [2]:
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Import handcoding subset

The goal of this handcoding is to validify our decision on labeling all the results of facebook.com, twitter.com, and isntagram.com as politician-controlled contents, and youtube.com as not-politician-controlled. To do this, we decided to take a small sample from the urls that contain these domains and handcode them as whether they are politician-controlled or not. Afterwards, we calculate the percentage of politician-controlled url for each domain, which are facebook 94%, instagram 87%, twitter 99%, youtube 4%.

The final labels are decided by majority agreement of 3 coders (Zhen, Allison, and Burak). The total of 943 samples and their labels can be found here: https://docs.google.com/spreadsheets/d/1g-gTuRTKIBlzfsS8Ha_xc1tJUpPEzDGOkm_mf9PlrTU/edit?gid=0#gid=0.

This is the subset data that I used to pull the handcoding samples for whether urls of social media domains are politician controlled. It contains all urls that have facebook.com, twitter.com, instagram.com, or youtube.com as their domains. As suggested below, only 1857 urls are there across all dates, location, and qry. 250 urls are sampled for facebook.com, twitter.com, and youtube.com. Only 193 urls have instagram.com, so all of them goes to the handcoding sample. 

In [4]:
handcode_subset = pd.read_csv("/net/lazer/lab-lazer/shared_projects/google_audit_reproduce/intermedidate_files/merged_summary/handcode_filtered_test.csv")

In [5]:
handcode_subset

Unnamed: 0,url,qry,domain,title,text,counts,path
0,http://www.facebook.com/LetMikeFixit,Mike Turner,facebook.com,Mike Turner - The Computer Guy,,8,LetMikeFixit
1,https://m.facebook.com/RepBrianMast/videos/rep...,Brian Mast,facebook.com,Rep. Mast Fights To Stop Toxic Discharges In C...,,1,RepBrianMast
2,https://m.facebook.com/abc3340/videos/sen-mitc...,Andy Barr,facebook.com,Sen. Mitch McConnell joins U.S. Congressman An...,,2,abc3340
3,https://m.facebook.com/abc3340/videos/sen-mitc...,Andy Barr,facebook.com,Sen. Mitch McConnell joins U.S. Congressman An...,,1,abc3340
4,https://m.facebook.com/captclayhiggins/videos/...,Clay Higgins,facebook.com,Final bands rolling through. Wind took... - Ca...,,2,captclayhiggins
...,...,...,...,...,...,...,...
1846,https://www.youtube.com/watch?v=vqzPPbyppaw,Xochitl Torres Small,youtube.com,HMF LCV Rep. Xochitl Torres Small (NM-02),League of Conservation Voters,417,watch
1847,https://www.youtube.com/watch?v=vqzPPbyppaw,Xochitl Torres Small,youtube.com,HMF LCV Rep. Xochitl Torres Small (NM-02),,16,watch
1848,https://www.youtube.com/watch?v=w3R3-9TfD6A,Raul Ruiz,youtube.com,Weekly Democratic Address -- Congressman Raul ...,,3,watch
1849,https://www.youtube.com/watch?v=wwX-SnlzHPQ,Dan Crenshaw,youtube.com,VIRAL AD: Rep. Dan Crenshaw's Avengers-style p...,The Hill,434,watch


In [6]:
len(handcode_subset[handcode_subset['domain']=='youtube.com'])

193

## Check whether data we filtered out later contain any social media urls. 

Kaicheng was concerned that raw results of type 'knowledge' and 'search_related' might have been included in the above samples, because I removed the two types from the analysis after I generate the handcoding sample. But these two types of resutls do not contain any domains that we sample for the above, thus it won't matter whether we did it before or after the handcode sample generation. Here, I am going to validify that the two types of results across all raw data do not contain any social media domains.

First, let's import all files from the raw data, which are stored in `/net/lazer/lab-lazer/shared_projects/google_audit_reproduce/intermedidate_files/parquet_house/`

In [15]:
raw_data_path = '/net/lazer/lab-lazer/shared_projects/google_audit_reproduce/intermedidate_files/parquet_house/'

In [16]:
raw_data_files = os.listdir(raw_data_path) 

Here's an example data file for one day.

In [18]:
raw_20201030 = pd.read_parquet(raw_data_path + "20201030.parquet")

In [19]:
raw_20201030

Unnamed: 0,type,sub_rank,title,url,text,cmpt_rank,serp_rank,crawl_id,qry,lang,loc_id,sub_type,timestamp,subtitle,domain
0,general,0,U.S. Representative Kathleen Rice,https://kathleenrice.house.gov/,,0,0,20201030,Kathleen Rice,en,"OH-5,Ohio,United States",,,,house.gov
1,general,0,Kathleen Rice for Congress,https://www.kathleenrice.com/,,1,1,20201030,Kathleen Rice,en,"OH-5,Ohio,United States",,,,kathleenrice.com
2,general,0,Kathleen Rice - Wikipedia,https://en.wikipedia.org/wiki/Kathleen_Rice,,2,2,20201030,Kathleen Rice,en,"OH-5,Ohio,United States",,,,wikipedia.org
3,twitter_cards,0,Kathleen Rice (@RepKathleenRice) · Twitter,https://twitter.com/RepKathleenRice?ref_src=tw...,,3,3,20201030,Kathleen Rice,en,"OH-5,Ohio,United States",header,,,twitter.com
4,twitter_cards,1,,https://twitter.com/RepKathleenRice/status/132...,It’s been 8 years since #Sandy. The road to re...,3,4,20201030,Kathleen Rice,en,"OH-5,Ohio,United States",card,10 hours ago,,twitter.com
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5004438,general,0,"Duncan Hunter, former Representative for Calif...",https://www.govtrack.us/congress/members/dunca...,,10,19,20201030,Duncan D. Hunter,en,"KS-4,Kansas,United States",,,,govtrack.us
5004439,searches_related,0,,,,11,20,20201030,Duncan D. Hunter,en,"KS-4,Kansas,United States",,,,
5004440,knowledge,0,Duncan D. Hunter,https://en.wikipedia.org/wiki/Duncan_D._Hunter,Duncan Duane Hunter is an American politician ...,12,21,20201030,Duncan D. Hunter,en,"KS-4,Kansas,United States",,,Former U.S. Representative,wikipedia.org
5004441,knowledge,1,Profiles,,,12,22,20201030,Duncan D. Hunter,en,"KS-4,Kansas,United States",,,,


Then, let's filter the results type to only `knowledge` and `search_related`, which we removed later in the analysis, and check if there's any social media domain.

In [27]:
raw_20201030[raw_20201030['type']=='knowledge'].groupby('domain', dropna=False).size().reset_index(name='counts')

Unnamed: 0,domain,counts
0,wikipedia.org,183513
1,,527309


In [28]:
raw_20201030[raw_20201030['type']=='search_related'].groupby('domain', dropna=False).size().reset_index(name='counts')

Unnamed: 0,domain,counts


As we see in the example, there is no social media domains. Now, let's check on all files.

In [31]:
domain_list = []
for file in raw_data_files:
    if file == '.ipynb_checkpoints':
        continue
    raw_df = pd.read_parquet(raw_data_path + file)
    knowledge_domains = raw_df[raw_df['type']=='knowledge'].groupby('domain', dropna=False).size().reset_index(name='counts')
    search_related_domains = raw_df[raw_df['type']=='search_related'].groupby('domain', dropna=False).size().reset_index(name='counts')
    domain_list.append(knowledge_domains)
    domain_list.append(search_related_domains)
    print('processed' + file)
    

processed20201006.parquet
processed20201229.parquet
processed20210307.parquet
processed20200914.parquet
processed20210301.parquet
processed20210306.parquet
processed20210228.parquet
processed20210305.parquet
processed20210310.parquet
processed20210317.parquet
processed20210312.parquet
processed20201128.parquet
processed20210324.parquet
processed20201221.parquet
processed20210311.parquet
processed20200902.parquet
processed20210326.parquet
processed20210321.parquet
processed20210105.parquet
processed20201125.parquet
processed20210101.parquet
processed20201021.parquet
processed20200921.parquet
processed20210118.parquet
processed20210124.parquet
processed20201031.parquet
processed20201026.parquet
processed20201219.parquet
processed20201030.parquet
processed20201212.parquet
processed20201025.parquet
processed20210123.parquet
processed20210304.parquet
processed20210115.parquet
processed20210314.parquet
processed20201218.parquet
processed20201004.parquet
processed20210309.parquet
processed202

In [33]:
all_domains = pd.concat(domain_list)

In [35]:
all_domains.groupby('domain', dropna=False).size().reset_index(name='counts')

Unnamed: 0,domain,counts
0,,106
1,wikipedia.org,200
2,,200


As shown in the combined domains, there is no social media sites that could possibily sampled from the data that we later removed. Therefore, our hand-coding sample for politician-control analysis is free from potential error in this sense.