# Summarize general info for Facebook pages

This notebook combines the `pages-info.csv` and `all-partisan-sites.csv` files to produce a file with four attributes for each Facebook page:

- `page_id`
- `political_category`
- `page_name`
- `fan_count`

In [1]:
import pandas as pd

In [2]:
page_info = pd.read_csv("../data/pages-info.csv", dtype={"page_id": str})\
    [[ "page_id", "page_name", "fan_count" ]]
page_info.head()

Unnamed: 0,page_id,page_name,fan_count
0,108038612554992,Americans Against the Tea Party,583256
1,153418591515382,act.tv,285075
2,188464111175168,New Blue United,1476093
3,296856040436954,Obama is the Worst President in US History,1569590
4,492836854251934,RedFlag NewsDesk,1533


In [3]:
sites = pd.read_csv(
    '../data/all-partisan-sites.csv',
    dtype={ "fb_id": str },
    na_values=["None"]
).rename(columns={"fb_id": "page_id"})

sites[[ "site", "political_category", "page_id",  ]].head()

Unnamed: 0,site,political_category,page_id
0,100percentfedup.com,right,311190048935167.0
1,21stcenturywire.com,left,182032255155419.0
2,24dailynew.com,right,515629708825640.0
3,24usnews.com,right,1430973860248840.0
4,4threvolutionarywar.wordpress.com,left,


Make sure that each page has been assigned only one political category:

In [4]:
assert (sites[
    ~sites["page_id"].isin([ "unavailable", "personal_page" ])
].groupby("page_id")["political_category"].nunique() > 1).sum() == 0

In [5]:
partisanship = sites.groupby("page_id")\
    ["political_category"].first()\
    .reset_index()

partisanship.head()

Unnamed: 0,page_id,political_category
0,100434040001314,left
1,1014803551921469,right
2,1019871961378419,right
3,1035617169863710,right
4,1036253643101134,left


In [6]:
summary = pd.merge(
    page_info,
    partisanship,
    how="left"
)

summary.head()

Unnamed: 0,page_id,page_name,fan_count,political_category
0,108038612554992,Americans Against the Tea Party,583256,left
1,153418591515382,act.tv,285075,left
2,188464111175168,New Blue United,1476093,left
3,296856040436954,Obama is the Worst President in US History,1569590,right
4,492836854251934,RedFlag NewsDesk,1533,right


In [7]:
summary["political_category"].value_counts()

right    310
left     142
Name: political_category, dtype: int64

In [8]:
summary.sort_values("page_id")\
    [["page_id", "political_category", "page_name", "fan_count" ]]\
    .to_csv("../output/fb-page-info-summary.csv", index=False)

---

---

---