## A2: Task 2
This is an exploration of task 2 described in the markdown. I'll try to use a TDD approach using [ipytest](https://github.com/chmp/ipytest) for my testing framework. I'll mainly be testing the "hard parts" and not stuff I'm confident in. 

In [53]:
import ipytest
import pandas as pd
import spacy
from spacy.tokens import Doc
from pathlib import Path
from spacytextblob.spacytextblob import SpacyTextBlob
from typing import Sequence, Callable, List, Tuple
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe('spacytextblob')
ipytest.autoconfig()

In [2]:
DATA_DIR = Path("../../../CDS-LANG/tabular_examples/")
data_path = DATA_DIR / "fake_or_real_news.csv"
df = pd.read_csv(data_path, index_col=0).reset_index(drop=True)

In [8]:
%%ipytest

def extract_geopol(doc: Doc) -> str:
    return ";".join(ent.text for ent in doc.ents if ent.label_ == "GPE")

def test_extract_geopol():
    doc = nlp("Washington battling Russia")
    geopols = extract_geopol(doc)
    assert geopols == "Washington;Russia"
    
def test_only_GPE():
    doc = nlp("Why isn't Rihanna leading Denmark yet?")
    geopols = extract_geopol(doc)
    assert geopols == "Denmark"
    
def test_no_GPE():
    doc = nlp("what?")
    geopols = extract_geopol(doc)
    assert geopols == ""

[32m.[0m[32m.[0m[32m.[0m[32m                                                                                          [100%][0m
[32m[32m[1m3 passed[0m[32m in 0.03s[0m[0m


In [30]:
%%ipytest
# Testing multiple sentences
def test_list_geopol():
    docs = list(nlp.pipe(["Denmark is a country", "Hello mr. smartypants"]))
    entities = list_geopol(docs)
    assert entities[0] == "Denmark"
    assert entities[1] == ""
    
def list_geopol(docs: Sequence[Doc]) -> List[str]:
    return [extract_geopol(doc) for doc in docs]

[32m.[0m[32m                                                                                            [100%][0m
[32m[32m[1m1 passed[0m[32m in 0.02s[0m[0m


## adding sentiment
To make the script interoperable between textblob and VADER, I'll only look at "compound" sentiment for both. For the beginning I'll only investigate textblob as it is easier to integrate

In [14]:
%%ipytest
def textblob_sentiment(doc: Doc) -> float:
    return doc._.blob.polarity

def test_textblob():
    doc = nlp("you are stupid and dumb :(")
    assert textblob_sentiment(doc) < 0

[32m.[0m[32m                                                                                            [100%][0m
[32m[32m[1m1 passed[0m[32m in 0.03s[0m[0m


In [22]:
%%ipytest
# Testing multiple sentiments
def test_multiple_sentiment():
    docs = list(nlp.pipe(["I am angry!", "Happy days people"]))
    sentiments = list_sentiment(docs, sent_f=textblob_sentiment)
    assert sentiments[0] < 0
    assert sentiments[1] > 0
    
def list_sentiment(docs: Sequence[Doc], sent_f: Callable[[Doc], float]) -> List[float]:
    return [sent_f(doc) for doc in docs]

[32m.[0m[32m                                                                                            [100%][0m
[32m[32m[1m1 passed[0m[32m in 0.02s[0m[0m


In [27]:
headline_docs = list(nlp.pipe(df["title"]))

In [31]:
geopols = list_geopol(headline_docs)
sentiments = list_sentiment(headline_docs, textblob_sentiment)

In [37]:

def process_df(df: pd.DataFrame) -> pd.DataFrame:
    headline_docs = list(nlp.pipe(df["title"]))
    geopols = list_geopol(headline_docs)
    sentiments = list_sentiment(headline_docs, textblob_sentiment)
    return pd.DataFrame(zip(df["title"], geopols, sentiments), columns = ["title", "GPE", "sentiment"])
    

### Creating the plot
Now it's time to create the plot. The steps are as follows
0. Split up the strings to one big list
1. Count all GPE's (without entity linking)
2. Plot
3. Profit

In [47]:
%%ipytest

def test_split_entities():
    ents = pd.Series(["Washington;Denmark", "", "United States"])
    ent_list = split_entities(ents)
    assert len(ent_list) == 3
    assert ent_list[0] == "Washington"
    assert ent_list[1] == "Denmark"
    assert all(len(ent) > 0 for ent in ent_list)

def flatten_list(lst: Sequence[Sequence]) -> Sequence: 
    return [x for y in lst for x in y]

def split_entities(ents: pd.Series) -> List[str]:
    non_empty_ents = ents[ents.str.len() > 0]
    split_ents = non_empty_ents.str.split(";")
    return flatten_list(split_ents.tolist())


[32m.[0m[32m                                                                                            [100%][0m
[32m[32m[1m1 passed[0m[32m in 0.02s[0m[0m


In [54]:
from collections import Counter

def n_most_common(lst: Sequence[str], n=20) -> List[Tuple[str, int]]:
    return Counter(lst).most_common(n)
top_ents = n_most_common(split_entities(results["GPE"]))
top_ents

[('US', 164),
 ('Obama', 141),
 ('Russia', 116),
 ('Iran', 104),
 ('America', 85),
 ('U.S.', 81),
 ('Syria', 67),
 ('Iowa', 43),
 ('Israel', 32),
 ('Paris', 28),
 ('Iraq', 26),
 ('Washington', 21),
 ('New Hampshire', 20),
 ('Florida', 20),
 ('Mosul', 19),
 ('California', 19),
 ('Yemen', 18),
 ('New York', 17),
 ('China', 16),
 ('Texas', 16)]