## A2: Task 2
This is an exploration of task 2 described in the markdown. I'll try to use a TDD approach using [ipytest](https://github.com/chmp/ipytest) for my testing framework. I'll mainly be testing the "hard parts" and not stuff I'm confident in. 

In [20]:
import ipytest
import pandas as pd
import spacy
from spacy.tokens import Doc
from pathlib import Path
from spacytextblob.spacytextblob import SpacyTextBlob
from typing import Sequence, Callable, List
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe('spacytextblob')
ipytest.autoconfig()

In [2]:
DATA_DIR = Path("../../../CDS-LANG/tabular_examples/")
data_path = DATA_DIR / "fake_or_real_news.csv"
df = pd.read_csv(data_path, index_col=0).reset_index(drop=True)

In [8]:
%%ipytest

def extract_geopol(doc: Doc) -> str:
    return ";".join(ent.text for ent in doc.ents if ent.label_ == "GPE")

def test_extract_geopol():
    doc = nlp("Washington battling Russia")
    geopols = extract_geopol(doc)
    assert geopols == "Washington;Russia"
    
def test_only_GPE():
    doc = nlp("Why isn't Rihanna leading Denmark yet?")
    geopols = extract_geopol(doc)
    assert geopols == "Denmark"
    
def test_no_GPE():
    doc = nlp("what?")
    geopols = extract_geopol(doc)
    assert geopols == ""

[32m.[0m[32m.[0m[32m.[0m[32m                                                                                          [100%][0m
[32m[32m[1m3 passed[0m[32m in 0.03s[0m[0m


In [30]:
%%ipytest
# Testing multiple sentences
def test_list_geopol():
    docs = list(nlp.pipe(["Denmark is a country", "Hello mr. smartypants"]))
    entities = list_geopol(docs)
    assert entities[0] == "Denmark"
    assert entities[1] == ""
    
def list_geopol(docs: Sequence[Doc]) -> List[str]:
    return [extract_geopol(doc) for doc in docs]

[32m.[0m[32m                                                                                            [100%][0m
[32m[32m[1m1 passed[0m[32m in 0.02s[0m[0m


## adding sentiment
To make the script interoperable between textblob and VADER, I'll only look at "compound" sentiment for both. For the beginning I'll only investigate textblob as it is easier to integrate

In [14]:
%%ipytest
def textblob_sentiment(doc: Doc) -> float:
    return doc._.blob.polarity

def test_textblob():
    doc = nlp("you are stupid and dumb :(")
    assert textblob_sentiment(doc) < 0

[32m.[0m[32m                                                                                            [100%][0m
[32m[32m[1m1 passed[0m[32m in 0.03s[0m[0m


In [22]:
%%ipytest
# Testing multiple sentiments
def test_multiple_sentiment():
    docs = list(nlp.pipe(["I am angry!", "Happy days people"]))
    sentiments = list_sentiment(docs, sent_f=textblob_sentiment)
    assert sentiments[0] < 0
    assert sentiments[1] > 0
    
def list_sentiment(docs: Sequence[Doc], sent_f: Callable[[Doc], float]) -> List[float]:
    return [sent_f(doc) for doc in docs]

[32m.[0m[32m                                                                                            [100%][0m
[32m[32m[1m1 passed[0m[32m in 0.02s[0m[0m


In [27]:
headline_docs = list(nlp.pipe(df["title"]))

In [31]:
geopols = list_geopol(headline_docs)
sentiments = list_sentiment(headline_docs, textblob_sentiment)

In [37]:

def process_df(df: pd.DataFrame) -> pd.DataFrame:
    headline_docs = list(nlp.pipe(df["title"]))
    geopols = list_geopol(headline_docs)
    sentiments = list_sentiment(headline_docs, textblob_sentiment)
    return pd.DataFrame(zip(df["title"], geopols, sentiments), columns = ["title", "GPE", "sentiment"])
    