## Some more on ```spaCy``` and ```pandas```

First we want to import some of the packages we need.

In [23]:
import os
import spacy
from pathlib import Path
# Remember we need to initialise spaCy
nlp = spacy.load("en_core_web_sm")

We can inspect this object and see that it's what we've been called a ```spaCy``` object. 

In [5]:
type(nlp)

spacy.lang.en.English

We use this ```spaCy``` object to create annotated outputs, what we call a ```Doc``` object.

In [12]:
text = "I'm doing this SpaCy thing called coding"
doc = nlp(text)
type(doc)

spacy.tokens.doc.Doc

```Doc``` objects are sequences of tokens, meaning we can iterate over the tokens and output specific annotations that we want such as POS tag or lemma.

In [21]:
[(tok.text, tok.pos_) for tok in doc]

[('I', 'PRON'),
 ("'m", 'AUX'),
 ('doing', 'VERB'),
 ('this', 'DET'),
 ('SpaCy', 'PROPN'),
 ('thing', 'NOUN'),
 ('called', 'VERB'),
 ('coding', 'VERB')]

__Reading data with ```pandas```__

```pandas``` is the main library in Python for working with DataFrames. These are tabular objects of mixed data types, comprising rows and columns.

In ```pandas``` vocabulary, a column is called a ```Series```, which is like a sophisticated list. I'll be using the names ```Series``` and column pretty interchangably.

In [11]:
import pandas as pd

In [33]:
DATA_DIR = Path("../../../CDS-LANG/tabular_examples")
assert DATA_DIR.exists()
filepath = DATA_DIR / "fake_or_real_news.csv"
data = pd.read_csv(filepath, index_col=0)

We can use ```.sample()``` to take random samples of the dataframe.

In [38]:
data.sample(3)

Unnamed: 0,title,text,label
3466,Supreme Court to consider redefining 'one-pers...,WASHINGTON — The Supreme Court agreed Tuesday ...,REAL
7266,THIS Is What It Means If You Have Two Dimples ...,"posted by Eddie Whether you have back dimples,...",FAKE
6532,Nevada: Rep. Election Workers Intimidated,Nevada: Rep. Election Workers Intimidated Nove...,FAKE


To delete unwanted columns, we can do the following:

In [None]:
# df.drop(["colname"], axis=1)

We can count the distribution of possible values in our data using ```.value_counts()``` - e.g. how many REAL and FAKE news entries do we have in our DataFrame?

In [37]:
data["label"].value_counts()

REAL    3171
FAKE    3164
Name: label, dtype: int64

__Filter on columns__

To filter on columns, we define a condition on which we want to filter and use that to filer our DataFrame. We use the square-bracket syntax, just as if we were slicing a list or string.

In [41]:
mask = data["label"] == "FAKE"

Here we create two new dataframes, one with only fake news text, and one with only real news text.

In [42]:
fake_news_df = data[mask]
true_df = data[~mask]

__Counters__

In the following cell, you can see how to use a 'counter' to count how many entries are in a list.

The += operator adds 1 to the variable ```counter``` for every entry in the list.

__Counting features in data__

Using the same logic, we can count how often adjectives (```JJ```) appear in our data. 

This is useful from a lingustic perspective; we could now, for example, figure out how many of each part of speech can be found in our data.

In this case, we're using ```nlp.pipe``` from ```spaCy``` to group the entries together into batches of 500 at a time.

Why?

Everytime we execute ```nlp(text)``` it incurs a small computational overhead which means that scaling becomes an issue. An overhead of 0.01s per document becomes an issue when dealing with 1,000,000 or 10,000,000 or 100,000,000...

If we batch, we can therefore be a bit more efficient. It also allows us to keep our ```spaCy``` logic compact and together, which becomes useful for more complex tasks.

## Sentiment with ```spaCy```

To work with spaCyTextBlob, we need to make sure that we are working with ```spacy==2.3.5```. 

Follow the separate instructions posted to Slack to make this work.

In [None]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob
# initialise spacy
nlp = spacy.load("en_core_web_sm")

Here, we initialise spaCyTextBlob and add it as a new component to our ```spaCy``` nlp pipeline.

Let's test spaCyTextBlob on a single text, specifically Virgian Woolf's _To The Lighthouse_, published in 1927.

We use ```spaCy``` to create a ```Doc``` object for the entire text (how might you do this in batch?)

We can extract the polarity for each sentence in the novel and create list of scores per sentence.

We can create a quick and cheap plot using matplotlib - this is only fine in Jupyter Notebooks, don't do this in the wild!

We can the use some fancy methods from ```pandas``` to calculate a rolling mean over a certain window length.

For example, we group together our polarity scores into a window of 100 sentences at a time and calculate an average on that window.

This plot with a rolling average shows us a 'smoothed' output showing the rolling average over time, helping to cut through the noise.