# A quick intro to working with ```spaCy```

In [1]:
import spacy

When ```spaCy``` is loaded, we then need to initialize a model.

NB: Models first have to be downloaded from the command line. An overview of avaiable models from ```spaCy``` can be found [here](https://spacy.io/usage/models):

```spacy download en_core_web_sm```

In [4]:
nlp = spacy.load("en_core_web_sm")

We first create a ```spaCy``` pipeline which is going to be used for all of our analysis. Essentially we feed our examples of language down the pipeline, and get annotated texts out the end.

In [2]:
sentence = "My name is Ross and I come from Scotland"

The final object that comes out of the end is known as a ```spaCy``` Doc which is essentiall a list of tokens. 

However, rather than just being a list of strings, each of the tokens in this list have their own *attributes*, which can be accessed using the dot notation.

In [5]:
doc = nlp(sentence)

In [6]:
for token in doc:
    print(token.text, "\t\t", token.pos_, "\t\t", token.dep_,"\t\t", token.morph)

My 		 PRON 		 poss 		 Number=Sing|Person=1|Poss=Yes|PronType=Prs
name 		 NOUN 		 nsubj 		 Number=Sing
is 		 AUX 		 ROOT 		 Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
Ross 		 PROPN 		 attr 		 Number=Sing
and 		 CCONJ 		 cc 		 ConjType=Cmp
I 		 PRON 		 nsubj 		 Case=Nom|Number=Sing|Person=1|PronType=Prs
come 		 VERB 		 conj 		 Tense=Pres|VerbForm=Fin
from 		 ADP 		 prep 		 
Scotland 		 PROPN 		 pobj 		 Number=Sing


We can also visualise certain aspects of the linguistic structure of the sentence, such as the dependency relations between individual words:

In [None]:
spacy.displacy.serve(doc, style="dep")

## Experimenting

- Experiment with different language models available from ```spaCy``` for another language you know - Danish, Dutch, Chinese, Portuguese, whatever.
    - How does ```spaCy``` perform? 
    - Are all the same features available for all languages?

## Task

- In the shared data drive, there is a folder called ```News_Category_Dataset_v2.json```. This is taken from [this Kaggle exercise](https://www.kaggle.com/datasets/rmisra/news-category-dataset) and comprises some 200k news headlines from [HuffPost](https://www.huffpost.com/). The data is a *json lines* format, with one JSON object per row. You can load this data into ```pandas``` in the following way:

```python
data = pd.read_json(filepath, lines=True)
```

- Select a couple of sub-categories of news data and use ```spaCy``` to find the **relative frequency per 10k words*** of each of the following word classes - NOUN, VERB, ADJECTIVE, ADVERB
    - Save the results as a CSV file (again using ```pandas```)
    - Are there any differences in the distributions?

In [19]:
import pandas as pd
import os
os.chdir("/work/115274/news_data")
print(os.getcwd())
data = pd.read_json("News_Category_Dataset_v2.json", lines = True)

/work/115274/news_data


In [58]:
sportsdata = data.loc[data["category"] == "SPORTS"]


headline = pd.DataFrame(sportsdata["headline"])


headline.assign(x = range(len(headline)))


for i in range(3):
    nlp(headline[i])



Unnamed: 0,headline,x
80,Jets Chairman Christopher Johnson Won't Fine P...,0
101,Trump Posthumously Pardons Boxer Jack Johnson,1
135,Anna Kournikova Dancing With Her Bouncing Baby...,2
136,Trump Says NFL Players Unwilling To Stand For ...,3
154,Brandi Chastain Totally Agrees Her Hall Of Fam...,4
...,...,...
200786,Thank You James Dolan and Time Warner,4879
200849,Maria Sharapova Stunned By Victoria Azarenka I...,4880
200850,"Giants Over Patriots, Jets Over Colts Among M...",4881
200851,Aldon Smith Arrested: 49ers Linebacker Busted ...,4882
