# A quick intro to working with ```spaCy```

In [2]:


import spacy

When ```spaCy``` is loaded, we then need to initialize a model.

NB: Models first have to be downloaded from the command line. An overview of avaiable models from ```spaCy``` can be found [here](https://spacy.io/usage/models):

```spacy download en_core_web_sm```

In [3]:
nlp = spacy.load("en_core_web_sm") # The model is instantiated as an object.
# It is common to give a spacy model the name nlp. Contained within is a whole pipeline of methods.

We first create a ```spaCy``` pipeline which is going to be used for all of our analysis. Essentially we feed our examples of language down the pipeline, and get annotated texts out the end.

In [16]:
sentence = "My name is, uh, Odin and I am a norse god of warfare and wisdom."

The final object that comes out of the end is known as a ```spaCy``` Doc which is essentiall a list of tokens. 

However, rather than just being a list of strings, each of the tokens in this list have their own *attributes*, which can be accessed using the dot notation.

In [17]:
doc = nlp(sentence) #Outputs a list of tokens. This is known as a spacy doc object.
print(type(doc)) #spacy.tokens.doc.Doc

<class 'spacy.tokens.doc.Doc'>


In [18]:
for token in doc:
    print(token.text, "\t\t", token.pos_, "\t\t", token.dep_,"\t\t", token.morph)

My 		 PRON 		 poss 		 Number=Sing|Person=1|Poss=Yes|PronType=Prs
name 		 NOUN 		 nsubj 		 Number=Sing
is 		 AUX 		 ROOT 		 Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
, 		 PUNCT 		 punct 		 PunctType=Comm
uh 		 INTJ 		 intj 		 
, 		 PUNCT 		 punct 		 PunctType=Comm
Odin 		 PROPN 		 intj 		 Number=Sing
and 		 CCONJ 		 cc 		 ConjType=Cmp
I 		 PRON 		 nsubj 		 Case=Nom|Number=Sing|Person=1|PronType=Prs
am 		 AUX 		 conj 		 Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin
a 		 DET 		 det 		 Definite=Ind|PronType=Art
norse 		 ADJ 		 amod 		 Degree=Pos
god 		 PROPN 		 attr 		 Number=Sing
of 		 ADP 		 prep 		 
warfare 		 NOUN 		 pobj 		 Number=Sing
and 		 CCONJ 		 cc 		 ConjType=Cmp
wisdom 		 NOUN 		 conj 		 Number=Sing
. 		 PUNCT 		 punct 		 PunctType=Peri


We can also visualise certain aspects of the linguistic structure of the sentence, such as the dependency relations between individual words:

In [19]:
spacy.displacy.render(doc, style="dep")

## Experimenting

- Experiment with different language models available from ```spaCy``` for another language you know - Danish, Dutch, Chinese, Portuguese, whatever.
    - How does ```spaCy``` perform? 
    - Are all the same features available for all languages?

In [26]:
# Loading the downloaded package for danish, seemingly trained on news
nlp_da = spacy.load("da_core_news_sm")
# Making a danish sentence.
danish_sentence = "Hey Danmark! Hvad sker der for dig? Jeg savner dig, jeg vil have dig tilbage, ligesom i de gamle dage hvor en spade var en, yo!"
# Making a doc object
doc_danish = nlp_da(danish_sentence)
# Printing the stuff
for token in doc_danish:
    print(token.text, "\t\t", token.pos_, "\t\t", token.dep_,"\t\t", token.morph)



Hey 		 PROPN 		 ROOT 		 
Danmark 		 PROPN 		 flat 		 
! 		 PUNCT 		 punct 		 
Hvad 		 PRON 		 obl 		 Number=Sing|PronType=Int,Rel
sker 		 VERB 		 ROOT 		 Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Act
der 		 ADV 		 expl 		 PartType=Inf
for 		 ADP 		 case 		 AdpType=Prep
dig 		 PRON 		 obl 		 Case=Acc|Gender=Com|Number=Sing|Person=2|PronType=Prs
? 		 PUNCT 		 punct 		 
Jeg 		 PRON 		 nsubj 		 Case=Nom|Gender=Com|Number=Sing|Person=1|PronType=Prs
savner 		 VERB 		 ROOT 		 Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Act
dig 		 PRON 		 obj 		 Case=Acc|Gender=Com|Number=Sing|Person=2|PronType=Prs
, 		 PUNCT 		 punct 		 
jeg 		 PRON 		 nsubj 		 Case=Nom|Gender=Com|Number=Sing|Person=1|PronType=Prs
vil 		 AUX 		 aux 		 Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Act
have 		 VERB 		 ccomp 		 VerbForm=Inf|Voice=Act
dig 		 PRON 		 obj 		 Case=Acc|Gender=Com|Number=Sing|Person=2|PronType=Prs
tilbage 		 ADV 		 advmod:lmod 		 
, 		 PUNCT 		 punct 		 
ligesom 		 SCONJ 		 mark 		 
i 		 ADP 		 case 		 AdpType=Prep


## Task

- In the shared data drive, there is a folder called ```News_Category_Dataset_v2.json```. This is taken from [this Kaggle exercise](https://www.kaggle.com/datasets/rmisra/news-category-dataset) and comprises some 200k news headlines from [HuffPost](https://www.huffpost.com/). The data is a *json lines* format, with one JSON object per row. You can load this data into ```pandas``` in the following way:

```python
data = pd.read_json(filepath, lines=True)
```

- Select a couple of sub-categories of news data and use ```spaCy``` to find the **relative frequency per 10k words*** of each of the following word classes - NOUN, VERB, ADJECTIVE, ADVERB
    - Save the results as a CSV file (again using ```pandas```)
    - Are there any differences in the distributions?

In [44]:
import os
import pandas as pd
path = os.path.join("..", '..',"..","115274","news_data","News_Category_Dataset_v2.json")
data = pd.read_json(path, lines=True)
# 
data_subset = data[data["category"] == "CRIME"]
data_descriptions = data_subset["short_description"]
data_descriptions.shape

#for i in data_descriptions:
#doc = nlp(data_descriptions[0])
#print(doc[0])




She
