# A quick intro to working with ```spaCy```

In [2]:
import spacy

When ```spaCy``` is loaded, we then need to initialize a model.

NB: Models first have to be downloaded from the command line. An overview of avaiable models from ```spaCy``` can be found [here](https://spacy.io/usage/models):

```spacy download en_core_web_sm```

In [3]:
#Installing a specific language model
nlp = spacy.load("en_core_web_sm")

We first create a ```spaCy``` pipeline which is going to be used for all of our analysis. Essentially we feed our examples of language down the pipeline, and get annotated texts out the end.

In [5]:
sentence = "My name is Ross and I come from Scotland"

The final object that comes out of the end is known as a ```spaCy``` Doc which is essentiall a list of tokens. 

However, rather than just being a list of strings, each of the tokens in this list have their own *attributes*, which can be accessed using the dot notation.

In [6]:
#Calls it doc, because it is a spacy doc-object
doc = nlp(sentence)

In [7]:
for token in doc:
    print(token.text, "\t\t", token.pos_, "\t\t", token.dep_,"\t\t", token.morph)

My 		 PRON 		 poss 		 Number=Sing|Person=1|Poss=Yes|PronType=Prs
name 		 NOUN 		 nsubj 		 Number=Sing
is 		 AUX 		 ROOT 		 Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
Ross 		 PROPN 		 attr 		 Number=Sing
and 		 CCONJ 		 cc 		 ConjType=Cmp
I 		 PRON 		 nsubj 		 Case=Nom|Number=Sing|Person=1|PronType=Prs
come 		 VERB 		 conj 		 Tense=Pres|VerbForm=Fin
from 		 ADP 		 prep 		 
Scotland 		 PROPN 		 pobj 		 Number=Sing


We can also visualise certain aspects of the linguistic structure of the sentence, such as the dependency relations between individual words:

In [8]:
spacy.displacy.serve(doc, style="dep")




Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


## Experimenting

- Experiment with different language models available from ```spaCy``` for another language you know - Danish, Dutch, Chinese, Portuguese, whatever.
    - How does ```spaCy``` perform? 
    - Are all the same features available for all languages?

In [9]:
nlp_da = spacy.load("da_core_news_sm")

In [12]:
sentence = ("Hej, hvordan går det med dig?")
doc = nlp_da(sentence)

In [13]:
for token in doc:
    print(token.text, "\t\t", token.pos_, "\t\t", token.dep_,"\t\t", token.morph)

Hej 		 PROPN 		 ROOT 		 
, 		 PUNCT 		 punct 		 
hvordan 		 ADV 		 advmod 		 
går 		 VERB 		 obj 		 Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Act
det 		 PRON 		 nsubj 		 Case=Acc|Gender=Neut|Number=Sing|Person=3|PronType=Prs
med 		 ADP 		 case 		 AdpType=Prep
dig 		 PRON 		 obl 		 Case=Acc|Gender=Com|Number=Sing|Person=2|PronType=Prs
? 		 PUNCT 		 punct 		 


In [14]:
spacy.displacy.serve(doc, style="dep")





Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


## Task

- In the shared data drive, there is a folder called ```News_Category_Dataset_v2.json```. This is taken from [this Kaggle exercise](https://www.kaggle.com/datasets/rmisra/news-category-dataset) and comprises some 200k news headlines from [HuffPost](https://www.huffpost.com/). The data is a *json lines* format, with one JSON object per row. You can load this data into ```pandas``` in the following way:

```python
data = pd.read_json(filepath, lines=True)
```

- Select a couple of sub-categories of news data and use ```spaCy``` to find the **relative frequency per 10k words*** of each of the following word classes - NOUN, VERB, ADJECTIVE, ADVERB
    - Save the results as a CSV file (again using ```pandas```)
    - Are there any differences in the distributions?

In [24]:
import pandas as pd
data = pd.read_json("/work/115274/news_data/News_Category_Dataset_v2.json", lines=True)

data_crime = data[data["category"] == "CRIME"]

df = []
for i in data_crime[data_crime["headline"]]:
    df.append(i)

print(df)


KeyError: "None of [Index(['There Were 2 Mass Shootings In Texas Last Week, But Only 1 On TV',\n       'Rachel Dolezal Faces Felony Charges For Welfare Fraud',\n       'Man Faces Charges After Pulling Knife, Stun Gun On Muslim Students At McDonald's',\n       '2 People Injured In Indiana School Shooting',\n       'Maryland Police Charge 3 Church Leaders With Past Abuse Of At-Risk Teen Girls',\n       'Florida Police Report 2 Dead After Standoff At Panama City Apartment Complex',\n       ''This Isn’t Pakistan, Bitch': Video Captures Driver’s Racist Rant',\n       'These Are The Victims Of The Santa Fe High School Shooting',\n       'Hospice Overdosed Patients To 'Hasten Their Deaths,' Former Health Care Executive Admits',\n       'Former WWF Wrestler Severely Beaten Outside California Home',\n       ...\n       'Rape Is a 4-Letter Word',\n       'George Huguely Murder Trial Timeline: Former College Lacrosse Player On Trial In Death Of Yeardley Love',\n       'Josh Powell, Father Who Killed Himself And Children In Fire, Had 400 Sex And Incest Photos (VIDEO)',\n       'George Huguely Murder Trial: Jurors To Deliberate Following Closing Arguments',\n       'Occupy The Criminal Justice System: From Stop-and-Frisk To Prison Cells',\n       'Elizabeth Smart, Former Kidnapping Victim, Marries Matthew Gilmour In Hawaii',\n       'Hannah Kelly, Pastor's Daughter, Dies After Accidental Shooting At Florida Church',\n       'Tim Cole, Convict Exonerated After Death, Gets Texas Historical Marker',\n       'Even When the Subject Is Gun Control, Our Government Wins When Justice Is Done',\n       'Karen Swift's Funeral Planned For Saturday As Homicide Investigation Continues'],\n      dtype='object', length=3405)] are in the [columns]"