# Working with NER
In this Notebook you will use basic NER to extract possible interesting metadata. We will use the movies dataset from IMDb.

In [3]:
import pandas as pd
import spacy

# use displacy to visually show the entities. 
from spacy import displacy

# load spacy model. Alternatively you can use en_core_web_lg
nlp = spacy.load("en_core_web_sm")

# load the data
df_movies = pd.read_csv('../data/imdb.csv', sep=',')

OSError: [E053] Could not read config file from C:\Users\Gebruiker\anaconda3\Lib\site-packages\en_core_web_sm\en_core_web_sm-2.2.0\config.cfg

### 1. Data inspection
Take a look at the movie 'Lawrence of Arabia'. Which genres are connected with this movie?

In [None]:
df_movies[df_movies['Title'] == 'Lawrence of Arabia']

In [None]:
# select the plot of the movie Lawrence of Arabia

# code goes here

# parse the text through Spacy NLP
doc = nlp(plot)

# render the text
displacy.render(doc, style="ent")

# alternative output:
# for ent in doc.ents:
#   print (ent.text, ent.label_)

### 2. Compare the output
Compare the output of Displacy with the tags associated with the movie. What do you notice? Would you include entities from Spacy to the metadata?

### 3. Process
Now it is time to process everything. Create a new column `plot_entities` and process the items by applying the provided function below on the `Plot` column.

In [193]:
def process(x):
  # there are some pesky NaN in the data. Easy but not so elegant way to fix this.
  if pd.isna(x) == False:
    doc = nlp(x)
  else:
    doc = ''
  return doc
  
# code goes here

### 4. Extract a specific entitity
Now we are going to create a column `events` to extract EVENTS. Apply the provided function below:

In [204]:
def get_events(x):
  events = []
  if x != '':
    for entities in x.ents:
      if entities.label_ == 'EVENT':
        events.append(entities.text)
  return events

# code goes here


### 5. Inspect the results
Now we have an extra column with a list of events it is time to count the events. A simple approach is to create a new dataframe where every row is an item of the list. Save the output to a CSV or other format to explore.

In [210]:
# code goes here

Observable is that an event such as _The Second World War_ is referred to in different ways: World War II or WWII. Time to clean up the data and create uniform concepts for events. Export the list from time to time to see the changes. You will notice that many movies are labelled as `war` but do not mention which war, so data cleaning is necessary. Use the function below to clean up the dataset iteratively

In [206]:
def change_entity(x, value, entities):
  return [value if i in entities else i for i in x]  

entities = ['WWII']
value = 'World War II'

df_movies['events'] = df_movies['events'].apply(lambda x: change_entity(x, value, entities))


### 6. Create metadata
Now you have a simple way to extract entities which in turn could serve as meta data it is time to create more columns. Which new columns can you think of? Take a look at [Extend Named Entity Recogniser (NER) to label new entities with spaCy](https://towardsdatascience.com/extend-named-entity-recogniser-ner-to-label-new-entities-with-spacy-339ee5979044) to see the different labels.

In [None]:
# code goes here

### 7. Finally
Which similarity or clustering algorithm would you use in order to make use of the meta data?