# Working with NER
In this Notebook you will use basic NER to extract possible interesting metadata. We will use the movies dataset from IMDb.

In [2]:
import pandas as pd
import spacy

# use displacy to visually show the entities. 
from spacy import displacy

# load spacy model. Alternatively you can use en_core_web_lg
!python -m spacy download en_core_web_sm
nlp = spacy.load("en_core_web_sm")

# load the data
df_movies = pd.read_csv('../data/imdb.csv', sep=',')

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


### 1. Data inspection
Take a look at the movie 'Lawrence of Arabia'. Which genres are connected with this movie?

In [3]:
df_movies[df_movies['Title'] == 'Lawrence of Arabia']

Unnamed: 0,Title,Year,Rated,Released,Runtime,Genre,Director,Writer,Actors,Plot,...,imdbRating,imdbVotes,imdbID,Type,DVD,BoxOffice,Production,Website,Response,tomatoURL
271,Lawrence of Arabia,1962,PG,11 Dec 1962,216 min,"Adventure, Biography, Drama",David Lean,"T.E. Lawrence (writings), Robert Bolt (screenp...","Peter O'Toole, Alec Guinness, Anthony Quinn, J...","The story of T.E. Lawrence, the English office...",...,8.3,207765,tt0056172,movie,03 Apr 2001,,Columbia Pictures,,True,http://www.rottentomatoes.com/m/lawrence_of_ar...


In [6]:
# select the plot of the movie Lawrence of Arabia
plot = df_movies[df_movies['Title'] == 'Lawrence of Arabia']['Plot'].item()
# code goes here

# parse the text through Spacy NLP
doc = nlp(plot)

# render the text
displacy.render(doc, style="ent")

# alternative output:
# for ent in doc.ents:
#   print (ent.text, ent.label_)

### 2. Compare the output
Compare the output of Displacy with the tags associated with the movie. What do you notice? Would you include entities from Spacy to the metadata?

### 3. Process
Now it is time to process everything. Create a new column `plot_entities` and process the items by applying the provided function below on the `Plot` column.

In [7]:
def process(x):
  # there are some pesky NaN in the data. Easy but not so elegant way to fix this.
  if pd.isna(x) == False:
    doc = nlp(x)
  else:
    doc = ''
  return doc
  
df_movies['plot_entities'] = df_movies['Plot'].apply(process)

In [8]:
df_movies['plot_entities'].head()

0    (A, former, intelligence, and, FBI, officer, ,...
1    (A, bus, driver, and, his, sewer, worker, frie...
2    (The, misadventures, of, a, misfit, PT, Boat, ...
3    (A, witch, married, to, an, ordinary, man, can...
4    (The, staff, of, an, army, hospital, in, the, ...
Name: plot_entities, dtype: object

### 4. Extract a specific entitity
Now we are going to create a column `events` to extract EVENTS. Apply the provided function below:

In [9]:
def get_events(x):
  events = []
  if x != '':
    for entities in x.ents:
      if entities.label_ == 'EVENT':
        events.append(entities.text)
  return events

df_movies['events'] = df_movies['plot_entities'].apply(get_events)


In [10]:
df_movies['events'].head()

0                  []
1                  []
2      [World War II]
3                  []
4    [the Korean war]
Name: events, dtype: object

### 5. Inspect the results
Now we have an extra column with a list of events it is time to count the events. A simple approach is to create a new dataframe where every row is an item of the list. Save the output to a CSV or other format to explore.

In [21]:
def export_events(df):
    counts = pd.Series([x for item in df['events'] for x in item]).value_counts()
    counts_df = counts.to_frame('count').rename_axis('event')
    print(counts_df.head())
    counts_df.to_csv('../data/event_counts.csv', )

In [23]:
export_events(df_movies)

                 count
event                 
World War II        46
World War I          7
the Vietnam War      6
New Year's Eve       5
the Cold War         4


Observable is that an event such as _The Second World War_ is referred to in different ways: World War II or WWII. Time to clean up the data and create uniform concepts for events. Export the list from time to time to see the changes. You will notice that many movies are labelled as `war` but do not mention which war, so data cleaning is necessary. Use the function below to clean up the dataset iteratively

In [26]:
def change_entity(x, value, entities):
  return [value if i.lower() in map(str.lower, entities) else i for i in x]  

entities = ['WWII', 'Holocaust', 'the Second World War', 'the World War II Battle of Iwo Jima'
            'the World War II Jewish Resistance', 'World War II Germany',
            'the Dutch Resistance', 'American World War II'
            ]
value = 'World War II'

df_movies['events'] = df_movies['events'].apply(lambda x: change_entity(x, value, entities))
export_events(df_movies)

                 count
event                 
World War II        52
World War I          7
the Vietnam War      6
New Year's Eve       5
the Cold War         4


### 6. Create metadata
Now you have a simple way to extract entities which in turn could serve as meta data it is time to create more columns. Which new columns can you think of? Take a look at [Extend Named Entity Recogniser (NER) to label new entities with spaCy](https://towardsdatascience.com/extend-named-entity-recogniser-ner-to-label-new-entities-with-spacy-339ee5979044) to see the different labels.

In [None]:
# be creative!

### 7. Finally
Which similarity or clustering algorithm would you use in order to make use of the meta data?