### Keyword Analysis
In this notebook, we will be implementing Named Entity Recognition (NER) to identify specific entities from a text. Frequencies will then be aggregated for each entity, to obtain the top 20 entities for a particular time period. 

<img src="https://miro.medium.com/max/840/1*jr9NAzhv-XnsRrkNM9BISA.png" height=300>
<img src="https://miro.medium.com/max/840/1*dG6L9GHLIKZQKrnkSUj_UA.png" height=320>

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
import spacy
import pandas as pd
import nltk

In [4]:
df = pd.read_csv("/content/drive/MyDrive/SMT483 FYP/Historical Data/Facepager/processed_facebook_text.csv", index_col=0)
df

Unnamed: 0,message,date_posted,processed_text
1,Embrace hail in Singapore. Just let it go.,2018-01-30,embrace hail singapore let go
2,Background music superb tok Kong. No horse run.,2018-01-31,background music superb tok kong horse run
3,hail stones hitting a moving car unexpectedly ...,2018-01-31,hail stone hitting moving car unexpectedly dau...
4,Ya y no drop money drop ice cubes.,2018-01-31,ya drop money drop ice cube
5,Ur BGM sibei irritating can,2018-01-31,ur bgm sibei irritating
...,...,...,...
326778,Insane in the membrane,2021-12-01,insane membrane
326779,Straight people are normal people ? All mother...,2021-12-01,straight people normal people mother love
326780,They can massage all they want inside the cell...,2021-12-01,massage want inside cell soon
326781,Both(mum n idiot) are very Sick. Jail both for...,2021-12-01,mum n idiot sick jail 20 year let die inside good


In [5]:
df = df[df["processed_text"].isna()==False]
df

Unnamed: 0,message,date_posted,processed_text
1,Embrace hail in Singapore. Just let it go.,2018-01-30,embrace hail singapore let go
2,Background music superb tok Kong. No horse run.,2018-01-31,background music superb tok kong horse run
3,hail stones hitting a moving car unexpectedly ...,2018-01-31,hail stone hitting moving car unexpectedly dau...
4,Ya y no drop money drop ice cubes.,2018-01-31,ya drop money drop ice cube
5,Ur BGM sibei irritating can,2018-01-31,ur bgm sibei irritating
...,...,...,...
326778,Insane in the membrane,2021-12-01,insane membrane
326779,Straight people are normal people ? All mother...,2021-12-01,straight people normal people mother love
326780,They can massage all they want inside the cell...,2021-12-01,massage want inside cell soon
326781,Both(mum n idiot) are very Sick. Jail both for...,2021-12-01,mum n idiot sick jail 20 year let die inside good


In [6]:
ner = spacy.load('en_core_web_sm')

In [7]:
def extract_entities(text):
    entity_dict = {}
    entities = ["PERSON", "NORP", "FAC", "ORG", "GPE", "LOC", "PRODUCT", "EVENT", "WORK_OF_ART", "DATE"]

    text = ner(text)
    for word in text.ents:
        word, label = word.text, word.label_
        if label in entities:
            if label not in entity_dict:
                entity_dict[label] = word
            else:
                entity_dict[label] += word
    return entity_dict

In [8]:
df["entities"] = df["message"].apply(extract_entities)
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,message,date_posted,processed_text,entities
1,Embrace hail in Singapore. Just let it go.,2018-01-30,embrace hail singapore let go,{'GPE': 'Singapore'}
2,Background music superb tok Kong. No horse run.,2018-01-31,background music superb tok kong horse run,"{'ORG': 'Background', 'GPE': 'Kong'}"
3,hail stones hitting a moving car unexpectedly ...,2018-01-31,hail stone hitting moving car unexpectedly dau...,{'ORG': 'hail stones'}
4,Ya y no drop money drop ice cubes.,2018-01-31,ya drop money drop ice cube,{}
5,Ur BGM sibei irritating can,2018-01-31,ur bgm sibei irritating,{'ORG': 'Ur BGM'}
...,...,...,...,...
326778,Insane in the membrane,2021-12-01,insane membrane,{}
326779,Straight people are normal people ? All mother...,2021-12-01,straight people normal people mother love,{}
326780,They can massage all they want inside the cell...,2021-12-01,massage want inside cell soon,{}
326781,Both(mum n idiot) are very Sick. Jail both for...,2021-12-01,mum n idiot sick jail 20 year let die inside good,{'DATE': '20 years'}


In [12]:
df.to_csv("/content/drive/MyDrive/SMT483 FYP/Historical Data/Facepager/ner.csv")

In [11]:
df

Unnamed: 0,message,date_posted,processed_text,entities
1,Embrace hail in Singapore. Just let it go.,2018-01-30,embrace hail singapore let go,{'GPE': 'Singapore'}
2,Background music superb tok Kong. No horse run.,2018-01-31,background music superb tok kong horse run,"{'ORG': 'Background', 'GPE': 'Kong'}"
3,hail stones hitting a moving car unexpectedly ...,2018-01-31,hail stone hitting moving car unexpectedly dau...,{'ORG': 'hail stones'}
4,Ya y no drop money drop ice cubes.,2018-01-31,ya drop money drop ice cube,{}
5,Ur BGM sibei irritating can,2018-01-31,ur bgm sibei irritating,{'ORG': 'Ur BGM'}
...,...,...,...,...
326778,Insane in the membrane,2021-12-01,insane membrane,{}
326779,Straight people are normal people ? All mother...,2021-12-01,straight people normal people mother love,{}
326780,They can massage all they want inside the cell...,2021-12-01,massage want inside cell soon,{}
326781,Both(mum n idiot) are very Sick. Jail both for...,2021-12-01,mum n idiot sick jail 20 year let die inside good,{'DATE': '20 years'}
