## ESERCITAZIONE
### Utilizzando il dataset pulito 20newsgroup, prova ad estrarre per ogni documento tutte le ORGANIZZAZIONI (ORG) le date (DATE), le persone (PERSON) e i luoghi (LOC)

In [1]:
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
import re
from nltk.corpus import stopwords
import nltk
import string
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import spacy

RANDOM_SEED = 176

## Dataset import

Import all records, without train-test split

In [2]:
X, y = fetch_20newsgroups(subset="all", return_X_y=True, random_state=None)

## Text cleaning
Though it's not required, data cleaning is performed in order to remove punctuation and stopwords.

@ and . punctuation symbols are kept in order to don't remove emails.

No lowercasing to don't remove people names.

In [3]:
nltk.download("stopwords")
en_stopwords = stopwords.words("english")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\aless\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
def text_cleaning(sentence):
    """
    Clean a text string
    """
    #lowercase
    # sentence=sentence.lower()

    #remove punctuation
    for c in string.punctuation:
        if(c !='@' and c !='.'):
            sentence = sentence.replace(c,"")

    #remove stopwords
    sentence = " ".join(word for word in sentence.split()\
                        if word not in en_stopwords)
    
    #trailing spaces
    sentence = re.sub(r" +", " ",sentence)
    return sentence

In [5]:
X = [text_cleaning(s) for s in X]

## NER
NER is performed by `en_core_web_sm` model from spacy.

It can be downloaded from console with:
```
python -m spacy download en_core_web_sm
```

In [6]:
#load model
nlp = spacy.load('en_core_web_sm')

In [7]:
org_list = [] #ORG
date_list = [] #DATE
person_list = [] #PERSON
loc_list = [] #LOC

ner_tags = {"ORG":[], "DATE":[], "PERSON": [], "LOC":[]}

for sentence in X:
    doc = nlp(sentence)
    for token in doc:
        if token.ent_type_ in ner_tags:
            ner_tags[token.ent_type_].append(str(token)) 

In [8]:
for k in ner_tags:
    # print(f"{k}: {len(ner_tags[k])}")
    print(f"{k}: {ner_tags[k][:10]}")

ORG: ['Ryan', 'Robbins', 'IO20456@MAINE.MAINE.EDU', 'IMO', 'Jack', 'Morris', 'Organization', 'Massachusetts', 'Institute', 'Technology']
DATE: ['5', 'years', 'ago', 'one', 'average', 'year', 'last', '5', 'tomorrow', 'last']
PERSON: ['Len', 'Reed', 'Len', 'Reed', 'Holos', 'Software', 'Inc.', 'Charles', 'M', 'Kozierok']
LOC: ['Mars', 'Venus', 'Mount', 'Carmel', 'Mt.', 'Everest', 'Mount', 'Everest', 'Mars', 'Mars']
