<a href="https://colab.research.google.com/github/RogShotz/CSC-482-Project/blob/model-testing/CSC482_Project_Geneology_Extractor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Geneology Extractor**

---

Here we go I guess. Here's some wiki stuff for extracting articles.

## **Packages and Imports**

In [None]:
!pip install spacy
!pip install spacy-transformers
!python -m spacy download en_core_web_trf

In [None]:
import nltk, csv, spacy
from google.colab import drive
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.classify import NaiveBayesClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from nltk.classify import accuracy


nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

## **DataSets and Setup**

Datasets are located at the google drive here

https://drive.google.com/drive/folders/1NVlU3sMEsQrF7SBM-F6SIbx27N91FKfj?usp=sharing

In [35]:
# Function Definitions

# Download the spaCy model for NER
nlp = spacy.load("en_core_web_trf")

# Dataset extractions, mount Google Drive before running
def extract_columns_from_csv(file_path, has_labels=True):
    texts = []
    labels = []

    with open(file_path, 'r', newline='', encoding='utf-8') as csvfile:
        csv_reader = csv.reader(csvfile, quoting=csv.QUOTE_MINIMAL)
        if has_labels:
          next(csv_reader)  # Skip the header row

        for row in csv_reader:
            if has_labels:
                text, label = row
                labels.append(label)
            else:
                text = row[0]

            texts.append(text)

    if has_labels:
        return texts, labels
    else:
        return texts


# Tokenize and preprocess text data
def preprocess_text(text):
    words = word_tokenize(text.lower())  # Tokenize and convert to lowercase
    words = [word for word in words if word.isalnum()]  # Remove non-alphanumeric characters
    words = [word for word in words if word not in stopwords.words('english')]  # Remove stopwords
    return dict([(word, True) for word in words])

# Extracts named entities from the text
def extract_people(sentences):
    nlp = spacy.load("en_core_web_sm")

    people_list = []
    for sentence in sentences:
        doc = nlp(sentence)
        people_in_sentence = [ent.text for ent in doc.ents if ent.label_ == "PERSON"]
        if people_in_sentence:
            people_list.append(people_in_sentence)
        else:
            people_list.append(None)

    return people_list


## **Training**

---

Change the file path to use the data file that you want to train with, this must be labelled data

In [67]:
file_path = '/content/drive/MyDrive/CSC482_data_sets/son.csv'
texts = extract_columns_from_csv(file_path)

In [68]:
# Prepare the labeled data in the required format for NLTK
featuresets = [(preprocess_text(text), label) for text, label in zip(texts[0], texts[1])]
label_set = set(texts[1])

# Split data into training and testing sets
train_set, test_set = train_test_split(featuresets, test_size=0.2, random_state=23)

# Train the Naive Bayes classifier
nb_classifier = NaiveBayesClassifier.train(train_set)

# Get the accuracy of the model
accuracy_result = accuracy(nb_classifier, test_set)
print(f"Accuracy: {accuracy_result}")

Accuracy: 0.9666666666666667


**More informative printouts for testing**

In [69]:
# Print out the test set and probabilities
print("Text has stop words filtered out\n------")
label_set = set(texts[1])
neg_label = next((s for s in label_set if s.startswith('not')), None)
pos_label = next((s for s in label_set if not s.startswith('not')), None)
for i, (text_features, _) in enumerate(test_set):
    if i >= 10:
        break
    prob_dist = nb_classifier.prob_classify(text_features)
    print(f"Text: {' '.join(text_features.keys())}")
    print(f"Probability of having 'son' relation: {prob_dist.prob(pos_label)}")
    print(f"Probability of not having 'son' relation: {prob_dist.prob(neg_label)}")
    print("------")

Text has stop words filtered out
------
Text: historian meticulously cataloged artifacts ensuring preservation future generations
Probability of having 'son' relation: 0.0071874641884249035
Probability of not having 'son' relation: 0.9928125358115756
------
Text: grace compassionate veterinarian cared animals unwavering dedication making positive impact countless furry companions
Probability of having 'son' relation: 0.0009946160199898856
Probability of not having 'son' relation: 0.9990053839800108
------
Text: sophie christopher adopted rescue dog family
Probability of having 'son' relation: 0.000947254678231364
Probability of not having 'son' relation: 0.999052745321767
------
Text: boy father built model airplane together
Probability of having 'son' relation: 0.9979349723607761
Probability of not having 'son' relation: 0.002065027639222439
------
Text: quiet village james son benjamin tended family farm cultivating deep connection land traditions
Probability of having 'son' relation

## **Labelling and Getting Relations**

In [60]:
"""
Inputs:
classifier: a trained classifier
texts: the data to be classified, list of sentences
label_set: set of the labels ("positive", "negative")
people: list of Named Entities in the sentences

Outputs:
tuple:  -positive label
        -probability
        -sentence
        -named people
"""
def get_son_relation(classifier, texts, label_set, people):
  neg_label = next((s for s in label_set if s.startswith('not')), None)
  pos_label = next((s for s in label_set if not s.startswith('not')), None)
  featureset = [preprocess_text(text) for text in texts]

  has_relation = []

  for i, text_features in enumerate(featureset):
    prob_dist = nb_classifier.prob_classify(text_features)
    prob_relation = prob_dist.prob(pos_label)

    if prob_relation >= 0.7:
      has_relation.append((pos_label, prob_relation, texts[i], people[i]))

  return has_relation

**Change the file_path to your csv file (1 column of strings)**

In [61]:
file_path = '/content/drive/MyDrive/CSC482_data_sets/data.csv'
texts = extract_columns_from_csv(file_path, has_labels=False)
people = extract_people(texts)
relations = get_son_relation(nb_classifier, texts, label_set, people)

**Information is stored as a list of tuples**

In [70]:
relations[1]

('son',
 0.9324277119804167,
 'James hugged his son, Matthew, tightly after returning from his long deployment overseas.',
 ['James', 'Matthew'])