<a href="https://colab.research.google.com/github/AlexanderHargrave/AlexanderHargrave/blob/main/Group_13_Unsupervised.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Evidence detection using Unsupervised or traditional approach**

This notebook approaches the task of evidence detection using a Conditional Random Field(CRF) approach. It initial cleans, tokenize and get the POS tags of the text then extracts the features used for the CRF model. These features are:


*   The words in the 'claim' section
*   The POS tags for those words
*   The minimum distance between tokens in claim and evidence
*   The words in both 'claim' and 'evidence' section
*   Using Word2Vec model to provide contextual information for words

These extracted features are then used to develop the CRF model where the model is run using the sklearn_crfsuite library. This model is then used to predict for both the development and testing set where the development set uses functions from sklearn.metrics which are accuracy_score and classification_report to present the results of the predictions of the development set.




In [1]:
!pip install sklearn-crfsuite
import pandas as pd
import nltk
nltk.download('all')
from sklearn_crfsuite import CRF
from sklearn_crfsuite.metrics import flat_classification_report
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.metrics import accuracy_score, classification_report
from gensim.models import Word2Vec
import multiprocessing
import gensim.downloader as api
import string

# Tokenization and POS Tagging function
def tokenize_and_tag(text):
    text = str(text)
    tokens = word_tokenize(text)
    # Remove punctuation
    tokens = [word for word in tokens if word not in string.punctuation]
    # Remove stop words
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word.lower() not in stop_words]
    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    # POS tagging
    pos_tags = nltk.pos_tag(tokens)

    return pos_tags
# Feature extraction function
def extract_features(data, word_embeddings_model, dev_test):
    X = []
    y = []

    for _, row in data.iterrows():
        claim_tokens = tokenize_and_tag(row['Claim'])
        evidence_tokens = tokenize_and_tag(row['Evidence'])
        features = []

        for word, pos_tag in claim_tokens:
            min_distance = min([abs(claim_tokens.index((word, pos_tag)) - evidence_tokens.index((e_word, e_pos_tag))) for e_word, e_pos_tag in evidence_tokens] or [-1])
            word_embedding = None

            # Check if the word is in the Word2Vec vocabulary
            if word in word_embeddings_model.wv:
                word_embedding = word_embeddings_model.wv[word]
            features.extend([{
                f'word={word}',
                f'pos_tag={pos_tag}',
                f'min_distance={min_distance}',
                f'in_evidence={word.lower() in [e_word.lower() for e_word, _ in evidence_tokens]}',
                f'word_embedding={word_embedding}'
            }])

        X.append(features)
        if dev_test == 'dev':
          y.append(['1' if row['label'] == 1 else '0' for i in claim_tokens])

    return X, y




Collecting sklearn-crfsuite
  Downloading sklearn_crfsuite-0.3.6-py2.py3-none-any.whl (12 kB)
Collecting python-crfsuite>=0.8.3 (from sklearn-crfsuite)
  Downloading python_crfsuite-0.9.10-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: python-crfsuite, sklearn-crfsuite
Successfully installed python-crfsuite-0.9.10 sklearn-crfsuite-0.3.6


[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping grammars/basque_grammars.zip.
[nltk_data]    | Downloading package bcp47 to /root/nltk_data...
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   U

In [2]:
# Load data
train_data = pd.read_csv('train.csv')
dev_data = pd.read_csv('dev.csv')

# Preprocess and extract features
all_texts = train_data['Claim'].tolist() + train_data['Evidence'].tolist()
tokenized_texts = [word_tokenize(text) for text in all_texts]

# Train Word2Vec model
word2vec_model = Word2Vec(sentences=tokenized_texts, vector_size=50, window=5, min_count=2, workers=multiprocessing.cpu_count())

# Save the trained Word2Vec model for future use
word2vec_model.save('word2vec_model.bin')

# Preprocess and extract features with the trained Word2Vec model
X_train, y_train = extract_features(train_data, word2vec_model, 'dev')
crf = CRF(algorithm='lbfgs',linesearch='MoreThuente',min_freq = 1, c1 = 0.1, c2 = 0.9, max_iterations=85, all_possible_transitions=True)
try:
    crf.fit(X_train, y_train)
except AttributeError:
    pass


In [3]:


# Evaluate the model on the validation dataset
X_dev, y_dev = extract_features(dev_data, word2vec_model, 'dev')
y_pred = crf.predict(X_dev)
y_pred = [sublist[0] for sublist in y_pred]
y_dev = [sublist[0] for sublist in y_dev]
accuracy = accuracy_score(y_dev, y_pred)
print("Accuracy:", accuracy)
print("Classification Report:")
print(classification_report(y_dev, y_pred))


Accuracy: 0.8044211947350658
Classification Report:
              precision    recall  f1-score   support

           0       0.84      0.90      0.87      4327
           1       0.67      0.54      0.60      1599

    accuracy                           0.80      5926
   macro avg       0.76      0.72      0.73      5926
weighted avg       0.80      0.80      0.80      5926



In [7]:
test_data = pd.read_csv('test.csv')
X_test, y_test = extract_features(test_data, word2vec_model, 'train')
y_pred = crf.predict(X_test)
y_pred = [sublist[0] for sublist in y_pred]
result_df = pd.DataFrame(y_pred, columns = ['prediction'])
result_df.to_csv('./Group_13_A.csv', index = False, header = True)