# The GAP (Gendered Ambiguous Pronouns) dataset

The dataset is designed for coreference resolution tasks, specifically for resolving ambiguous pronouns to their correct antecedents. It contains English sentences with ambiguous pronouns and corresponding entities. The primary goal is to develop models that can correctly identify whether a given pronoun refers to "A," "B," or neither.

Here's a brief summary of the structure of the GAP dataset:

1. **Columns:**
   - **ID:** A unique identifier for each example.
   - **Text:** The text of the sentence containing the ambiguous pronoun.
   - **Pronoun:** The ambiguous pronoun in the sentence.
   - **Pronoun-offset:** The offset (position) of the pronoun in the sentence.
   - **A, B:** The candidate entities to which the pronoun may refer.
   - **A-offset, B-offset:** The offsets of entities A and B in the sentence.
   - **A-coref, B-coref:** Binary labels indicating whether the pronoun refers to entities A or B.

2. **Labels:**
   - **A-coref, B-coref:** These binary labels are used for training the model. A label of 1 indicates that the pronoun refers to the corresponding entity, and 0 indicates it does not.

3. **Task:**
   - The task associated with this dataset is to build a model that, given a sentence with an ambiguous pronoun, predicts whether the pronoun refers to entity A, entity B, or neither.

Here is a snippet of what the data might look like:

```plaintext
ID, Text, Pronoun, Pronoun-offset, A, A-offset, B, B-offset, A-coref, B-coref
example1, "John met Susan in the park. He said she had a dog.", he, 35, John, 0, Susan, 16, True, False
example2, "Alice and Bob went to the store. They bought groceries.", they, 35, Alice, 0, Bob, 11, True, False
```

In this example, the model needs to predict whether "he" refers to John or Susan and whether "they" refers to Alice or Bob.



# The `CoRefModel`

CoRefModelModel is a simple neural network model designed for pairwise ranking tasks, such as the task of ranking ambiguous pronoun candidates in coreference resolution. Let's break down the components and discuss their relevance to the task:

```python
class CoRefModel(nn.Module):
    def __init__(self, input_dim):
        super(CoRefModelModel, self).__init__()
        self.fc1 = nn.Linear(input_dim, 64)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(64, 1)
```

1. **Initialization (`__init__` method):**
   - `input_dim`: This parameter represents the dimensionality of the input features. In the case of coreference resolution, it could be the dimensionality of the feature vectors representing pairs of mentions (e.g., TF-IDF vectors or embeddings).

   - `nn.Linear(input_dim, 64)`: This is the first fully connected (linear) layer. It takes the input features and maps them to a 64-dimensional intermediate representation.

   - `nn.ReLU()`: The Rectified Linear Unit (ReLU) activation function is applied element-wise after the first linear layer. ReLU introduces non-linearity to the model, allowing it to learn complex relationships in the data.

   - `nn.Linear(64, 1)`: The second linear layer reduces the 64-dimensional representation to a single output. In the context of CoRefModel, this output is interpreted as the predicted ranking score for a pair of mentions.

```python
    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.fc2(x)
        return x
```

2. **Forward Pass (`forward` method):**
   - `x`: This represents the input features, such as the TF-IDF vectors or embeddings for pairs of mentions.

   - `self.relu(self.fc1(x))`: The input features pass through the first linear layer, followed by the ReLU activation function. This introduces non-linearity to the model's transformations.

   - `self.fc2(x)`: The output of the first layer is then passed through the second linear layer, producing a single-dimensional output. In the context of pairwise ranking, this output can be interpreted as the predicted ranking score for the pair of mentions.

   - `return x`: The final output is returned, representing the model's predicted ranking score for the input pair of mentions.

**Relevance to the Task:**
   - The model is designed for pairwise ranking, which is suitable for tasks where the goal is to rank pairs of items. In coreference resolution, this can be used to rank pairs of candidate antecedents for an ambiguous pronoun.

   - The model architecture with two linear layers and a ReLU activation allows the network to capture complex relationships and patterns in the input data.

   - The single-dimensional output from the model can be used to compare and rank pairs of mentions, aiding in the decision of whether a pronoun refers to one entity over another.

   - The choice of activation functions and the architecture is common in neural network models for ranking tasks, providing a balance between expressiveness and simplicity.

In summary, the `CoRefModelModel` is a neural network architecture tailored for the task of co-reference resolution, making it relevant for scenarios like coreference resolution where the goal is to rank potential antecedents for ambiguous pronouns.

# Preprocessing

Certainly! Let's break down the preprocessing steps:

1. **Loading the Dataset:**
   ```python
   url = "https://raw.githubusercontent.com/google-research-datasets/gap-coreference/master/gap-development.tsv"
   gap_data = pd.read_csv(url, sep='\t')
   ```
   - The dataset is loaded from the provided URL using `pd.read_csv`. The `sep='\t'` parameter indicates that the data is tab-separated.

2. **Creating Pairs and Labels:**
   ```python
   pairs = []
   labels = []

   for index, row in gap_data.iterrows():
       mention1 = row["Text"]
       mention2 = row["Pronoun"]

       # Assign label based on whether the pronoun refers to the same entity (1) or not (0)
       label = 1 if row["A-coref"] or row["B-coref"] else 0

       pairs.append({"mention1": mention1, "mention2": mention2})
       labels.append(label)
   ```
   - For each row in the dataset, two mentions (`mention1` and `mention2`) are extracted from the columns "Text" and "Pronoun."

   - The label is assigned based on whether the pronoun refers to entity A or B (1) or neither (0).

   - Pairs of mentions and their corresponding labels are stored in the `pairs` and `labels` lists.

3. **Feature Engineering with TF-IDF:**
   ```python
   vectorizer = TfidfVectorizer()
   features = vectorizer.fit_transform([pair["mention1"] + " " + pair["mention2"] for pair in pairs])
   ```
   - A `TfidfVectorizer` is used to convert pairs of mentions into TF-IDF (Term Frequency-Inverse Document Frequency) vectors.

   - The TF-IDF vectors are computed based on the concatenation of `mention1` and `mention2` for each pair.

4. **Converting to PyTorch Tensors:**
   ```python
   X = torch.tensor(features.toarray(), dtype=torch.float32)
   y = torch.tensor(labels, dtype=torch.float32).unsqueeze(1)
   ```
   - The TF-IDF vectors (`features`) are converted to a PyTorch tensor (`X`) with a data type of `torch.float32`.

   - The labels (`labels`) are also converted to a PyTorch tensor (`y`) with the same data type. Additionally, `unsqueeze(1)` is used to convert the 1D tensor into a column vector, as CoRefModel expects labels in this format.

The resulting `X` and `y` tensors can be used for training and evaluating the CoRefModel model on the pairwise ranking task.

In [74]:
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np

In [75]:
class CoRefModel(nn.Module):
    def __init__(self, input_dim):
        super(CoRefModel, self).__init__()
        self.fc1 = nn.Linear(input_dim, 64)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(64, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.sigmoid(self.fc2(x))
        return x

In [None]:
# Load GAP dataset from the URL
url = "https://raw.githubusercontent.com/google-research-datasets/gap-coreference/master/gap-development.tsv"
gap_data = pd.read_csv(url, sep='\t')

# Preprocess data: create pairs of mentions and labels
pairs = []
labels = []

for index, row in gap_data.iterrows():
    mention1 = row["Text"]
    mention2 = row["Pronoun"]

    # Assign label based on whether the pronoun refers to the same entity (1) or not (0)
    label = 1 if row["A-coref"] or row["B-coref"] else 0

    pairs.append({"mention1": mention1, "mention2": mention2})
    labels.append(label)


In [None]:
print(f"Examples:\n{pairs[:5]}")
print(len(pairs))

Examples:
[{'mention1': "Zoe Telford -- played the police officer girlfriend of Simon, Maggie. Dumped by Simon in the final episode of series 1, after he slept with Jenny, and is not seen again. Phoebe Thomas played Cheryl Cassidy, Pauline's friend and also a year 11 pupil in Simon's class. Dumped her boyfriend following Simon's advice after he wouldn't have sex with her but later realised this was due to him catching crabs off her friend Pauline.", 'mention2': 'her'}, {'mention1': 'He grew up in Evanston, Illinois the second oldest of five children including his brothers, Fred and Gordon and sisters, Marge (Peppy) and Marilyn. His high school days were spent at New Trier High School in Winnetka, Illinois. MacKenzie studied with Bernard Leach from 1949 to 1952. His simple, wheel-thrown functional pottery is heavily influenced by the oriental aesthetic of Shoji Hamada and Kanjiro Kawai.', 'mention2': 'His'}, {'mention1': "He had been reelected to Congress, but resigned in 1990 to accept

## Featurization

In [None]:

# Feature engineering: use TF-IDF vectors as features
vectorizer = TfidfVectorizer()
features = vectorizer.fit_transform([pair["mention1"] + " " + pair["mention2"] for pair in pairs])

In [None]:
features.shape

(2000, 20664)

## Put to tensors

In [None]:
# Convert features and labels to PyTorch tensors
X = torch.tensor(features.toarray(), dtype=torch.float32)
y = torch.tensor(labels, dtype=torch.float32).unsqueeze(1)  # CoRefModel expects labels in column vector form

## Train , test split

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
print(f"+ve Examples: {y_train.sum()/y_train.shape[0]}")
print(f"input size: {X_train.shape[1]}")

+ve Examples: 0.8974999785423279
input size: 20664


## Data loader

In [76]:
import torch
from torch.utils.data import Dataset, DataLoader

# Define a custom dataset class
class CoRefModeltDataset(Dataset):
    def __init__(self, features, labels):
        self.features = features
        self.labels = labels

    def __len__(self):
        return len(self.features)

    def __getitem__(self, index):
        return self.features[index], self.labels[index]


In [78]:
# Create a dataset instance
gap_dataset = CoRefModeltDataset(features=X_train, labels=y_train)


## Model training

In [72]:

# Create a DataLoader
batch_size = 32  # Choose an appropriate batch size
train_loader = DataLoader(dataset=gap_dataset, batch_size=batch_size, shuffle=True)

input_dim = X_train.shape[1]
# Instantiate your CoRefModel, optimizer, and loss function
model = CoRefModel(input_dim)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.BCELoss()

# Set the number of training epochs
num_epochs = 10

# Training loop with DataLoader
for epoch in range(num_epochs):
    # Set the model in training mode
    model.train()

    # Iterate through batches in the DataLoader
    for batch_inputs, batch_labels in train_loader:
        # Zero the gradients
        optimizer.zero_grad()

        # Forward pass
        batch_outputs = model(batch_inputs)
        batch_loss = criterion(batch_outputs, batch_labels)

        # Backward pass and optimization
        batch_loss.backward()
        optimizer.step()

    # Print or log the training loss at the end of each epoch
    print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {batch_loss.item()}')

# Training complete
print('Training finished.')


Epoch [1/10], Loss: 0.4676911234855652
Epoch [2/10], Loss: 0.2628365755081177
Epoch [3/10], Loss: 0.15499475598335266
Epoch [4/10], Loss: 0.33937108516693115
Epoch [5/10], Loss: 0.16770993173122406
Epoch [6/10], Loss: 0.0981973186135292
Epoch [7/10], Loss: 0.041659265756607056
Epoch [8/10], Loss: 0.12453542649745941
Epoch [9/10], Loss: 0.029852237552404404
Epoch [10/10], Loss: 0.024321584030985832
Training finished.


## Evaluation

In [73]:

from sklearn.metrics import precision_score

# Set a threshold for binary classification
threshold = 0.5

# Evaluate the model on the test set
with torch.no_grad():
    test_outputs = model(X_test)
    test_predictions = (test_outputs > threshold).float()

    # Convert to numpy arrays for Scikit-learn compatibility
    y_test_np = y_test.numpy() if isinstance(y_test, torch.Tensor) else y_test
    test_predictions_np = test_predictions.numpy()

    # Calculate precision using Scikit-learn
    precision = precision_score(y_test_np, test_predictions_np)
    print("Precision:", precision)


Precision: 0.9120603015075377
