#Problem Statement
**The goal is to automatically tag and analyze tweets for entities like people, companies, locations, and more, without relying on hashtags. Named Entity Recognition (NER) will identify and classify these entities. This project could enhance social media analysis, targeted marketing, and trend tracking.**

#Applications
- **Social Media Monitoring: Analyze trending topics and public sentiment.**
- **Targeted Advertising: Identify key topics for better ad targeting.**
- **Trend Detection: Recognize shifts in interest around entities, like companies or locations.**

# Import Libraries and Download Data

In [None]:
import pandas as pd
import numpy as np
import torch
from transformers import BertTokenizer, TFBertForTokenClassification
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
import gdown

# Download dataset from Google Drive folder link
url = 'https://drive.google.com/drive/folders/14IgdWzzpjp166rhNhp9UFenjUp_czaGo?usp=share_link'
gdown.download_folder(url, quiet=True)

['/content/Datasets/wnut 16.txt.conll',
 '/content/Datasets/wnut 16test.txt.conll']

# Data Loading and Exploration
**Load the CoNLL-format data file, where each word is labeled line by line. Sentences are separated by empty lines.**

In [None]:
def load_data(file_path):
    sentences = []
    labels = []
    sentence = []
    label = []
    with open(file_path, 'r') as file:
        for line in file:
            if line.strip():  # Non-empty line
                word, tag = line.strip().split()
                sentence.append(word)
                label.append(tag)
            else:  # Empty line indicates end of a sentence
                sentences.append(sentence)
                labels.append(label)
                sentence = []
                label = []
    # Append the last sentence if the file doesn’t end with a blank line
    if sentence:
        sentences.append(sentence)
        labels.append(label)
    return sentences, labels

train_file_path = '/content/wnut 16.txt.conll'
test_file_path = '/content/wnut 16test.txt.conll'
train_sentences, train_labels = load_data(train_file_path)
test_sentences, test_labels = load_data(test_file_path)


# Exploratory Data Analysis (EDA)

1. **Checking the structure of the data**

Print a few sample sentences and labels to understand the structure

In [None]:
print("Sample sentence:", train_sentences[0])
print("Sample labels:", train_labels[0])

Sample sentence: ['@SammieLynnsMom', '@tg10781', 'they', 'will', 'be', 'all', 'done', 'by', 'Sunday', 'trust', 'me', '*wink*']
Sample labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']


In [None]:
# Check unique labels
unique_labels = set(label for labels in train_labels for label in labels)
print("Unique labels:", unique_labels)

Unique labels: {'I-facility', 'I-person', 'B-sportsteam', 'B-other', 'B-product', 'B-facility', 'B-geo-loc', 'O', 'B-musicartist', 'I-movie', 'I-musicartist', 'I-company', 'I-product', 'I-geo-loc', 'B-tvshow', 'B-company', 'B-movie', 'B-person', 'I-tvshow', 'I-sportsteam', 'I-other'}


**2. Summary Statistics**

Calculate basic statistics like the average sentence length and label distribution.

In [None]:
sentence_lengths = [len(sentence) for sentence in train_sentences]
print("Average sentence length:", np.mean(sentence_lengths))
print("Label distribution:", pd.Series([lbl for labels in train_labels for lbl in labels]).value_counts())


Average sentence length: 19.41060985797828
Label distribution: O                44007
B-person           449
I-other            320
B-geo-loc          276
B-other            225
I-person           215
B-company          171
I-facility         105
B-facility         104
B-product           97
I-product           80
I-musicartist       61
B-musicartist       55
B-sportsteam        51
I-geo-loc           49
I-movie             46
I-company           36
B-movie             34
B-tvshow            34
I-tvshow            31
I-sportsteam        23
Name: count, dtype: int64


# Data Preprocessing


In [None]:
# Initialize tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [None]:
# Convert labels to integers
label_map = {label: i for i, label in enumerate(unique_labels)}
num_labels = len(unique_labels)


def encode_sentences(sentences, labels, tokenizer, max_length=128):
    input_ids = []
    attention_masks = []
    label_ids = []

    for sent, label in zip(sentences, labels):


        # Tokenize sentence
      encoded_dict = tokenizer.encode_plus(
            sent,
            add_special_tokens=True,
            max_length=max_length,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt'
        )

    input_ids.append(encoded_dict['input_ids'])
    attention_masks.append(encoded_dict['attention_mask'])

    # Encode labels
    label_id = [label_map[l] for l in label]
    label_id += [0] * (max_length - len(label_id))  # Pad labels
    label_ids.append(label_id)

    return torch.cat(input_ids, dim=0), torch.cat(attention_masks, dim=0), torch.tensor(label_ids)



In [None]:
train_inputs, train_masks, train_labels = encode_sentences(train_sentences, train_labels, tokenizer)
test_inputs, test_masks, test_labels = encode_sentences(test_sentences, test_labels, tokenizer)

In [None]:
# Initialize model
model = TFBertForTokenClassification.from_pretrained('bert-base-uncased', num_labels=num_labels)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFBertForTokenClassification.

Some weights or buffers of the TF 2.0 model TFBertForTokenClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
import tensorflow as tf

# Train model
device = "/gpu:0" if tf.test.is_gpu_available() else "/cpu:0"


Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.


In [None]:
# Compile model
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-5),
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

In [None]:
# Train model
train_inputs = tf.convert_to_tensor(train_inputs)
train_labels = tf.convert_to_tensor(train_labels)
test_inputs = tf.convert_to_tensor(test_inputs)
test_labels = tf.convert_to_tensor(test_labels)


In [None]:
# Train model
model.fit(train_inputs, train_labels, epochs=5, batch_size=32, validation_data=(test_inputs, test_labels))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tf_keras.src.callbacks.History at 0x7d8bb65de560>

In [None]:
# Evaluate model
test_loss, test_acc = model.evaluate(test_inputs, test_labels)
print(f'Test loss: {test_loss:.3f}, Test accuracy: {test_acc:.3f}')

Test loss: 0.660, Test accuracy: 0.906


In [None]:
# Make predictions
predictions = model.predict(test_inputs)
predicted_labels = np.argmax(predictions)





In [None]:
label_map_tensor = tf.constant(list(label_map.values()))
predicted_label = tf.gather(label_map_tensor, predicted_labels)


In [None]:
# Convert predicted labels back to original labels
label_map_tensor = tf.constant(list(label_map.values()))
predicted_label = tf.gather(label_map_tensor, predicted_labels)

In [None]:
# Pad the test_labels list to ensure all tensors have the same length
max_length = max(len(label) for label in test_labels)
padded_test_labels = [tf.pad(label, [[0, max_length - len(label)]], 'constant') for label in test_labels]

# Pad the padded_test_labels list to match the length of the test_sentences list
num_sentences = len(test_sentences)
padded_test_labels += [tf.zeros((max_length,))] * (num_sentences - len(padded_test_labels))

# Print sample predictions
for i in range(5):
    print("Sentence:", test_sentences[i])
    print("Actual label:", padded_test_labels[i].numpy())


Sentence: ['New', 'Orleans', 'Mother', "'s", 'Day', 'Parade', 'shooting', '.', 'One', 'of', 'the', 'people', 'hurt', 'was', 'a', '10-year-old', 'girl', '.', 'WHAT', 'THE', 'HELL', 'IS', 'WRONG', 'WITH', 'PEOPLE', '?']
Actual label: [7 7 7 7 7 7 7 7 7 7 7 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
Sentence: ['RT', '@hxranspizza', ':', 'Going', 'into', 'school', 'tomorrow', 'like', '#KCA', '#Vote1DUK', 'http://t.co/vvkoEEMjMX']
Actual label: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.

# Insights and Recommendations:

** Data Quality and Quantity: **

- Data Quality: Ensure data is clean, accurate, and relevant to the NER task.
- Data Quantity: Sufficient training data is crucial for model performance. Consider data augmentation techniques if needed.

** Model Architecture: **

- BERT-Based Models: Utilize pre-trained BERT models for strong performance.
- Fine-tuning: Fine-tune the pre-trained model on the specific NER task.
- Experimentation: Try different model architectures and hyperparameters to optimize results.

** Training and Evaluation:**

- Hyperparameter Tuning: Experiment with learning rate, batch size, and other hyperparameters.
- Early Stopping: Implement early stopping to prevent overfitting.
- Evaluation Metrics: Use appropriate metrics like precision, recall, and F1-score to assess model performance.

**Deployment and Inference:**

- Model Serving: Deploy the model using a framework like TensorFlow Serving or TorchServe.
- Batch Processing: Process multiple tweets at once for efficiency.
- Real-time Inference: Consider using a streaming framework like Kafka for real-time processing.

**Ethical Considerations:**

- Bias and Fairness: Ensure the model is fair and unbiased, especially for sensitive topics.
- Privacy: Protect user privacy and data security.

**Future Directions:**

- Contextual Understanding: Explore models that can capture deeper contextual information.
- Multi-lingual NER: Develop models that can handle multiple languages.
- Domain-Specific NER: Fine-tune models for specific domains like finance or healthcare.