### **Lab Activity: Predicting Nationality from Names using Machine Learning**

#### **1. Data**

For this lab activity, we will use a dataset containing names from various nationalities. The dataset is organized into different text files, each corresponding to a specific nationality. For example, you might have files named `English.txt`, `French.txt`, etc. Each file contains a list of names from that nationality.

#### **2. Representation**

To train our machine learning model, we need to convert the names into a numerical format. Here are the key steps for data representation:
- **Character-to-Index Conversion**: Each character in a name is mapped to an index based on a predefined vocabulary.
- **Padding**: Names are padded to a consistent length to ensure they have the same dimensions.
- **Character Embeddings**: The character indices are transformed into dense vectors using an embedding layer. This helps capture more nuanced information about the characters.

#### **3. Prediction Model**

We will use a Multi-Layer Perceptron (MLP) for predicting the nationality of a given name. Here’s a brief overview of the model architecture:
- **Embedding Layer**: Transforms character indices into dense vectors.
- **Fully Connected Layer**: Flattens the embeddings and passes them through a linear layer to generate predictions.
- **Output Layer**: Produces probability scores for each nationality.

The model is trained using cross-entropy loss and optimized with the Adam optimizer. After training, the model can predict the nationality of new names based on the learned patterns.


In [1]:
import glob
import unicodedata
import string
import requests
import random
import sys
import torch
from torch.autograd import Variable
from torch import nn
from torch.utils.data import Dataset, DataLoader
import random
import time
import math
from torch.autograd import Variable
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

## Data Prep

In [13]:
all_letters = string.ascii_letters + " .,;'-<PAD><UNK>"
n_letters = len(all_letters)

def letterToIndex(letter):
    index = all_letters.find(letter)
    if index == -1:  # Handle unknown characters
        index = all_letters.find('<UNK>')  # Use index for unknown characters
    return index

def findFiles(path): return glob.glob(path)

# Turn a Unicode string to plain ASCII, thanks to http://stackoverflow.com/a/518232/2809427
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
        and c in all_letters
    )

# Read a file and split into lines
def readLines(filename):
    lines = open(filename, encoding='utf-8').read().strip().split('\n')
    return [unicodeToAscii(line) for line in lines]

# Turn a line into a <line_length x 1 x n_letters>,
# or an array of one-hot letter vectors
def lineToTensor(line):
    tensor = torch.zeros(len(line), 1, n_letters)
    for li, letter in enumerate(line):
        tensor[li][0][letterToIndex(letter)] = 1
    return tensor

# Define the GitHub repository URL and branch
github_url = 'https://api.github.com/repos/DrUzair/NLP/contents/data/names'
branch = 'f8e0c40481b1c1e32440b1da39c8bdfc9f070ffa'

# Initialize dictionaries to store data
all_categories = []
category_lines = {}

# Make a request to the GitHub API to get the list of files in the directory
response = requests.get(f'{github_url}?ref={branch}')

if response.status_code == 200:
    file_data = response.json()

    for file_info in file_data:
        if file_info['type'] == 'file' and file_info['name'].endswith('.txt'):
            file_url = file_info['download_url']
            category = file_info['name'].split('.')[0]

            # Add the category to the list
            all_categories.append(category)

            # Make a request to download the file
            file_response = requests.get(file_url)

            if file_response.status_code == 200:
                # Read and store the file content
                lines = file_response.text.split('\n')
                category_lines[category] = lines
            else:
                print(f"Failed to download file: {file_info['name']}")
else:
    print(f"Failed to retrieve file list from GitHub: {response.status_code}")

n_categories = len(all_categories)

In [None]:
print(all_categories)
print(len(all_categories))

['Arabic', 'Chinese', 'Czech', 'Dutch', 'English', 'French', 'German', 'Greek', 'Irish', 'Italian', 'Japanese', 'Korean', 'Polish', 'Portuguese', 'Russian', 'Scottish', 'Spanish', 'Vietnamese']
18


### NameDataset

In [14]:
# Update the Dataset class to handle unknown characters
class NameDataset(Dataset):
    def __init__(self, category_lines, all_categories):
        self.category_lines = category_lines
        self.all_categories = all_categories
        self.data = []
        self.labels = []
        self.max_len = max([len(line) for lines in category_lines.values() for line in lines])
        self.prepare_data()

    def prepare_data(self):
        for category in self.all_categories:
            for line in self.category_lines[category]:
                self.data.append(line)
                self.labels.append(self.all_categories.index(category))

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        name = self.data[idx]
        label = self.labels[idx]
        name_idx = [letterToIndex(char) for char in name]
        name_idx = name_idx + [letterToIndex('<PAD>')] * (self.max_len - len(name_idx))
        return torch.tensor(name_idx), torch.tensor(label)


In [16]:
# Inspect the DataLoader
def inspect_dataloader(dataloader):
    for i, (names, labels) in enumerate(dataloader):
        print(f"Batch {i+1}")
        print("Names (indices):", names)
        print("Labels:", labels)
        if i >= 0:  # Inspect the first 3 batches only
            break

In [24]:
# Create an instance of the DataLoader
dataset = NameDataset(category_lines, all_categories)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

def inspect_dataloader(dataloader):
    for i, (names, labels) in enumerate(dataloader):
        print(f"Batch {i+1}")
        print("Names (indices):", names)
        print("Labels:", labels)
        print("Names Shape:", names.shape)  # Print shape of names tensor
        # Check if any index is out of range
        if torch.any(names >= n_letters) or torch.any(names < 0):
            print("Error: Found indices out of range")
            print(names)
            break
        if i >= 2:  # Inspect the first 3 batches only
            break

inspect_dataloader(dataloader)


Batch 1
Names (indices): tensor([[27,  8, 17, 12,  0, 13, 58, 58, 58, 58, 58, 58, 58, 58, 58, 58, 58, 58,
         58, 58],
        [32, 17,  8,  6, 14, 17, 14, 21, 58, 58, 58, 58, 58, 58, 58, 58, 58, 58,
         58, 58],
        [38,  2, 19,  0,  6,  6,  0, 17, 19, 58, 58, 58, 58, 58, 58, 58, 58, 58,
         58, 58],
        [26,  6, 17,  0, 13, 14,  5,  5, 58, 58, 58, 58, 58, 58, 58, 58, 58, 58,
         58, 58],
        [35, 20, 10, 14, 21,  8, 13, 58, 58, 58, 58, 58, 58, 58, 58, 58, 58, 58,
         58, 58],
        [45, 14, 19,  0,  7, 58, 58, 58, 58, 58, 58, 58, 58, 58, 58, 58, 58, 58,
         58, 58],
        [29, 17,  8,  5,  5,  8,  4, 11,  3, 58, 58, 58, 58, 58, 58, 58, 58, 58,
         58, 58],
        [44,  4,  6, 17,  4, 19,  8, 58, 58, 58, 58, 58, 58, 58, 58, 58, 58, 58,
         58, 58],
        [37, 14, 17,  4, 13, 19, 25, 58, 58, 58, 58, 58, 58, 58, 58, 58, 58, 58,
         58, 58],
        [32, 17, 14, 18, 18, 58, 58, 58, 58, 58, 58, 58, 58, 58, 58, 58, 58, 58,
   

## Model

In [18]:
class NameClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_classes, max_len):
        super(NameClassifier, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.fc = nn.Linear(embed_dim * max_len, num_classes)

    def forward(self, x):
        x = self.embedding(x)
        x = x.view(x.size(0), -1)
        x = self.fc(x)
        return x


In [19]:
def train_model(model, dataloader, criterion, optimizer, num_epochs):
    model.train()
    for epoch in range(num_epochs):
        running_loss = 0.0
        for names, labels in dataloader:
            optimizer.zero_grad()
            outputs = model(names)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            running_loss += loss.item()
        print(f'Epoch {epoch+1}, Loss: {running_loss/len(dataloader)}')


In [27]:
dataset = NameDataset(category_lines, all_categories)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)


max_len = dataset.max_len  # Define max_len based on the dataset
vocab_size = len(all_letters)
embed_dim = 16
num_classes = n_categories
max_len = dataset.max_len

model = NameClassifier(vocab_size, embed_dim, num_classes, max_len)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
num_epochs = 20

train_model(model, dataloader, criterion, optimizer, num_epochs)


Epoch 1, Loss: 1.424667982633706
Epoch 2, Loss: 1.1186335546195887
Epoch 3, Loss: 1.017360669270063
Epoch 4, Loss: 0.9556197317637456
Epoch 5, Loss: 0.9128345596562525
Epoch 6, Loss: 0.8837758338280545
Epoch 7, Loss: 0.8586434003938536
Epoch 8, Loss: 0.8405353678924263
Epoch 9, Loss: 0.8273091610687174
Epoch 10, Loss: 0.8143955353813567
Epoch 11, Loss: 0.8036792738612291
Epoch 12, Loss: 0.7961930720621992
Epoch 13, Loss: 0.7893191932872602
Epoch 14, Loss: 0.782509991080518
Epoch 15, Loss: 0.7789353146010144
Epoch 16, Loss: 0.775368066967293
Epoch 17, Loss: 0.7708929920462286
Epoch 18, Loss: 0.765097949630136
Epoch 19, Loss: 0.7606734704155071
Epoch 20, Loss: 0.7583556070354334


## Evaluation

In [29]:
import torch

def evaluate_model(model, dataloader, criterion):
    model.eval()  # Set the model to evaluation mode
    total_loss = 0.0
    correct = 0
    total = 0

    with torch.no_grad():  # Disable gradient calculation
        for names, labels in dataloader:
            outputs = model(names)
            loss = criterion(outputs, labels)
            total_loss += loss.item()

            # Calculate accuracy
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    avg_loss = total_loss / len(dataloader)
    accuracy = correct / total
    print(f'Evaluation Loss: {avg_loss:.4f}, Accuracy: {accuracy:.4f}')

    return avg_loss, accuracy

# Example usage:
dataset = NameDataset(category_lines, all_categories)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

criterion = nn.CrossEntropyLoss()
evaluate_model(model, dataloader, criterion)


Evaluation Loss: 0.7389, Accuracy: 0.7610


(0.7389168261437659, 0.7609994027473621)

## Predict

In [31]:
import torch
from torch.autograd import Variable

# Assuming 'model' is your trained model and 'all_categories' is your list of categories
def evaluate(line_tensor):
    model.eval()  # Set the model to evaluation mode
    with torch.no_grad():  # Disable gradient calculation
        output = model(line_tensor)
    return output

def predict(name, n_predictions=3):
    # Convert the name to a tensor of indices
    name_idx = [letterToIndex(char) for char in name]
    name_idx = name_idx + [letterToIndex('<PAD>')] * (max_len - len(name_idx))  # Pad the name if necessary
    name_tensor = torch.tensor(name_idx).unsqueeze(0)  # Add batch dimension
    name_tensor = name_tensor.long()  # Ensure the tensor is of type LongTensor

    # Evaluate the model
    output = evaluate(Variable(name_tensor))

    # Get top N categories
    topv, topi = output.data.topk(n_predictions, 1, True)
    predictions = []

    for i in range(n_predictions):
        value = topv[0][i].item()  # Convert tensor to scalar
        category_index = topi[0][i].item()  # Convert tensor to scalar
        predictions.append((value, all_categories[category_index]))

    return predictions

# Example usage
name = 'ahmad'
predictions = predict(name)
for value, category in predictions:
    print(f'({value:.2f}) {category}')


(3.15) Arabic
(2.06) English
(1.82) Russian
