## City Names Classification

In this project, I trained a model to classify cities of France by region, processing letters of city names sequentially. 

The first objective of this project was to get familiar with PyTorch modules nn and autograd. We don't get a tremendous accuracy after training a simple and shallow LSTM, but whith time and computational ressources, there is room for a lot of improvements. Just to name a few:

* increase the depth and number of layers of the LSTM
* use character embeddings
* use CNN layers first to encode local information
* fine tune the hyperparameters

The three CSV files used in this project can be found in the Open platform for French public data: https://www.data.gouv.fr/en/datasets/regions-departements-villes-et-villages-de-france-et-doutre-mer/

In [1]:
import pandas as pd
import numpy as np
import string
import time
import math
import random

import torch
import torch.nn as nn
from torch.autograd import Variable
import torch.nn.functional as F
import torch.autograd as autograd
from torch.nn.utils.rnn import pack_padded_sequence

In [2]:
cities = pd.read_csv("./csv/cities.csv", sep=";")
cities.head()

Unnamed: 0,id,departments_id,name,slug,pattern,postal_code,gps_lat,gps_lon
0,1,1,L'Abergement-Clémenciat,l-abergement-clemenciat,l abergement clemenciat,1400,46.1519,4.9216
1,2,1,L'Abergement-de-Varey,l-abergement-de-varey,l abergement de varey,1640,46.0078,5.4252
2,3,1,Ambérieu-en-Bugey,amberieu-en-bugey,amberieu en bugey,1500,45.9631,5.3541
3,4,1,Ambérieux-en-Dombes,amberieux-en-dombes,amberieux en dombes,1330,45.9991,4.9033
4,5,1,Ambléon,ambleon,ambleon,1300,45.75,5.6016


In [3]:
regions = pd.read_csv("./csv/regions.csv", sep=";")
regions.columns = ["regions_id", "name", "slug", "iso_code"]
regions.head()

Unnamed: 0,regions_id,name,slug,iso_code
0,1,Auvergne-Rhône-Alpes,auvergne-rhone-alpes,FR-ARA
1,2,Bourgogne-Franche-Comté,bourgogne-franche-comte,FR-BCF
2,3,Bretagne,bretagne,FR-BRE
3,4,Centre-Val de Loire,centre-val-de-loire,FR-CVL
4,5,Corse,corse,FR-COR


In [4]:
departments = pd.read_csv("./csv/departments.csv", sep=";")
departments.columns = ["departments_id", "regions_id", "code", "name", "slug", "iso_code"]
departments.head()

Unnamed: 0,departments_id,regions_id,code,name,slug,iso_code
0,1,1,1,Ain,ain,FR-01
1,2,7,2,Aisne,aisne,FR-02
2,3,1,3,Allier,allier,FR-03
3,4,13,4,Alpes-de-Haute-Provence,alpes-de-haute-provence,FR-04
4,5,13,5,Hautes-Alpes,hautes-alpes,FR-05


In [5]:
temp_df = pd.merge(departments, regions, on="regions_id", how="left")
final_df = pd.merge(cities, temp_df, on="departments_id", how="left")
final_df.head()

Unnamed: 0,id,departments_id,name,slug,pattern,postal_code,gps_lat,gps_lon,regions_id,code,name_x,slug_x,iso_code_x,name_y,slug_y,iso_code_y
0,1,1,L'Abergement-Clémenciat,l-abergement-clemenciat,l abergement clemenciat,1400,46.1519,4.9216,1,1,Ain,ain,FR-01,Auvergne-Rhône-Alpes,auvergne-rhone-alpes,FR-ARA
1,2,1,L'Abergement-de-Varey,l-abergement-de-varey,l abergement de varey,1640,46.0078,5.4252,1,1,Ain,ain,FR-01,Auvergne-Rhône-Alpes,auvergne-rhone-alpes,FR-ARA
2,3,1,Ambérieu-en-Bugey,amberieu-en-bugey,amberieu en bugey,1500,45.9631,5.3541,1,1,Ain,ain,FR-01,Auvergne-Rhône-Alpes,auvergne-rhone-alpes,FR-ARA
3,4,1,Ambérieux-en-Dombes,amberieux-en-dombes,amberieux en dombes,1330,45.9991,4.9033,1,1,Ain,ain,FR-01,Auvergne-Rhône-Alpes,auvergne-rhone-alpes,FR-ARA
4,5,1,Ambléon,ambleon,ambleon,1300,45.75,5.6016,1,1,Ain,ain,FR-01,Auvergne-Rhône-Alpes,auvergne-rhone-alpes,FR-ARA


### Data Exploration

In [6]:
print("Number of cities is:", len(set(cities["pattern"].tolist())))
print("Number of regions is:", len(set(regions["slug"].tolist())))
print("Number of departments is:", len(set(departments["slug"].tolist())))

Number of cities is: 33304
Number of regions is: 14
Number of departments is: 108


In [7]:
#Graphs

### Data Preprocessing

In [7]:
df = final_df[["regions_id", "pattern"]]
df["regions_id"] = df["regions_id"].apply(lambda x : x-1)
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,regions_id,pattern
0,0,l abergement clemenciat
1,0,l abergement de varey
2,0,amberieu en bugey
3,0,amberieux en dombes
4,0,ambleon


In [8]:
set(df["regions_id"].tolist())

{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13}

In [9]:
all_letters = string.ascii_letters[:26] + " "
n_letters = len(all_letters)
print("Number of letters we consider:", n_letters)
print("Letters:", all_letters, "and 'space'.")
print()
n_categories = len(set(df["regions_id"].tolist()))
print("Number of regions:", n_categories)
print()
regions["regions_id"] = regions["regions_id"].apply(lambda x : x-1)
all_categories = regions.set_index("regions_id").to_dict()["name"]
print("Dictionary of regions and their id:", all_categories)

Number of letters we consider: 27
Letters: abcdefghijklmnopqrstuvwxyz  and 'space'.

Number of regions: 14

Dictionary of regions and their id: {0: 'Auvergne-Rhône-Alpes', 1: 'Bourgogne-Franche-Comté', 2: 'Bretagne', 3: 'Centre-Val de Loire', 4: 'Corse', 5: 'Grand Est', 6: 'Hauts-de-France', 7: 'Île-de-France', 8: 'Normandie', 9: 'Nouvelle-Aquitaine', 10: 'Occitanie', 11: 'Pays de la Loire', 12: "Provence-Alpes-Côte d'Azur", 13: 'Outre-Mer'}


In [10]:
def line_to_tensor(line):
    tensor = np.zeros((len(line), n_letters)) # tensor --> np
    for li, letter in enumerate(line):
        letter_index = all_letters.find(letter)
        tensor[li][letter_index] = 1
    return tensor

a_city = df["pattern"][4]
print(a_city)
print(line_to_tensor(a_city).shape)

ambleon
(7, 27)


In [11]:
cities = df["pattern"].tolist()
labels = df["regions_id"].tolist()

In [12]:
import torch.utils.data as data


class CityDataset(data.Dataset):
    """Custom Dataset compatible with torch.utils.data.DataLoader."""

    def __init__(self, cities, labels):
        """Set the path for audio data, together wth labels and objid.

        Args:

        """
        self.cities = cities
        self.labels = labels

    def __getitem__(self, index):
        """Returns one data pair (city and label)."""
        city = line_to_tensor(self.cities[index])
        label = self.labels[index]
        return city, label

    def __len__(self):
        return len(self.cities)
    
    
def collate_fn(data):
    # Sort a data list by length (descending order).
    data.sort(key=lambda x: x[0].shape[0], reverse=True)
    cities, labels = zip(*data)

    lengths = [city.shape[0] for city in cities]
    lengths = torch.LongTensor(sorted(lengths)[::-1])
    num_coeffs = data[0][0].shape[1]
    padded_cities = torch.zeros(len(cities), max(lengths), num_coeffs)
    for i, city in enumerate(cities):
        end = lengths[i]
        padded_cities[i, :end, :] = torch.from_numpy(city[:end])
        
    # Merge labels.
    labels = torch.FloatTensor(labels)
    return padded_cities, labels, lengths


def get_train_loader(cities, labels, batch_size, shuffle, sampler, collate_fn):
    """Returns torch.utils.data.DataLoader for custom dataset."""

    dataset = CityDataset(cities=cities, labels=labels)

    data_loader = torch.utils.data.DataLoader(dataset=dataset,
                                              batch_size=batch_size,
                                              shuffle=shuffle,
                                              collate_fn=collate_fn,
                                              sampler = sampler)
    return data_loader

In [13]:
from torch.utils.data.sampler import SubsetRandomSampler

#Define a split for train/valid
valid_size = 0.2
batch_size = 10
num_train = len(cities)
indices = list(range(num_train))
split = int(np.floor(valid_size * num_train))

train_idx, valid_idx = indices[split:], indices[:split]

train_sampler = SubsetRandomSampler(train_idx)
valid_sampler = SubsetRandomSampler(valid_idx)

#Load data generators
train_data_loader = get_train_loader(cities=cities, labels=labels, batch_size=batch_size, shuffle=False, collate_fn=collate_fn, sampler=train_sampler)
valid_data_loader = get_train_loader(cities=cities, labels=labels, batch_size=batch_size, shuffle=False, collate_fn=collate_fn, sampler=valid_sampler)

In [14]:
import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.functional as F
import torch.optim as optim

from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence


class LSTMClassifier(nn.Module):

    def __init__(self, vocab_size, hidden_dim, output_size):

        super(LSTMClassifier, self).__init__()

        self.hidden_dim = hidden_dim
        self.vocab_size = vocab_size

        self.lstm = nn.LSTM(vocab_size, hidden_dim, num_layers=1)

        self.hidden2out = nn.Linear(hidden_dim, output_size)
        self.softmax = nn.LogSoftmax()

        self.dropout_layer = nn.Dropout(p=0.2)


    def init_hidden(self):
        return(autograd.Variable(torch.randn(1, 1, self.hidden_dim)),
                        autograd.Variable(torch.randn(1, 1, self.hidden_dim)))


    def forward(self, batch, lengths):

        self.hidden = self.init_hidden()

        packed_input = pack_padded_sequence(batch, lengths, batch_first=True)
        outputs, (ht, ct) = self.lstm(packed_input, self.hidden)

        # ht is the last hidden state of the sequences
        # ht = (1 x batch_size x hidden_dim)
        # ht[-1] = (batch_size x hidden_dim)
        
        #output = self.dropout_layer(ht[-1])
        output = self.hidden2out(ht[-1])
        output = self.softmax(output)

        return output

In [15]:
model = LSTMClassifier(27, 10, 14)

In [16]:
# Loss and Optimizer
criterion = nn.NLLLoss()
learning_rate = 0.8 
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

In [17]:
# train
losses = []

num_epochs = 10

# Train the Model
for epoch in range(num_epochs):
    print("##### epoch {:2d}".format(epoch + 1))
    for i, batch in enumerate(train_data_loader):
        city = autograd.Variable(batch[0])
        length = batch[2].cpu().numpy()
        label = batch[1].long()
        optimizer.zero_grad()
        pred = model(city, length)
        true = autograd.Variable(label)
        loss = criterion(pred, true)
        loss.backward()
        optimizer.step()

        if (i + 1) % 1000 == 0:
            print('Epoch [%d/%d], Iter [%d/%d] Loss: %.4f'
                  % (epoch + 1, num_epochs, i + 1, len(train_data_loader), loss.data[0]))

    valid_losses = 0
    for i, batch in enumerate(valid_data_loader):
        city = autograd.Variable(batch[0])
        label = autograd.Variable(batch[1].long())
        length = batch[2].cpu().numpy()
        pred = model(city, length)
        loss = criterion(pred, label)
        valid_losses += loss.data[0]

    print('Validation MSE of the model at epoch {} is: {}'.format(epoch, np.round(valid_losses / len(valid_data_loader), 2)))


##### epoch  1
Epoch [1/10], Iter [1000/2890] Loss: 2.8232
Epoch [1/10], Iter [2000/2890] Loss: 2.7673
Validation MSE of the model at epoch 0 is: 2.7
##### epoch  2
Epoch [2/10], Iter [1000/2890] Loss: 2.1993
Epoch [2/10], Iter [2000/2890] Loss: 2.2776
Validation MSE of the model at epoch 1 is: 2.57
##### epoch  3
Epoch [3/10], Iter [1000/2890] Loss: 2.5592
Epoch [3/10], Iter [2000/2890] Loss: 2.4095
Validation MSE of the model at epoch 2 is: 2.73
##### epoch  4
Epoch [4/10], Iter [1000/2890] Loss: 1.8663
Epoch [4/10], Iter [2000/2890] Loss: 2.9404
Validation MSE of the model at epoch 3 is: 2.72
##### epoch  5
Epoch [5/10], Iter [1000/2890] Loss: 2.0144
Epoch [5/10], Iter [2000/2890] Loss: 1.6621
Validation MSE of the model at epoch 4 is: 2.65
##### epoch  6
Epoch [6/10], Iter [1000/2890] Loss: 1.8594
Epoch [6/10], Iter [2000/2890] Loss: 2.1055
Validation MSE of the model at epoch 5 is: 2.76
##### epoch  7
Epoch [7/10], Iter [1000/2890] Loss: 2.1695
Epoch [7/10], Iter [2000/2890] Loss: