# Research on the quality of localization of movie titles

## Importing needed packages

In [1]:
# !pip install spacy spacy-langdetect sentence-transformers
# !python -m spacy download en_core_web_sm

In [2]:
import re
from pprint import PrettyPrinter
from typing import List, Tuple

import numpy as np
import pandas as pd
import spacy
import torch
from sentence_transformers import SentenceTransformer, util
from spacy.language import Language
from spacy_langdetect import LanguageDetector
from torch import Tensor

In [3]:
SEED = 42

The goal of this research is to find out 

## Reading data

From the whole dataset with information about films we'll just read the next columns:
* `id` - movie id
* `russian_title` - Russian title
* `original_title` - original title 

In [4]:
movie_df = pd.read_parquet("../data/movies.parquet", columns=["id", "russian_title", "original_title"])
movie_df.shape

(984, 3)

In [5]:
movie_df = movie_df.replace(r"^\s*$", np.nan, regex=True)

In [6]:
movie_df.sample(10, random_state=SEED)

Unnamed: 0,id,russian_title,original_title
613,647671,Охота (2012),Jagten
451,45146,Любовь и голуби (1984),
731,84049,Рапунцель: Запутанная история (2010),Tangled
436,1721,"Лжец, лжец (1997)",Liar Liar
275,1072788,Дьявол всегда здесь (2020),The Devil All the Time
582,469,Однажды в Америке (1983),Once Upon a Time in America
707,915112,Призрачная красота (2016),Collateral Beauty
299,81555,Загадочная история Бенджамина Баттона (2008),The Curious Case of Benjamin Button
718,1236583,Прошлой ночью в Сохо (2021),Last Night in Soho
494,577673,Место под соснами (2012),The Place Beyond the Pines


We can see that the `russian_title` contains the release year for every film (or at least these 10 random samples).  
This is an opportunity to clean the data.

## Cleaning the data

### Removing parenthesis from `russian_title`

#### Approach 1: Removing last 6 characters

In the previous section we've found out that the `russian_title` contains the release year of the film.  
Let's check that the last six characters from the `russian_title` is always the same and look like `(year)`.

In [7]:
six_chars = movie_df["russian_title"].apply(lambda s: s[-6:].replace("(", "").replace(")", ""))
six_chars.value_counts(ascending=True).iloc[:15]

1968     1
1976     1
1959     1
 1988    1
1956     1
1975     1
 2022    1
1970     1
1939     1
1974     1
1977     1
1971     1
1989     2
1973     2
1993     2
Name: russian_title, dtype: int64

We can see that amongst the most unfrequent years there are little errors ` 1998` and ` 2022` - year contains additional whitespaces.  
I'm going to check the whole title for this case.

In [8]:
indices = [i for i, val in enumerate(six_chars) if " " in val]
movie_df["russian_title"][indices]

705    Привилегированные (ТВ, 2022)
775       Собачье сердце (ТВ, 1988)
Name: russian_title, dtype: object

Aha!

#### Approach 2: removing the whole parenthesis

Let's switch to another strategy - finding out whether every title contains substring like `(smth)` and if it is true, then remove such substring

In [9]:
def has_numbers_in_square_brackets(s):
    return bool(re.search(r"\(.*\)", s))

In [10]:
assert (
    movie_df["russian_title"].apply(has_numbers_in_square_brackets).sum() == movie_df.shape[0]
), "Not every title has brackets with something inside"

Every title contains some information in brackets - we don't really care what's inside them. Our goal is to clean the titles, so, we'll just delete the brackets with their contents.

In [11]:
movie_df["russian_title"] = movie_df["russian_title"].apply(lambda s: re.sub(r"\([^()]*\)", "", s).strip())
movie_df["russian_title"]

0                           1+1
1      10 причин моей ненависти
2                12 лет рабства
3                    12 обезьян
4        12 разгневанных мужчин
                 ...           
979                  Я — начало
980                    Я, робот
981                Яйцо Фаберже
982                      Ярость
983                      Ярость
Name: russian_title, Length: 984, dtype: object

Just checking

In [12]:
assert (
    movie_df["russian_title"].apply(has_numbers_in_square_brackets).sum() == 0
), "Not all brackets were deleted - check the procedure"

### Checking for duplicates

In [13]:
movie_df.drop_duplicates().shape == movie_df.shape

True

There are no duplicates in the dataset - nothing to do here

### Checking for missing values

In [14]:
movie_df.isna().any()

id                False
russian_title     False
original_title     True
dtype: bool

We can see that the 'original_title' column contains `NaN` values.  
Probably, because not every movie has an `original_title` - maybe movie is Russian-made and doesn't have English-translated title, for example.

In [15]:
movie_df[movie_df.isnull().any(axis=1)]

Unnamed: 0,id,russian_title,original_title
8,4533880,1941. Крылья над Берлином,
18,84674,9 рота,
41,8385,Андрей Рублев,
53,1291197,Артек. Большое путешествие,
60,4550354,Бабки,
...,...,...,...
944,1209750,Чернобыль: Зона отчуждения. Финал,
969,4312912,Этерна: Часть первая,
975,4903616,Я иду искать,
977,4493006,Я хочу! Я буду!,


Yes, my assumption was right.  
I think it is reasonable to drop such rows.

In [16]:
movie_df = movie_df.dropna(axis=0, inplace=False) if movie_df.isna().any().any() else movie_df
movie_df.shape

(853, 3)

In [17]:
assert movie_df.isna().any().any() == False

## Semantic similarity

### Getting titles

In [18]:
russian_titles = movie_df["russian_title"].values
original_titles = movie_df["original_title"].values

### Useful functions

In [33]:
def get_embeddings(model, texts: List[str]) -> Tensor:
    texts_lowercase = list(map(str.lower, texts))
    with torch.no_grad():
        text_embeddings = model.encode(texts_lowercase, convert_to_tensor=True)
    return text_embeddings

In [34]:
def compute_embeddings(model, russian_titles: np.array, original_titles: np.array) -> Tuple[Tensor]:
    russian_title_embs = get_embeddings(model, russian_titles)
    original_title_embs = get_embeddings(model, original_titles)
    
    return russian_title_embs, original_title_embs

In [35]:
def compute_similarity(russian_title_embs: Tensor, original_title_embs: Tensor) -> Tensor:
    return util.cos_sim(russian_title_embs, original_title_embs).cpu().detach().numpy()

In [42]:
def get_similarity_dataframe(model, russian_titles, original_titles, sort=False, ascending=True):
    embeddings = compute_embeddings(model, russian_titles, original_titles)
    similarity_scores = compute_similarity(*embeddings)

    rows = []
    for i in range(len(russian_titles)):
        rows.append([russian_titles[i], original_titles[i], similarity_scores[i][i]])

    similarity_df = pd.DataFrame(data=rows, columns=["russian_title", "original_title", "similarity"])
    similarity_df["similarity"] = similarity_df["similarity"].apply(lambda similarity: round(similarity, 3))
    
    del embeddings
    torch.cuda.empty_cache()
    
    similarity_df = similarity_df.sort_values(by="similarity", ascending=ascending) if sort else similarity_df
    
    return similarity_df

#### Debug opportunities

In [30]:
# embeddings = compute_embeddings(distil_use_v2, russian_titles, original_titles)
# similarity_scores = compute_similarity(*embeddings)

### Choosing model

We are going to use multilingual models from [SentenceTransformers](https://www.sbert.net/docs/pretrained_models.html#multi-lingual-models) framework

In [24]:
%%time

distil_use_v1 = SentenceTransformer("distiluse-base-multilingual-cased-v1", cache_folder="../cache_folder")
distil_use_v2 = SentenceTransformer("distiluse-base-multilingual-cased-v2", cache_folder="../cache_folder")
minilm = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2", cache_folder="../cache_folder")
mpnet = SentenceTransformer("paraphrase-multilingual-mpnet-base-v2", cache_folder="../cache_folder")
labse = SentenceTransformer("LaBSE", cache_folder="../cache_folder")

CPU times: total: 22.8 s
Wall time: 1min 5s


In [25]:
models = {
    "distiluse-base-multilingual-cased-v1": distil_use_v1,
    "distiluse-base-multilingual-cased-v2": distil_use_v2,
    "paraphrase-multilingual-MiniLM-L12-v2": minilm,
    "paraphrase-multilingual-mpnet-base-v2": mpnet,
    "LaBSE": labse,
}

### Sanity check

Before moving on I'd like to check the sanity of the model - assess how well they predict the similarity between Russian and original titles

In [64]:
similarities = {}
for model_name, model in models.items():
    similarity_df = get_similarity_dataframe(model, russian_titles, original_titles)
    similarities[model_name] = similarity_df["similarity"]
    print(model_name)
    with pd.option_context("display.max_rows", None, "display.max_columns", None):
        display(similarity_df.sample(10, random_state=SEED))
    print("\n")

distiluse-base-multilingual-cased-v1


Unnamed: 0,russian_title,original_title,similarity
66,Бесконечность,Infinite,0.896
434,Мемуары гейши,Memoirs of a Geisha,0.464
198,"Девочка, покорившая время",Toki o kakeru shojo,0.152
212,Дивергент,Divergent,0.735
652,Святые из Бундока,The Boondock Saints,0.882
543,Пассажиры,Passengers,0.956
280,Звёздные войны: Эпизод 2 — Атака клонов,Star Wars: Episode II - Attack of the Clones,0.715
296,Зомби по имени Шон,Shaun of the Dead,0.627
365,Крупная рыба,Big Fish,0.965
679,Список Шиндлера,Schindler's List,0.878




distiluse-base-multilingual-cased-v2


Unnamed: 0,russian_title,original_title,similarity
66,Бесконечность,Infinite,0.841
434,Мемуары гейши,Memoirs of a Geisha,0.765
198,"Девочка, покорившая время",Toki o kakeru shojo,0.328
212,Дивергент,Divergent,0.697
652,Святые из Бундока,The Boondock Saints,0.858
543,Пассажиры,Passengers,0.952
280,Звёздные войны: Эпизод 2 — Атака клонов,Star Wars: Episode II - Attack of the Clones,0.798
296,Зомби по имени Шон,Shaun of the Dead,0.631
365,Крупная рыба,Big Fish,0.931
679,Список Шиндлера,Schindler's List,0.926




paraphrase-multilingual-MiniLM-L12-v2


Unnamed: 0,russian_title,original_title,similarity
66,Бесконечность,Infinite,0.954
434,Мемуары гейши,Memoirs of a Geisha,0.787
198,"Девочка, покорившая время",Toki o kakeru shojo,0.536
212,Дивергент,Divergent,0.612
652,Святые из Бундока,The Boondock Saints,0.727
543,Пассажиры,Passengers,0.992
280,Звёздные войны: Эпизод 2 — Атака клонов,Star Wars: Episode II - Attack of the Clones,0.954
296,Зомби по имени Шон,Shaun of the Dead,0.49
365,Крупная рыба,Big Fish,0.98
679,Список Шиндлера,Schindler's List,0.701




paraphrase-multilingual-mpnet-base-v2


Unnamed: 0,russian_title,original_title,similarity
66,Бесконечность,Infinite,0.898
434,Мемуары гейши,Memoirs of a Geisha,0.838
198,"Девочка, покорившая время",Toki o kakeru shojo,0.467
212,Дивергент,Divergent,0.562
652,Святые из Бундока,The Boondock Saints,0.728
543,Пассажиры,Passengers,0.963
280,Звёздные войны: Эпизод 2 — Атака клонов,Star Wars: Episode II - Attack of the Clones,0.95
296,Зомби по имени Шон,Shaun of the Dead,0.624
365,Крупная рыба,Big Fish,0.925
679,Список Шиндлера,Schindler's List,0.724




LaBSE


Unnamed: 0,russian_title,original_title,similarity
66,Бесконечность,Infinite,0.874
434,Мемуары гейши,Memoirs of a Geisha,0.761
198,"Девочка, покорившая время",Toki o kakeru shojo,0.279
212,Дивергент,Divergent,0.796
652,Святые из Бундока,The Boondock Saints,0.822
543,Пассажиры,Passengers,0.967
280,Звёздные войны: Эпизод 2 — Атака клонов,Star Wars: Episode II - Attack of the Clones,0.892
296,Зомби по имени Шон,Shaun of the Dead,0.437
365,Крупная рыба,Big Fish,0.942
679,Список Шиндлера,Schindler's List,0.863






### Averaging similarities

They seem to work fine, but their results vary. So what if we calculate average similarity between all the models.  
Can this approach give us better overall performance?

In [65]:
similarity_df.drop("similarity", axis=1, inplace=True, errors='ignore')
for model_name, similarity_col in similarities.items():
    similarity_df[model_name] = similarity_col

similarity_df["avg_sim"] = similarity_df[similarities.keys()].mean(axis=1)

In [66]:
with pd.option_context("display.max_rows", None, "display.max_columns", None):
    display(similarity_df.sort_values(by="avg_sim", ascending=True).iloc[:50])

Unnamed: 0,russian_title,original_title,distiluse-base-multilingual-cased-v1,distiluse-base-multilingual-cased-v2,paraphrase-multilingual-MiniLM-L12-v2,paraphrase-multilingual-mpnet-base-v2,LaBSE,avg_sim
80,Борат,Borat: Cultural Learnings of America for Make ...,0.082,0.033,0.172,0.33,0.195,0.1624
505,Одинокий волк,Clean,0.15,0.144,0.123,0.143,0.336,0.1792
507,Однажды в Ирландии,The Guard,0.085,0.104,0.295,0.282,0.182,0.1896
570,По соображениям совести,Hacksaw Ridge,0.311,0.249,0.192,0.095,0.117,0.1928
810,"Человек, который изменил всё",Moneyball,0.155,0.13,0.193,0.157,0.343,0.1956
757,Унесённые призраками,Sen to Chihiro no kamikakushi,0.26,0.276,0.137,0.224,0.15,0.2094
765,Философы: Урок выживания,After the Dark,0.084,0.075,0.411,0.314,0.188,0.2144
139,Воспоминания об убийстве,Salinui chueok,0.154,0.273,0.256,0.2,0.201,0.2168
51,Атака титанов. Фильм первый: Жестокий мир,Shingeki no kyojin,0.136,0.218,0.366,0.232,0.135,0.2174
205,День курка,Boss Level,0.245,0.234,0.15,0.167,0.298,0.2188


In [60]:
with pd.option_context("display.max_rows", None, "display.max_columns", None):
    display(similarity_df.sort_values(by="avg_sim", ascending=False).iloc[:50])

Unnamed: 0,russian_title,original_title,distiluse-base-multilingual-cased-v1,distiluse-base-multilingual-cased-v2,paraphrase-multilingual-MiniLM-L12-v2,paraphrase-multilingual-mpnet-base-v2,LaBSE,avg_sim
6,1408,1408,1.0,1.0,1.0,1.0,1.0,1.0
7,1917,1917,1.0,1.0,1.0,1.0,1.0,1.0
23,X,X,1.0,1.0,1.0,1.0,1.0,1.0
10,2012,2012,1.0,1.0,1.0,1.0,1.0,1.0
13,365 дней,365 dni,0.993,0.997,0.999,0.998,0.998,0.997
578,Пожары,Incendies,0.956,0.989,0.997,0.979,0.985,0.9812
84,Братья,Brothers,0.981,0.971,0.986,0.97,0.977,0.977
515,Опасный метод,A Dangerous Method,0.969,0.957,0.986,0.981,0.961,0.9708
675,Социальная сеть,The Social Network,0.962,0.968,0.977,0.982,0.959,0.9696
607,Преступления будущего,Crimes of the Future,0.962,0.968,0.974,0.972,0.971,0.9694


### Creating resulting DataFrame

In [103]:
similarity_df = get_similarity_dataframe(model, russian_titles, original_titles)
similarity_df.head()

Unnamed: 0,russian_title,original_title,similarity
0,1+1,Intouchables,0.19
1,10 причин моей ненависти,10 Things I Hate About You,0.791
2,12 лет рабства,12 Years a Slave,0.939
3,12 обезьян,Twelve Monkeys,0.932
4,12 разгневанных мужчин,12 Angry Men,0.962


## Results

### Dissimilar titles

Let's look at titles for which Russian translations doesn't convey meaning of the original name. 

In [77]:
with pd.option_context("display.max_rows", None, "display.max_columns", None):
    display(similarity_df.sort_values(by="similarity", ascending=True).iloc[:50])

Unnamed: 0,russian_title,original_title,similarity
80,Борат,Borat: Cultural Learnings of America for Make ...,0.033
765,Философы: Урок выживания,After the Dark,0.075
507,Однажды в Ирландии,The Guard,0.104
113,Веном 2,Venom: Let There Be Carnage,0.112
311,Иллюзия обмана,Now You See Me,0.127
474,Невероятный мир глазами Энцо,The Art of Racing in the Rain,0.13
810,"Человек, который изменил всё",Moneyball,0.13
169,Гарри Хафт: Последний бой,The Survivor,0.141
751,Удивительное путешествие доктора Дулиттла,Dolittle,0.142
505,Одинокий волк,Clean,0.144


There are a few cases for dissimilarity:  

**Russian title is a cropped version of original title**  
Another problem in this case can be the fact that embeddings don't work very well with proper names like Borat::Борат, Dolittle::Дулиттл, and so on.  
Examples:
* Борат::Borat: Cultural Learnings of America for Make Benefit Glorious Nation of Kazakhstan (sim score: 0.033)
* Веном 2::Venom: Let There Be Carnage (sim score: 0.112)
* Бёрдмэн::Birdman or (The Unexpected Virtue of Ignorance) (sim score: 0.146)

**Russian title is an extended version of original title**  
(Remark about proper names applies to this case too)  
Examples:
* Удивительное путешествие доктора Дулиттла::Dolittle (sim score: 0.142)
* Пол: Секретный материальчик::Paul (sim score: 0.169)
* Рапунцель: Запутанная история::Tangled (sim score: 0.181)

**Russian title was localized (made up) by translators/localizers**  
Examples:
* Невероятный мир глазами Энцо::The Art of Racing in the Rain (sim score: 0.130)
* Человек, который изменил всё::Moneyball (sim score: 0.130)
* Области тьмы::Limitless (sim score: 0.230)

**Errors due to the language of the original title**  
Examples:
* Хочу съесть твою поджелудочную железу::Kimi no suizo wo tabetai (sim score: 0.158)
* Красавица и дракон::Ryu to Sobakasu no Hime (sim score: 0.199)
* Дитя погоды::Tenki no k (sim score: 0.218)

### Similar titles

In [78]:
with pd.option_context("display.max_rows", None, "display.max_columns", None):
    display(similarity_df.sort_values(by="similarity", ascending=False).iloc[:30])

Unnamed: 0,russian_title,original_title,similarity
23,X,X,1.0
6,1408,1408,1.0
7,1917,1917,1.0
10,2012,2012,1.0
13,365 дней,365 dni,0.997
708,Тело,El cuerpo,0.99
578,Пожары,Incendies,0.989
421,Мама,Mama,0.989
466,Назад в будущее,Back to the Future,0.982
21,Kingsman: Секретная служба,Kingsman: The Secret Service,0.982


Similar titles are easier - they are almost literal translation of original titles. 

## Things to do

* Visualization:
    * [Visualizing Embeddings With t-SNE](https://www.kaggle.com/code/colinmorris/visualizing-embeddings-with-t-sne/notebook)