# Importing needed packages

In [99]:
# !pip install spacy spacy-langdetect sentence-transformers
# !python -m spacy download en_core_web_sm

In [5]:
import re
from pprint import PrettyPrinter

import numpy as np
import pandas as pd
import spacy
from sentence_transformers import SentenceTransformer, util
from spacy.language import Language
from spacy_langdetect import LanguageDetector

In [2]:
SEED = 42

The goal of this research is to find out 

# Reading data

From the whole dataset with information about films we'll just read the next columns:
* `id` - movie id
* `russian_title` - Russian title
* `original_title` - original title 

In [3]:
movie_df = pd.read_parquet("../movies.parquet", columns=["id", "russian_title", "original_title"])
movie_df.shape

(984, 3)

In [6]:
movie_df = movie_df.replace(r'^\s*$', np.nan, regex=True)

In [7]:
movie_df.sample(10, random_state=SEED)

Unnamed: 0,id,russian_title,original_title
613,647671,Охота (2012),Jagten
451,45146,Любовь и голуби (1984),
731,84049,Рапунцель: Запутанная история (2010),Tangled
436,1721,"Лжец, лжец (1997)",Liar Liar
275,1072788,Дьявол всегда здесь (2020),The Devil All the Time
582,469,Однажды в Америке (1983),Once Upon a Time in America
707,915112,Призрачная красота (2016),Collateral Beauty
299,81555,Загадочная история Бенджамина Баттона (2008),The Curious Case of Benjamin Button
718,1236583,Прошлой ночью в Сохо (2021),Last Night in Soho
494,577673,Место под соснами (2012),The Place Beyond the Pines


We can see that the `russian_title` contains the release year for every film (or at least these 10 random samples).  
This is an opportunity to clean the data.

# Cleaning the data

## Removing parenthesis from `russian_title`

### Approach 1: Removing last 6 characters

In the previous section we've found out that the `russian_title` contains the release year of the film.  
Let's check that the last six characters from the `russian_title` is always the same and look like `(year)`.

In [8]:
six_chars = movie_df["russian_title"].apply(lambda s: s[-6:].replace("(", "").replace(")", "")).values
print(six_chars)

['2011' '1999' '2013' '1995' '1956' '2004' '2007' '2019' '2022' '1984'
 '1968' '2009' '2002' '2007' '2020' '2007' '2009' '2022' '2005' '2020'
 '2019' '2019' '2012' '2015' '2021' '2022' '2006' '2009' '2005' '2015'
 '1997' '2018' '2019' '2010' '2019' '2021' '2015' '2001' '1998' '2000'
 '2015' '1966' '2019' '2020' '2017' '2022' '2018' '1979' '2006' '2022'
 '1998' '2021' '2021' '2021' '2010' '2015' '2020' '2022' '2020' '2014'
 '2021' '2021' '2015' '2019' '2018' '2019' '2020' '2014' '2017' '1982'
 '2015' '2021' '2022' '2021' '2009' '2021' '1991' '2015' '2022' '2018'
 '2022' '2022' '2013' '1999' '2017' '2022' '2021' '2000' '1998' '2007'
 '2006' '2022' '2021' '2022' '2000' '1997' '2016' '2009' '1999' '2003'
 '2021' '2003' '2021' '2022' '2011' '2021' '2021' '2022' '2016' '1989'
 '2005' '1973' '2007' '2011' '2006' '2015' '2015' '2017' '2008' '2021'
 '2004' '2016' '2022' '2009' '2015' '2015' '2020' '2013' '2018' '2014'
 '2018' '2021' '2021' '2013' '2016' '2004' '2021' '2020' '2019' '2015'
 '2021

We see that it is almost always true, but for one movie there is little errors ` 1998` and ` 2022` - additional whitespaces. I'm going to check the whole title for this case.

In [15]:
indices = [i for i, val in enumerate(six_chars) if " " in val]
movie_df["russian_title"][indices]

705    Привилегированные (ТВ, 2022)
775       Собачье сердце (ТВ, 1988)
Name: russian_title, dtype: object

Aha!

### Approach 2: removing the whole parenthesis

Let's switch to another strategy - finding out whether every title contains substring like `(smth)` and if it is true, then remove such substring

In [16]:
def has_numbers_in_square_brackets(s):
    return bool(re.search(r"\(.*\)", s))

In [17]:
assert (
    movie_df["russian_title"].apply(has_numbers_in_square_brackets).sum() == movie_df.shape[0]
), "Not every title has brackets with something inside"

Every title contains some information in brackets - we don't really care what's inside them. Our goal is to clean the titles, so, we'll just delete the brackets with their contents.

In [18]:
movie_df["russian_title"] = movie_df["russian_title"].apply(lambda s: re.sub(r"\([^()]*\)", "", s).strip())
movie_df["russian_title"]

0                           1+1
1      10 причин моей ненависти
2                12 лет рабства
3                    12 обезьян
4        12 разгневанных мужчин
                 ...           
979                  Я — начало
980                    Я, робот
981                Яйцо Фаберже
982                      Ярость
983                      Ярость
Name: russian_title, Length: 984, dtype: object

Just checking

In [19]:
assert (
    movie_df["russian_title"].apply(has_numbers_in_square_brackets).sum() == 0
), "Not all brackets were deleted - check the procedure"

## Checking for duplicates

In [20]:
movie_df.drop_duplicates().shape == movie_df.shape

True

There are no duplicates in the dataset - nothing to do here

## Checking for missing values

In [21]:
movie_df.isna().any()

id                False
russian_title     False
original_title     True
dtype: bool

We can see that the 'original_title' column contains `NaN` values.  
Probably, because not every movie has an `original_title` - maybe movie is Russian-made and doesn't have English-translated title, for example.

In [22]:
movie_df[movie_df.isnull().any(axis=1)]

Unnamed: 0,id,russian_title,original_title
8,4533880,1941. Крылья над Берлином,
18,84674,9 рота,
41,8385,Андрей Рублев,
53,1291197,Артек. Большое путешествие,
60,4550354,Бабки,
...,...,...,...
944,1209750,Чернобыль: Зона отчуждения. Финал,
969,4312912,Этерна: Часть первая,
975,4903616,Я иду искать,
977,4493006,Я хочу! Я буду!,


Yes, my assumption was right.  
I think it is reasonable to drop such rows.

In [23]:
movie_df = movie_df.dropna(axis=0, inplace=False) if movie_df.isna().any().any() else movie_df
movie_df.shape

(853, 3)

In [24]:
assert movie_df.isna().any().any() == False

# Semantic similarity

## Computing similarities

### Choosing model

In [138]:
model = SentenceTransformer("distiluse-base-multilingual-cased-v2")

Multi-Lingual model of Universal Sentence Encoder for 50+ languages: ar, bg, ca, cs, da, de, el, es, et, fa, fi, fr, fr-ca, gl, gu, he, hi, hr, hu, hy, id, it, ja, ka, ko, ku, lt, lv, mk, mn, mr, ms, my, nb, nl, pl, pt, pt, pt-br, ro, ru, sk, sl, sq, sr, sv, th, tr, uk, ur, vi, zh-cn, zh-tw

In [142]:
# Two lists of sentences
russian_titles = movie_df["russian_title"].values
original_titles = movie_df["original_title"].values

russian_titles_lowercase = movie_df["russian_title"].apply(lambda title: title.lower()).values
original_titles_lowercase = movie_df["original_title"].apply(lambda title: title.lower()).values

# Compute embedding for both lists
russian_title_embs = model.encode(russian_titles_lowercase, convert_to_tensor=True)
original_title_embs = model.encode(original_titles_lowercase, convert_to_tensor=True)

# Compute cosine-similarits
cosine_scores = util.cos_sim(russian_title_embs, original_title_embs)

## Creating resulting Dataframe & Sorting

In [175]:
rows = []
for i in range(len(sentences1)):
    rows.append([russian_titles[i], original_titles[i], cosine_scores[i][i].cpu().detach().numpy()])

similarity_df = pd.DataFrame(data=rows, columns=["russian_title", "original_title", "similarity"])
similarity_df.sort_values(by="similarity", ascending=True, inplace=True)

## Results

In [176]:
with pd.option_context("display.max_rows", None, "display.max_columns", None):
    display(similarity_df)

Unnamed: 0,russian_title,original_title,similarity
68,Борат,Borat: Cultural Learnings of America for Make ...,0.033492133
452,Однажды в Ирландии,The Guard,0.103530146
96,Веном 2,Venom: Let There Be Carnage,0.112162165
275,Иллюзия обмана,Now You See Me,0.1267828
422,Невероятный мир глазами Энцо,The Art of Racing in the Rain,0.13008945
50,Бёрдмэн,Birdman or (The Unexpected Virtue of Ignorance),0.14588004
517,Пол: Секретный материальчик,Paul,0.16916819
89,Ведьма,The VVitch: A New-England Folktale,0.18793191
0,1+1,Intouchables,0.19013056
69,Босс-молокосос 2,The Boss Baby: Family Business,0.19786376


## Thoughts

## Visualization

https://www.kaggle.com/code/colinmorris/visualizing-embeddings-with-t-sne/notebook