# Importing needed packages

In [2]:
# !pip install -U sentence-transformers

In [27]:
import re
from pprint import PrettyPrinter

import pandas as pd
from sentence_transformers import SentenceTransformer, util

In [4]:
SEED = 42

The goal of this research is to find out 

# Reading data

From the whole dataset with information about films we'll just read the next columns:
* `id` - movie id
* `russian_title` - Russian title
* `original_title` - original title 

In [5]:
movie_df = pd.read_parquet(
    "../movies.parquet", columns=["id", "russian_title", "original_title"]
)
movie_df.shape

(593, 3)

In [6]:
movie_df.sample(10, random_state=SEED)

Unnamed: 0,id,russian_title,original_title
30,354,Апокалипсис сегодня (1979),Apocalypse Now
76,958442,Ветреная река (2016),Wind River
227,522,"Карты, деньги, два ствола (1998)","Lock, Stock and Two Smoking Barrels"
424,470553,Прислуга (2011),The Help
184,391,Заводной апельсин (1971),A Clockwork Orange
362,893245,Основатель (2016),The Founder
301,577673,Место под соснами (2012),The Place Beyond the Pines
550,48162,"Хроники Нарнии: Лев, колдунья и волшебный шкаф...","The Chronicles of Narnia: The Lion, the Witch ..."
101,686,Ганнибал (2001),Hannibal
177,61237,Железный человек (2008),Iron Man


We can see that the `russian_title` contains the release year for every film (or at least these 10 random samples).  
This is an opportunity to clean the data.

# Cleaning the data

## Removing parenthesis from `russian_title`

### Approach 1: Removing last 6 characters

In the previous section we've found out that the `russian_title` contains the release year of the film.  
Let's check that the last six characters from the `russian_title` is always the same and look like `(year)`.

In [7]:
six_chars = (
    movie_df["russian_title"]
    .apply(lambda s: s[-6:].replace("(", "").replace(")", ""))
    .values
)
print(six_chars)

['2011' '1999' '2013' '1995' '1956' '2004' '2007' '2019' '1968' '2009'
 '2002' '2007' '2007' '2009' '2005' '2019' '2015' '2006' '2009' '2005'
 '2015' '1997' '2018' '2019' '2010' '2019' '2001' '1998' '2000' '2018'
 '1979' '2006' '1998' '2010' '2020' '2014' '2015' '2019' '2020' '2014'
 '2017' '1982' '2015' '2021' '2009' '1991' '2015' '2018' '2013' '1999'
 '2017' '2000' '1998' '2007' '2006' '2021' '2000' '1997' '2009' '2003'
 '2003' '2016' '2005' '2007' '2006' '2015' '2017' '2008' '2004' '2009'
 '2015' '2013' '2018' '2014' '2018' '2021' '2016' '2004' '2021' '2001'
 '2003' '2002' '2018' '2011' '2013' '2005' '2002' '2013' '2019' '2010'
 '2021' '2017' '2011' '2008' '2014' '2015' '2013' '2015' '2015' '2010'
 '2007' '2001' '2022' '2010' '2011' '2005' '2007' '2009' '2002' '2004'
 '2001' '1997' '2021' '2000' '2021' '2021' '2009' '2015' '2013' '2005'
 '2005' '2014' '2005' '2009' '2013' '2008' '2014' '2007' '2013' '2013'
 '2017' '1999' '2011' '2020' '2016' '2017' '2019' '1993' '2012' '2012'
 '2019

We see that it is almost always true, but for one movie there is a little error ` 1998` - additional whitespace. I'm going to check the whole title for this case.

In [8]:
movie_idx = [" " in val for val in six_chars].index(True)

movie_idx, movie_df["russian_title"][movie_idx]

(460, 'Собачье сердце (ТВ, 1988)')

Aha!

### Approach 2: removing the whole parenthesis

Let's switch to another strategy - finding out whether every title contains substring like `(smth)` and if it is true, then remove such substring

In [9]:
def has_numbers_in_square_brackets(s):
    return bool(re.search(r"\(.*\)", s))

In [10]:
assert (
    movie_df["russian_title"].apply(has_numbers_in_square_brackets).sum()
    == movie_df.shape[0]
), "Not every title has brackets with something inside"

Every title contains some information in brackets - we don't really care what's inside them. Our goal is to clean the titles, so, we'll just delete the brackets with their contents.

In [11]:
movie_df["russian_title"] = movie_df["russian_title"].apply(
    lambda s: re.sub(r"\([^()]*\)", "", s).strip()
)
movie_df["russian_title"]

0                           1+1
1      10 причин моей ненависти
2                12 лет рабства
3                    12 обезьян
4        12 разгневанных мужчин
                 ...           
588                Я иду искать
589                 Я — легенда
590                  Я — начало
591                    Я, робот
592                      Ярость
Name: russian_title, Length: 593, dtype: object

Just checking

In [12]:
assert (
    movie_df["russian_title"].apply(has_numbers_in_square_brackets).sum() == 0
), "Not all brackets were deleted - check the procedure"

## Checking for duplicates

In [13]:
movie_df.drop_duplicates().shape == movie_df.shape

True

There are no duplicates in the dataset - nothing to do here

## Checking for missing values

In [18]:
movie_df.isna().any()

id                False
russian_title     False
original_title     True
dtype: bool

We can see that the 'original_title' column contains `NaN` values.  
Probably, because not every movie has an `original_title` - maybe movie is Russian-made and doesn't have English-translated title, for example.

In [19]:
movie_df[movie_df.isnull().any(axis=1)]

Unnamed: 0,id,russian_title,original_title
14,84674,9 рота,
37,1000443,Балканский рубеж,
38,1326397,Батя,
46,742026,Битва за Севастополь,
56,41520,Брат 2,
57,41519,Брат,
60,57166,Бумер,
86,41431,Война,
91,885316,Время первых,
127,259251,Груз 200,


Yes, my assumption was right.  
I think it is reasonable to drop such rows.

In [23]:
movie_df = (
    movie_df.dropna(axis=0, inplace=False) if movie_df.isna().any().any() else movie_df
)

In [24]:
assert movie_df.isna().any().any() == False

# Semantic similarity

In [46]:
model = SentenceTransformer("distiluse-base-multilingual-cased-v1")

# Two lists of sentences
sentences1 = movie_df["russian_title"].values
sentences2 = movie_df["original_title"].values

# Compute embedding for both lists
embeddings1 = model.encode(sentences1, convert_to_tensor=True)
embeddings2 = model.encode(sentences2, convert_to_tensor=True)

# Compute cosine-similarits
cosine_scores = util.cos_sim(embeddings1, embeddings2)

Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/114 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.58M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.38k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/556 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/341 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/539M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/452 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/996k [00:00<?, ?B/s]

In [47]:
# Output the pairs with their score
rows = []
for i in range(len(sentences1)):
    rows.append(
        [sentences1[i], sentences2[i], cosine_scores[i][i].cpu().detach().numpy()]
    )

similarity_df = pd.DataFrame(
    data=rows, columns=["russian_title", "original_title", "similarity"]
)
similarity_df.sort_values(by="similarity", ascending=False, inplace=True)

In [48]:
with pd.option_context("display.max_rows", None, "display.max_columns", None):
    display(similarity_df)

Unnamed: 0,russian_title,original_title,similarity
6,1408,1408,1.0
9,2012,2012,0.9999999
7,1917,1917,0.9999998
378,После,After,0.98192537
322,Один день,One Day,0.97902197
283,Моана,Moana,0.97319967
383,Почему он?,Why Him?,0.968497
312,Никто,Nobody,0.96801865
346,Пассажиры,Passengers,0.9671833
52,Братья,Brothers,0.9670155
