# Data anonymization

The goal of this notebook is to anonymize datasets.

## Imports

In [1]:
import os
from pathlib import Path

import numpy as np
import pandas as pd

In [2]:
SEED = 42

## Data

In [4]:
RELATIVE_PATH = "../data"

RAW_PATH = os.path.join(RELATIVE_PATH, "0_raw_parsed_data")
ANONYMIZED_PATH = os.path.join(RELATIVE_PATH, "1_anonymized_data")

In [5]:
Path(ANONYMIZED_PATH).mkdir(parents=True, exist_ok=True)

### Loading data

In [36]:
movies_reviews_df = pd.read_parquet(os.path.join(RAW_PATH, "movies_reviews.parquet"))
movies_reviews_df.shape

(170894, 8)

In [37]:
series_reviews_df = pd.read_parquet(os.path.join(RAW_PATH, "series_reviews.parquet"))
series_reviews_df.shape

(35643, 8)

In [38]:
movies_info_df = pd.read_parquet(os.path.join(RAW_PATH, "movies_info.parquet"))
movies_info_df.shape

(984, 43)

In [39]:
series_info_df = pd.read_parquet(os.path.join(RAW_PATH, "series_info.parquet"))
series_info_df.shape

(980, 40)

### Looking at the data

#### Movies

In [41]:
movies_info_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 984 entries, 0 to 983
Data columns (total 43 columns):
 #   Column                             Non-Null Count  Dtype 
---  ------                             --------------  ----- 
 0   id                                 984 non-null    int64 
 1   russian_title                      984 non-null    object
 2   original_title                     984 non-null    object
 3   actors                             984 non-null    object
 4   voice_actors                       984 non-null    object
 5   year                               984 non-null    object
 6   country                            984 non-null    object
 7   genre                              984 non-null    object
 8   slogan                             984 non-null    object
 9   director                           984 non-null    object
 10  scriptwriter                       984 non-null    object
 11  producer                           984 non-null    object
 12  operator

For info datasets we need to map show id to new id.

### Reviews

In [42]:
movies_reviews_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 170894 entries, 0 to 170893
Data columns (total 8 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   show_id           170894 non-null  int64 
 1   username          170894 non-null  object
 2   datetime          170894 non-null  object
 3   sentiment         170894 non-null  object
 4   subtitle          170894 non-null  object
 5   review_body       170894 non-null  object
 6   usefulness_ratio  170894 non-null  object
 7   direct_link       170894 non-null  object
dtypes: int64(1), object(7)
memory usage: 10.4+ MB


For reviews dataset we have to map usernames to user IDs and drop direct link column.  

## Processing

### Info

#### Reading

In [43]:
movies_id = movies_info_df["id"].values
series_id = series_info_df["id"].values

len(movies_id), len(series_id)

(984, 980)

In [44]:
set(movies_id).intersection(set(series_id))

set()

Now we know that identifiers for movies and series don't intersect.

#### Merging

Let's merge them, drop duplicates and sort them

In [45]:
show_ids = sorted(list(set(np.concatenate((movies_id, series_id)))))
len(show_ids)

1964

#### Creating map

In [46]:
show_id_map = {user: id_ for user, id_ in zip(show_ids, range(len(show_ids)))}

#### Updating dataset

In [47]:
movies_info_df["id"] = movies_info_df["id"].map(show_id_map)
series_info_df["id"] = series_info_df["id"].map(show_id_map)

In [48]:
movies_reviews_df["show_id"] = movies_reviews_df["show_id"].map(show_id_map)
series_reviews_df["show_id"] = series_reviews_df["show_id"].map(show_id_map)

### Reviews

#### Reading

Let's read all usernames under which users left their reviews for both movies and series

In [49]:
movies_reviews_users = movies_reviews_df["username"].values
series_reviews_users = series_reviews_df["username"].values

len(movies_reviews_users), len(series_reviews_users)

(170894, 35643)

#### Merging

Let's merge them, drop duplicates and sort them

In [50]:
users = sorted(list(set(np.concatenate((movies_reviews_users, series_reviews_users)))))
len(users)

69166

#### Checking

In [51]:
users[0]

''

Empty string is something we want to avoid.

In [52]:
movies_reviews_df[movies_reviews_df["username"] == ""].shape

(22, 8)

We can see that, actually, we have column `direct_link` with link to the comment and this link contains the `user_id`. Different users happen to have empty usernames.  
It will be easier to use actual `user_id`'s than generate them ourselves - we just need to find out how to extract `user_id`'s from links.

#### Extracting `user_id`'s

Some reviews doesn't have direct link.

So, in this case, it will be easier to replace empty usernames with actual `user_id`'s and then encode usernames to new user IDs.  
This way we will restore information for empty usernames (for building recommendation system).

In [66]:
movies_reviews_df.loc[movies_reviews_df["username"] == "", "username"] = [
    link.split("/")[4]
    for link in movies_reviews_df[movies_reviews_df["username"] == ""][
        "direct_link"
    ].values
]

Let's do the same for series reviews

In [67]:
series_reviews_df[series_reviews_df["username"] == ""].shape

(3, 8)

In [68]:
series_reviews_df.loc[series_reviews_df["username"] == "", "username"] = [
    link.split("/")[4]
    for link in series_reviews_df[series_reviews_df["username"] == ""][
        "direct_link"
    ].values
]

#### Creating map

In [69]:
users = sorted(list(set(np.concatenate((movies_reviews_users, series_reviews_users)))))
len(users)

69172

Before we had 69166 unique users. Now we have 6 more - not much, but better.

In [70]:
user_map = {user: id_ for user, id_ in zip(users, range(len(users)))}

#### Updating dataset

In [71]:
movies_reviews_df["username"] = movies_reviews_df["username"].map(user_map)
series_reviews_df["username"] = series_reviews_df["username"].map(user_map)

Let's also rename column `username` to `user_id`

In [72]:
movies_reviews_df.rename({"username": "user_id"}, axis=1, inplace=True)
series_reviews_df.rename({"username": "user_id"}, axis=1, inplace=True)

And drop `direct_link` and `review_id` columns for anonymized version of the dataset

In [73]:
movies_reviews_df.drop("direct_link", axis=1, inplace=True, errors="ignore")
series_reviews_df.drop("direct_link", axis=1, inplace=True, errors="ignore")

And save the datasets

In [74]:
movies_reviews_df.to_parquet(os.path.join(ANONYMIZED_PATH, "movies_reviews.parquet"))
series_reviews_df.to_parquet(os.path.join(ANONYMIZED_PATH, "series_reviews.parquet"))

In [75]:
movies_info_df.to_parquet(os.path.join(ANONYMIZED_PATH, "movies_info.parquet"))
series_info_df.to_parquet(os.path.join(ANONYMIZED_PATH, "series_info.parquet"))