# Gathering and Sanitizing Netflix User Data

A concern when sharing information online is giving out more information than might be safe. Since this report is going to be about Netflix profile watch data the chances of that being true here is small, but not null. After we pull the data we will check the information included in the base file, then sanitized and/or [hash](https://www.techtarget.com/searchdatamanagement/definition/hashing)  the data.

To download the usage information from Netflix navigate to the url `https://www.netflix.com/account/getmyinfo` and submit a request for a copy of your information. After a short period of time you will receive an email from Netflix with a CSV file of your data.

Download and unzip the file. Inside are 11 folders and two files:

![netflix-report-tree structure](./files/assets/netflix-report-tree.png 'netflix report tree')

## Viewing Activity

The first report that we are looking for is `./CONTENT_INTERACTION/ViewingActivity.csv` and we can move that into the `files/non-sanitized` folder

We can now pull in the data and check what the columns are.

In [6]:
import pandas as pd

file_dir = r'./files'
non_sanitized_dir = fr'{file_dir}/non-sanitized'

df_views = pd.read_csv(fr'{non_sanitized_dir}/ViewingActivity.csv')
df_views.dtypes

Duration                   object
Start Time                 object
Profile Name               object
Country                    object
Bookmark                   object
Latest Bookmark            object
Supplemental Video Type    object
Attributes                 object
Device Type                object
Title                      object
dtype: object

The two columns that I might be  about releasing are `Profile Name` and `Device Type`. Looking at the actual data in the columns the only one that I will be sanitizing is `Profile Name`.

We'll make a new DataFrame with the unique profile names, generate codenames for those profiles, and save the dataframe in the non-sanitized folder as a codename legend.

In [2]:
df_profiles = df_views['Profile Name'].unique()
df_profiles = pd.DataFrame(data={'Profile Name': df_profiles, 'Code Name': [f'Profile {i + 1}' for i, j in enumerate(df_profiles)]}, index=df_profiles)
df_profiles.to_csv(rf'{non_sanitized_dir}/codenames.csv', index=False)

Next we replace the values in the column `Profile Name` with the new code names

In [7]:
# df['Profile Name'] = df['Profile Name'].replace(df_profiles)
# df_profiles.set_index('Profile Name')
df_views['Profile Name'] = df_views['Profile Name'].replace(df_profiles.to_dict()['Code Name'])
df_views.sample(n=10)

Unnamed: 0,Duration,Start Time,Profile Name,Country,Bookmark,Latest Bookmark,Supplemental Video Type,Attributes,Device Type,Title
9969,00:51:22,2020-05-10 01:33:45,Profile 1,US (United States),00:52:08,Not latest view,,Autoplayed: user action: Unspecified;,Vizio MG186 MT5597DV CAST/HYBRID INX Smart TV,Outlander: Season 3: The Bakra (Episode 12)
3127,00:40:20,2023-12-14 14:31:02,Profile 1,US (United States),00:40:20,00:40:20,,,Vizio MG186 MT5597DV CAST INX Smart TV,My Life With the Walter Boys: Season 1: The Co...
14891,00:02:52,2012-01-10 04:03:58,Profile 1,US (United States),00:00:00,Not latest view,,,Vizio VIA-B3TP DTV,Mad Men: Season 2: The Inheritance (Episode 10)
10529,00:00:02,2020-01-07 00:26:23,Profile 1,US (United States),00:00:02,00:00:02,,Autoplayed: user action: None;,Vizio MG186 MT5597DV CAST/HYBRID INX Smart TV,Zumbo's Just Desserts: Season 1: Episode 5 (Ep...
13650,00:22:48,2016-03-25 23:10:20,Profile 1,US (United States),00:23:01,Not latest view,,,Vizio VIA-B3TP DTV,The Powerpuff Girls: Season 2: Birthday Bash /...
11781,00:01:21,2018-10-28 16:37:30,Profile 1,US (United States),00:01:22,Not latest view,,,iPad Mini 2 WiFi,Coco
1700,00:20:51,2025-01-28 03:56:51,Profile 3,US (United States),00:20:49,00:20:49,,Autoplayed: user action: User_Interaction;,Vizio MG186 MT5597DV CAST/HYBRID INX Smart TV,Reba: Season 4: Mother's Intuition (Episode 2)
697,00:48:20,2025-06-12 00:08:27,Profile 2,US (United States),00:48:20,00:48:20,,Autoplayed: user action: User_Interaction;,Amazon Fire TV Stick 2020 + Streaming Stick,50 First Dates
12871,00:21:39,2017-02-25 08:06:26,Profile 1,US (United States),00:21:39,Not latest view,,,Google Chromecast streaming stick,Bob's Burgers: Season 3: Family Fracas (Episod...
9730,00:06:36,2020-06-01 04:58:57,Profile 1,US (United States),00:07:55,Not latest view,,Autoplayed: user action: Unspecified;,Vizio MG186 MT5597DV CAST/HYBRID INX Smart TV,Outlander: Season 1: Wentworth Prison (Episode...


Now we save the data. We don't need the index so we will also remove the index column.

To maintain data integrity we save the file in a different location

In [4]:
df_views.to_csv(rf'{file_dir}/ViewingActivity_sanitized.csv', index=False)

## Billing History

The next report can be found in the `./PAYMENT_AND_BILLING/BillingHistory.csv`, and we will place in the same `files/non-sanitized` folder

In [14]:
df_billing = pd.read_csv(fr'{non_sanitized_dir}/BillingHistory.csv')
df_billing.dtypes

Transaction Date              object
Country                       object
Mop Last 4                     int64
Final Invoice Result          object
Mop Pmt Processor Desc        object
Pmt Txn Type                  object
Description                   object
Gross Sale Amt               float64
Pmt Status                    object
Payment Type                  object
Tax Amt                      float64
Service Period Start Date     object
Item Price Amt               float64
Mop Creation Date             object
Currency                      object
Next Billing Date             object
Service Period End Date       object
dtype: object

Next we remove the columns that may pose a sercurity issue

In [15]:
df_billing.drop(columns=['Country', 'Mop Last 4', 'Payment Type'], inplace=True)

In [16]:
df_billing.head()

Unnamed: 0,Transaction Date,Final Invoice Result,Mop Pmt Processor Desc,Pmt Txn Type,Description,Gross Sale Amt,Pmt Status,Tax Amt,Service Period Start Date,Item Price Amt,Mop Creation Date,Currency,Next Billing Date,Service Period End Date
0,2025-07-01,,,SALE,payment_transaction,24.99,NEW,,,,,USD,,
1,2025-07-01,SETTLED,HELIX,SALE,SUBSCRIPTION,24.99,APPROVED,0.0,2025-07-01,24.99,2024-03-01,USD,2025-08-01,2025-07-31
2,2025-07-01,,,SALE,payment_transaction,24.99,APPROVED,,,,,USD,,
3,2025-07-01,SETTLED,HELIX,SALE,SUBSCRIPTION,24.99,NEW,0.0,2025-07-01,24.99,2024-03-01,USD,2025-08-01,2025-07-31
4,2025-06-01,SETTLED,HELIX,SALE,SUBSCRIPTION,24.99,APPROVED,0.0,2025-06-01,24.99,2024-03-01,USD,2025-07-01,2025-06-30


we can now save the new report for later use.

In [17]:
df_billing.to_csv(rf'{file_dir}/BillingHistory_sanitized.csv', index=False)