# Tags - Data Cleaning

In [5]:
import pandas as pd

In [6]:
path = "../../data/small"
init_tags = pd.read_csv(f"{path}/tags.csv")

## Tags
All tags are contained in the file `tags.csv`. Each line of this file after the header row represents one tag applied to one movie by one user, and has the following format:

    userId,movieId,tag,timestamp

The lines within this file are ordered first by userId, then, within user, by movieId.

Tags are user-generated metadata about movies. Each tag is typically a single word or short phrase. The meaning, value, and purpose of a particular tag is determined by each user.

Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

In [7]:
def init_pipeline(df):
    return df.copy()


In [8]:
def adjust_dtypes(df):
    df["timestamp"] = pd.to_datetime(df["timestamp"], unit="s") 
    return df


In [9]:
def save_csv(df):
    df.to_csv("../../data/clean/tags_norm.csv", index=False)
    return df


In [10]:
tags = (
    init_tags
        .pipe(init_pipeline)
        #.pipe(missing_values)
        .pipe(adjust_dtypes)
        .pipe(save_csv)
) 
tags.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3683 entries, 0 to 3682
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   userId     3683 non-null   int64         
 1   movieId    3683 non-null   int64         
 2   tag        3683 non-null   object        
 3   timestamp  3683 non-null   datetime64[ns]
dtypes: datetime64[ns](1), int64(2), object(1)
memory usage: 115.2+ KB


In [13]:
tags.sample(5)

Unnamed: 0,userId,movieId,tag,timestamp
670,357,7444,Mark Ruffalo,2012-09-26 02:52:40
2961,567,3266,dark humor,2018-05-02 17:43:28
1312,474,1185,In Netflix queue,2006-01-14 01:10:55
979,462,152711,murder,2016-11-07 03:35:01
3011,567,4878,atmospheric,2018-05-02 17:36:21
