Tasks
1. Data Overview and Structure Analysis
- Goal: Load the dataset and understand its structure, including data types and nullability.
- Expected Outcome: Familiarity with data loading procedures and basic DataFrame inspection in both frameworks. Gain an understanding of the dataset's structure.

In [3]:
# Task 1
# PySpark
# URL = '/Users/njpate/Documents/GWU_MS_DS/bigdata/hw_2/hw_2_user_data.csv'
URL = 'hw_2_user_data.csv'
COLS = ['UserID','UserName','WatchedMovie','MovieGenre','SessionLength','LastLoginDate']

import pandas as pd

df = pd.read_csv(URL, parse_dates=['LastLoginDate'])

# The fields 'UserName' and 'MovieGenre' were parsed as type 'object' by pandas.
# In order to use the properly, we will convert these types to python string.
df['UserName'] = df['UserName'].astype('string')
df['MovieGenre'] = df['MovieGenre'].astype('string')
# df


print(f'DATA TYPES: \n------------\n{df.dtypes}')
print(f'\nDATA: \n------------\n')
df

# Looking at the following data, we have easily loaded the data using pandas.read_csv.
# For this dataset, we can tell that there are 6 columns:
# ['UserName','WatchedMovie','MovieGenre','LastLoginDate'] are categorical fields, 
# and ['UserID', 'SessionLength'] are numerical fields. 

# MovieGenre is the only field that is nullable. 

DATA TYPES: 
------------
Unnamed: 0                int64
UserID                    int64
UserName         string[python]
WatchedMovie               bool
MovieGenre       string[python]
SessionLength             int64
LastLoginDate    datetime64[ns]
dtype: object

DATA: 
------------



Unnamed: 0.1,Unnamed: 0,UserID,UserName,WatchedMovie,MovieGenre,SessionLength,LastLoginDate
0,0,34557,Leslie Shelton,False,,175,2012-04-30
1,1,48013,Hannah Sanders,False,,1409,2020-05-30
2,2,13230,Christopher Torres,True,Adventure Drama,181,2009-11-25
3,3,18988,Christopher Stokes,True,Animation Adventure Comedy Family Fantasy Musi...,179,2022-02-15
4,4,29844,Joel Cox,False,,227,2012-06-09
...,...,...,...,...,...,...,...
199995,199995,24547,Mark Rivera,True,Drama Romance Sci-Fi,1280,2022-03-20
199996,199996,12542,Jeremy Gregory,False,,1433,2008-10-01
199997,199997,12153,Krista Bush,False,,390,2008-05-14
199998,199998,29021,Suzanne Johnson,True,Biography Drama,160,2017-03-12


2. Basic Data Aggregation
- Goal: Calculate the total number of watched movies for each genre.
- Expected Outcome: Perform simple aggregations and understand how to execute these operations in both PySpark and Pandas.

In [10]:
# Task 2
# movie_genres = df['MovieGenre'].unique().dropna()

# First we find rows where a movie was watched, 
# then group this data by genre 
# and then get the count for each genre.
df[df['WatchedMovie'] == True]\
    .groupby(['MovieGenre'])\
    .agg({'WatchedMovie': ['count']})



Unnamed: 0_level_0,WatchedMovie
Unnamed: 0_level_1,count
MovieGenre,Unnamed: 1_level_2
Action Adventure,1557
Action Adventure Comedy Crime Thriller,1624
Action Adventure Crime Drama,1577
Action Adventure Crime Mystery Thriller,1643
Action Adventure Drama,1583
...,...
Drama Mystery Thriller,1616
Drama Romance,1608
Drama Romance Sci-Fi,1564
Mystery Sci-Fi Thriller,1587


3. Date-Based Insights
- Goal: Identify the number of unique users who logged in during the last three months.
- Expected Outcome: Handle date operations and derive insights based on recent user activity.

In [13]:
# Task 3
import numpy as np

# df['nb_months'] = ((df.date2 - df.date1)/np.timedelta64(1, 'M'))
# Find records where today - LastLoginDate is ~3 months. 
# For this we will find the time delta and devide by 30 day period to get the difference in months.
# Then we can filter rows out where difference is larger than 3.
len(df[((pd.to_datetime('today') - df['LastLoginDate']) / np.timedelta64(30, 'D')) <= 3]\
    ['UserID'].unique())

1858

4. User Behavior Analysis
- Goal: Determine the average session length for users who have watched more than two movies.
- Expected Outcome: Learn to combine conditional logic with aggregation to analyze specific user behaviors.

In [15]:
# Task 4

# Filter by watched movies and group by UserID while aggregating count for 
# watched movies and avg session length. Then filter results where watched count > 2.

r = df[df['WatchedMovie'] == True].groupby(['UserID'])\
    .agg({'WatchedMovie': ['count'], 'SessionLength': ['mean']})
r[r['WatchedMovie']['count'] > 2]

Unnamed: 0_level_0,WatchedMovie,SessionLength
Unnamed: 0_level_1,count,mean
UserID,Unnamed: 1_level_2,Unnamed: 2_level_2
2,3,839.000000
6,4,783.750000
9,3,788.666667
13,4,998.500000
17,3,979.666667
...,...,...
49977,3,497.666667
49983,6,453.833333
49987,5,683.400000
49989,4,825.500000


5. Data Enhancement
- Goal: Add a new column indicating the days since the last login for each user.
- Expected Outcome: Experience with adding computed columns and manipulating date fields.

In [7]:
# Task 5
# Insert new column by computing time delta (#days) between today and LastLoginDate.
df.insert(7, 'DaysSinceLastLogin', ((pd.to_datetime('today') - df['LastLoginDate']) / np.timedelta64(1, 'D')))

# Cast DaysSinceLastLogin to int for better readability.
df['DaysSinceLastLogin'] = df['DaysSinceLastLogin'].astype(int)

df

Unnamed: 0.1,Unnamed: 0,UserID,UserName,WatchedMovie,MovieGenre,SessionLength,LastLoginDate,DaysSinceLastLogin
0,0,34557,Leslie Shelton,False,,175,2012-04-30,4359
1,1,48013,Hannah Sanders,False,,1409,2020-05-30,1407
2,2,13230,Christopher Torres,True,Adventure Drama,181,2009-11-25,5246
3,3,18988,Christopher Stokes,True,Animation Adventure Comedy Family Fantasy Musi...,179,2022-02-15,781
4,4,29844,Joel Cox,False,,227,2012-06-09,4319
...,...,...,...,...,...,...,...,...
199995,199995,24547,Mark Rivera,True,Drama Romance Sci-Fi,1280,2022-03-20,748
199996,199996,12542,Jeremy Gregory,False,,1433,2008-10-01,5666
199997,199997,12153,Krista Bush,False,,390,2008-05-14,5806
199998,199998,29021,Suzanne Johnson,True,Biography Drama,160,2017-03-12,2582
