# 2 Groupify

### Now, we will deal with clustering algorithms that will provide groups of Netflix users that are similar among them.

To solve this task, you must accomplish the following stages:

### 2.1.1 Getting your data + feature engineering
1.Access to the data found in this dataset

In [1]:
import pandas as pd
import numpy as np

In [2]:
dataset = pd.read_csv('vodclickstream_uk_movies_03.csv')

In [3]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 671736 entries, 0 to 671735
Data columns (total 8 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   Unnamed: 0    671736 non-null  int64  
 1   datetime      671736 non-null  object 
 2   duration      671736 non-null  float64
 3   title         671736 non-null  object 
 4   genres        671736 non-null  object 
 5   release_date  671736 non-null  object 
 6   movie_id      671736 non-null  object 
 7   user_id       671736 non-null  object 
dtypes: float64(1), int64(1), object(6)
memory usage: 41.0+ MB


In [4]:
dataset.head()

Unnamed: 0.1,Unnamed: 0,datetime,duration,title,genres,release_date,movie_id,user_id
0,58773,2017-01-01 01:15:09,0.0,"Angus, Thongs and Perfect Snogging","Comedy, Drama, Romance",2008-07-25,26bd5987e8,1dea19f6fe
1,58774,2017-01-01 13:56:02,0.0,The Curse of Sleeping Beauty,"Fantasy, Horror, Mystery, Thriller",2016-06-02,f26ed2675e,544dcbc510
2,58775,2017-01-01 15:17:47,10530.0,London Has Fallen,"Action, Thriller",2016-03-04,f77e500e7a,7cbcc791bf
3,58776,2017-01-01 16:04:13,49.0,Vendetta,"Action, Drama",2015-06-12,c74aec7673,ebf43c36b6
4,58777,2017-01-01 19:16:37,0.0,The SpongeBob SquarePants Movie,"Animation, Action, Adventure, Comedy, Family, ...",2004-11-19,a80d6fc2aa,a57c992287


### 2.1.2
Sometimes, the features (variables, fields) are not given in a dataset but can be created from it; this is known as feature engineering. For example, the original dataset has several clicks done by the same user, so grouping data by user_id will allow you to create new features for each user:

a) Favorite genre (i.e., the genre on which the user spent the most time)

b) Average click duration

c) Time of the day (Morning/Afternoon/Night) when the user spends the most time on the platform (the time spent is tracked through the duration of the clicks)

d) Is the user an old movie lover, or is he into more recent stuff (content released after 2010)?

e) Average time spent a day by the user (considering only the days he logs in)

So, in the end, you should have for each user_id five features.

In [5]:
# We analyze for each column if there are none values
# This approach for every column, like this

for i in range(len(dataset.user_id)):
    if dataset.user_id[i]=='':
        print(i)    

There aren't any none values in the dataset  

#### Now let's analyze the `genres`

In [6]:
dataset.genres 

0                                    Comedy, Drama, Romance
1                        Fantasy, Horror, Mystery, Thriller
2                                          Action, Thriller
3                                             Action, Drama
4         Animation, Action, Adventure, Comedy, Family, ...
                                ...                        
671731                                            Talk-Show
671732         Animation, Action, Adventure, Family, Sci-Fi
671733                            Action, Adventure, Sci-Fi
671734                                   Documentary, Music
671735                                        Comedy, Drama
Name: genres, Length: 671736, dtype: object

In [7]:
#we're making the genres column as genres lists
dataset.genres.apply(lambda row: row.split(','))

0                                [Comedy,  Drama,  Romance]
1                   [Fantasy,  Horror,  Mystery,  Thriller]
2                                       [Action,  Thriller]
3                                          [Action,  Drama]
4         [Animation,  Action,  Adventure,  Comedy,  Fam...
                                ...                        
671731                                          [Talk-Show]
671732    [Animation,  Action,  Adventure,  Family,  Sci...
671733                        [Action,  Adventure,  Sci-Fi]
671734                                [Documentary,  Music]
671735                                     [Comedy,  Drama]
Name: genres, Length: 671736, dtype: object

Create a new column called `genres_list`

In [8]:
dataset['genres_list']=''
#dataset.info() controlliamo -----> ok

In [9]:
dataset['genres_list']=dataset.genres.apply(lambda row: [word.strip() for word in row.split(',')]) #escludo gli extra space con strip

In [10]:
#unique values
unique_genres = set()
dataset['genres_list'].apply(lambda row: [unique_genres.add(value) for value in row])

0                           [None, None, None]
1                     [None, None, None, None]
2                                 [None, None]
3                                 [None, None]
4         [None, None, None, None, None, None]
                          ...                 
671731                                  [None]
671732          [None, None, None, None, None]
671733                      [None, None, None]
671734                            [None, None]
671735                            [None, None]
Name: genres_list, Length: 671736, dtype: object

In [11]:
unique_genres

{'Action',
 'Adventure',
 'Animation',
 'Biography',
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Family',
 'Fantasy',
 'Film-Noir',
 'History',
 'Horror',
 'Music',
 'Musical',
 'Mystery',
 'NOT AVAILABLE',
 'News',
 'Reality-TV',
 'Romance',
 'Sci-Fi',
 'Short',
 'Sport',
 'Talk-Show',
 'Thriller',
 'War',
 'Western'}

##### First feature: Favorite genres

In [12]:
#se ci riesci grazie sennò ci provo domani

In [13]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 671736 entries, 0 to 671735
Data columns (total 9 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   Unnamed: 0    671736 non-null  int64  
 1   datetime      671736 non-null  object 
 2   duration      671736 non-null  float64
 3   title         671736 non-null  object 
 4   genres        671736 non-null  object 
 5   release_date  671736 non-null  object 
 6   movie_id      671736 non-null  object 
 7   user_id       671736 non-null  object 
 8   genres_list   671736 non-null  object 
dtypes: float64(1), int64(1), object(7)
memory usage: 46.1+ MB


##### Second feature: Average click duration

Let's analyze the column `duration`

In [14]:
conta=[]
for i in range(len(dataset.duration)):
    if dataset.duration[i] < 0:
        conta.append(i)
    
print(len(conta))

21734


In [15]:
dataset.loc[dataset['duration'] < 0, 'duration'] = np.nan
#now we don't have any negative values

In [16]:
#'Duration' shows how long it was (in seconds) until that user clicked on another URL
# b) Average click duration
dataset.groupby(by='user_id').duration.mean().reset_index()
#there is a negative value

Unnamed: 0,user_id,duration
0,00004e2862,0.000000
1,000052a0a0,2024.166667
2,000090e7c8,0.000000
3,000118a755,0.000000
4,000296842d,11044.000000
...,...,...
161913,fffd9bf758,8495.000000
161914,fffe7b777b,1785.000000
161915,fffeac83be,40606.272727
161916,ffff2c5f9e,0.000000


##### Third feature: The most spent time during a day

Changing string to datetime 

In [17]:
dataset.datetime = pd.to_datetime(dataset.datetime) # convert string to datetime

In [18]:
dataset.datetime.dt.dayofweek

0         6
1         6
2         6
3         6
4         6
         ..
671731    6
671732    6
671733    6
671734    6
671735    6
Name: datetime, Length: 671736, dtype: int64

In [19]:
# creating a new column
dataset['hourofday'] = 0
dataset['hourofday'] = dataset.datetime.dt.hour

#for each user and for every hour we see the mean of the durations of the clicks
dataset.groupby(by=['user_id','hourofday']).duration.mean()

# dataframe with the user_id, hourofday where the user spend the most of the time and the average duration
cols = ['user_id', 'hourofday']
df=dataset.groupby(by=['user_id','hourofday']).duration.mean().reset_index()
df_first = df.sort_values(cols, ascending=[True, False]).groupby('user_id', as_index=False).first()

df_first.head(10) #escono dei nan, forse era meglio non cambiare?

Unnamed: 0,user_id,hourofday,duration
0,00004e2862,20,0.0
1,000052a0a0,23,1988.0
2,000090e7c8,20,0.0
3,000118a755,23,0.0
4,000296842d,22,6470.5
5,0002aab109,21,0.0
6,0002abf14f,22,0.0
7,0002d1c4b1,1,0.0
8,000499c2b6,0,
9,00051f0e1f,20,


##### Fourth feature: old movie lover or recent movie lover

In [20]:
dataset.release_date = pd.to_datetime(dataset.release_date, errors='coerce')

dataset['oldmovie'] = False
dataset['oldmovie'] = dataset.release_date.dt.year < 2010

df2 = dataset.groupby(['user_id', 'oldmovie'])['movie_id'].count().reset_index()

df_oldnew = df2.sort_values(['user_id', 'oldmovie']).groupby('user_id', as_index=False).first()

##### Fifth feature : Average time spent a day by the user (considering only the days he logs in)

In [21]:
# Group by 'user_id' and day, then calculate time spent per day
grouped = dataset.groupby(['user_id', dataset['datetime'].dt.dayofweek])

# Calculate the time difference per day for each user
time_per_day = grouped['datetime'].transform(lambda x: (x.max() - x.min()).total_seconds() / 3600)  # Converting to hours

# Calculate the mean time spent per day by each user
average_time_per_day = time_per_day.groupby(dataset['user_id']).mean()

# Displaying the average time spent per day by each user
average_time_per_day.head(10)

user_id
00004e2862      0.000000
000052a0a0    198.690880
000090e7c8      0.000000
000118a755      0.125000
000296842d     13.943542
0002aab109      0.000000
0002abf14f      0.000000
0002d1c4b1      0.000000
000499c2b6      0.000000
00051f0e1f      0.000000
Name: datetime, dtype: float64

Now we merge the 5 features into 1 dataset

In [22]:
dataset_user = df_first[['user_id', 'hourofday']].merge(df_oldnew[['user_id', 'oldmovie']],on='user_id',how='inner')\
.merge(dataset.groupby(by='user_id').duration.mean().reset_index(),on='user_id', how='inner')

In [23]:
# adding favorite_genre
#dataset_user = pd.merge(dataset_user, df_fav, on='user_id', how='left')

In [24]:
# adding avg_dur_click
average_duration = dataset.groupby(by='user_id')['duration'].mean().reset_index()
average_duration.columns = ['user_id', 'avg_dur_click']

# merging
dataset_user = pd.merge(dataset_user, average_duration, on='user_id', how='left')

In [25]:
# Resetting index to merge average_time_per_day
average_time_per_day = average_time_per_day.reset_index()
average_time_per_day.columns = ['user_id', 'avg_time_day']

# merging
dataset_user = pd.merge(dataset_user, average_time_per_day, on='user_id', how='inner')


In [26]:
dataset_user.head(10)

Unnamed: 0,user_id,hourofday,oldmovie,duration,avg_dur_click,avg_time_day
0,00004e2862,20,True,0.0,0.0,0.0
1,000052a0a0,23,False,2024.166667,2024.166667,198.69088
2,000090e7c8,20,False,0.0,0.0,0.0
3,000118a755,23,False,0.0,0.0,0.125
4,000296842d,22,False,11044.0,11044.0,13.943542
5,0002aab109,21,False,27875.0,27875.0,0.0
6,0002abf14f,22,False,0.0,0.0,0.0
7,0002d1c4b1,1,False,0.0,0.0,0.0
8,000499c2b6,0,True,,,0.0
9,00051f0e1f,20,False,,,0.0


### 2.1.3
Consider at least 10 additional features that can be generated for each user_id (you can use chatGPT or other LLM tools for suggesting features to create). Describe each of them and add them to the previous dataset you made (the one with five features). In the end, you should have for each user at least 15 features (5 recommended + 10 suggested by you).

In [27]:
# All the genres a user've watched

# excluding the genres that the user passed(0 seconds)
filtered_dataset = dataset[dataset['duration'] > 0]

# Group by 'user_id' and aggregate the genres watched by each user
grouped_genres = filtered_dataset.groupby('user_id')['genres'].agg(lambda x: ', '.join(set(x)))

# Create a new DataFrame with 'user_id' and the combined genres watched
genres_per_user = grouped_genres.reset_index()

# Rename the column
genres_per_user.columns = ['user_id', 'all_watched_genres']

# Merge 
dataset_user = pd.merge(dataset_user, genres_per_user, on='user_id', how='left')
dataset_user.head(20)

#ci sono tanti NaN

Unnamed: 0,user_id,hourofday,oldmovie,duration,avg_dur_click,avg_time_day,all_watched_genres
0,00004e2862,20,True,0.0,0.0,0.0,
1,000052a0a0,23,False,2024.166667,2024.166667,198.69088,"Action, Horror, Sci-Fi, Thriller, Crime, Drama..."
2,000090e7c8,20,False,0.0,0.0,0.0,
3,000118a755,23,False,0.0,0.0,0.125,
4,000296842d,22,False,11044.0,11044.0,13.943542,"Drama, Mystery, Sci-Fi, Thriller"
5,0002aab109,21,False,27875.0,27875.0,0.0,"Biography, Drama"
6,0002abf14f,22,False,0.0,0.0,0.0,
7,0002d1c4b1,1,False,0.0,0.0,0.0,
8,000499c2b6,0,True,,,0.0,
9,00051f0e1f,20,False,,,0.0,


In [28]:
#Time spent watching for all the genres

# duration > 0
filtered_data = dataset[dataset['duration'] > 0]

# Group by user_id and genres, then calculate the total time watched for each genre by each user
genre_time_watched = filtered_data.groupby(['user_id', 'genres'])['duration'].sum().reset_index()

# Group by user_id and aggregate all genres with their total time watched for each user
user_watched_genres = genre_time_watched.groupby('user_id').apply(lambda x: {genre: time for genre, time in zip(x['genres'], x['duration'])}).reset_index(name='watching_time_per_genre')

# Merge
dataset_user = pd.merge(dataset_user, user_watched_genres, on='user_id', how='left')

In [29]:
# Does the user prefer weekday or weekend to watch movies?

df_w = dataset[dataset['duration'] > 0]

# Extract the day of the week (0 = Monday, 6 = Sunday)
df_w['day_of_week'] = df_w['datetime'].dt.dayofweek

# Fucntion that assigns Monday=0...Sunday=6 
def categorize_day(day):
    if day < 5:
        return 'Weekday'
    else:
        return 'Weekend'

# New column 'day_category'
df_w['day_category'] = df_w['day_of_week'].apply(categorize_day)

# Group by 'user_id' and 'day_category' I count the occurrences
day_clicks = df_w.groupby(['user_id', 'day_category']).size().unstack().fillna(0)

# Determine if the user prefers more clicks during weekdays or weekends
day_clicks['weekend_or_weekday'] = 'Weekend'
day_clicks.loc[day_clicks['Weekday'] > day_clicks['Weekend'], 'weekend_or_weekday'] = 'Weekday'

# Reset the index to merge properly
user_preference = day_clicks['weekend_or_weekday'].reset_index()

# Merge
dataset_user = pd.merge(dataset_user, user_preference, on='user_id', how='left')

dataset_user.head(20)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_w['day_of_week'] = df_w['datetime'].dt.dayofweek
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_w['day_category'] = df_w['day_of_week'].apply(categorize_day)


Unnamed: 0,user_id,hourofday,oldmovie,duration,avg_dur_click,avg_time_day,all_watched_genres,watching_time_per_genre,weekend_or_weekday
0,00004e2862,20,True,0.0,0.0,0.0,,,
1,000052a0a0,23,False,2024.166667,2024.166667,198.69088,"Action, Horror, Sci-Fi, Thriller, Crime, Drama...","{'Action, Adventure, Comedy, Sci-Fi': 6226.0, ...",Weekday
2,000090e7c8,20,False,0.0,0.0,0.0,,,
3,000118a755,23,False,0.0,0.0,0.125,,,
4,000296842d,22,False,11044.0,11044.0,13.943542,"Drama, Mystery, Sci-Fi, Thriller","{'Drama, Mystery, Sci-Fi, Thriller': 77308.0}",Weekday
5,0002aab109,21,False,27875.0,27875.0,0.0,"Biography, Drama","{'Biography, Drama': 83625.0}",Weekend
6,0002abf14f,22,False,0.0,0.0,0.0,,,
7,0002d1c4b1,1,False,0.0,0.0,0.0,,,
8,000499c2b6,0,True,,,0.0,,,
9,00051f0e1f,20,False,,,0.0,,,


In [30]:
# When the user watches more during the year

# Extract the month to determine the season
dataset['month'] = dataset['datetime'].dt.month

# Function hat assigns months into seasons
def categorize_season(month):
    if month in [12, 1, 2]:  # Winter: December to February
        return 'Winter'
    elif month in [3, 4, 5]:  # Spring: March to May
        return 'Spring'
    elif month in [6, 7, 8]:  # Summer: June to August
        return 'Summer'
    else:  # Autumn: September to November
        return 'Autumn'

# Apply the function to create a new column 'season'
dataset['season'] = dataset['month'].apply(categorize_season)

# Group by 'user_id' and 'season', count occurrences
season_clicks = dataset.groupby(['user_id', 'season']).size().unstack().fillna(0)

# Determine the season with maximum clicks for each user
season_clicks['preferred_season'] = season_clicks.idxmax(axis=1)

# Reset the index to merge properly
user_season_preference = season_clicks['preferred_season'].reset_index()

# Merge
dataset_user = pd.merge(dataset_user, user_season_preference, on='user_id', how='left')

dataset_user.head(20)

Unnamed: 0,user_id,hourofday,oldmovie,duration,avg_dur_click,avg_time_day,all_watched_genres,watching_time_per_genre,weekend_or_weekday,preferred_season
0,00004e2862,20,True,0.0,0.0,0.0,,,,Winter
1,000052a0a0,23,False,2024.166667,2024.166667,198.69088,"Action, Horror, Sci-Fi, Thriller, Crime, Drama...","{'Action, Adventure, Comedy, Sci-Fi': 6226.0, ...",Weekday,Summer
2,000090e7c8,20,False,0.0,0.0,0.0,,,,Spring
3,000118a755,23,False,0.0,0.0,0.125,,,,Summer
4,000296842d,22,False,11044.0,11044.0,13.943542,"Drama, Mystery, Sci-Fi, Thriller","{'Drama, Mystery, Sci-Fi, Thriller': 77308.0}",Weekday,Winter
5,0002aab109,21,False,27875.0,27875.0,0.0,"Biography, Drama","{'Biography, Drama': 83625.0}",Weekend,Spring
6,0002abf14f,22,False,0.0,0.0,0.0,,,,Spring
7,0002d1c4b1,1,False,0.0,0.0,0.0,,,,Winter
8,000499c2b6,0,True,,,0.0,,,,Summer
9,00051f0e1f,20,False,,,0.0,,,,Spring


In [31]:
# Does the user tend to watch a new released movie within a month?

# Difference in days between the click datetime and the release date
dataset['days_since_release'] = (dataset['datetime'] - dataset['release_date']).dt.days

# Function to identify if a movie was watched within a month of release
def watched_within_month(days_since_release):
    return 1 if days_since_release <= 30 else 0

# New column 'watched_within_month'
dataset['watched_within_month'] = dataset['days_since_release'].apply(watched_within_month)

# Group by 'user_id' and count occurrences of watching a newly released movie
new_release_watch = dataset[dataset['watched_within_month'] == 1].groupby('user_id').size().reset_index()
new_release_watch.columns = ['user_id', 'watched_new_release']

# Merge
dataset_user = pd.merge(dataset_user, new_release_watch, on='user_id', how='left')

dataset_user.head(20)

Unnamed: 0,user_id,hourofday,oldmovie,duration,avg_dur_click,avg_time_day,all_watched_genres,watching_time_per_genre,weekend_or_weekday,preferred_season,watched_new_release
0,00004e2862,20,True,0.0,0.0,0.0,,,,Winter,
1,000052a0a0,23,False,2024.166667,2024.166667,198.69088,"Action, Horror, Sci-Fi, Thriller, Crime, Drama...","{'Action, Adventure, Comedy, Sci-Fi': 6226.0, ...",Weekday,Summer,
2,000090e7c8,20,False,0.0,0.0,0.0,,,,Spring,1.0
3,000118a755,23,False,0.0,0.0,0.125,,,,Summer,
4,000296842d,22,False,11044.0,11044.0,13.943542,"Drama, Mystery, Sci-Fi, Thriller","{'Drama, Mystery, Sci-Fi, Thriller': 77308.0}",Weekday,Winter,8.0
5,0002aab109,21,False,27875.0,27875.0,0.0,"Biography, Drama","{'Biography, Drama': 83625.0}",Weekend,Spring,
6,0002abf14f,22,False,0.0,0.0,0.0,,,,Spring,
7,0002d1c4b1,1,False,0.0,0.0,0.0,,,,Winter,
8,000499c2b6,0,True,,,0.0,,,,Summer,
9,00051f0e1f,20,False,,,0.0,,,,Spring,


# non so quale scegliere delle due sotto 

In [None]:
# Number of popular movie that a user's watched

# Calculate the total number of clicks for each movie
popular = dataset.groupby('movie_id').size().reset_index()
popular.columns = ['movie_id', 'click_count']

# Define a threshold for popularity (e.g., top 25% most clicked movies considered popular)
popularity_threshold = popular['click_count'].quantile(0.75)

# Identify movies that are considered popular based on the threshold
popular['is_popular'] = popular['click_count'] >= popularity_threshold

# Merge 
dataset = pd.merge(dataset, popular[['movie_id', 'is_popular']], on='movie_id', how='left')

# Group by 'user_id' and count occurrences of watching popular movies
user_popular_movies = dataset[dataset['is_popular']].groupby('user_id').size().reset_index()
user_popular_movies.columns = ['user_id', 'watched_popular_movies']

# Merge 
dataset_user = pd.merge(dataset_user, user_popular_movies, on='user_id', how='left')

# Fill NaN values (users who didn't watch any popular movies)
dataset_user['watched_popular_movies'] = dataset_user['watched_popular_movies'].fillna(0)

dataset_user.head(20)


In [32]:
#Is the user a popular movie lover? False or True base on a threshold 60%

# Calculate the total number of clicks for each movie
movie_popularity = dataset.groupby('movie_id').size().reset_index()
movie_popularity.columns = ['movie_id', 'click_count']

# Define a threshold for popularity (e.g., top 25% most clicked movies considered popular)
popularity_threshold = movie_popularity['click_count'].quantile(0.75)

# Identify movies that are considered popular based on the threshold
movie_popularity['is_popular'] = movie_popularity['click_count'] >= popularity_threshold

# Merge
dataset = pd.merge(dataset, movie_popularity[['movie_id', 'is_popular']], on='movie_id', how='left')

# Group by 'user_id' and count occurrences of watching popular movies
user_movie_counts = dataset.groupby(['user_id', 'movie_id'])['is_popular'].max().reset_index()
user_movie_counts = user_movie_counts.groupby('user_id')['is_popular'].value_counts().unstack(fill_value=0)

# Calculate the proportion of popular movies watched by each user
user_movie_counts['proportion_popular'] = user_movie_counts[True] / (user_movie_counts[True] + user_movie_counts[False])

# Create a column 'popular_movie_lover' indicating if the user is a popular movie lover or not
user_movie_counts['popular_movie_lover'] = user_movie_counts['proportion_popular'] > 0.6

# Reset the index to merge properly
user_movie_lover = user_movie_counts['popular_movie_lover'].reset_index()

# Merge
dataset_user = pd.merge(dataset_user, user_movie_lover, on='user_id', how='left')

# Fill NaN values (users who didn't watch any movies or popular movies)
dataset_user['popular_movie_lover'] = dataset_user['popular_movie_lover'].fillna(False)

dataset_user.head(20)


Unnamed: 0,user_id,hourofday,oldmovie,duration,avg_dur_click,avg_time_day,all_watched_genres,watching_time_per_genre,weekend_or_weekday,preferred_season,watched_new_release,popular_movie_lover
0,00004e2862,20,True,0.0,0.0,0.0,,,,Winter,,True
1,000052a0a0,23,False,2024.166667,2024.166667,198.69088,"Action, Horror, Sci-Fi, Thriller, Crime, Drama...","{'Action, Adventure, Comedy, Sci-Fi': 6226.0, ...",Weekday,Summer,,True
2,000090e7c8,20,False,0.0,0.0,0.0,,,,Spring,1.0,True
3,000118a755,23,False,0.0,0.0,0.125,,,,Summer,,False
4,000296842d,22,False,11044.0,11044.0,13.943542,"Drama, Mystery, Sci-Fi, Thriller","{'Drama, Mystery, Sci-Fi, Thriller': 77308.0}",Weekday,Winter,8.0,True
5,0002aab109,21,False,27875.0,27875.0,0.0,"Biography, Drama","{'Biography, Drama': 83625.0}",Weekend,Spring,,True
6,0002abf14f,22,False,0.0,0.0,0.0,,,,Spring,,True
7,0002d1c4b1,1,False,0.0,0.0,0.0,,,,Winter,,True
8,000499c2b6,0,True,,,0.0,,,,Summer,,True
9,00051f0e1f,20,False,,,0.0,,,,Spring,,True


In [42]:
# Feature 7
# Average days the user log in during a week

# Considering the dataset with duration > 0
filtered_data = dataset[dataset['duration'] > 0]

# Extract day of the week (0 = Monday, 6 = Sunday)
filtered_data['day_of_week'] = filtered_data['datetime'].dt.dayofweek

# Calculate the count of unique days logged in for each user
user_unique_days = filtered_data.groupby('user_id')['day_of_week'].nunique().reset_index()
user_unique_days.columns = ['user_id', 'unique_days_logged_in']

# Calculate the average days logged in a week considering only days with movie duration > 0
user_unique_days['avg_days_log'] = user_unique_days['unique_days_logged_in'] / 7.0

# Merge
dataset_user = pd.merge(dataset_user, user_unique_days[['user_id', 'avg_days_log']], on='user_id', how='left')

dataset_user.head(20)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_data['day_of_week'] = filtered_data['datetime'].dt.dayofweek


Unnamed: 0,user_id,hourofday,oldmovie,duration,avg_dur_click,avg_time_day,all_watched_genres,watching_time_per_genre,weekend_or_weekday,preferred_season,watched_new_release,popular_movie_lover,avg_days_logged_in_x,avg_days_log_x,avg_days_logged_in_y,avg_days_log_y
0,00004e2862,20,True,0.0,0.0,0.0,,,,Winter,,True,6.0,6.0,,
1,000052a0a0,23,False,2024.166667,2024.166667,198.69088,"Action, Horror, Sci-Fi, Thriller, Crime, Drama...","{'Action, Adventure, Comedy, Sci-Fi': 6226.0, ...",Weekday,Summer,,True,2.0,2.0,0.857143,0.857143
2,000090e7c8,20,False,0.0,0.0,0.0,,,,Spring,1.0,True,1.0,1.0,,
3,000118a755,23,False,0.0,0.0,0.125,,,,Summer,,False,1.0,1.0,,
4,000296842d,22,False,11044.0,11044.0,13.943542,"Drama, Mystery, Sci-Fi, Thriller","{'Drama, Mystery, Sci-Fi, Thriller': 77308.0}",Weekday,Winter,8.0,True,1.0,1.0,0.285714,0.285714
5,0002aab109,21,False,27875.0,27875.0,0.0,"Biography, Drama","{'Biography, Drama': 83625.0}",Weekend,Spring,,True,1.0,1.0,0.142857,0.142857
6,0002abf14f,22,False,0.0,0.0,0.0,,,,Spring,,True,4.0,4.0,,
7,0002d1c4b1,1,False,0.0,0.0,0.0,,,,Winter,,True,1.0,1.0,,
8,000499c2b6,0,True,,,0.0,,,,Summer,,True,1.0,1.0,,
9,00051f0e1f,20,False,,,0.0,,,,Spring,,True,2.0,2.0,,


In [43]:
# Feature 8
# last interaction

# Find the most recent interaction datetime for each user
last_interaction = dataset.groupby('user_id')['datetime'].max().reset_index()
last_interaction.columns = ['user_id', 'last_interaction']

# Merge
dataset_user = pd.merge(dataset_user, last_interaction, on='user_id', how='left')

dataset_user.head(20)

Unnamed: 0,user_id,hourofday,oldmovie,duration,avg_dur_click,avg_time_day,all_watched_genres,watching_time_per_genre,weekend_or_weekday,preferred_season,watched_new_release,popular_movie_lover,avg_days_logged_in_x,avg_days_log_x,avg_days_logged_in_y,avg_days_log_y,last_interaction
0,00004e2862,20,True,0.0,0.0,0.0,,,,Winter,,True,6.0,6.0,,,2017-12-05 20:39:15
1,000052a0a0,23,False,2024.166667,2024.166667,198.69088,"Action, Horror, Sci-Fi, Thriller, Crime, Drama...","{'Action, Adventure, Comedy, Sci-Fi': 6226.0, ...",Weekday,Summer,,True,2.0,2.0,0.857143,0.857143,2017-06-26 18:25:42
2,000090e7c8,20,False,0.0,0.0,0.0,,,,Spring,1.0,True,1.0,1.0,,,2018-03-09 20:01:40
3,000118a755,23,False,0.0,0.0,0.125,,,,Summer,,False,1.0,1.0,,,2018-06-15 03:01:15
4,000296842d,22,False,11044.0,11044.0,13.943542,"Drama, Mystery, Sci-Fi, Thriller","{'Drama, Mystery, Sci-Fi, Thriller': 77308.0}",Weekday,Winter,8.0,True,1.0,1.0,0.285714,0.285714,2018-12-31 20:16:23
5,0002aab109,21,False,27875.0,27875.0,0.0,"Biography, Drama","{'Biography, Drama': 83625.0}",Weekend,Spring,,True,1.0,1.0,0.142857,0.142857,2017-05-07 20:36:52
6,0002abf14f,22,False,0.0,0.0,0.0,,,,Spring,,True,4.0,4.0,,,2019-03-07 22:31:29
7,0002d1c4b1,1,False,0.0,0.0,0.0,,,,Winter,,True,1.0,1.0,,,2018-01-09 01:21:31
8,000499c2b6,0,True,,,0.0,,,,Summer,,True,1.0,1.0,,,2017-08-29 00:49:57
9,00051f0e1f,20,False,,,0.0,,,,Spring,,True,2.0,2.0,,,2019-05-03 20:01:43


In [44]:
# Feature 9 
# Average gap time between logins of the user

df_gap = dataset[dataset['duration'] > 0]

# Sort the dataset by 'user_id' and 'datetime'
df_gap = df_gap.sort_values(['user_id', 'datetime'])

# Calculate the time difference between consecutive interactions for each user
df_gap['time_diff'] = df_gap.groupby('user_id')['datetime'].diff().dt.total_seconds()

# Calculate the average time gap between logs and watching movies for each user
average_time_gap = df_gap.groupby('user_id')['time_diff'].mean().reset_index()
average_time_gap.columns = ['user_id', 'avg_time_gap']

# Merge the average time gap information into 'dataset_user'
dataset_user = pd.merge(dataset_user, average_time_gap, on='user_id', how='left')

dataset_user.head(20)

Unnamed: 0,user_id,hourofday,oldmovie,duration,avg_dur_click,avg_time_day,all_watched_genres,watching_time_per_genre,weekend_or_weekday,preferred_season,watched_new_release,popular_movie_lover,avg_days_logged_in_x,avg_days_log_x,avg_days_logged_in_y,avg_days_log_y,last_interaction,avg_time_gap
0,00004e2862,20,True,0.0,0.0,0.0,,,,Winter,,True,6.0,6.0,,,2017-12-05 20:39:15,
1,000052a0a0,23,False,2024.166667,2024.166667,198.69088,"Action, Horror, Sci-Fi, Thriller, Crime, Drama...","{'Action, Adventure, Comedy, Sci-Fi': 6226.0, ...",Weekday,Summer,,True,2.0,2.0,0.857143,0.857143,2017-06-26 18:25:42,225895.2
2,000090e7c8,20,False,0.0,0.0,0.0,,,,Spring,1.0,True,1.0,1.0,,,2018-03-09 20:01:40,
3,000118a755,23,False,0.0,0.0,0.125,,,,Summer,,False,1.0,1.0,,,2018-06-15 03:01:15,
4,000296842d,22,False,11044.0,11044.0,13.943542,"Drama, Mystery, Sci-Fi, Thriller","{'Drama, Mystery, Sci-Fi, Thriller': 77308.0}",Weekday,Winter,8.0,True,1.0,1.0,0.285714,0.285714,2018-12-31 20:16:23,26423.33
5,0002aab109,21,False,27875.0,27875.0,0.0,"Biography, Drama","{'Biography, Drama': 83625.0}",Weekend,Spring,,True,1.0,1.0,0.142857,0.142857,2017-05-07 20:36:52,
6,0002abf14f,22,False,0.0,0.0,0.0,,,,Spring,,True,4.0,4.0,,,2019-03-07 22:31:29,
7,0002d1c4b1,1,False,0.0,0.0,0.0,,,,Winter,,True,1.0,1.0,,,2018-01-09 01:21:31,
8,000499c2b6,0,True,,,0.0,,,,Summer,,True,1.0,1.0,,,2017-08-29 00:49:57,
9,00051f0e1f,20,False,,,0.0,,,,Spring,,True,2.0,2.0,,,2019-05-03 20:01:43,


In [47]:
# Feature 10
# Change of the preference of the genre for user

# Sort the dataset by 'user_id' and 'datetime'
sorted_data = dataset.sort_values(['user_id', 'datetime'])

# Extract the unique genres for each user in each timestamp
sorted_data['genres_list'] = sorted_data.groupby('user_id')['genres'].transform(lambda x: ','.join(set(x)))

# Check if the genres change between consecutive interactions for each user
sorted_data['genre_change'] = sorted_data.groupby('user_id')['genres_list'].transform(lambda x: x.ne(x.shift()))

# Aggregate genre change information for each user
user_genre_change = sorted_data.groupby('user_id')['genre_change'].max().reset_index()

# Merge the genre change information into 'dataset_user'
dataset_user = pd.merge(dataset_user, user_genre_change, on='user_id', how='left')

dataset_user.head(20)

#super lento però va

Unnamed: 0,user_id,hourofday,oldmovie,duration,avg_dur_click,avg_time_day,all_watched_genres,watching_time_per_genre,weekend_or_weekday,preferred_season,watched_new_release,popular_movie_lover,avg_days_logged_in_x,avg_days_log_x,avg_days_logged_in_y,avg_days_log_y,last_interaction,avg_time_gap,genre_change
0,00004e2862,20,True,0.0,0.0,0.0,,,,Winter,,True,6.0,6.0,,,2017-12-05 20:39:15,,True
1,000052a0a0,23,False,2024.166667,2024.166667,198.69088,"Action, Horror, Sci-Fi, Thriller, Crime, Drama...","{'Action, Adventure, Comedy, Sci-Fi': 6226.0, ...",Weekday,Summer,,True,2.0,2.0,0.857143,0.857143,2017-06-26 18:25:42,225895.2,True
2,000090e7c8,20,False,0.0,0.0,0.0,,,,Spring,1.0,True,1.0,1.0,,,2018-03-09 20:01:40,,True
3,000118a755,23,False,0.0,0.0,0.125,,,,Summer,,False,1.0,1.0,,,2018-06-15 03:01:15,,True
4,000296842d,22,False,11044.0,11044.0,13.943542,"Drama, Mystery, Sci-Fi, Thriller","{'Drama, Mystery, Sci-Fi, Thriller': 77308.0}",Weekday,Winter,8.0,True,1.0,1.0,0.285714,0.285714,2018-12-31 20:16:23,26423.33,True
5,0002aab109,21,False,27875.0,27875.0,0.0,"Biography, Drama","{'Biography, Drama': 83625.0}",Weekend,Spring,,True,1.0,1.0,0.142857,0.142857,2017-05-07 20:36:52,,True
6,0002abf14f,22,False,0.0,0.0,0.0,,,,Spring,,True,4.0,4.0,,,2019-03-07 22:31:29,,True
7,0002d1c4b1,1,False,0.0,0.0,0.0,,,,Winter,,True,1.0,1.0,,,2018-01-09 01:21:31,,True
8,000499c2b6,0,True,,,0.0,,,,Summer,,True,1.0,1.0,,,2017-08-29 00:49:57,,True
9,00051f0e1f,20,False,,,0.0,,,,Spring,,True,2.0,2.0,,,2019-05-03 20:01:43,,True


## 2.2 Choose your features (variables)!

You may notice that you have plenty of features to work with now. So, it would be best to find a way to reduce the dimensionality (reduce the number of variables to work with). You can follow the subsequent directions to achieve it:

1.To normalise or not to normalise? That's the question. Sometimes, it is worth normalizing (scaling) the features. Explain if it is a good idea to perform any normalization method. If you think the normalization should be used, apply it to your data (look at the available normalization functions in the scikit-learn library).

2.Select one method for dimensionality reduction and apply it to your data. Some suggestions are Principal Component Analysis, Multiple Correspondence Analysis, Singular Value Decomposition, Factor Analysis for Mixed Data, Two-Steps clustering. Make sure that the method you choose applies to the features you have or modify your data to be able to use it. Explain why you chose that method and the limitations it may have.