In [1]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Markdown, display, HTML

# Fix the dying kernel problem (only a problem in some installations - you can remove it, if it works without it)
import os
os.environ['KMP_DUPLICATE_LIB_OK'] = 'True'

# Numpy tasks

For a detailed reference check out: https://numpy.org/doc/stable/reference/arrays.indexing.html.

**Task 1.** Calculate the sigmoid (logistic) function on every element of the following numpy array [0.3, 1.2, -1.4, 0.2, -0.1, 0.1, 0.8, -0.25] and print the last 5 elements. Use only vector operations.

In [6]:
import numpy as np

def sigmoid(number):
    new_a = np.exp(-(number))
    return 1 / ( 1 + new_a)

a = np.array([0.3, 1.2, -1.4, 0.2, -0.1, 0.1, 0.8, -0.25])
a = sigmoid(a)
print(a[-5:])

[0.549834   0.47502081 0.52497919 0.68997448 0.4378235 ]


**Task 2.** Calculate the dot product of the following two vectors:<br/>
$x = [3, 1, 4, 2, 6, 1, 4, 8]$<br/>
$y = [5, 2, 3, 12, 2, 4, 17, 11]$<br/>
a) by using element-wise mutliplication and np.sum,<br/>
b) by using np.dot,<br/>
b) by using np.matmul and transposition (x.T).

In [17]:
import numpy as np

𝑥 = np.array([3, 1, 4, 2, 6, 1, 4, 8])
𝑦 = np.array([5, 2, 3, 12, 2, 4, 17, 11])

multi_sum_result = np.sum(np.multiply(x, y))
print(multi_sum_result, end = "\n")

dot_result = np.dot(x, y)
print(dot_result, end = "\n")

matmul_result = np.matmul(x.T, y)
print(matmul_result, end = "\n")

225
225
225


**Task 3.** Calculate value of the logistic model<br/>
$$y = \frac{1}{1 + e^{-x_0 \theta_0 - \ldots - x_9 \theta_9 - \theta_{10}}}$$
for<br/>
$x = [1.2, 2.3, 3.4, -0.7, 4.2, 2.7, -0.5, 1.4, -3.3, 0.2]$<br/>
$\theta = [2.7, 0.33, -2.12, -1.73, 2.9, -5.8, -0.9, 12.11, 3.43, -0.5, -1.65]$<br/>
and print the result. Use only vector operations.

In [1]:
import numpy as np

def sigmoid(number):
    new_a = np.exp(-(number))
    return 1 / ( 1 + new_a)

def multiplication_arrays(x, y):
    length = 0
    if(len(x) > len(y)):
        new_a = x[:-1] * y
        new_a = np.append(new_a, x[-1])
    elif(len(x) < len(y)):
        new_a = x * y[:-1]
        new_a = np.append(new_a, y[-1])
    return new_a

x = np.array([1.2, 2.3, 3.4, -0.7, 4.2, 2.7, -0.5, 1.4, -3.3, 0.2])
y = np.array([2.7, 0.33, -2.12, -1.73, 2.9, -5.8, -0.9, 12.11, 3.43, -0.5, -1.65])

result = multiplication_arrays(x, y)
result = sigmoid(np.sum(result))
print(result)

0.2417699832615572


**Task 4.** Calculate value of the multivariate linear regression model<br/>
$$y = A x + B$$
for<br/>
$A = \begin{bmatrix} 1 & 2 & 1 \\ 3 & 0 & 1 \end{bmatrix}$<br/>
$B = \begin{bmatrix} 0.2 \\ 0.3 \end{bmatrix}$<br/>
$x = [1, 2, 3]^T$<br/>
and print the result. Use only vector and matrix operations.

In [6]:
import numpy as np

A = np.array([[1, 2, 1], [3, 0, 1]])
B = np.array([[0.2], [0.3]])
x = np.array([[1, 2, 3]])

Ax = np.matmul(A, x.T)
result = Ax + B

print(result)

[[8.2]
 [6.3]]


# Pandas

## Load datasets

- Steam (https://www.kaggle.com/tamber/steam-video-games)

- MovieLens (https://grouplens.org/datasets/movielens/)

In [69]:
steam_df = pd.read_csv(os.path.join("data", "steam", "steam-200k.csv"), 
                       names=['user-id', 'game-title', 'behavior-name', 'value', 'zero'])

ml_ratings_df = pd.read_csv(os.path.join("data", "movielens_small", "ratings.csv"))
ml_movies_df = pd.read_csv(os.path.join("data", "movielens_small", "movies.csv"))

## Merge both MovieLens DataFrames into one

In [5]:
ml_df = pd.merge(ml_ratings_df, ml_movies_df, on='movieId')
ml_df.head(10)

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,5,1,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,7,1,4.5,1106635946,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
3,15,1,2.5,1510577970,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
4,17,1,4.5,1305696483,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
5,18,1,3.5,1455209816,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
6,19,1,4.0,965705637,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
7,21,1,3.5,1407618878,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
8,27,1,3.0,962685262,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
9,31,1,5.0,850466616,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy


## Pandas tasks - Steam dataset

**Task 5.** How many people made a purchase in the Steam dataset? Remember that a person could buy many games, but you need to count every person once.

In [93]:
steam_df_copy = steam_df['user-id'][steam_df['behavior-name'] == 'purchase'].drop_duplicates()
len(steam_df_copy)

12393

**Task 6.** How many people made a purchase of "The Elder Scrolls V Skyrim"?

In [61]:
steam_df_copy = steam_df[
    (steam_df['game-title'] == 'The Elder Scrolls V Skyrim') & 
    (steam_df['behavior-name'] == 'purchase')]
print(len(steam_df_copy))

717


**Task 7.** How many purchases people made on average?

In [6]:
steam_df_copy = steam_df['user-id'][steam_df['behavior-name'] == 'purchase']
all_purchases = len(steam_df_copy)
people_count = len(steam_df_copy.drop_duplicates())

print(all_purchases/people_count)

10.45033486645687


**Task 8.** Who bought the most games?

In [92]:
steam_df_copy = steam_df.loc[steam_df['behavior-name'] == 'purchase', ['user-id', 'value']]
steam_df_copy = steam_df_copy.groupby('user-id').sum()
steam_df_copy = steam_df_copy.sort_values(by='value', ascending=False).reset_index()

print(steam_df_copy['user-id'][0])

62990992


**Task 9.** How many hours on average people played in "The Elder Scrolls V Skyrim"?

In [7]:
steam_df_copy = steam_df.loc[
    (steam_df['game-title'] == 'The Elder Scrolls V Skyrim') & 
    (steam_df['behavior-name'] == 'play'), ['user-id', 'value']]
people_count = len(steam_df_copy.drop_duplicates())
hours_played = steam_df_copy['value'].sum()

print(hours_played/people_count)

104.71093057607091


**Task 10.** Which games were played the most (in terms of the number of hours played)? Print the first 10 titles and respective numbers of hours.

In [14]:
steam_df_copy = steam_df.loc[steam_df['behavior-name'] == 'play', ['game-title', 'value']]
steam_df_copy = steam_df_copy.groupby('game-title').sum()
steam_df_copy = steam_df_copy.sort_values(by='value', ascending=False).reset_index()
print(steam_df_copy.head(10))

                                    game-title     value
0                                       Dota 2  981684.6
1              Counter-Strike Global Offensive  322771.6
2                              Team Fortress 2  173673.3
3                               Counter-Strike  134261.1
4                   Sid Meier's Civilization V   99821.3
5                        Counter-Strike Source   96075.5
6                   The Elder Scrolls V Skyrim   70889.3
7                                  Garry's Mod   49725.3
8  Call of Duty Modern Warfare 2 - Multiplayer   42009.9
9                                Left 4 Dead 2   33596.7


**Task 11.** Which games are the most consistently played (in terms of the average number of hours played)? Print the first 10 titles and respective numbers of hours.

In [20]:
steam_df_copy = steam_df.loc[steam_df['behavior-name'] == 'play', 
                             ['game-title', 'value']]
steam_df_copy1 = steam_df_copy.groupby('game-title').sum()
steam_df_copy2 = steam_df_copy.groupby('game-title').count()

steam_df_copy1['value'] = steam_df_copy1['value']/steam_df_copy2['value']
steam_df_copy1 = steam_df_copy1.sort_values(by='value', ascending=False).reset_index()
print(steam_df_copy1.head(10))

                          game-title        value
0            Eastside Hockey Manager  1295.000000
1  Baldur's Gate II Enhanced Edition   475.255556
2                    FIFA Manager 09   411.000000
3                          Perpetuum   400.975000
4              Football Manager 2014   391.984615
5              Football Manager 2012   390.453165
6              Football Manager 2010   375.048571
7              Football Manager 2011   365.703226
8                  Freaking Meatbags   331.000000
9        Out of the Park Baseball 16   330.400000


**Task 12\*\*.** Fix the above for the fact that 0 hours played is not listed, but only a purchase is recorded in such a case.

In [58]:
steam_df_copy = steam_df.loc[steam_df['behavior-name'] == 'play', 
                             ['game-title', 'value']]
steam_df_copy_all = steam_df.loc[steam_df['behavior-name'] == 'purchase', 
                                  ['game-title', 'user-id']]

steam_df_copy = steam_df_copy.groupby('game-title').sum().reset_index()
steam_df_copy_all = steam_df_copy_all.groupby('game-title').count().reset_index()

steam_df_copy_all = steam_df_copy_all.set_index('game-title').join(steam_df_copy.set_index('game-title'))
steam_df_copy_all['value'] = (steam_df_copy_all['value']/steam_df_copy_all['user-id']).fillna(0)
steam_df_copy_all.pop('user-id')

steam_df_copy_all = steam_df_copy_all.sort_values(by='value', ascending=False).reset_index()
print(steam_df_copy_all.head(10))

                    game-title        value
0      Eastside Hockey Manager  1295.000000
1              FIFA Manager 09   411.000000
2                    Perpetuum   400.975000
3        Football Manager 2012   385.572500
4        Football Manager 2014   382.185000
5        Football Manager 2010   345.439474
6        Football Manager 2011   333.435294
7  Out of the Park Baseball 16   330.400000
8        Football Manager 2013   310.659615
9        Football Manager 2015   307.381013


**Task 13.** Apply the sigmoid function
$$f(x) = \frac{1}{1 + e^{-\frac{1}{100}x}}$$
to hours played and print the first 10 rows from the entire Steam dataset after this change.

In [74]:
import numpy as np

def sigmoid(number):
    new_a = np.exp(-((1/100)*number))
    return 1 / ( 1 + new_a)

steam_df_copy = steam_df.copy()
condition = (steam_df_copy['behavior-name'] == 'play')
steam_df_copy.loc[condition, 'value'] = sigmoid(steam_df_copy.loc[condition, 'value'])

print(steam_df_copy.head(10))

     user-id                  game-title behavior-name     value  zero
0  151603712  The Elder Scrolls V Skyrim      purchase  1.000000     0
1  151603712  The Elder Scrolls V Skyrim          play  0.938774     0
2  151603712                   Fallout 4      purchase  1.000000     0
3  151603712                   Fallout 4          play  0.704746     0
4  151603712                       Spore      purchase  1.000000     0
5  151603712                       Spore          play  0.537181     0
6  151603712           Fallout New Vegas      purchase  1.000000     0
7  151603712           Fallout New Vegas          play  0.530213     0
8  151603712               Left 4 Dead 2      purchase  1.000000     0
9  151603712               Left 4 Dead 2          play  0.522235     0


## Pandas tasks - MovieLens dataset

**Task 14\*.** Calculate popularity (by the number of users who watched a movie) of all genres. Print a DataFrame with two columns: genre, n_users, where n_users contains the number of users who watched a given genre. Sort all genres in descending order.

In [75]:
ml_df = pd.merge(ml_ratings_df, ml_movies_df, on='movieId')
ml_df_copy = ml_df.copy()

ml_df.head(10)

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,5,1,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,7,1,4.5,1106635946,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
3,15,1,2.5,1510577970,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
4,17,1,4.5,1305696483,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
5,18,1,3.5,1455209816,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
6,19,1,4.0,965705637,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
7,21,1,3.5,1407618878,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
8,27,1,3.0,962685262,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
9,31,1,5.0,850466616,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy


**Task 15\*.** Calculate average rating for all genres. Print a DataFrame with two columns: genre, rating, where rating contains the average rating for a given genre. Sort all genres in descending order.

In [None]:
# Write your code here

**Task 17.** Calculate each movie rating bias (deviation from the mean of all movies average rating). Print first 10 in the form: title, average rating, bias.

In [None]:
# Write your code here

**Task 17.** Calculate each user rating bias (deviation from the mean of all users average rating). Print first 10 in the form: user_id, average rating, bias.

In [None]:
# Write your code here

**Task 18.** Randomly choose 10 movies and 10 users and print their interaction matrix in the form of a DataFrame with user_id as index and movie titles as columns. You can iterate over the DataFrame in this task.

In [None]:
# Write your code here

## Pandas + numpy tasks

**Task 19.** Create the entire interaction matrix for the MovieLens dataset. Print the submatrix of first 10 rows and 10 columns.

In [None]:
# Write your code here

**Task 20.** Calculate the matrix of size (n_users, n_users) where at position (i, j) there is the number of movies watched both by user i and user j. Print the submatrix of first 10 rows and 10 columns.

In [None]:
# Write your code here

**Task 21.** Calculate the matrix of size (n_items, n_items) where at position (i, j) there is the number of users who watched both movie i and movie j. To prevent hanging your computer because of RAM shortage use only the first 1000 items. Print the submatrix of first 10 rows and 10 columns.

In [None]:
# Write your code here