<h1 style="font-size:2em;color:#2467C0">Group By and Aggregate </h1>

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html 

In [5]:
import pandas as pd

In [7]:
ratings = pd.read_csv('ratings.csv', sep=',')
print(ratings.shape)
ratings.head(4)

(9999, 4)


Unnamed: 0,userId,movieId,rating,timestamp
0,1,17,4.0,944249077
1,1,25,1.0,944250228
2,1,29,2.0,943230976
3,1,30,5.0,944249077


### **pandas.DataFrame.info()**
A summary of the dataframe.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html

In [10]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9999 entries, 0 to 9998
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   userId     9999 non-null   int64  
 1   movieId    9999 non-null   int64  
 2   rating     9999 non-null   float64
 3   timestamp  9999 non-null   int64  
dtypes: float64(1), int64(3)
memory usage: 312.6 KB


### **Count the total number of movies for each rating score.**

In [100]:
#count the total number of movies for each rating score.
ratings_count_per_score = ratings.groupby('rating').count() # for each rating score, count
ratings_count_per_score

Unnamed: 0_level_0,userId,movieId,timestamp
rating,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.5,93,93,93
1.0,302,302,302
1.5,155,155,155
2.0,977,977,977
2.5,433,433,433
3.0,2269,2269,2269
3.5,964,964,964
4.0,2831,2831,2831
4.5,641,641,641
5.0,1334,1334,1334


#### The above results might cause some confusion, since we only need to know how many movies for each rating score.

In [18]:
ratings_count_per_score = ratings[['movieId','rating']].groupby('rating').count()
ratings_count_per_score

Unnamed: 0_level_0,movieId
rating,Unnamed: 1_level_1
0.5,93
1.0,302
1.5,155
2.0,977
2.5,433
3.0,2269
3.5,964
4.0,2831
4.5,641
5.0,1334


In [6]:
ratings_count_per_score.index

Index([0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0], dtype='float64', name='rating')

#### Note in the above output, how "rating" becomes an index, rather than a column. </br>
If you want to keep *rating* as a column, use **as_index = False** in **groupby()**

In [7]:
ratings_count_per_score = ratings[['movieId','rating']].groupby('rating', as_index = False).count()
ratings_count_per_score

Unnamed: 0,rating,movieId
0,0.5,525132
1,1.0,946675
2,1.5,531063
3,2.0,2028622
4,2.5,1685386
5,3.0,6054990
6,3.5,4290105
7,4.0,8367654
8,4.5,2974000
9,5.0,4596577


In [93]:
save = ratings[['movieId','rating']].groupby('rating')
save

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x17f2ddaf0>

In [8]:
ratings_count_per_score.index

RangeIndex(start=0, stop=10, step=1)

<h1 style="font-size:2em;color:#FF6111">Practice -- groupby() </h1>
Using the ratings dataframe, create a new dataframe that shows the number of ratings for each movie.

Hints: You need to decide -- groupyby what?


In [9]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,17,4.0,944249077
1,1,25,1.0,944250228
2,1,29,2.0,943230976
3,1,30,5.0,944249077
4,1,32,5.0,943228858


In [12]:
movies_count_per_score = ratings[['movieId','rating']].groupby('movieId').count()
movies_count_per_score

Unnamed: 0_level_0,rating
movieId,Unnamed: 1_level_1
1,22
2,9
3,4
4,1
5,4
...,...
285593,1
286897,1
287699,1
288513,1


### **Calclate the mean ratings for each movie**

In [72]:
#mean_ratings_per_movie = ratings.groupby('movieId')['rating'].mean()
mean_ratings_per_movie = ratings[['rating','movieId']].groupby('movieId').mean()
mean_ratings_per_movie

Unnamed: 0_level_0,rating
movieId,Unnamed: 1_level_1
1,3.681818
2,3.111111
3,3.250000
4,2.000000
5,2.750000
...,...
285593,3.000000
286897,5.000000
287699,5.000000
288513,5.000000


<h1 style="font-size:2em;color:#2467C0">Find Unique Records </h1>

In [20]:
ratings['movieId'].unique()

array([   17,    25,    29, ..., 44193, 56152, 58299])

In [23]:
print(type(ratings['movieId'].unique()))
len(ratings['movieId'].unique())

<class 'numpy.ndarray'>


3965

In [27]:
#ratings['movieId'].unique().tolist()


### **Given the *ratings* dataframe, find out how many unique movies there are.**

In [80]:
ratings['movieId'].unique()
#返回该列中的 所有唯一值（即去除重复项后的值）。这些唯一值以数组的形式返回

array([   17,    25,    29, ..., 44193, 56152, 58299])

In [82]:
print(type(ratings['movieId'].unique()))
len(ratings['movieId'].unique())

<class 'numpy.ndarray'>


3965

In [84]:
#ratings['movieId'].unique().tolist()

<h1 style="font-size:2em;color:#FF6111">Practice  -- unique() </h1>

- Read the tags.csv to a dataframe called *tags*, find out how many unique tags there are.
- Store these tags into a list and print the first 15 tags.


In [41]:
tags = pd.read_csv('tags.csv', sep=',')

In [45]:
print(len(tags['tag'].unique()))
print(tags['tag'].unique())

140980
['Kevin Kline' 'misogyny' 'acrophobia' ... 'juno temple' 'Pemble'
 'Clemence Poesy']


<h1 style="font-size:2em;color:#FF6111">Practice -- Filtering and unique()  </h1>

- Based on the tags dataframe, show all the rows with the tag '007'.
- Find out how many unique movies are tagged as '007'.


In [59]:
filter=tags['tag']=='007'
tags[filter]

Unnamed: 0,userId,movieId,tag,timestamp
14,58,49272,007,1672551409
52,58,63113,007,1672551214
89,58,136020,007,1672551090
3159,741,3082,007,1151687624
3162,741,3639,007,1151688338
...,...,...,...,...
1952490,158161,49272,007,1245865076
1961264,159300,3082,007,1524925844
1997368,162102,49272,007,1194725068
1997711,162153,2948,007,1355175387


<h1 style="font-size:2em;color:#EE5630">Practice -- groupby() again</h1>  

Use groupby and aggregation operations, such as `count()` and `mean()`, to  

1. Show the total number of ratings for each user. Show only the first 5 users 
2. Show the mean rating for each user only the first 5 users

In [74]:
#total_ratings_per_user = ratings.groupby('userId')['rating'].count()
total_ratings_per_user = ratings[['userId','rating']].groupby('userId').count()
total_ratings_per_user.head(5)

Unnamed: 0_level_0,rating
userId,Unnamed: 1_level_1
1,141
2,52
3,147
4,27
5,33


In [76]:
#mean_ratings_per_user = ratings.groupby('userId')['rating'].mean()
mean_ratings_per_user = ratings[['userId','rating']].groupby('userId').mean()
mean_ratings_per_user.head(5)

Unnamed: 0_level_0,rating
userId,Unnamed: 1_level_1
1,3.531915
2,4.269231
3,3.588435
4,2.62963
5,3.272727
