## MovieLens 电影数据分析

指导教师：胡俊峰

负责助教：苏亚鲁，李浩然

致谢：孙睿涵，刘宇川，李砺涵

注意：仅需要提交.ipynb文件，请**不要**将下发压缩包中的其他文件一并交上。

截止日期：5月10日24点

### 第一部分：男女用户观影偏好分析（4分）

#### 读取moivelens 1M 数据（data目录）

In [None]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
# Reading ratings file
ratings = pd.read_csv('data/ratings2.csv',  encoding='latin-1', usecols=['user_id', 'movie_id', 'rating', 'timestamp'])

# Reading users file
users = pd.read_csv('data/users.csv', encoding='latin-1', usecols=['user_id', 'gender', 'zipcode', 'age_desc', 'occ_desc'])

# Reading movies file
movies = pd.read_csv('data/movies.csv',  encoding='latin-1', usecols=['movie_id', 'title', 'genres'])

In [None]:
ratings[int(1e6):int(1e6+10)]

In [None]:
# Reading movies info file
movies_info = pd.read_csv('data/info.csv',  encoding='latin-1', usecols=['id', 'name', 'genre','intro','directors','starts', 'release_time'])
movies_info.rename(columns ={ 'id':'movie_id', 'starts': 'stars'}, inplace = True)
movies_info

#### 1.1 结合观影信息、评分信息，分别筛选出前20部比较流行的男性/女性偏好电影。

In [None]:
# 筛选出观影人次大于300的电影
popular = ratings['movie_id'].value_counts()
popular = popular[popular > 300]
popular = popular.rename('count')
popular = popular.rename_axis('movie_id')
print("Popular:")
display(popular)

In [None]:
import matplotlib.pyplot as plt

# 如何衡量偏好程度？看过某部电影的男性/女性观众对该部电影的平均打分显然是最重要的衡量依据，
# 但我们需要注意到，受“偏好”程度用男性/女性观众对该部电影的平均打分之差来衡量要比用各自的绝对值来衡量更合理。
# 但是同时，该电影的观影人次也应当有加权，比如总观影人次非常多的电影如果还呈现出明显的性别打分差异，
# 那么更能说明这部电影是明显的性别偏好电影，因为打分随机性等可能导致非内在的男女打分差异的因素影响会较弱，性别差异结果的“信度”就更高。
# 不过这一项只是辅助指标，不应成为影响最终评价函数的最主要因素。故最终我选择用 " * ln(总观影人次) " 的方式。
# 因此，我最终定义的评价函数是 " (男性观众平均打分 - 女性观众平均打分) * ln(总观影人次) " 。

# 我这里没有使用每部电影的“男性观影人次”和“女性观影人次”这两个指标，
# 是因为这会涉及男/女性观众人数不一致等问题，会带来麻烦，纵然进行归一化等操作也很难不使评价结果失真。

gender_users = pd.merge(users, ratings, on='user_id', how='outer')
gender_users = gender_users[gender_users['movie_id'].isin(popular.index)] # 只保留popular的电影

female_users = gender_users[gender_users['gender'] == 'F']
male_users = gender_users[gender_users['gender'] == 'M']

# 计算平均打分
female_mean_rating = female_users.groupby('movie_id')['rating'].mean()
male_mean_rating = male_users.groupby('movie_id')['rating'].mean()
print("Female mean rating:")
display(female_mean_rating)
print("\nMale mean rating:")
display(male_mean_rating)

# 计算评价函数
score = male_mean_rating - female_mean_rating
score = pd.merge(score, popular, on='movie_id', how='outer') # 加入总观影人次这一列
score['score'] = score['rating'] * np.log(score['count']) # score['score']这一列为评价函数
print("\nScore:")
display(score)

In [None]:
# 筛出前20名和后20名的电影，分别作为男/女性偏好的电影
male_preference = score.nlargest(20, 'score').reset_index()
female_preference = score.nsmallest(20, 'score').reset_index()
female_preference['score'] = - female_preference['score']

In [None]:
# 详细展示这些选出的电影的信息
print("男性最偏好的20部电影：")
male_preference_detail = pd.merge(male_preference, movies, on='movie_id', how='inner')
male_preference_detail = male_preference_detail.drop(['rating', 'count', 'score', 'movie_id'], axis=1) # 去除无需展示的列
male_preference_detail['ranking'] = male_preference_detail.index.to_series().apply(lambda x: x+1)
male_preference_detail = male_preference_detail.set_index('ranking') # 增加“排名”这一列，并将其设为index
display(male_preference_detail)

print("\n女性最偏好的20部电影：")
female_preference_detail = pd.merge(female_preference, movies, on='movie_id', how='inner')
female_preference_detail = female_preference_detail.drop(['rating', 'count', 'score', 'movie_id'], axis=1) # 去除无需展示的列
female_preference_detail['ranking'] = female_preference_detail.index.to_series().apply(lambda x: x+1)
female_preference_detail = female_preference_detail.set_index('ranking') # 增加“排名”这一列，并将其设为index
display(female_preference_detail)

#### 思考：你有没有更好的描述“流行”的统计量？请进行实现（选做）

In [None]:
# TODO

#### 1.2 针对不同类型（genres）的电影，统计分析男女偏好程度，并进行图形化对比显示。

具体包括以下步骤

（1）数据预处理：读取、合并表格

（2）将genres进行split操作，构建描述矩阵

（3）分别统计男女对不同类别电影评价的均值、标准差等统计量，并进行可视化对比分析。

#### 数据预处理

In [None]:
data = pd.merge(ratings, users, how='outer')
data = pd.merge(data, movies, how='outer')
data

In [None]:
data_male = data[data.gender=='M']
data_female = data[data.gender=='F']
female_count = data_female.shape[0]
male_count = data_male.shape[0]
data_male.shape, data_female.shape

#### 将genres进行split操作，构建描述矩阵

In [None]:
# 看看总共有多少个风格
genre_list = []
for i in movies.genres:
    genre = i.split(sep='|')
    genre_list += genre
genre_list = list(set(genre_list))
genre_list

In [None]:
# 统计每个电影的风格
num_movies = movies.shape[0]
for genre in genre_list:
    movies[genre] = 0
for i in movies.index:
    genre = movies.loc[i].genres.split(sep='|')
    for j in genre:
        movies[j][i] = 1
movies

In [None]:
data_2 = pd.merge(ratings, users, how='outer')
data_2 = pd.merge(data, movies, how='outer')
data_2.shape

In [None]:
data_2_male = data_2[data_2.gender=='M']
data_2_female = data_2[data_2.gender=='F']

data_2_male.head()

In [None]:
# 初始化一个用于对比男女不同风格差异的表格
df_2 = pd.DataFrame(np.zeros((len(genre_list), 8)), index=genre_list, columns=[['Male', 'Male', 'Male', 'Male', 'Female', 'Female', 'Female', 'Female'], ['mean', 'std', 'count', 'portion', 'mean', 'std', 'count', 'portion']])
df_2

In [None]:
# 收集每个风格的均值，分风格将数据归一化；同时收集每个genre中男性和女性评分数据
genre_rating_male = []
genre_rating_female = []
for i in genre_list:
    data_2_genre_m = data_2_male[data_2_male[i]==1]
    m_mean = data_2_genre_m.rating.mean()
    m_std = data_2_genre_m.rating.std()
    data_2_genre_m.rating = (data_2_genre_m.rating - m_mean) / m_std
    df_2.loc[i, ('Male', 'mean')] = m_mean
    df_2.loc[i, ('Male', 'std')] = m_std
    df_2.loc[i, ('Male', 'count')] = data_2_genre_m.shape[0]
    df_2.loc[i, ('Male', 'portion')] = df_2.loc[i, ('Male', 'count')] / male_count
    genre_rating_male.append(data_2_genre_m.rating.to_list())


    # 收集女性评分数据(4分)
    # TODO



df_2

#### 可视化展示结果

In [None]:
import matplotlib.pyplot as plt
df_2.plot.barh(y=[('Male', 'mean'), ('Female', 'mean')], color=['cyan', 'tab:pink'])
plt.title('Rating by Genres')
plt.xlim(3, 4.5)
plt.legend(['Male', 'Female'])
plt.savefig('ratings by genres.png')

In [None]:
df_2.sort_values(by=('Male', 'count'), inplace=True, ascending=False)
df_2.plot.bar(y=('Male', 'count'), color=(0, 128/255, 128/255))
plt.title('Watched by Genres: Male')
plt.savefig('counts by genres: male.png')

In [None]:
df_2.sort_values(by=('Female', 'count'), inplace=True, ascending=False)
df_2.plot.bar(y=('Female', 'count'), color='orange')
plt.title('Watched by Genres: Female')
plt.savefig('counts by genres: female.png')

In [None]:
df_2.plot.bar(y=[('Male', 'portion'), ('Female', 'portion')], color=[(0, 128/255, 128/255), 'orange'])
plt.title('Watched by Genres')
plt.legend(['Male', 'Female'])
plt.savefig('counts by genres.png')

### 第二部分：通过观影及评分信息，手动实现KNN（K-Nearest Neighbor）算法预测观众的年龄和性别（6分）

In [None]:
from sklearn import model_selection
from sklearn.decomposition import PCA

ratings_user_stats2 = ratings.groupby('user_id').agg({'rating':'mean','movie_id':'count'})
ratings_user_stats2.rename(columns={'rating':'rating_average','movie_id':'movie_count'}, inplace = True)
# 筛选观影数超过100的用户
ratings_user_stats2 = ratings_user_stats2[ratings_user_stats2['movie_count'] > 100] 
ratings_user_stats2.insert(ratings_user_stats2.shape[1], 'index', range(ratings_user_stats2.shape[0]))
ratings_user_stats2 = ratings_user_stats2.set_index('index', append=True).reset_index(['user_id']).merge(users, left_on ='user_id', 
                                                                                                        right_on = 'user_id', how = 'left').loc[:, ['user_id', 'rating_average', 'gender', 'age_desc', 'movie_count']]
ratings_user_stats2

In [None]:
from collections import Counter
from sklearn.model_selection import train_test_split
import numpy as np
import random
np.random.seed(0)
X_gender = ratings_user_stats2.loc[:, ['rating_average', 'movie_count']].values
X_age = ratings_user_stats2.loc[:, ['rating_average']].values
y_age = ratings_user_stats2.loc[:, 'age_desc'].values
y_gender = ratings_user_stats2.loc[:, 'gender'].values
X_train_age, X_test_age, y_train_age, y_test_age = train_test_split(X_age, y_age, test_size = 0.2)
X_train_gender, X_test_gender, y_train_gender, y_test_gender = train_test_split(X_gender, y_gender, test_size = 0.2)

In [None]:
def knn_classify(X, y, testInstance, k):
    # 手动实现KNN算法，注意使用向量化运算提高效率（6分）
    # 可分为四步:计算元素两两之间的距离；找出最近的K个元素的idx; 找出KNN对应的n个y值; 返回投票结果
    # TODO

In [None]:
# K值是试验出来的
predictions_age = [knn_classify(X_train_age, y_train_age, data, 13) for data in X_test_age]
predictions_gender = [knn_classify(X_train_gender, y_train_gender, data, 24) for data in X_test_gender] 

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score

print(accuracy_score(y_test_age, predictions_age)) 
print(recall_score(y_test_age, predictions_age, average='micro')) 
print(accuracy_score(y_test_gender, predictions_gender)) 
print(recall_score(y_test_gender, predictions_gender, average='micro')) 

#### 思考：该KNN实现的预测结果并不理想，你有没有改进策略？（选做）

In [None]:
# TODO