# 导师制名企实训班商业智能方向 004期 Lesson 3

### Thinking 1 : 如何使用用户标签来指导业务（如何提升业务）

用户通常有四种标签：  
八字原则：用户消费行为分析
1. 用户标签：是用户基本不变的属性，如性别、年龄、地域、收入、学历、职业等。
2. 消费标签：是用户的消费习惯或倾向，如消费习惯、购买意向、是否对促销敏感。 
3. 行为标签：是用户的行为习惯或倾向，如时间段、频次、时长、收藏、点击、喜欢、评分。
4. 内容分析：对用户平时浏览的内容进行分析，比如体育、游戏、八卦。

通过用户的标签来表示用户的特征，然后针对不同特征的用户可以针对性的推荐用户感性趣的信息。

### Thinking 2 : 如果给你一堆用户数据，没有打标签。你该如何处理（如何打标签）

标签来源的典型的方式有： 
* PGC: 专家生产
* UGC：普通生产

使用K-Means、EM聚类、Mean-Shift、DBSCAN、层次聚类、PCA对用户数据进行处理，并进行人工定义标签含义。

### Thinking 3 : 准确率和精确率有何不同（评估指标）

$$ 准确率（accuracy）= \frac{TP+TN}{TP + FP + TN + FN} $$

$$ 精准率（precision）= \frac{TP}{TP + FP} $$
 
其中，TP(True Positives)：样本为正，预测结果为正；FP(False Positives)：样本为负，预测结果为正；TN(True Negatives)：样本为负，预测结果为负；FN(False Negatives)：样本为正，预测结果为负。  
准确率（accuracy）： 预测正确的占全部预测的比例  
精准率（precision）：正确预测为正占全部预测为正的比例 

### Thinking 4 : 如果你使用大众点评，想要给某个餐厅打标签。这时系统可以自动提示一些标签，你会如何设计（标签推荐）

1. 所有餐厅中便签的热门标签
2. 这个餐厅的热门标签
3. 同类型餐厅的热门标签
4. 喜欢这个餐厅的用户喜欢打的标签
5. 当前用户喜欢打的标签

### Thinking 5 : 我们今天使用了10种方式来解MNIST，这些方法有何不同？你还有其他方法来解决MNIST识别问题么（分类方法）

1. LR是逻辑回归，线性模型，计算速度快
2. CART、ID3是决策树模型，ID3算法采用信息增益，CART树采用基尼指数连作为划分依据  
3. LDA是线性判别分析，进行数据降维，提高效率
4. Naive bayes基于贝叶斯定理，采用属性条件独立性假设，基于训练集估计类先验概率，为每个属性估计条件概率  
5. SVM是支持向量机，在训练集中找到最大间隔的划分超平面  
6. KNN是最近邻算法，是一种“慵懒学习”方法，在测试集上找到训练集中的k个近邻并根据近邻来预测测试样本的结果
7. Adaboost是集成算法模型，通过训练多个弱分类器，集合起来构成一个强分类器
8. XGBoost是GBDT决策树的一种优化
9. TPOT是一种自动机器学习工具，能够自动进行特征选择和处理，探索出效果最好的模型
10. keras是一个人工神经网络的高度封装库，以tensorflow等框架为后台达到简化模型实现的目的

MINIST识别问题，还可以其他机器学习方法以及深度学习方法，CV中的CNN，胶囊神经网络等都可以进行MINIST识别。

### Action 1 : 针对Delicious数据集，对SimpleTagBased算法进行改进（使用NormTagBased、TagBased-TFIDF算法）

In [1]:
# 使用SimpleTagBased算法对Delicious2K数据进行推荐
# 原始数据集：https://grouplens.org/datasets/hetrec-2011/
# 数据格式：userID     bookmarkID     tagID     timestamp
import random
import math
import operator
import pandas as pd
import numpy as np

In [2]:
file_path = "data/user_taggedbookmarks-timestamps.dat"

In [3]:
# 字典类型，保存了user对item的tag，即{userid: {item1:[tag1, tag2], ...}}
records = {}
# 训练集，测试集
train_data = dict()
test_data = dict()
# 用户标签，商品标签
user_tags = dict()
tag_items = dict()
user_items = dict()

In [4]:
item_tags = dict()
tag_users = dict()
item_users = dict()

In [5]:
# 数据加载
def load_data(file_path):
    print("开始数据加载...")
    df = pd.read_csv(file_path, sep='\t')
    for i in range(len(df)):
        uid = df['userID'][i]
        iid = df['bookmarkID'][i]
        tag = df['tagID'][i]
        # 键不存在时，设置默认值{}
        records.setdefault(uid,{})
        records[uid].setdefault(iid,[])
        records[uid][iid].append(tag)
    print("数据集大小为 %d." % (len(df)))
    print("设置tag的人数 %d." % (len(records)))
    print("数据加载完成\n")
    return records

In [6]:
pd.read_csv(file_path, sep='\t').head()

Unnamed: 0,userID,bookmarkID,tagID,timestamp
0,8,1,1,1289255362000
1,8,2,1,1289255159000
2,8,7,1,1289238901000
3,8,7,6,1289238901000
4,8,7,7,1289238901000


In [7]:
records = load_data(file_path)

开始数据加载...
数据集大小为 437593.
设置tag的人数 1867.
数据加载完成



#### records 结构：  
records = {uid_1:{iid_1: [tag_1, tag_2,...] , iid_2: [tag_1,tag_2,...], ...},uid_2...}


In [8]:
#records

In [9]:
# 将数据集拆分为训练集和测试集
def train_test_split(records, ratio, seed=100):
    random.seed(seed)
    for u in records.keys():
        for i in records[u].keys():
            # ratio比例设置为测试集
            if random.random()<ratio:
                test_data.setdefault(u,{})
                test_data[u].setdefault(i,[])
                for t in records[u][i]:
                    test_data[u][i].append(t)
            else:
                train_data.setdefault(u,{})
                train_data[u].setdefault(i,[])
                for t in records[u][i]:
                    train_data[u][i].append(t)        
    print("训练集样本数 %d, 测试集样本数 %d" % (len(train_data),len(test_data)))
    return train_data, test_data

In [10]:
# 训练集，测试集拆分，20%测试集
train_data, test_data = train_test_split(records, 0.2)

训练集样本数 1860, 测试集样本数 1793


In [11]:
# 设置矩阵 mat[index, item] = 1
def addValueToMat(mat, index, item, value=1):
    if index not in mat:
        mat.setdefault(index,{})
        mat[index].setdefault(item,value)
    else:
        if item not in mat[index]:
            mat[index][item] = value
        else:
            mat[index][item] += value

In [12]:
# 使用训练集，初始化user_tags, tag_items, user_items
def initStat(records):
    # records=train_data
    for u,items in records.items():
        for i,tags in items.items():
            for tag in tags:
                #print tag
                # 用户和tag的关系
                addValueToMat(user_tags, u, tag, 1)
                # tag和item的关系
                addValueToMat(tag_items, tag, i, 1)
                # 用户和item的关系
                addValueToMat(user_items, u, i, 1)
                # item和tag的关系
                addValueToMat(item_tags, i, tag, 1)
                # tag和item的关系
                addValueToMat(tag_users, tag, u, 1)
                # item和user的关系
                addValueToMat(item_users, i, u, 1)
    print("user_tags, tag_items, user_items初始化完成.")
    print("user_tags大小 %d, tag_items大小 %d, user_items大小 %d" % (len(user_tags), len(tag_items), len(user_items)))

In [13]:
# 初始化训练集
initStat(train_data)

user_tags, tag_items, user_items初始化完成.
user_tags大小 1860, tag_items大小 36884, user_items大小 1860


In [14]:
# 对用户user推荐Top-N(simple tag based)
def recommend(user, N):
    recommend_items = dict()
    # 对Item进行打分，分数为所有的（用户对某标签使用的次数 wut, 乘以 商品被打上相同标签的次数 wti）之和
    tagged_items = user_items[user]     
    for tag, wut in user_tags[user].items():
        #print(self.user_tags[user].items())
        for item, wti in tag_items[tag].items():
            if item in tagged_items:
                continue
            #print('wut = %s, wti = %s' %(wut, wti))
            if item not in recommend_items:
                recommend_items[item] = wut * wti
            else:
                recommend_items[item] = recommend_items[item] + wut * wti
    return sorted(recommend_items.items(), key=operator.itemgetter(1), reverse=True)[0:N]

In [15]:
# 对用户user推荐Top-N(norm tag based)
def recommend_normtag(user, N):
    recommend_items = dict()
    # 对Item进行打分，分数为所有的（用户对某标签使用的次数 wut, 乘以 商品被打上相同标签的次数 wti）之和
    tagged_items = user_items[user]     
    for tag, wut in user_tags[user].items():
        #print(self.user_tags[user].items())
        for item, wti in tag_items[tag].items():
            if item in tagged_items:
                continue
            norm = len(tag_users[tag].items()) * len(user_tags[user].items())
            #print('wut = %s, wti = %s' %(wut, wti))
            if item not in recommend_items:
                recommend_items[item] = wut * wti / norm
            else:
                recommend_items[item] = recommend_items[item] + wut * wti / norm
    return sorted(recommend_items.items(), key=operator.itemgetter(1), reverse=True)[0:N]

In [16]:
# 对用户user推荐Top-N(tag based TFIDF)
def recommend_tagbased_TFIDF(user, N):
    recommend_items = dict()
    # 对Item进行打分，分数为所有的（用户对某标签使用的次数 wut, 乘以 商品被打上相同标签的次数 wti）之和
    tagged_items = user_items[user]     
    for tag, wut in user_tags[user].items():
        #print(self.user_tags[user].items())
        for item, wti in tag_items[tag].items():
            if item in tagged_items:
                continue
            norm = math.log(len(tag_users[tag].items()) + 1)
            #print('wut = %s, wti = %s' %(wut, wti))
            if item not in recommend_items:
                recommend_items[item] = wut * wti / norm
            else:
                recommend_items[item] = recommend_items[item] + wut * wti / norm
    return sorted(recommend_items.items(), key=operator.itemgetter(1), reverse=True)[0:N]

In [17]:
# 使用测试集，计算准确率和召回率
def precisionAndRecall(N, recommend):
    hit = 0
    h_recall = 0
    h_precision = 0
    for user,items in test_data.items():
        if user not in train_data:
            continue
        # 获取Top-N推荐列表
        rank = recommend(user, N)
        for item,rui in rank:
            if item in items:
                hit = hit + 1
        h_recall = h_recall + len(items)
        h_precision = h_precision + N
    #print('一共命中 %d 个, 一共推荐 %d 个, 用户设置tag总数 %d 个' %(hit, h_precision, h_recall))
    # 返回准确率 和 召回率
    return (hit/(h_precision*1.0)), (hit/(h_recall*1.0))

In [18]:
# 使用测试集，对推荐结果进行评估
def testRecommend(recommend):
    print("推荐结果评估")
    print("%3s %10s %10s" % ('N',"精确率",'召回率'))
    for n in [5,10,20,40,60,80,100]:
        precision,recall = precisionAndRecall(n, recommend)
        print("%3d %10.3f%% %10.3f%%" % (n, precision * 100, recall * 100))
        

In [19]:
# simple tag based
testRecommend(recommend)

推荐结果评估
  N        精确率        召回率
  5      0.829%      0.355%
 10      0.633%      0.542%
 20      0.512%      0.877%
 40      0.381%      1.304%
 60      0.318%      1.635%
 80      0.276%      1.893%
100      0.248%      2.124%


In [20]:
# norm tag based
testRecommend(recommend_normtag)

推荐结果评估
  N        精确率        召回率
  5      0.907%      0.388%
 10      0.638%      0.546%
 20      0.507%      0.868%
 40      0.356%      1.218%
 60      0.287%      1.476%
 80      0.255%      1.750%
100      0.241%      2.061%


In [21]:
# tag based TFIDF
testRecommend(recommend_tagbased_TFIDF)

推荐结果评估
  N        精确率        召回率
  5      1.008%      0.431%
 10      0.761%      0.652%
 20      0.549%      0.940%
 40      0.402%      1.376%
 60      0.328%      1.687%
 80      0.297%      2.033%
100      0.269%      2.306%


### Action 2 : 对Titanic数据进行清洗，建模并对乘客生存进行预测。使用之前介绍过的10种模型中的至少2种（包括TPOT）

In [22]:
import pandas as pd
import numpy as np

In [23]:
train_data = pd.read_csv("data/titanic/train.csv")
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [24]:
test_data = pd.read_csv("data/titanic/test.csv")
test_data.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


### 缺失值处理

In [25]:
# 使用平均年龄来填充年龄中的nan值
train_data['Age'].fillna(train_data['Age'].mean(), inplace=True)
# 使用票价的均值填充票价中的nan值
train_data['Fare'].fillna(train_data['Fare'].mean(), inplace=True)

In [26]:
train_data

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.000000,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.000000,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.000000,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.000000,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.000000,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.000000,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.000000,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,29.699118,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.000000,0,0,111369,30.0000,C148,C


In [27]:
train_data['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [28]:
# 使用登录最多的港口来填充登录港口的nan值
train_data['Embarked'].fillna('S', inplace=True)

In [29]:
train_data

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.000000,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.000000,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.000000,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.000000,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.000000,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.000000,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.000000,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,29.699118,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.000000,0,0,111369,30.0000,C148,C


### 特征选择：

In [30]:
from sklearn.feature_extraction import DictVectorizer
# 特征选择 去掉了乘客编号，乘客姓名，船票号码，船舱
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
train_features = train_data[features]
train_labels = train_data['Survived']
dvec = DictVectorizer(sparse = False)
train_features = dvec.fit_transform(train_features.to_dict(orient='record'))




### 从训练集中切分出验证集

In [31]:
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(train_features, train_labels, train_size=0.8, test_size=0.2)

In [32]:
def data_clean(data, features=['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']):
    # 使用平均年龄来填充年龄中的nan值
    data['Age'].fillna(data['Age'].mean(), inplace=True)
    # 使用票价的均值填充票价中的nan值
    data['Fare'].fillna(data['Fare'].mean(), inplace=True)
    # 使用登录最多的港口来填充登录港口的nan值
    data['Embarked'].fillna('S', inplace=True)
    # 特征选择 去掉了乘客编号，乘客姓名，船票号码，船舱
    features = data[features]
    return features

In [33]:
test_features = data_clean(test_data)
test_features=dvec.transform(test_features.to_dict(orient='record'))

### 模型选择

ID3 决策树：

In [34]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
print("ID3决策树：")
ID3DT = DecisionTreeClassifier(criterion='entropy')
ID3DT.fit(X_train, y_train)
ID3DT_valid_pred = ID3DT.predict(X_valid)
ID3DT_pred_labels = ID3DT.predict(test_features)
# 得到 CART 决策树准确率(基于训练集)
acc_ID3DT = round(ID3DT.score(X_train, y_train), 6)
print(u'score准确率为 %.4lf' % acc_ID3DT)
# 使用K折交叉验证 统计决策树准确率
print(u'cross_val_score准确率为 %.4lf' % np.mean(cross_val_score(ID3DT, X_train, y_train, cv=10)))
# 验证集上准确率
print(u'valid dataset准确率为 %.4lf' % accuracy_score(y_valid, ID3DT_valid_pred))

ID3决策树：
score准确率为 0.9874
cross_val_score准确率为 0.7628
valid dataset准确率为 0.7598


梯度提升决策树：

In [35]:
from sklearn.ensemble import GradientBoostingClassifier
print("梯度提升决策树：")
GBDT = GradientBoostingClassifier()
GBDT.fit(X_train, y_train)
GBDT_valid_pred = GBDT.predict(X_valid)
GBDT_pred_labels = GBDT.predict(test_features)
# 得到 梯度提升决策树准确率(基于训练集)
acc_GBDT = round(GBDT.score(X_train, y_train), 6)
print(u'score准确率为 %.4lf' % acc_GBDT)
# 使用K折交叉验证 统计决策树准确率
print(u'cross_val_score准确率为 %.4lf' % np.mean(cross_val_score(GBDT, X_train, y_train, cv=10)))
# 验证集上准确率
print(u'valid dataset准确率为 %.4lf' % accuracy_score(y_valid, GBDT_valid_pred))

梯度提升决策树：
score准确率为 0.9129
cross_val_score准确率为 0.8133
valid dataset准确率为 0.7989


XGBoost：

In [36]:
from xgboost import XGBClassifier
print("XGBoost:")
XGB = XGBClassifier()
XGB.fit(X_train, y_train)
XGB_valid_pred = XGB.predict(X_valid)
XGB_pred_labels = XGB.predict(test_features)
# 得到 梯度提升决策树准确率(基于训练集)
acc_XGB = round(XGB.score(X_train, y_train), 6)
print(u'score准确率为 %.4lf' % acc_XGB)
# 使用K折交叉验证 统计决策树准确率
print(u'cross_val_score准确率为 %.4lf' % np.mean(cross_val_score(XGB, X_train, y_train, cv=10)))
# 验证集上准确率
print(u'valid dataset准确率为 %.4lf' % accuracy_score(y_valid, XGB_valid_pred))

XGBoost:
score准确率为 0.9691
cross_val_score准确率为 0.7880
valid dataset准确率为 0.7989


TPOT

In [37]:
from tpot import TPOTClassifier
TPOT = TPOTClassifier(generations=5, population_size=20, verbosity=2)
TPOT.fit(X_train, y_train)
print(TPOT.score(X_valid, y_valid))
TPOT.export('tpot_titanic.py')

HBox(children=(FloatProgress(value=0.0, description='Optimization Progress', max=120.0, style=ProgressStyle(de…


Generation 1 - Current best internal CV score: 0.8357332808037032
Generation 2 - Current best internal CV score: 0.8357332808037032
Generation 3 - Current best internal CV score: 0.8357332808037032
Generation 4 - Current best internal CV score: 0.8357332808037032
Generation 5 - Current best internal CV score: 0.8357332808037032
Best pipeline: DecisionTreeClassifier(XGBClassifier(input_matrix, learning_rate=0.1, max_depth=3, min_child_weight=6, n_estimators=100, nthread=1, subsample=0.6500000000000001), criterion=entropy, max_depth=6, min_samples_leaf=19, min_samples_split=11)
0.8156424581005587


### From “tpot_titanic.py”

In [41]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import MaxAbsScaler
from sklearn.pipeline import make_pipeline
from tpot.builtins import StackingEstimator

In [42]:
# Average CV score on the training set was: 0.8441741357234316
exported_pipeline = make_pipeline(
    StackingEstimator(estimator=XGBClassifier(learning_rate=0.1, max_depth=3, min_child_weight=6, n_estimators=100, nthread=1, subsample=0.6500000000000001)),
    DecisionTreeClassifier(criterion="entropy", max_depth=6, min_samples_leaf=19, min_samples_split=11)
)
exported_pipeline.fit(X_train, y_train)
results = exported_pipeline.predict(X_valid)

In [43]:
# 验证集上准确率
print(u'valid dataset准确率为 %.4lf' % accuracy_score(y_valid, results))

valid dataset准确率为 0.8156
