THINKING:
1.如何使用用户标签来指导业务（如何提升业务）
答：系统刚上线阶段，需要冷启动，拉新，这时可以根据用户注册时填写的信息包括人口统计信息（例如：年龄，工作种类，性别等）来进行推荐，也可根据获得用户的场景来（例如，用户为大学校园内的拉新活动获得）来进行推荐；拉新阶段过后，用户产生用户行为数据，可以根据UGC和PGC来进行相应的个性化推荐，也可以进行热门推荐等多种推荐同时进行，保证用户黏度，提高用户留存率；如果产品对用户来说为阶段性使用产品，则需要对关键的用户流失点进行预测，统计流失率，对留存用户数量有一个大致的掌握。

2.如果给你一堆用户数据，没有打标签。你该如何处理（如何打标签）
答：用户数据量小的情况下可以使用专家评分方式进行打标签，如果用户数量有一定的规模，则可以对用户的行为数据进行聚类分析，然后根据类别来判断分析同一个类别中的用户所具有的特征，对这些用户的特性进行解释，并产生用户标签。

3.准确率和精确率有何不同（评估指标）
答：准确率是预测结果中预测分类正确结果的比例，而精确率是指预测为YES的类别中真正为YES的概率。

4.如果你使用大众点评，想要给某个餐厅打标签。这时系统可以自动提示一些标签，你会如何设计（标签推荐）
答：一是根据商店自身的性质来分析，分析餐厅菜品种类，分析拥有与之相近菜品种类的餐厅，将后者的标签推荐给当前餐厅，二是根据该餐厅的其他用户行为进行分析，计算出其他用户喜欢给该餐厅打的标签并进行推荐，三是根据用户自身行为，该用户对同类餐厅常用的标签也可以进行推荐。


5.我们今天使用了10种方式来解MNIST，这些方法有何不同？你还有其他方法来解决MNIST识别问题么（分类方法）
答：这些算法模型主要包括决策树模型，逻辑回归，高维空间中的支持向量机，神经网络模型，先验概率分布朴素贝叶斯模型等这么几大类，TPOT则是一种大而全的最优模型搜索工具。还可以通过主成分分析之后进行softmax分类。

In [1]:
import random
import math
import operator
import pandas as pd

file_path = "G:\\python_lesson\\L3-code\\code\\delicious-2k\\user_taggedbookmarks-timestamps.dat"
records = {}
train_data = dict()
test_data = dict()
# 用户标签，商品标签
user_tags = dict()
tag_items = dict()
user_items = dict()

item_tags = dict()
tag_users = dict()
item_users = dict()


def precisionAndRecall(N):
    hit = 0
    h_recall = 0
    h_precision = 0
    for user, items in test_data.items():
        if user not in train_data:
            continue
        rank = recommend(user, N)
        for item, rui in rank:
            if item in items:
                hit = hit + 1
        h_recall = h_recall + len(items)
        h_precision = h_precision + N
    return (hit / (h_precision * 1.0)), (hit / (h_recall * 1.0))


def recommend(user, N):
    recommend_items = dict()
    tagged_items = user_items[user]
    for tag, wut in user_tags[user].items():
        # print(self.user_tags[user].items())
        for item, wti in tag_items[tag].items():
            if item in tagged_items:
                continue
            #norm = len(tag_users[tag].items()) * len(user_tags[user].items())
            norm=math.log(len(tag_users[tag].items())+1)
            if item not in recommend_items:
                recommend_items[item] = wut * wti / norm
            else:
                recommend_items[item] = recommend_items[item] + wut * wti / norm
    return sorted(recommend_items.items(), key=operator.itemgetter(1), reverse=True)[0:N]


def testRecommend():
    print("推荐结果评估")
    print("%3s %10s %10s" % ('N', "精确率", '召回率'))
    for n in [5, 10, 20, 40, 60, 80, 100]:
        precision, recall = precisionAndRecall(n)
        print("%3d %10.3f%% %10.3f%%" % (n, precision * 100, recall * 100))


def load_data():
    df = pd.read_csv(file_path, sep='\t')
    for i in range(len(df)):
        uid = df['userID'][i]
        iid = df['bookmarkID'][i]
        tag = df['tagID'][i]
        records.setdefault(uid, {})
        records[uid].setdefault(iid, [])
        records[uid][iid].append(tag)
    print("数据集大小为 %d." % (len(df)))
    print("设置tag的人数 %d." % (len(records)))
    print("数据加载完成\n")


def train_test_split(ratio, seed=100):
    random.seed(seed)
    for u in records.keys():
        for i in records[u].keys():
            # ratio比例设置为测试集
            if random.random() < ratio:
                test_data.setdefault(u, {})
                test_data[u].setdefault(i, [])
                for t in records[u][i]:
                    test_data[u][i].append(t)
            else:
                train_data.setdefault(u, {})
                train_data[u].setdefault(i, [])
                for t in records[u][i]:
                    train_data[u][i].append(t)
    print("训练集样本数 %d, 测试集样本数 %d" % (len(train_data), len(test_data)))


def addValueToMat(mat, index, item, value=1):
    if index not in mat:
        mat.setdefault(index, {})
        mat[index].setdefault(item, value)
    else:
        if item not in mat[index]:
            mat[index][item] = value
        else:
            mat[index][item] += value


def initStat():
    records = train_data
    for u, items in records.items():
        for i, tags in items.items():
            for tag in tags:
                addValueToMat(user_tags, u, tag, 1)
                addValueToMat(tag_items, tag, i, 1)
                addValueToMat(user_items, u, i, 1)
                addValueToMat(tag_users,tag,u,1)
    print("user_tags, tag_items, user_items初始化完成.")
    print("user_tags大小 %d, tag_items大小 %d, user_items大小 %d" % (len(user_tags), len(tag_items), len(user_items)))


# 数据加载
load_data()
# 训练集，测试集拆分，20%测试集
train_test_split(0.2)
initStat()
testRecommend()


数据集大小为 437593.
设置tag的人数 1867.
数据加载完成

训练集样本数 1860, 测试集样本数 1793
user_tags, tag_items, user_items初始化完成.
user_tags大小 1860, tag_items大小 36884, user_items大小 1860
推荐结果评估
  N        精确率        召回率
  5      1.008%      0.431%
 10      0.761%      0.652%
 20      0.549%      0.940%
 40      0.402%      1.376%
 60      0.328%      1.687%
 80      0.297%      2.033%
100      0.269%      2.306%


推荐结果评估
  N        精确率        召回率
  5      0.829%      0.355%
 10      0.633%      0.542%
 20      0.512%      0.877%
 40      0.381%      1.304%
 60      0.318%      1.635%
 80      0.276%      1.893%
100      0.248%      2.124%

NormTagBased推荐结果评估:

 N        精确率        召回率
  5      0.907%      0.388%
 10      0.638%      0.546%
 20      0.507%      0.868%
 40      0.356%      1.218%
 60      0.287%      1.476%
 80      0.255%      1.750%
100      0.241%      2.061%

TagBased-TFIDF推荐结果评估:
  N        精确率        召回率
  5      1.008%      0.431%
 10      0.761%      0.652%
 20      0.549%      0.940%
 40      0.402%      1.376%
 60      0.328%      1.687%
 80      0.297%      2.033%
100      0.269%      2.306%

#tpot预测结果

In [55]:
import pandas as pd
from tpot import TPOTClassifier
from sklearn.model_selection import train_test_split
titanic_train_data = pd.read_csv( 'G:\\python_lesson\\titanic\\train.csv')
titanic_test_data=pd.read_csv("G:\\python_lesson\\titanic\\test.csv")
real_result=pd.read_csv("G:\\python_lesson\\titanic\\gender_submission.csv")
print("训练集数据探索：")
#print(titanic_train_data.info())
'''
le = preprocessing.LabelEncoder()
le.fit(['male','female'])
a=le.transform(titanic_train_data['Sex'].values.tolist())'''
print(titanic_train_data.iloc[61,11])
print(str(titanic_train_data.iloc[61,11]) =='nan')

def caculate_age(sex,Pclass,Survived):
    a=titanic_train_data[(titanic_train_data['Sex'] == sex)&(titanic_train_data['Pclass'] == Pclass)&(titanic_train_data['Survived'] == Survived)]['Age'].mean()
    if a==0:
        a=titanic_train_data[(titanic_train_data['Sex'] == sex)&(titanic_train_data['Pclass'] == Pclass)]['Age'].mean()
        if a==0:
            a = titanic_train_data[(titanic_train_data['Sex'] == sex)]['Age'].mean()
    #print(a)
    return a
def caculate_test_age(sex,Pclass):
    a=titanic_train_data[(titanic_train_data['Sex'] == sex)&(titanic_train_data['Pclass'] == Pclass)]['Age'].mean()
    if a==0:
        a = titanic_train_data[(titanic_train_data['Sex'] == sex)]['Age'].mean()
    #print(a)
    return a
def caculate_fare(sex,Pclass):
    b=titanic_train_data[(titanic_train_data['Sex'] == sex)&(titanic_train_data['Pclass'] == Pclass)]['Fare'].mean()
    if b==0:
        b=titanic_train_data[(titanic_train_data['Sex'] == sex)]['Fare'].mean()
    #print(b)
    return b
'''
def caculate_Embarked(sex,Survived):
    b=titanic_train_data[(titanic_train_data['Sex'] == sex)&(titanic_train_data['Survived'] == Survived)]['Embarked'].median()
    if b==0:
        b=titanic_train_data[(titanic_train_data['Sex'] == sex)]['Embarked'].median()
    print(b)
    return b
'''
def encode(oridata):
    for i in range(0,oridata.shape[0]):
        for j in range(0,oridata.shape[1]):
            #如果是sex列
            if j==4:
                if oridata.iloc[i,j]=='male':
                    oridata.iloc[i,j]=1
                else:
                    oridata.iloc[i, j] = 0
            if j==11:
                if oridata.iloc[i,j]=='S':
                    oridata.iloc[i, j] = 0
                if oridata.iloc[i,j]=='C':
                    oridata.iloc[i, j] = 1
                if oridata.iloc[i, j] == 'Q':
                    oridata.iloc[i, j] = 2
            else:
                continue
    return oridata
def encode_test(oridata):
    for i in range(0,oridata.shape[0]):
        for j in range(0,oridata.shape[1]):
            #如果是sex列
            if j==3:
                if oridata.iloc[i,j]=='male':
                    oridata.iloc[i,j]=1
                else:
                    oridata.iloc[i, j] = 0
            if j==10:
                if oridata.iloc[i,j]=='S':
                    oridata.iloc[i, j] = 0
                if oridata.iloc[i,j]=='C':
                    oridata.iloc[i, j] = 1
                if oridata.iloc[i, j] == 'Q':
                    oridata.iloc[i, j] = 2
            else:
                continue
    return oridata
def fillnan(oridata):
    for i in range(0, oridata.shape[0]):
        for j in range(0, oridata.shape[1]):
                    # 如果是age列
            if j==5:
                if str(oridata.iloc[i,j])=='nan':
                    oridata.iloc[i, j]=caculate_age(oridata.iloc[i,4],oridata.iloc[i,2],oridata.iloc[i,1])
            if j==9:
                if str(oridata.iloc[i,j])=='nan':
                    oridata.iloc[i, j] = caculate_fare(oridata.iloc[i, 4],
                                                                 oridata.iloc[i, 2])
            if j==11:
                if str(oridata.iloc[i,j])=='nan':
                    oridata.iloc[i, j] = 0
            else:
                continue
    return oridata

def filltestnan(oridata):
    for i in range(0, oridata.shape[0]):
        for j in range(0, oridata.shape[1]):
                    # 如果是age列
            if j==4:
                if str(oridata.iloc[i,j])=='nan':
                    oridata.iloc[i, j]=caculate_test_age(oridata.iloc[i,3],oridata.iloc[i,1])
            if j==8:
                if str(oridata.iloc[i,j])=='nan':
                    oridata.iloc[i, j] = caculate_fare(oridata.iloc[i, 3],
                                                                 oridata.iloc[i, 1])
            if j==10:
                if str(oridata.iloc[i,j])=='nan':
                    oridata.iloc[i, j] = 0
            else:
                continue
    return oridata
def cleandata(oridata):
    return  fillnan(encode(oridata))
def cleantestdata(oridata):
    return filltestnan(encode_test(oridata))
titanic_train_data=cleandata(titanic_train_data)
titanic_test_data=cleantestdata(titanic_test_data)
#print(titanic_train_data['Age'])

#print(titanic_test_data['Embarked'])

features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
X_train,X_test,y_train,y_test=train_test_split(titanic_train_data[features],titanic_train_data['Survived'],test_size=0.25)
tpot=TPOTClassifier(generations=5,population_size=20,verbosity=2)
tpot.fit(X_train,y_train)
predict_result=tpot.predict(titanic_test_data[features])
print(predict_result)
print(tpot.score(X_test,y_test))
count=0
for i in range(0,len(predict_result)):
    if predict_result[i]==real_result.iloc[i,1]:
        count+=1
    else:
        continue
print(count)
print("预测准确率：")
print(count/len(predict_result))


#运行结果为：
#Best pipeline: RandomForestClassifier(PolynomialFeatures(RobustScaler(input_matrix), degree=2, include_bias=False, interaction_only=False), bootstrap=True, criterion=gini, max_features=0.8500000000000001, min_samples_leaf=4, min_samples_split=11, n_estimators=100)
#0.8923766816143498




训练集数据探索：
nan
True


HBox(children=(FloatProgress(value=0.0, description='Optimization Progress', max=120.0, style=ProgressStyle(de…


Generation 1 - Current best internal CV score: 0.8473347547974415
Generation 2 - Current best internal CV score: 0.8518572550779935
Generation 3 - Current best internal CV score: 0.8518572550779935
Generation 4 - Current best internal CV score: 0.8518572550779935
Generation 5 - Current best internal CV score: 0.8548423297048592
Best pipeline: ExtraTreesClassifier(MaxAbsScaler(PolynomialFeatures(input_matrix, degree=2, include_bias=False, interaction_only=False)), bootstrap=False, criterion=entropy, max_features=0.4, min_samples_leaf=3, min_samples_split=5, n_estimators=100)
[0 0 0 0 1 0 0 0 1 0 0 0 1 0 1 1 0 0 0 0 1 1 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0
 0 0 1 0 1 0 1 1 0 0 0 1 1 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 1 0 0 0 0
 1 0 0 1 0 0 1 0 0 0 0 0 0 0 1 1 1 0 1 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0
 1 1 1 1 0 0 1 0 1 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0
 1 0 1 0 0 1 0 0 1 0 1 1 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 1 1 0 0 1 0 1
 0 1 0 0 0 0 1 1 0 1 0 1 1 0 0 1 1 0 1 0 0 

#ID3决策树预测结果

In [57]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction import DictVectorizer
dvec=DictVectorizer(sparse=False)
train_features=dvec.fit_transform(X_train.to_dict(orient='record'))
print(dvec.feature_names_)
# 构造ID3决策树
clf = DecisionTreeClassifier(criterion='entropy',max_depth=8)
# 决策树训练
clf.fit(X_train, y_train)

test_features=dvec.transform(X_test.to_dict(orient='record'))
# 决策树预测
pred_labels = clf.predict(test_features)
print(pred_labels)

real_test_features=dvec.transform(titanic_test_data[features].to_dict(orient='record'))
# 决策树预测
real_pred_labels = clf.predict(real_test_features)
print(real_pred_labels)
# 得到决策树准确率(基于训练集)
#acc_decision_tree = round(clf.score(train_features, y_train), 6)
#print(u'score准确率为 %.4lf' % acc_decision_tree)
count=0
for i in range(0,len(real_pred_labels)):
    if real_pred_labels[i]==real_result.iloc[i,1]:
        count+=1
    else:
        continue
print(count)
print("ID3决策树预测准确率：")
print(count/len(real_pred_labels))

['Age', 'Embarked', 'Fare', 'Parch', 'Pclass', 'Sex', 'SibSp']
[1 1 1 0 0 1 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 1 0 0 0
 1 0 0 1 1 1 0 1 1 1 1 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 1 0 0 0 0 1 1 1 0 1 1
 1 1 1 1 0 1 0 1 0 0 0 0 0 0 1 1 1 1 1 1 0 0 1 0 0 1 1 1 0 0 1 1 1 1 0 0 1
 1 0 1 1 1 0 1 0 0 1 1 0 0 1 0 1 1 1 1 1 1 1 0 0 1 1 0 0 1 1 0 1 1 0 1 0 1
 0 1 0 1 1 0 1 1 0 0 1 1 1 0 1 1 0 0 0 1 1 0 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 0 1 0 0 1 0 1 1 0 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 0 1 0
 1]
[1 1 1 1 1 1 1 0 1 0 1 0 0 0 0 0 1 1 1 1 0 1 0 0 0 1 0 1 0 0 0 0 1 0 0 1 1
 1 1 0 0 0 1 1 0 1 0 1 0 1 0 0 1 0 0 0 1 1 1 0 1 1 1 1 0 1 1 0 0 0 1 1 1 0
 0 0 1 0 1 1 0 0 0 1 1 0 1 1 1 0 1 1 0 1 0 1 0 1 1 1 0 0 1 1 1 1 1 1 1 1 0
 1 0 1 0 0 1 1 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 1 1 1 1 0 0 0 0 0 0 1 0 1
 0 0 0 1 1 1 0 1 0 1 0 1 1 0 1 1 1 1 0 0 0 1 1 1 1 1 0 0 0 0 0 0 1 0 0 1 0
 1 1 1 0 1 0 0 1 1 0 1 0 1 1 1 0 1 0 0 1 0 1 1 0 1 1 1 0 0 1 0 1 0 0 1 0 1
 1 1 0 0 1 1 1 1 0 0 1 1 0 1 0 1 

