# Thinking1：ALS都有哪些应用场景

ALS是数学上针对损失函数变量不唯一情况的优化方法（类似于多元函数求最值中分别对每个变量求偏导数），可以解决最优化问题（损失函数最小化）,可以做推荐系统,比如电影推荐,商品推荐,广告推荐等.

# Thinking2：ALS进行矩阵分解的时候，为什么可以并行化处理

从更新公式可以看到，当固定后，迭代更新每个只依赖自己，不依赖与其他的标的物的特征向量，所以可以将不同的更新放到不同的服务器上执行。同理，当固定后，迭代更新时，每个只依赖自己，不依赖与其他用户的特征向量，一样可以将不同用户的更新公式放到不同的服务器上执行。Spark 的ALS算法就是采用这样的方式做到并行化的。

# Thinking3：梯度下降法中的批量梯度下降（BGD），随机梯度下降（SGD），和小批量梯度下降有什么区别（MBGD）


- 批量梯度下降（BGD):
1. 每次更新的时候是全样本
2. 稳定，收敛慢
- 随机梯度下降（SGD)
1. 每次更新的时候用 1 个样本，用一个样本来近似所有样本
2. 更快收敛，最终解在全局最优解附近，容易陷入局部最优解
- 小批量梯度下降
1. 每次更新时用 b 个样本，折中的方法
2. 速度较快

# Thinking4：使用TPOT等AutoML工具，有怎样的好处和不足

- 优点：可以解决特征选择，模型选择，一般可以跑出多种模型的精确度上限
- 缺点：不包括数据清洗，大规模数据集非常缓慢

# Thinking5：你阅读过和推荐系统/计算广告/预测相关的论文么？有哪些论文是你比较推荐的，可以分享到微信群中

阅读过，希望以后加大阅读量

# Action1：对MovieLens数据集进行评分预测

### SlopeOne 算法简单原理：
1. 计算 item 之间的评分差的均值，记为评分偏差（两个 item 都评分过的用户）
2. 根据 item 间的评分偏差和用户的历史评分，预测用户未来评分的 item 的评分
3. 将预测评分排序，取 Top-N 对应的 item 推荐给用户

In [1]:
from surprise import Dataset
from surprise.model_selection import cross_validate
from surprise import Reader
from surprise import BaselineOnly, KNNBasic, KNNBaseline, SlopeOne
from surprise import accuracy
from surprise.model_selection import KFold
import pandas as pd
import io
import pandas as pd

# 读取物品（电影）名称信息
def read_item_names():
    file_name = ('F:\\Jupyter_notebook_workdir\\BI\\week2\\MovieLens\\movies.csv') 
    data = pd.read_csv('F:\\Jupyter_notebook_workdir\\BI\\week2\\MovieLens\\movies.csv')
    rid_to_name = {}
    name_to_rid = {}
    for i in range(len(data['movieId'])):
        rid_to_name[data['movieId'][i]] = data['title'][i]
        name_to_rid[data['title'][i]] = data['movieId'][i]

    return rid_to_name, name_to_rid 

# 数据读取
reader = Reader(line_format='user item rating timestamp', sep=',', skip_lines=1)
data = Dataset.load_from_file('F:\\Jupyter_notebook_workdir\\BI\\week2\\MovieLens\\ratings.csv', reader=reader)
train_set = data.build_full_trainset()


# 使用SlopeOne算法
algo = SlopeOne()
algo.fit(train_set)
# 对指定用户和商品进行评分预测
uid = str(196) 
iid = str(302) 
pred = algo.predict(uid, iid, r_ui=4, verbose=True)

user: 196        item: 302        r_ui = 4.00   est = 4.32   {'was_impossible': False}


# Action2：Paper Reading

- 见附件 pdf 

# Action3 Titanic预测

## 方法一 : 使用 TPOT 进行训练

In [5]:
import numpy as np 
import pandas as pd
from tpot import TPOTClassifier
import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import train_test_split

# 数据加载
train_data = pd.read_csv('F:/Jupyter_notebook_workdir/BI/week2/Titanic数据集/train.csv')
test_data = pd.read_csv('F:/Jupyter_notebook_workdir/BI/week2/Titanic数据集/test.csv')
# 数据探索
print(train_data.info())
print('-'*30)
print(train_data.describe())
print('-'*30)
print(train_data.describe(include=['O']))
print('-'*30)
print(train_data.head())
print('-'*30)
print(train_data.tail())
# 数据清洗
# 使用平均年龄来填充年龄中的 nan 值
train_data['Age'].fillna(train_data['Age'].mean(), inplace=True)
test_data['Age'].fillna(test_data['Age'].mean(),inplace=True)
# 使用票价的均值填充票价中的 nan 值
train_data['Fare'].fillna(train_data['Fare'].mean(), inplace=True)
test_data['Fare'].fillna(test_data['Fare'].mean(),inplace=True)
print(train_data['Embarked'].value_counts())

# 使用登录最多的港口来填充登录港口的 nan 值
train_data['Embarked'].fillna('S', inplace=True)
test_data['Embarked'].fillna('S',inplace=True)

# 特征选择
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
train_features = train_data[features]
train_labels = train_data['Survived']
test_features = test_data[features]

dvec=DictVectorizer(sparse=False)
train_features=dvec.fit_transform(train_features.to_dict(orient='record'))
print(dvec.feature_names_)


X_train, X_test, y_train, y_test = train_test_split(train_features,
   train_labels, train_size=0.75, test_size=0.25)

tpot = TPOTClassifier(generations=100, population_size=0, verbosity=2)
tpot.fit(train_features,train_labels)
print(tpot.score(X_test, y_test))
tpot.export('tpot_titanic_pipeline.py')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None
------------------------------
       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008  

HBox(children=(FloatProgress(value=0.0, description='Optimization Progress', max=10100.0, style=ProgressStyle(…


Generation 1 - Current best internal CV score: 0.8316866486723997
Generation 2 - Current best internal CV score: 0.8316866486723997
Generation 3 - Current best internal CV score: 0.8372920720607621
Generation 4 - Current best internal CV score: 0.8384031134266525
Generation 5 - Current best internal CV score: 0.8384031134266525
Generation 6 - Current best internal CV score: 0.8440210909547424
Generation 7 - Current best internal CV score: 0.8440210909547424
Generation 8 - Current best internal CV score: 0.8440210909547424
Generation 9 - Current best internal CV score: 0.8440210909547424
Generation 10 - Current best internal CV score: 0.8440210909547424
Generation 11 - Current best internal CV score: 0.8440210909547424
Generation 12 - Current best internal CV score: 0.8440210909547424
Generation 13 - Current best internal CV score: 0.8440210909547424
Generation 14 - Current best internal CV score: 0.848521750047078
Generation 15 - Current best internal CV score: 0.848521750047078
Gener

## 方法二 ：使用自定义的 ID3 决策树

In [25]:
import numpy as np 
import pandas as pd 
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import cross_val_score
from sklearn import metrics

In [26]:
# 数据加载
train_data = pd.read_csv('F:/Jupyter_notebook_workdir/BI/week2/Titanic数据集/train.csv')
test_data = pd.read_csv('F:/Jupyter_notebook_workdir/BI/week2/Titanic数据集/test.csv')

In [27]:
# 数据探索
print(train_data.info())
print(train_data.describe())
# 查看下离散数据的分布
print(train_data.describe(include = ['O']))
print(train_data.head())
print(train_data.tail())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None
       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.48659

In [28]:
# 数据清洗
train_data['Age'].fillna(train_data['Age'].mean(),inplace = True)
test_data['Age'].fillna(test_data['Age'].mean(),inplace = True)

train_data['Fare'].fillna(train_data['Fare'].mean(),inplace = True)
test_data['Fare'].fillna(test_data['Fare'].mean(),inplace = True)

print(train_data['Embarked'].value_counts())
train_data['Embarked'].fillna('S',inplace = True)
test_data['Embarked'].fillna('S',inplace = True)

# 特征选择
features = ['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked']
train_features = train_data[features]
train_labels = train_data['Survived']
test_features = test_data[features]
print(train_features)

S    644
C    168
Q     77
Name: Embarked, dtype: int64
     Pclass     Sex        Age  SibSp  Parch     Fare Embarked
0         3    male  22.000000      1      0   7.2500        S
1         1  female  38.000000      1      0  71.2833        C
2         3  female  26.000000      0      0   7.9250        S
3         1  female  35.000000      1      0  53.1000        S
4         3    male  35.000000      0      0   8.0500        S
..      ...     ...        ...    ...    ...      ...      ...
886       2    male  27.000000      0      0  13.0000        S
887       1  female  19.000000      0      0  30.0000        S
888       3  female  29.699118      1      2  23.4500        S
889       1    male  26.000000      0      0  30.0000        C
890       3    male  32.000000      0      0   7.7500        Q

[891 rows x 7 columns]


In [29]:
# 将字符串类型转化为数字类型
dvec = DictVectorizer(sparse=False)
train_features = dvec.fit_transform(train_features.to_dict(orient = 'record'))
test_features = dvec.fit_transform(test_features.to_dict(orient = 'record'))
print(dvec.feature_names_)

['Age', 'Embarked=C', 'Embarked=Q', 'Embarked=S', 'Fare', 'Parch', 'Pclass', 'Sex=female', 'Sex=male', 'SibSp']


In [31]:
train_features

array([[22.        ,  0.        ,  0.        , ...,  0.        ,
         1.        ,  1.        ],
       [38.        ,  1.        ,  0.        , ...,  1.        ,
         0.        ,  1.        ],
       [26.        ,  0.        ,  0.        , ...,  1.        ,
         0.        ,  0.        ],
       ...,
       [29.69911765,  0.        ,  0.        , ...,  1.        ,
         0.        ,  1.        ],
       [26.        ,  1.        ,  0.        , ...,  0.        ,
         1.        ,  0.        ],
       [32.        ,  0.        ,  1.        , ...,  0.        ,
         1.        ,  0.        ]])

In [32]:
# 构造 ID3 决策树
clf = DecisionTreeClassifier(criterion = 'entropy')
clf.fit(train_features,train_labels)
pred_labels = clf.predict(test_features)

In [33]:
# 得到 决策树的基于训练集准确率
acc_decision_tree = round(clf.score(train_features,train_labels),6)
print(u'score准确率为 % .4lf'%acc_decision_tree)

score准确率为  0.9820


In [37]:
# 使用  K 折交叉验证，统计决策树准确率
print(u'cross_vaL_score准确率为：%.4lf'%np.mean(cross_val_score(clf,train_features,train_labels,cv = 120)))

cross_vaL_score准确率为：0.7842
