## 要完成的任务：

- 1.数据预处理（1.8G数据中大部分都是没用的，需要剔除掉无关项）
- 2.文本清洗（文本数据直接用恐怕不行，停用词，文本筛选，正则等操作都得做起来）
- 3.矩阵分解（SVD与NMF,到底哪个好，还得试一试，其他任务中可能SVD效果好一些，这个项目中恰好就NMF强一些）
- 4.LDA主题模型（无监督神器，文本分析任务中经常会用到，由于不涉及标签，用途比较广泛）
- 5.构建推荐引擎（其实就是相似度计算，得出推荐结果）

## 涉及到的工具包
- numpy，pandas这些就不用说啦，必备的！
- gensim：这个可以说是文本处理与建模神器，预处理方法与LDA模型等都可以在这里直接调用
- sklearn：NMF与SVD直接可以调用，机器学习中用的最多的包

In [1]:
import numpy as np
import pandas as pd
import re
import string

from sklearn.decomposition import NMF
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

import gensim
from gensim.parsing.preprocessing import STOPWORDS
from gensim import corpora, models
from gensim.utils import simple_preprocess

from nltk.stem.porter import PorterStemmer
import warnings 
warnings.filterwarnings("ignore")

### 导入数据集
- 如果同学们笔记本执行速度较慢的话，可以选择只读取一部分数据，加上参数：nrows = 1000；

In [2]:
medium = pd.read_csv('Medium_AggregatedData.csv')
medium.head()

Unnamed: 0,audioVersionDurationSec,codeBlock,codeBlockCount,collectionId,createdDate,createdDatetime,firstPublishedDate,firstPublishedDatetime,imageCount,isSubscriptionLocked,...,slug,name,postCount,author,bio,userId,userName,usersFollowedByCount,usersFollowedCount,scrappedDate
0,0,,0.0,638f418c8464,2018-09-18,2018-09-18 20:55:34,2018-09-18,2018-09-18 20:57:03,1,False,...,blockchain,Blockchain,265164.0,Anar Babaev,,f1ad85af0169,babaevanar,450.0,404.0,20181104
1,0,,0.0,638f418c8464,2018-09-18,2018-09-18 20:55:34,2018-09-18,2018-09-18 20:57:03,1,False,...,samsung,Samsung,5708.0,Anar Babaev,,f1ad85af0169,babaevanar,450.0,404.0,20181104
2,0,,0.0,638f418c8464,2018-09-18,2018-09-18 20:55:34,2018-09-18,2018-09-18 20:57:03,1,False,...,it,It,3720.0,Anar Babaev,,f1ad85af0169,babaevanar,450.0,404.0,20181104
3,0,,0.0,,2018-01-07,2018-01-07 17:04:37,2018-01-07,2018-01-07 17:06:29,13,False,...,technology,Technology,166125.0,George Sykes,,93b9e94f08ca,tasty231,6.0,22.0,20181104
4,0,,0.0,,2018-01-07,2018-01-07 17:04:37,2018-01-07,2018-01-07 17:06:29,13,False,...,robotics,Robotics,9103.0,George Sykes,,93b9e94f08ca,tasty231,6.0,22.0,20181104


### 预处理除了固定的套路，还得根据数据自己来设计一些规则
- 大部分文本数据都是英文的，还有少量其他的，只保留英文数据
- 推荐的文章也得差不多一点，点赞数量少的，暂时去除掉

In [3]:
medium = medium[medium['language'] == 'en']         
medium = medium[medium['totalClapCount'] >= 25]     

整理文章对应标签

In [4]:
def findTags(title):
    rows = medium[medium['title'] == title]
    #print(len(rows))
    tags = list(rows['tag_name'].values)
    return tags

In [5]:
titles = medium['title'].unique()                   # 所有文章名字

tag_dict = {'title': [], 'tags': []}               # 文章对应标签

for title in titles:
    tag_dict['title'].append(title)
    tag_dict['tags'].append(findTags(title))

tag_df = pd.DataFrame(tag_dict)                     # 转换成DF

# 去重
medium = medium.drop_duplicates(subset = 'title', keep = 'first')

添加标签到DF中

In [6]:
def addTags(title):
    try:
        tags = list(tag_df[tag_df['title'] == title]['tags'])[0]
    except:
        # If there's an error assume no tags
        tags = np.NaN
    return tags

In [7]:
# 将标签加入到原始DF中
medium['allTags'] = medium['title'].apply(addTags)

# 只保留需要的列
keep_cols = ['title', 'url', 'allTags', 'readingTime', 'author', 'text']
medium = medium[keep_cols]

# 标题为空的不要了
null_title = medium[medium['title'].isna()].index
medium.drop(index = null_title, inplace = True)

medium.reset_index(drop = True, inplace = True)

print(medium.shape)
medium.head()

(24576, 6)


Unnamed: 0,title,url,allTags,readingTime,author,text
0,"Private Business, Government and Blockchain",https://medium.com/s/story/private-business-go...,"[Blockchain, Samsung, It]",0.958491,Anar Babaev,"Private Business, Government and Blockchain\n\..."
1,Can a robot love us better than another human ...,https://medium.com/s/story/can-a-robot-love-us...,"[Robotics, Meditation, Therapy, Artificial Int...",0.65283,Stewart Alsop,Can a robot love us better than another human ...
2,"2017 Big Data, AI and IOT Use Cases",https://medium.com/s/story/2017-big-data-ai-an...,"[Artificial Intelligence, Data Science, Big Da...",7.055031,Melody Ucros,"2017 Big Data, AI and IOT Use Cases\nAn Active..."
3,The Meta Model and Meta Meta-Model of Deep Lea...,https://medium.com/s/story/the-meta-model-and-...,"[Machine Learning, Deep Learning, Artificial I...",5.684906,Carlos E. Perez,The Meta Model and Meta Meta-Model of Deep Lea...
4,Don’t trust “Do you trust this computer”,https://medium.com/s/story/dont-trust-do-you-t...,"[Artificial Intelligence, Ethics, Elon Musk, D...",2.739623,Virginia Dignum,Don’t trust “Do you trust this computer”\nfrom...


### 文本清洗（正则表达式）

In [8]:
def clean_text(text):  
    # 去掉http开头那些链接
    text = re.sub('(?:(?:https?|ftp):\/\/)?[\w/\-?=%.]+\.[\w/\-?=%.]+','', text)
    # 去掉特殊字符之类的
    text = re.sub('\w*\d\w*', ' ', text)
    # 去掉标点符号等，将所有字符转换成小写的
    text = re.sub('[%s]' % re.escape(string.punctuation), ' ', text.lower())
    # 去掉换行符
    text = text.replace('\n', ' ')
    return text

medium['text'] = medium['text'].apply(clean_text)

### 去停用词
- 一般都是用现成的停用词典，但是现成的往往难以满足自己的任务需求，还需要额外补充
- 可以自己添加，一个词一个词的加入，也可以基于统计方法来计算，比如词频最高的前100个词等

In [9]:
# 自己添加一部分停用词
stop_list = STOPWORDS.union(set(['data', 'ai', 'learning', 'time', 'machine', 'like', 'use', 'new', 'intelligence', 'need', "it's", 'way',
                                 'artificial', 'based', 'want', 'know', 'learn', "don't", 'things', 'lot', "let's", 'model', 'input',
                                 'output', 'train', 'training', 'trained', 'it', 'we', 'don', 'you', 'ce', 'hasn', 'sa', 'do', 'som',
                                 'can']))

# 去停用词
def remove_stopwords(text):
    clean_text = []
    for word in text.split(' '):
        if word not in stop_list and (len(word) > 2):
            clean_text.append(word)
    return ' '.join(clean_text)

medium['text'] = medium['text'].apply(remove_stopwords)

### 词干提取
- 英文数据也有事多的时候，统一成标准的词

In [10]:
stemmer = PorterStemmer()

def stem_text(text):
    word_list = []
    for word in text.split(' '):
        word_list.append(stemmer.stem(word))
    return ' '.join(word_list)

medium['text'] = medium['text'].apply(stem_text)

### 预处理通常花的时间比较多，把结果保存下来

In [11]:
medium.to_csv('pre-processed.csv')

In [12]:
# medium = pd.read_csv('pre-processed.csv')

### TFIDF处理
- 通常都会讲一个蜜蜂养殖的故事。。。

In [13]:
vectorizer = TfidfVectorizer(stop_words = stop_list, ngram_range = (1,1))
doc_word = vectorizer.fit_transform(medium['text'])

In [14]:
doc_word.shape

(24576, 122871)

### SVD矩阵分解
- 函数使用说明：https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html#sklearn.decomposition.TruncatedSVD
- 需要指定参数，这里的8相当于你觉得这些文档可能属于多少个主题

In [15]:
# 其实跟PCA蛮像的
svd = TruncatedSVD(8)
docs_svd = svd.fit_transform(doc_word)

In [16]:
docs_svd.shape

(24576, 8)

In [17]:
# 打印函数
def display_topics(model, feature_names, no_top_words, no_top_topics, topic_names=None):
    count = 0
    for ix, topic in enumerate(model.components_):
        if count == no_top_topics:
            break
        if not topic_names or not topic_names[ix]:
            print("\nTopic ", (ix + 1))
        else:
            print("\nTopic: '",topic_names[ix],"'")
        print(", ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))
        count += 1

### 结果展示

In [18]:
display_topics(svd, vectorizer.get_feature_names(), 15, 8)


Topic  1
human, network, imag, technolog, work, user, algorithm, predict, peopl, compani, product, busi, deep, custom, develop

Topic  2
imag, layer, network, neural, function, dataset, featur, weight, convolut, vector, valu, gradient, deep, predict, paramet

Topic  3
chatbot, bot, user, custom, convers, app, messag, messeng, chat, servic, text, word, voic, assist, interact

Topic  4
imag, network, layer, neural, human, convolut, deep, chatbot, neuron, robot, technolog, cnn, brain, gan, architectur

Topic  5
imag, blockchain, tensorflow, file, python, project, api, cloud, instal, token, platform, app, code, team, googl

Topic  6
blockchain, market, valu, token, custom, layer, network, function, predict, gradient, busi, price, platform, compani, decentr

Topic  7
scienc, network, neural, chatbot, deep, layer, scientist, cours, python, neuron, gradient, skill, function, program, weight

Topic  8
word, vector, text, blockchain, token, sentenc, languag, embed, document, nlp, network, sent

### 试试NMF分解（了解就行，大部分情况下都用不到它）
- 对于任意给定的一个非负矩阵A，NMF算法能够寻找到一个非负矩阵U和一个非负矩阵V，将一个非负的矩阵分解为左右两个非负矩阵的乘积，将矩阵的维数进行降低，对大量的数据进行压缩

In [19]:
nmf = NMF(8)
docs_nmf = nmf.fit_transform(doc_word)

display_topics(nmf, vectorizer.get_feature_names(), 15, 8)


Topic  1
human, robot, technolog, peopl, machin, world, think, futur, brain, job, car, autom, design, game, live

Topic  2
valu, predict, variabl, featur, regress, function, algorithm, linear, set, test, dataset, paramet, gradient, tree, distribut

Topic  3
chatbot, bot, custom, user, convers, messag, chat, servic, messeng, busi, assist, app, interact, voic, answer

Topic  4
network, layer, imag, neural, deep, convolut, neuron, weight, cnn, function, architectur, loss, gener, gan, gradient

Topic  5
file, tensorflow, imag, python, code, instal, api, run, notebook, googl, librari, creat, app, dataset, gpu

Topic  6
blockchain, market, technolog, compani, busi, custom, product, platform, servic, token, develop, industri, user, invest, team

Topic  7
scienc, scientist, cours, work, skill, team, job, peopl, project, engin, busi, analyt, program, deep, start

Topic  8
word, vector, text, sentenc, embed, languag, document, sentiment, nlp, corpu, sequenc, token, context, topic, matrix


### 再来试试传说中的LDA

In [20]:
tokenized_docs = medium['text'].apply(simple_preprocess)#大写转小写，去掉过长或者过短的文本等操作
dictionary = gensim.corpora.Dictionary(tokenized_docs)#生成字典
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)#去掉出现次数低于no_below的，去掉出现次数高于no_above（百分数）
corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]#转换成向量格式

lda = models.LdaMulticore(corpus=corpus, num_topics=8, id2word=dictionary, passes=10, workers = 4)

lda.print_topics()

[(0,
  '0.014*"human" + 0.008*"peopl" + 0.006*"world" + 0.006*"think" + 0.005*"technolog" + 0.005*"robot" + 0.005*"year" + 0.004*"futur" + 0.003*"chang" + 0.003*"right"'),
 (1,
  '0.013*"code" + 0.012*"python" + 0.010*"file" + 0.010*"run" + 0.007*"tensorflow" + 0.007*"function" + 0.006*"api" + 0.006*"librari" + 0.006*"ll" + 0.006*"build"'),
 (2,
  '0.019*"user" + 0.016*"custom" + 0.014*"chatbot" + 0.012*"bot" + 0.010*"app" + 0.008*"servic" + 0.008*"convers" + 0.008*"product" + 0.006*"busi" + 0.006*"googl"'),
 (3,
  '0.028*"network" + 0.027*"imag" + 0.018*"neural" + 0.014*"layer" + 0.013*"deep" + 0.006*"detect" + 0.006*"result" + 0.006*"task" + 0.006*"model" + 0.006*"featur"'),
 (4,
  '0.012*"scienc" + 0.008*"team" + 0.007*"scientist" + 0.007*"busi" + 0.006*"problem" + 0.006*"cours" + 0.006*"compani" + 0.006*"project" + 0.006*"peopl" + 0.006*"product"'),
 (5,
  '0.024*"word" + 0.012*"text" + 0.008*"user" + 0.008*"vector" + 0.007*"languag" + 0.007*"featur" + 0.006*"similar" + 0.006*"matr

### 结果汇总

In [100]:
#列名，一会要用到
column_names = ['title', 'url', 'allTags', 'readingTime', 'author', 'Tech',
                'Modeling', 'Chatbots', 'Deep Learning', 'Coding', 'Business',
                'Careers', 'NLP', 'sum']
#计算各个类别可能性总和
topic_sum = pd.DataFrame(np.sum(docs_nmf, axis = 1))
#做成DF
doc_topic_df = pd.DataFrame(data = docs_nmf)
#汇总大表
doc_topic_df = pd.concat([medium[['title', 'url', 'allTags', 'readingTime', 'author']], doc_topic_df, topic_sum], axis = 1)
doc_topic_df.columns = column_names
#剔除掉那些啥也不是的
doc_topic_df = doc_topic_df[doc_topic_df['sum'] != 0]
doc_topic_df.drop(columns = 'sum', inplace = True)

保存结果，一会推荐的话直接用这个表就好了

In [101]:
doc_topic_df.reset_index(drop = True, inplace = True)
doc_topic_df.to_csv('tfidf_nmf_8topics.csv', index = False)
doc_topic_df.head()

Unnamed: 0,title,url,allTags,readingTime,author,Tech,Modeling,Chatbots,Deep Learning,Coding,Business,Careers,NLP
0,"Private Business, Government and Blockchain",https://medium.com/s/story/private-business-go...,"[Blockchain, Samsung, It]",0.958491,Anar Babaev,0.003306,0.0,0.0,0.0,0.0,0.076125,0.0,0.0
1,Can a robot love us better than another human ...,https://medium.com/s/story/can-a-robot-love-us...,"[Robotics, Meditation, Therapy, Artificial Int...",0.65283,Stewart Alsop,0.052396,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"2017 Big Data, AI and IOT Use Cases",https://medium.com/s/story/2017-big-data-ai-an...,"[Artificial Intelligence, Data Science, Big Da...",7.055031,Melody Ucros,0.020479,0.016317,0.0,0.011528,0.004397,0.045034,0.016111,0.0
3,The Meta Model and Meta Meta-Model of Deep Lea...,https://medium.com/s/story/the-meta-model-and-...,"[Machine Learning, Deep Learning, Artificial I...",5.684906,Carlos E. Perez,0.008826,0.001702,0.0,0.045566,0.000699,0.0,0.009762,0.006317
4,Don’t trust “Do you trust this computer”,https://medium.com/s/story/dont-trust-do-you-t...,"[Artificial Intelligence, Ethics, Elon Musk, D...",2.739623,Virginia Dignum,0.026699,0.00548,0.003155,0.0,0.000637,0.005429,0.017051,0.004065


In [102]:
# doc_topic_df = pd.read_csv('tfidf_nmf_8topics.csv')

### 推荐

用矩阵形式计算，如果遍历所有数据求相似度实在太慢了

In [103]:
topic_names = ['Tech', 'Modeling', 'Chatbots', 'Deep Learning', 'Coding', 'Business', 'Careers', 'NLP']
topic_array = np.array(doc_topic_df[topic_names])
norms = np.linalg.norm(topic_array, axis = 1)#求范数
topic_array.shape

(24574, 8)

计算余弦相似度

In [104]:
def compute_dists(top_vec, topic_array):
    dots = np.matmul(topic_array, top_vec)
    input_norm = np.linalg.norm(top_vec)
    co_dists = dots / (input_norm * norms)
    return co_dists

计算推荐结果
- top_vec：用户当前输入向量
- topic_array：已经计算好的所有文章向量
- doc_topic_df：数据表（返回的时候从这里选推荐的）

In [105]:
def produce_rec(top_vec, topic_array, doc_topic_df, rand = 15):
    
    #Rand参数控制随机程度对结果的影响
    top_vec = top_vec + np.random.rand(8,)/(np.linalg.norm(top_vec)) * rand
    co_dists = compute_dists(top_vec, topic_array)
    return doc_topic_df.loc[np.argmax(co_dists)]

测试推荐

In [106]:
tech = 0
modeling = 0
chatbots = 0
deep = 5
coding = 0
business = 0
careers = 0
nlp = 5

top_vec = np.array([tech, modeling, chatbots, deep, coding, business, careers, nlp])

rec = produce_rec(top_vec, topic_array, doc_topic_df)
rec

title            A deeper understanding of NNets (Part 3) — LST...
url              https://medium.com/s/story/a-deeper-understand...
allTags          [Machine Learning, Lstm, Recurrent Neural Netw...
readingTime                                                7.22767
author                                               Pranjal Yadav
Tech                                                    0.00822458
Modeling                                                 0.0122572
Chatbots                                                         0
Deep Learning                                            0.0240986
Coding                                                           0
Business                                                         0
Careers                                                 0.00971773
NLP                                                      0.0259292
Name: 14559, dtype: object