## Рекомендательная система для подбора образовательных курсов



Задача: рекомендовать пользователю наилучший образовательный курс по его запросу и предпочтениям
Идея: сравнивать образовательные курсы так же как сравнивают фильмы - по оценкам пользователей, жанрам (областям научного знания)

## Импорт библиотек

In [2]:
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt

import pandas as pd
import numpy as np

from tqdm import tqdm
tqdm.pandas(desc="progress-bar")
from tqdm import tqdm_notebook

import gensim
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument

from sklearn import utils
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score, f1_score

import seaborn as sns

import multiprocessing

import re

from nltk.corpus import stopwords
from nltk.stem.snowball import EnglishStemmer

%matplotlib inline

In [4]:
#Загрузка данных
data = pd.read_csv('data.csv')
data=data.drop(['Unnamed: 0'], axis=1)
data['skills'] = data.skills.fillna(value = '')
data.head()

Unnamed: 0,course_id,reviewer_name,rating,review_text,title,topics,about,instructors,average_score,ratings_count,reviews_count,skills,syllabus,recommendations,url,already_enrolled,recent_views,recent_views_conversion,hours_to_complete,level_range
0,2-speed-it,Ravish,5,Very relevant and useful course designed for CIOs,Two Speed IT: How Companies Can Surf the Digit...,Business Business Essentials,"Transform or disappear, the Darwinism of IT: I...",Antoine Gourévitch Vanessa Lyon Eric Baudson,4.4,33,33,,Introduction IT and the CIO in the Digital Wor...,fundamentals-of-management entrepreneurial-thi...,https://www.coursera.org/learn/2-speed-it,16728,5149,324.9,21.0,0.0
1,2-speed-it,Etienne R,2,This course does not say anything about digiti...,Two Speed IT: How Companies Can Surf the Digit...,Business Business Essentials,"Transform or disappear, the Darwinism of IT: I...",Antoine Gourévitch Vanessa Lyon Eric Baudson,4.4,33,33,,Introduction IT and the CIO in the Digital Wor...,fundamentals-of-management entrepreneurial-thi...,https://www.coursera.org/learn/2-speed-it,16728,5149,324.9,21.0,0.0
2,2-speed-it,Viswas P,4,Videos that are presented in French could've b...,Two Speed IT: How Companies Can Surf the Digit...,Business Business Essentials,"Transform or disappear, the Darwinism of IT: I...",Antoine Gourévitch Vanessa Lyon Eric Baudson,4.4,33,33,,Introduction IT and the CIO in the Digital Wor...,fundamentals-of-management entrepreneurial-thi...,https://www.coursera.org/learn/2-speed-it,16728,5149,324.9,21.0,0.0
3,2-speed-it,AN L,3,"The course content is quite good, though it co...",Two Speed IT: How Companies Can Surf the Digit...,Business Business Essentials,"Transform or disappear, the Darwinism of IT: I...",Antoine Gourévitch Vanessa Lyon Eric Baudson,4.4,33,33,,Introduction IT and the CIO in the Digital Wor...,fundamentals-of-management entrepreneurial-thi...,https://www.coursera.org/learn/2-speed-it,16728,5149,324.9,21.0,0.0
4,2-speed-it,Konstantin A,5,"Great piece of work, I especially liked a few ...",Two Speed IT: How Companies Can Surf the Digit...,Business Business Essentials,"Transform or disappear, the Darwinism of IT: I...",Antoine Gourévitch Vanessa Lyon Eric Baudson,4.4,33,33,,Introduction IT and the CIO in the Digital Wor...,fundamentals-of-management entrepreneurial-thi...,https://www.coursera.org/learn/2-speed-it,16728,5149,324.9,21.0,0.0


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159253 entries, 0 to 159252
Data columns (total 20 columns):
course_id                  159253 non-null object
reviewer_name              159253 non-null object
rating                     159253 non-null int64
review_text                159246 non-null object
title                      159253 non-null object
topics                     159253 non-null object
about                      159253 non-null object
instructors                159253 non-null object
average_score              159253 non-null float64
ratings_count              159253 non-null int64
reviews_count              159253 non-null int64
skills                     159253 non-null object
syllabus                   159253 non-null object
recommendations            151143 non-null object
url                        159253 non-null object
already_enrolled           159253 non-null int64
recent_views               159253 non-null int64
recent_views_conversion    159253 non-null 

### Рекомендация курса пользователю на основании интересов пользователя (ранее оценённых им курсов)
предсказываем оценку, которую поставил бы пользователь не просмотренному им курсу на основании регрессии, затем выбираем курс с вероятной наивысшей оценкой
- на фичах: TF-IDF на topics & skills

- средние оценки (+ median, variance, etc.) пользователя

In [9]:
#объединим topics & skills в новую фичу key_words
data['key_words'] = data.topics.map(str) + ' ' + data.skills
data.tail(1000)

Unnamed: 0,course_id,reviewer_name,rating,review_text,title,topics,about,instructors,average_score,ratings_count,...,skills,syllabus,recommendations,url,already_enrolled,recent_views,recent_views_conversion,hours_to_complete,level_range,key_words
158253,women-in-leadership,Яна Ч,5,I really enjoy this course. I was little sarca...,Women in Leadership: Inspiring Positive Change,Business Leadership and Management,This course aims to inspire and empower women ...,Diana Bilimoria PhD,4.6,37,...,Assertiveness Communication Negotiation Leader...,WEEK 1: Yourself as a Leader—Developing your L...,coaching-conversations coaching-practices,https://www.coursera.org/learn/women-in-leader...,9752,8957,108.9,14.0,0.0,Business Leadership and Management Assertivene...
158254,women-in-leadership,MAGALI A S,3,"Hi, there is very valuable in the videos, but ...",Women in Leadership: Inspiring Positive Change,Business Leadership and Management,This course aims to inspire and empower women ...,Diana Bilimoria PhD,4.6,37,...,Assertiveness Communication Negotiation Leader...,WEEK 1: Yourself as a Leader—Developing your L...,coaching-conversations coaching-practices,https://www.coursera.org/learn/women-in-leader...,9752,8957,108.9,14.0,0.0,Business Leadership and Management Assertivene...
158255,women-in-leadership,Ted B,5,This is a great course. Thank you for all the...,Women in Leadership: Inspiring Positive Change,Business Leadership and Management,This course aims to inspire and empower women ...,Diana Bilimoria PhD,4.6,37,...,Assertiveness Communication Negotiation Leader...,WEEK 1: Yourself as a Leader—Developing your L...,coaching-conversations coaching-practices,https://www.coursera.org/learn/women-in-leader...,9752,8957,108.9,14.0,0.0,Business Leadership and Management Assertivene...
158256,women-in-leadership,Shelina R,4,Great course! The information was great. I enj...,Women in Leadership: Inspiring Positive Change,Business Leadership and Management,This course aims to inspire and empower women ...,Diana Bilimoria PhD,4.6,37,...,Assertiveness Communication Negotiation Leader...,WEEK 1: Yourself as a Leader—Developing your L...,coaching-conversations coaching-practices,https://www.coursera.org/learn/women-in-leader...,9752,8957,108.9,14.0,0.0,Business Leadership and Management Assertivene...
158257,women-in-leadership,Lilija M,5,This is probably the best course I have taken ...,Women in Leadership: Inspiring Positive Change,Business Leadership and Management,This course aims to inspire and empower women ...,Diana Bilimoria PhD,4.6,37,...,Assertiveness Communication Negotiation Leader...,WEEK 1: Yourself as a Leader—Developing your L...,coaching-conversations coaching-practices,https://www.coursera.org/learn/women-in-leader...,9752,8957,108.9,14.0,0.0,Business Leadership and Management Assertivene...
158258,women-in-leadership,Giada B,5,"Inspiring, insightful and",Women in Leadership: Inspiring Positive Change,Business Leadership and Management,This course aims to inspire and empower women ...,Diana Bilimoria PhD,4.6,37,...,Assertiveness Communication Negotiation Leader...,WEEK 1: Yourself as a Leader—Developing your L...,coaching-conversations coaching-practices,https://www.coursera.org/learn/women-in-leader...,9752,8957,108.9,14.0,0.0,Business Leadership and Management Assertivene...
158259,women-in-leadership,Roopali S,5,Wonderful course. All women should attend,Women in Leadership: Inspiring Positive Change,Business Leadership and Management,This course aims to inspire and empower women ...,Diana Bilimoria PhD,4.6,37,...,Assertiveness Communication Negotiation Leader...,WEEK 1: Yourself as a Leader—Developing your L...,coaching-conversations coaching-practices,https://www.coursera.org/learn/women-in-leader...,9752,8957,108.9,14.0,0.0,Business Leadership and Management Assertivene...
158260,women-in-leadership,Angélica T,5,Thought-provoking course that validated many o...,Women in Leadership: Inspiring Positive Change,Business Leadership and Management,This course aims to inspire and empower women ...,Diana Bilimoria PhD,4.6,37,...,Assertiveness Communication Negotiation Leader...,WEEK 1: Yourself as a Leader—Developing your L...,coaching-conversations coaching-practices,https://www.coursera.org/learn/women-in-leader...,9752,8957,108.9,14.0,0.0,Business Leadership and Management Assertivene...
158261,women-in-leadership,Merrill C,5,Really great professor and material! I found t...,Women in Leadership: Inspiring Positive Change,Business Leadership and Management,This course aims to inspire and empower women ...,Diana Bilimoria PhD,4.6,37,...,Assertiveness Communication Negotiation Leader...,WEEK 1: Yourself as a Leader—Developing your L...,coaching-conversations coaching-practices,https://www.coursera.org/learn/women-in-leader...,9752,8957,108.9,14.0,0.0,Business Leadership and Management Assertivene...
158262,women-in-leadership,Claude D,5,very inspirational,Women in Leadership: Inspiring Positive Change,Business Leadership and Management,This course aims to inspire and empower women ...,Diana Bilimoria PhD,4.6,37,...,Assertiveness Communication Negotiation Leader...,WEEK 1: Yourself as a Leader—Developing your L...,coaching-conversations coaching-practices,https://www.coursera.org/learn/women-in-leader...,9752,8957,108.9,14.0,0.0,Business Leadership and Management Assertivene...


In [11]:
data_t_s = data[['course_id','reviewer_name','rating','title','average_score',
                 'ratings_count','reviews_count','already_enrolled','recent_views',
                 'recent_views_conversion','hours_to_complete','level_range','key_words']]

#### Токенизация и очистка данных
Сделаем токенизацию слов из текстов topics и skills

In [13]:
mystopwords = stopwords.words('english') + ["i'm", '-', "i've"] 
regex = re.compile("['A-Za-z\-]+")

def tokenize(text, regex=regex, stopwords=mystopwords):
    """ Tokenize all tokens from text string
        Returns array of tokens
    """
    try:
        text = " ".join(regex.findall(text)).lower()
        tokens = ' '.join([token for token in text.split(' ') if not token in stopwords])
        return tokens
    except:
        return []

In [14]:
data_t_s['key_words_tokenize'] = data_t_s.key_words.apply(tokenize)
data_t_s.tail(1000)

Unnamed: 0,course_id,reviewer_name,rating,title,average_score,ratings_count,reviews_count,already_enrolled,recent_views,recent_views_conversion,hours_to_complete,level_range,key_words,key_words_tokenize
158253,women-in-leadership,Яна Ч,5,Women in Leadership: Inspiring Positive Change,4.6,37,37,9752,8957,108.9,14.0,0.0,Business Leadership and Management Assertivene...,business leadership management assertiveness c...
158254,women-in-leadership,MAGALI A S,3,Women in Leadership: Inspiring Positive Change,4.6,37,37,9752,8957,108.9,14.0,0.0,Business Leadership and Management Assertivene...,business leadership management assertiveness c...
158255,women-in-leadership,Ted B,5,Women in Leadership: Inspiring Positive Change,4.6,37,37,9752,8957,108.9,14.0,0.0,Business Leadership and Management Assertivene...,business leadership management assertiveness c...
158256,women-in-leadership,Shelina R,4,Women in Leadership: Inspiring Positive Change,4.6,37,37,9752,8957,108.9,14.0,0.0,Business Leadership and Management Assertivene...,business leadership management assertiveness c...
158257,women-in-leadership,Lilija M,5,Women in Leadership: Inspiring Positive Change,4.6,37,37,9752,8957,108.9,14.0,0.0,Business Leadership and Management Assertivene...,business leadership management assertiveness c...
158258,women-in-leadership,Giada B,5,Women in Leadership: Inspiring Positive Change,4.6,37,37,9752,8957,108.9,14.0,0.0,Business Leadership and Management Assertivene...,business leadership management assertiveness c...
158259,women-in-leadership,Roopali S,5,Women in Leadership: Inspiring Positive Change,4.6,37,37,9752,8957,108.9,14.0,0.0,Business Leadership and Management Assertivene...,business leadership management assertiveness c...
158260,women-in-leadership,Angélica T,5,Women in Leadership: Inspiring Positive Change,4.6,37,37,9752,8957,108.9,14.0,0.0,Business Leadership and Management Assertivene...,business leadership management assertiveness c...
158261,women-in-leadership,Merrill C,5,Women in Leadership: Inspiring Positive Change,4.6,37,37,9752,8957,108.9,14.0,0.0,Business Leadership and Management Assertivene...,business leadership management assertiveness c...
158262,women-in-leadership,Claude D,5,Women in Leadership: Inspiring Positive Change,4.6,37,37,9752,8957,108.9,14.0,0.0,Business Leadership and Management Assertivene...,business leadership management assertiveness c...


In [44]:
data_t_s_1 = data_t_s.head(1000)
data_t_s_tok = data_t_s_1
data_t_s_tok['key_words_tokenize']

0                           business business essentials
1                           business business essentials
2                           business business essentials
3                           business business essentials
4                           business business essentials
5                           business business essentials
6                           business business essentials
7                           business business essentials
8                           business business essentials
9                           business business essentials
10                          business business essentials
11                          business business essentials
12                          business business essentials
13                          business business essentials
14                          business business essentials
15                          business business essentials
16                          business business essentials
17                          bus

In [None]:
key_words = []
for i in tqdm_notebook(data_t_s_tok.key_words_tokenize.str.split(' ')):
        for j in i :
            key_words.append(j)

HBox(children=(IntProgress(value=0, max=159253), HTML(value='')))




In [None]:
dict_key_words_idf = {i:np.log(len(data_t_s_tok)/key_words.count(i)) for i in key_words}
dict_key_words_idf

In [None]:
len(sorted(dict_key_words_idf.items(), key=lambda kv: kv[1]) )

In [None]:
for i in dict_key_words_idf:
    data_t_s_tok['tf_idf_'+i] = data_t_s_tok.apply(lambda row: 
                                   (1/len(row['key_words'].split(' ')))*dict_key_words_idf[i]
                                   if i in row['key_words'] else 0, axis=1)

In [None]:
data_t_s_tok.head()

In [None]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

In [None]:
label_encoder = LabelEncoder()
data_t_s_1['new_course_id'] = pd.Series(label_encoder.fit_transform(data_t_s_tok['course_id']))
data_t_s_tok['new_course_id'] = pd.Series(label_encoder.fit_transform(data_t_s_tok['course_id']))
data_t_s_tok['new_course_id'].value_counts()

In [None]:
data_t_s_tok['reviewer_id'] = pd.Series(label_encoder.fit_transform(data_t_s_tok['reviewer_name']))
data_t_s_tok['reviewer_id'].value_counts().head()

In [None]:
data_t_s_tok

In [None]:
data_t_s_tok = data_t_s_tok.drop(['course_id'], axis=1)
data_t_s_tok = data_t_s_tok.drop(['reviewer_name'], axis=1)
data_t_s_tok = data_t_s_tok.drop(['key_words'], axis=1)
data_t_s_tok = data_t_s_tok.drop(['key_words_tokenize'], axis=1)

In [None]:
data_t_s_tok.tail(50)

In [None]:
# сохраним data_t_s_tok  
data_t_s_tok.to_csv('data_t_s_tok.csv') #выполняется 1 раз

In [None]:
#разделим обучающую и тестовую выборки
from sklearn.model_selection import train_test_split

In [None]:
X = data_t_s_tok.drop(columns=['rating', 'average_score'])
y = data_t_s_tok['rating']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [None]:
result = pd.DataFrame(y_test.reset_index(drop=True), columns=['target'])

## Линейная регрессия для предсказания оценки пользователя
Используем линейную регрессию в качестве бейзлайна, чтобы предсказать переменную - оценку пользователей (rating)

In [58]:
from sklearn.linear_model import LinearRegression # метод наименьших квадратов
from sklearn.metrics import mean_squared_log_error, mean_squared_error

In [59]:
model = LinearRegression()
model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [60]:
predictions = model.predict(X_test)

In [61]:
mean_squared_error(predictions, y_test)

0.7711057362460693

In [62]:
X_test['predictions'] = predictions

In [63]:
X_test['LinearRegression_predictions'] = X_test['predictions']
models_res = X_test['LinearRegression_predictions']
result['LinearRegression'] = models_res #для визуализации

In [64]:
results = X_test.merge(data_t_s_1, how='left', on='new_course_id')[
                                                        [ 'reviewer_name' ,'course_id', 'predictions', 'rating',]]
results.sort_values('predictions', ascending=False).head()

Unnamed: 0,reviewer_name,course_id,predictions,rating
28825,Thomas,ableton-live,4.795733,4
28807,Thomas J,ableton-live,4.795733,5
28795,Graham M,ableton-live,4.795733,5
28796,Alberto B,ableton-live,4.795733,5
28797,Gonza M,ableton-live,4.795733,5


#### Предскажем оценки пользователей используя RandomForestRegressor

In [65]:
from sklearn.ensemble import RandomForestRegressor

In [66]:
X = data_t_s_tok.drop(columns=['rating', 'average_score'])
y = data_t_s_tok['rating']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=42)

In [67]:
model = RandomForestRegressor()
model.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [68]:
predictions = model.predict(X_test)
X_test['RandomForestRegressor_predictions'] = predictions
models_res.RandomForestRegressor_predictions = X_test['RandomForestRegressor_predictions']

print('root_mean_squared_error = ', np.sqrt(mean_squared_error(y_test, predictions)))

root_mean_squared_error =  0.9202480543209431


In [69]:
print(model.feature_importances_)

[0.03833874 0.03579585 0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.00643615
 0.         0.         0.         0.00203631 0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.07570934 0.84168361]


#### Попробуем применить к исходным key_words CountVectorizer и TfidfTransformer

In [70]:
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer

In [71]:
cv_data_t_s_tok = CountVectorizer()
tf_data_t_s_tok = TfidfTransformer()

In [72]:
data_t_s_tok_ = cv_data_t_s_tok.fit_transform(data_t_s_1.key_words_tokenize)

In [73]:
cv_data_t_s_tok.get_feature_names()

['abelton',
 'accounting',
 'analytics',
 'art',
 'arts',
 'audio',
 'basic',
 'business',
 'computer',
 'development',
 'earnings',
 'education',
 'english',
 'entrepreneurship',
 'essentials',
 'file',
 'finance',
 'health',
 'human',
 'humanities',
 'interaction',
 'language',
 'learning',
 'live',
 'management',
 'materials',
 'midi',
 'mixing',
 'music',
 'new',
 'personal',
 'product',
 'programming',
 'recording',
 'science',
 'sciences',
 'social']

In [74]:
len(cv_data_t_s_tok.get_feature_names())

37

In [75]:
tfidf_data_t_s_tok= tf_data_t_s_tok.fit_transform(data_t_s_tok_)

In [76]:
tfidf_data_t_s_tok

<1000x37 sparse matrix of type '<class 'numpy.float64'>'
	with 6875 stored elements in Compressed Sparse Row format>

In [77]:
X_train, X_test, y_train, y_test = train_test_split(tfidf_data_t_s_tok, y, test_size=0.3,random_state=42)

In [78]:
model = RandomForestRegressor()
model.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [79]:
print(model.feature_importances_)

[0.00000000e+00 1.66919854e-03 1.60302690e-02 9.37147062e-04
 1.16818027e-02 2.02054117e-04 5.74564863e-05 6.90237667e-01
 0.00000000e+00 2.40992667e-02 4.35343264e-04 7.94804685e-02
 3.03449272e-03 6.09582629e-02 1.82121451e-02 3.96906843e-04
 1.00300725e-02 4.88767495e-03 7.48451378e-04 0.00000000e+00
 0.00000000e+00 1.45604140e-02 1.12151676e-02 0.00000000e+00
 1.16719192e-04 7.41916010e-05 0.00000000e+00 0.00000000e+00
 0.00000000e+00 1.38798547e-03 1.37138220e-02 0.00000000e+00
 0.00000000e+00 6.27895130e-04 5.47831274e-03 1.49617731e-02
 1.47650397e-02]


In [80]:
predictions1 = model.predict(X_test)
#X_test['RandomForestRegressor'] = model.predict(X_test)
#models_res.RandomForestRegressor1 = X_test['RandomForestRegressor1']
print('root_mean_squared_error = ', np.sqrt(mean_squared_error(y_test, predictions1)))

root_mean_squared_error =  0.8320153589129327


#### Попробуем применить GridSearch к моделям

In [81]:
from sklearn.model_selection import GridSearchCV

In [82]:
%%time
lr_params = {
    'fit_intercept':[False, True]
}

lr = LinearRegression()
grid_lr = GridSearchCV(lr, lr_params,
                       scoring='neg_mean_squared_error',
                       cv=3,n_jobs=-1)
grid_lr.fit(X_train, y_train)

print(grid_lr.best_params_)
print(grid_lr.best_score_)
print(grid_lr.best_estimator_)

{'fit_intercept': True}
-0.6704581691859347
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)
CPU times: user 46.7 ms, sys: 61.4 ms, total: 108 ms
Wall time: 3.93 s


In [83]:
#Оценим тчоностьна тесте
grid_lr.score(X_test,y_test)

-0.6949992592228276

In [84]:
print('root_mean_squared_error = ', np.sqrt(mean_squared_error(y_test, grid_lr.best_estimator_.predict(X_test))))

root_mean_squared_error =  0.8336661557379114


In [85]:
np.sqrt(-grid_lr.score(X_test,y_test))

0.8336661557379114

#### Предскажем оценки пользователей используя KNeighborsRegressor

In [86]:
from sklearn.model_selection import RandomizedSearchCV

In [87]:
RandomizedSearchCV

sklearn.model_selection._search.RandomizedSearchCV

In [88]:
from sklearn.neighbors import KNeighborsRegressor

In [89]:
%%time

knn_params = {
    'n_neighbors':list(range(1, 30))
   ,'weights': ['uniform', 'distance']
   ,'algorithm' : ['auto', 'ball_tree', 'kd_tree', 'brute']
   ,'leaf_size':list(range(5, 30))
}

grid_knn = RandomizedSearchCV(KNeighborsRegressor(), knn_params,  scoring='neg_mean_squared_error',
                       cv=3,n_jobs=-1)
grid_knn.fit(X_train, y_train)
print(grid_knn.best_params_)
print(grid_knn.best_score_)
print(grid_knn.best_estimator_)

{'weights': 'distance', 'n_neighbors': 25, 'leaf_size': 20, 'algorithm': 'ball_tree'}
-0.6829959086213447
KNeighborsRegressor(algorithm='ball_tree', leaf_size=20, metric='minkowski',
          metric_params=None, n_jobs=None, n_neighbors=25, p=2,
          weights='distance')
CPU times: user 52.6 ms, sys: 6.42 ms, total: 59 ms
Wall time: 335 ms




In [90]:
grid_knn.score(X_test,y_test)

-0.691242279964622

In [91]:
print('root_mean_squared_error = ', np.sqrt(mean_squared_error(y_test, grid_knn.best_estimator_.predict(X_test))))

root_mean_squared_error =  0.8314098146910596


#### Предскажем оценки пользователей используя DecisionTreeRegressor

In [92]:
from sklearn.tree import DecisionTreeRegressor

In [93]:
%%time
dt_params = {
    'max_depth':[None,1,2,5,10,25,50],
     'min_samples_split':[2,5,8,10,25,50],
    
    'min_weight_fraction_leaf': [0, 0.01, 0.1, 0.15, 0.25, 0.5] ,
    'min_samples_leaf':list(range(1, 10)),
     'criterion':  ['mse', 'friedman_mse', 'mae'],
    'max_features':list(range(1, 13)) }
#model_forest = RandomForestRegressor(
    

grid_dt =  RandomizedSearchCV(DecisionTreeRegressor(),dt_params,  scoring='neg_mean_squared_error',
                       cv=3,n_jobs=-1)
grid_dt.fit(X_train, y_train)
print(grid_dt.best_params_)
print(grid_dt.best_score_)
print(grid_dt.best_estimator_)

{'min_weight_fraction_leaf': 0.1, 'min_samples_split': 10, 'min_samples_leaf': 4, 'max_features': 10, 'max_depth': 5, 'criterion': 'friedman_mse'}
-0.665238211666703
DecisionTreeRegressor(criterion='friedman_mse', max_depth=5, max_features=10,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=4,
           min_samples_split=10, min_weight_fraction_leaf=0.1,
           presort=False, random_state=None, splitter='best')
CPU times: user 38.2 ms, sys: 2.81 ms, total: 41 ms
Wall time: 227 ms


In [94]:
grid_dt.score(X_test,y_test)

-0.6925337264640157

In [95]:
print('root_mean_squared_error = ', np.sqrt(mean_squared_error(y_test, grid_dt.best_estimator_.predict(X_test))))

root_mean_squared_error =  0.8321861128762097


#### Предскажем оценки пользователей используя RandomForestRegressor

In [96]:
%%time
rf_params = {
    'n_estimators': [1,10,20,30,40,50,60,80,90],
    
    'max_depth':[None,1,2,5,10,25,50],
    
    'min_samples_leaf':list(range(1, 10)),
   
    'max_features':list(range(1, 13)) ,
     'criterion':  ['mse', 'friedman_mse', 'mae']}
    

grid_rf =  RandomizedSearchCV(RandomForestRegressor(),rf_params,scoring='neg_mean_squared_error',
                       cv=3,n_jobs=-1)
grid_rf.fit(X_train, y_train)
print(grid_rf.best_params_)
print(grid_rf.best_score_)
print(grid_rf.best_estimator_)

{'n_estimators': 30, 'min_samples_leaf': 9, 'max_features': 6, 'max_depth': 2, 'criterion': 'mse'}
-0.6637947498018606
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=2,
           max_features=6, max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=9,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=30, n_jobs=None, oob_score=False,
           random_state=None, verbose=0, warm_start=False)
CPU times: user 114 ms, sys: 8.12 ms, total: 122 ms
Wall time: 1.81 s


In [97]:
grid_rf.score(X_test,y_test)

-0.6936030491545594

In [98]:
print('root_mean_squared_error = ', np.sqrt(mean_squared_error(y_test, grid_rf.best_estimator_.predict(X_test))))

root_mean_squared_error =  0.8328283431503514


#### Предскажем оценки пользователей используя GradientBoostingRegressor

In [99]:
%%time
from sklearn.ensemble.gradient_boosting import GradientBoostingRegressor
gb_params = {
          'n_estimators': [1,10,20,30,40,50,60,80,90],
          'max_features': list(range(1, 13)),
             'max_depth':[None,1,2,5,10,25,50],
            'learning_rate': [0.1,0.3,0.5,0.7],
               #'min_samples_split':[2,5,8,10,25,50],
            'min_samples_leaf':list(range(1, 10)),}
    
grid_gb =  RandomizedSearchCV(GradientBoostingRegressor(),gb_params,scoring='neg_mean_squared_error',
                       cv=3,n_jobs=-1)
grid_gb.fit(X_train, y_train)
print(grid_gb.best_params_)
print(grid_gb.best_score_)
print(grid_gb.best_estimator_)

{'n_estimators': 60, 'min_samples_leaf': 4, 'max_features': 7, 'max_depth': 1, 'learning_rate': 0.1}
-0.6668181210906947
GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=1, max_features=7,
             max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=4,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             n_estimators=60, n_iter_no_change=None, presort='auto',
             random_state=None, subsample=1.0, tol=0.0001,
             validation_fraction=0.1, verbose=0, warm_start=False)
CPU times: user 66.8 ms, sys: 3.65 ms, total: 70.4 ms
Wall time: 597 ms


In [100]:
grid_gb.score(X_test,y_test)

-0.6920897760177771

In [101]:
print('root_mean_squared_error = ', np.sqrt(mean_squared_error(y_test, grid_gb.best_estimator_.predict(X_test))))

root_mean_squared_error =  0.831919332638554


#### Предскажем оценки пользователей используя SVR

In [102]:
from sklearn.svm import SVR

In [103]:
%%time
SVR_params = {
          #'kernel':['linear', 'poly','rbf', 'sigmoid', 'precomputed'],
             'C' :[0.001, 0.01, 0.1, 1, 10],
    'gamma': [0.001, 0.01, 0.1, 1]}

grid_SVR =   GridSearchCV(SVR(),SVR_params,scoring='neg_mean_squared_error',
                       cv=3,n_jobs=-1)
grid_SVR.fit(X_train, y_train)
print(grid_SVR.best_params_)
print(grid_SVR.best_score_)
print(grid_SVR.best_estimator_)

{'C': 0.001, 'gamma': 0.1}
-0.7482732832688053
SVR(C=0.001, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma=0.1,
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
CPU times: user 65.1 ms, sys: 4.26 ms, total: 69.3 ms
Wall time: 610 ms


In [104]:
grid_SVR.score(X_test,y_test)

-0.7806339780934861

In [105]:
print('root_mean_squared_error = ', np.sqrt(mean_squared_error(y_test, grid_SVR.best_estimator_.predict(X_test))))

root_mean_squared_error =  0.8835349331483652


In [106]:
models=['SVR','gb','dt','rf','knn','lr']

In [107]:
for model_ in models:
    print(model_,'---','root_mean_squared_error = ', np.sqrt(mean_squared_error(y_test, eval(
                                                        'grid_'+model_+'.best_estimator_').predict(X_test))))


SVR --- root_mean_squared_error =  0.8835349331483652
gb --- root_mean_squared_error =  0.831919332638554
dt --- root_mean_squared_error =  0.8321861128762097
rf --- root_mean_squared_error =  0.8328283431503514
knn --- root_mean_squared_error =  0.8314098146910596
lr --- root_mean_squared_error =  0.8336661557379114


In [108]:
#видим, что нилучший результат у модели SVR

In [109]:
from plotly.offline import init_notebook_mode, iplot
import plotly
import plotly.graph_objs as go

init_notebook_mode(connected=True)

In [110]:
# Визуализируем результаты обучения
"""
columns = result.columns

traces=[]
for i in columns:
    traces.append(go.Scatter(
                    x=result.index,
                    y=result[i],
                    name=i,
                    orientation = 'v')
                 )

layout = {'title': 'Models result'}
fig = go.Figure(data=traces, layout=layout)

iplot(fig, show_link=False)
"""

"\ncolumns = result.columns\n\ntraces=[]\nfor i in columns:\n    traces.append(go.Scatter(\n                    x=result.index,\n                    y=result[i],\n                    name=i,\n                    orientation = 'v')\n                 )\n\nlayout = {'title': 'Models result'}\nfig = go.Figure(data=traces, layout=layout)\n\niplot(fig, show_link=False)\n"

### Рекомендация курсов на основании содержания текста review

In [111]:
# Три варианта положительных комментариев
for i, text in data[data.rating==5].head(3).iterrows():
    print("Good comment: \n {0} \n".format(text['review_text']))

Good comment: 
 Very relevant and useful course designed for CIOs 

Good comment: 
 Great piece of work, I especially liked a few 'lifehacks' for the CIO 

Good comment: 
 Excellent course, for me it was very rewarding and the terms used and the tools given were excellent, and today and I put in use in my job, Thank you for inculcating knowledge and move on               



In [112]:
# Три варианта отрицательных(негативных) комментариев
for i, text in data[data.rating==1].head(3).iterrows():
    print("Bad comment: \n {0} \n".format(text['review_text']))

Bad comment: 
 Till now no assigment for my work on week 4. 

Bad comment: 
 This course doesn't contain any new information. It does not teach you but just excitedly shows commonly known facts.There are better ways to invest your time. 

Bad comment: 
 I do not find very interesting this course. too many interviews. It could works for the first course, but not for the second. I was expecting to have more technical material and lessons. 



### Токенизация и очистка данных
Сделаем токенизацию слов из текстов review

In [113]:
mystopwords = stopwords.words('english') + ["i'm", '-', "i've"] + ["\\", "\"", "'", "\'"]
regex = re.compile("['A-Za-z\-]+")

def tokenize(text, regex=regex, stopwords=mystopwords):
    """ Tokenize all tokens from text string
        Returns array of tokens
    """
    try:
        text = " ".join(regex.findall(text)).lower()
        tokens = ' '.join([token for token in text.split(' ') if not token in stopwords])
        return tokens
    except:
        return []

In [114]:
data['review_text_tokenize'] = data.review_text.apply(tokenize)

In [115]:
data['review_text_tokenize']

0                      relevant useful course designed cios
1         course say anything digitization core subject ...
2         videos presented french could've translated en...
3         course content quite good though could deeper ...
4         great piece work especially liked 'lifehacks' cio
5         excellent course rewarding terms used tools gi...
6         excellent representation day day thanks sharin...
7                            interesting well-designed mooc
8         completion course progress well reviews taking...
9         nice course macro ideias several areas pretty ...
10        really liked presentation slides really clear ...
11        un cours vraiment int ressant qui fait chos de...
12        expectation course huge many people told cours...
13        course really helpful understanding strategy o...
14        excellent course really learned lot role chall...
15        insightful course transformations backed solid...
16                                      

### Определим частоту слов построим облако слов для того чтобы понять о чем большинство текстов

In [116]:
from collections import Counter

lemmata = []
for index, row in data.iterrows():
    lemmata += row['review_text_tokenize'].split()
cnt = Counter(lemmata)

for i in cnt.most_common(15):
    print(i)

AttributeError: 'list' object has no attribute 'split'

In [121]:
# Количество слов в словаре:
print(len(cnt))

NameError: name 'cnt' is not defined

In [122]:
from wordcloud import *
word_freq = [i for i in cnt.most_common(100)]
wd = WordCloud(background_color = 'white')
wd.generate_from_frequencies(dict(word_freq))
plt.figure()
plt.imshow(wd, interpolation = 'bilinear')
plt.axis('off')
plt.show()

NameError: name 'cnt' is not defined

### Сформируем сбалансированный датасет c обучающей и тестовой выборкой

In [123]:
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=0)

X = data[['review_text_tokenize']]
y = data['rating']

X_balanced, y_balanced = rus.fit_resample(X, y)

In [124]:
balanced = pd.DataFrame.from_dict({'review_text_tokenize': X_balanced[:,0], 'rating': y_balanced}) 
balanced.head()

Unnamed: 0,review_text_tokenize,rating
0,material extremely fragmented seems like instr...,1
1,poor design presentation assignment,1
2,totally unsufficient guidance external tools n...,1
3,find course added anything already learned see...,1
4,pesimo,1


In [125]:
# Разделим на обучающую и тестовую выборку
train, test = train_test_split(balanced, test_size=0.2, random_state=42)

### Построим векторную модель с помощью Doc2Vec

для предсказания пользовательских оценок

In [126]:
import multiprocessing
cores = multiprocessing.cpu_count()

In [127]:
train_tagged = train.apply(
    lambda r: TaggedDocument(words=r['review_text_tokenize'].split(' '), tags=[r.rating]), axis=1)
test_tagged = test.apply(
    lambda r: TaggedDocument(words=r['review_text_tokenize'].split(' '), tags=[r.rating]), axis=1)

In [128]:
train_tagged.values[30]

TaggedDocument(words=['interesting', 'course', 'way', 'many', 'quizzes', 'extremely', 'tedious', 'would', 'recommend', 'course', 'anyone'], tags=[2])

In [129]:
model_dbow = Doc2Vec(dm=0, vector_size=2000, negative=5, hs=0, min_count=2, sample = 0, workers=cores)
model_dbow.build_vocab([x for x in tqdm(train_tagged.values)])

100%|██████████| 13028/13028 [00:00<00:00, 1794528.49it/s]


In [130]:
# Обучаем модель
for epoch in range(10):
    model_dbow.train(utils.shuffle([x for x in tqdm(train_tagged.values)]), total_examples=len(train_tagged.values), epochs=1)
    model_dbow.alpha -= 0.002
    model_dbow.min_alpha = model_dbow.alpha

100%|██████████| 13028/13028 [00:00<00:00, 1769597.22it/s]
100%|██████████| 13028/13028 [00:00<00:00, 1416402.51it/s]
100%|██████████| 13028/13028 [00:00<00:00, 1542727.06it/s]
100%|██████████| 13028/13028 [00:00<00:00, 2944148.30it/s]
100%|██████████| 13028/13028 [00:00<00:00, 3329477.97it/s]
100%|██████████| 13028/13028 [00:00<00:00, 3473391.34it/s]
100%|██████████| 13028/13028 [00:00<00:00, 3156023.59it/s]
100%|██████████| 13028/13028 [00:00<00:00, 3013643.97it/s]
100%|██████████| 13028/13028 [00:00<00:00, 3363084.23it/s]
100%|██████████| 13028/13028 [00:00<00:00, 3412226.33it/s]


In [131]:
# Сформируем итоговый набор векторов для обучения
def vec_for_learning(model, tagged_docs):
    sents = tagged_docs.values
    targets, regressors = zip(*[(doc.tags[0], model.infer_vector(doc.words, steps=20)) for doc in sents])
    return targets, regressors

In [132]:
## Логистическая регрессия

In [133]:
y_train, X_train = vec_for_learning(model_dbow, train_tagged)
y_test, X_test = vec_for_learning(model_dbow, test_tagged)

In [134]:
logreg = LogisticRegression(n_jobs=1, C=1e5)
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)

In [135]:
print('Testing accuracy %s' % accuracy_score(y_test, y_pred))
print('Testing F1 score: {}'.format(f1_score(y_test, y_pred, average='weighted')))

Testing accuracy 0.39146453791832975
Testing F1 score: 0.38899449316649926


In [1]:
mean_squared_error(y_pred, y_test)

NameError: name 'mean_squared_error' is not defined