### Задание 1

*Самостоятельно повторить tfidf (документация https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

In [1]:
# !pip install razdel pymorphy2

In [3]:
import pandas as pd

In [4]:
news = pd.read_csv("articles.csv")
print(news.shape)
news.head(3)

(27000, 2)


Unnamed: 0,doc_id,title
0,6,Заместитель председателяnправительства РФnСерг...
1,4896,Матч 1/16 финала Кубка России по футболу был п...
2,4897,Форвард «Авангарда» Томаш Заборский прокоммент...


In [5]:
users = pd.read_csv("users_articles.csv")
users.head(3)

Unnamed: 0,uid,articles
0,u105138,"[293672, 293328, 293001, 293622, 293126, 1852]"
1,u108690,"[3405, 1739, 2972, 1158, 1599, 322665]"
2,u108339,"[1845, 2009, 2356, 1424, 2939, 323389]"


In [6]:
# !pip install gensim

In [7]:
# предобработка текстов
import re
import numpy as np
from gensim.corpora.dictionary import Dictionary
from razdel import tokenize  # сегментация русскоязычного текста на токены и предложения https://github.com/natasha/razdel
import pymorphy2  # Морфологический анализатор

In [8]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Rokkar\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [9]:
stopword_ru = stopwords.words('russian')
print(len(stopword_ru))

151


In [10]:
with open('stopwords.txt') as f:
    additional_stopwords = [w.strip() for w in f.readlines() if w]
    
stopword_ru += additional_stopwords
len(stopword_ru)

776

In [11]:
def clean_text(text):
    '''
    очистка текста
    
    на выходе очищеный текст
    '''
    if not isinstance(text, str):
        text = str(text)
    
    text = text.lower()
    text = text.strip('\n').strip('\r').strip('\t')
    text = re.sub("-\s\r\n\|-\s\r\n|\r\n", '', str(text))

    text = re.sub("[0-9]|[-—.,:;_%©«»?*!@#№$^•·&()]|[+=]|[[]|[]]|[/]|", '', text)
    text = re.sub(r"\r\n\t|\n|\\s|\r\t|\\n", ' ', text)
    text = re.sub(r'[\xad]|[\s+]', ' ', text.strip())
    text = re.sub('n', ' ', text)
    
    return text

cache = {}
morph = pymorphy2.MorphAnalyzer()

def lemmatization(text):    
    '''
    лемматизация
        [0] если зашел тип не `str` делаем его `str`
        [1] токенизация предложения через razdel
        [2] проверка есть ли в начале слова '-'
        [3] проверка токена с одного символа
        [4] проверка есть ли данное слово в кэше
        [5] лемматизация слова
        [6] проверка на стоп-слова

    на выходе лист лемматизированых токенов
    '''

    # [0]
    if not isinstance(text, str):
        text = str(text)
    
    # [1]
    tokens = list(tokenize(text))
    words = [_.text for _ in tokens]

    words_lem = []
    for w in words:
        if w[0] == '-': # [2]
            w = w[1:]
        if len(w) > 1: # [3]
            if w in cache: # [4]
                words_lem.append(cache[w])
            else: # [5]
                temp_cach = cache[w] = morph.parse(w)[0].normal_form
                words_lem.append(temp_cach)
    
    words_lem_without_stopwords = [i for i in words_lem if not i in stopword_ru] # [6]
    
    return words_lem_without_stopwords

In [12]:
%%time
from tqdm import tqdm
tqdm.pandas()

# Запускаем очистку текста. Будет долго...
news['title'] = news['title'].progress_apply(lambda x: clean_text(x))

  from pandas import Panel
  text = re.sub("[0-9]|[-—.,:;_%©«»?*!@#№$^•·&()]|[+=]|[[]|[]]|[/]|", '', text)
100%|██████████| 27000/27000 [00:28<00:00, 933.46it/s] 


Wall time: 29.7 s


In [13]:
news['title'].iloc[:10]

0    заместитель председателя правительства рф серг...
1    матч  финала кубка россии по футболу был приос...
2    форвард авангарда томаш заборский прокомментир...
3    главный тренер кубани юрий красножан прокоммен...
4    решением попечительского совета владивостокско...
5    ио главного тренера вячеслав буцаев прокоммент...
6    запорожский металлург дома потерпел разгромное...
7    сборная сша одержала победу над австрией со сч...
8    бывший защитник сборной россии дарюс каспарайт...
9    полузащитник цска зоран тошич после победы над...
Name: title, dtype: object

In [14]:
%%time
# Запускаем лемматизацию текста. Будет очень долго...
news['title'] = news['title'].progress_apply(lambda x: lemmatization(x))

100%|██████████| 27000/27000 [03:33<00:00, 126.38it/s]


Wall time: 3min 33s


In [15]:
# сформируем список наших текстов
texts = list(news['title'].values)

# Создадим корпус из списка с текстами
common_dictionary = Dictionary(texts)
common_corpus = [common_dictionary.doc2bow(text) for text in texts]

In [16]:
N_topic = 20

In [17]:
%%time
from gensim.models import LdaModel

# Обучаем модель на корпусе
lda = LdaModel(common_corpus, num_topics=N_topic, id2word=common_dictionary, passes=5)

Wall time: 2min 38s


In [18]:
from gensim.test.utils import datapath

# Сохраняем модель на диск
temp_file = datapath("model.lda")
lda.save(temp_file)

In [19]:
# Загружаем обученную модель с диска
lda = LdaModel.load(temp_file)

In [20]:
# Создаем новый корпус документов, которые раньше не видели
other_texts = list(news['title'].iloc[:3])
other_corpus = [common_dictionary.doc2bow(text) for text in other_texts]

unseen_doc = other_corpus[2]
print(other_texts[2])
lda[unseen_doc] 

['форвард', 'авангард', 'томаш', 'заборский', 'прокомментировать', 'игра', 'свой', 'команда', 'матч', 'чемпионат', 'кхл', 'против', 'атланта', 'провести', 'плохой', 'матч', 'нижний', 'новгород', 'против', 'торпедо', 'настраиваться', 'первый', 'минута', 'включиться', 'работа', 'сказать', 'заборский', 'получиться', 'забросить', 'быстрый', 'гол', 'задать', 'хороший', 'темп', 'поединок', 'мочь', 'играть', 'ещё', 'хороший', 'сторона', 'пять', 'очко', 'выезд', 'девять', 'это', 'хороший']


[(1, 0.23695453),
 (5, 0.045665275),
 (6, 0.09033788),
 (7, 0.089774765),
 (9, 0.13232376),
 (12, 0.026187146),
 (13, 0.023459477),
 (15, 0.04525591),
 (18, 0.29771447)]

In [21]:
x = lda.show_topics(num_topics=N_topic, num_words=7, formatted=False)
topics_words = [(tp[0], [wd[0] for wd in tp[1]]) for tp in x]

# Печатаем только слова
for topic, words in topics_words:
    print(f"topic_{topic}: " + " ".join(words))

topic_0: земля миссия участок дыра мышь дождь журнал
topic_1: игра сезон команда таиланд сочи матч парка
topic_2: сша американский китай российский сила турция двигатель
topic_3: компания млн британский пенсия стоимость продажа великобритания
topic_4: газ ракета тело обнаружить северный вода который
topic_5: фильм ступень петербург парк александр письмо египетский
topic_6: президент путин сша заявить соглашение владимир государство
topic_7: год это млрд рубль военный тыс млн
topic_8: год день который город стать русский москва
topic_9: франция французский сообщать италия испания париж лауреат
topic_10: закон законопроект налоговый германия инвестиция законодательство выплата
topic_11: дело который суд это свой сотрудник человек
topic_12: рост цена уровень снижение писать россиянин население
topic_13: космонавт мэй кладбище нил челябинский похороны эндрю
topic_14: фонд тверской управляемый саратовский югра хх послушать
topic_15: фестиваль рейтинг место перевод озеро греция золото
topic_

In [22]:
def get_lda_vector(lda, text):
    unseen_doc = common_dictionary.doc2bow(text)
    lda_tuple = lda[unseen_doc]

    not_null_topics = dict(zip([i[0] for i in lda_tuple], [i[1] for i in lda_tuple]))

    output_vector = []
    for i in range(N_topic):
        if i not in not_null_topics:
            output_vector.append(0)
        else:
            output_vector.append(not_null_topics[i])
    return np.array(output_vector)

In [23]:
get_lda_vector(lda, news['title'].iloc[0])

array([0.        , 0.10109202, 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.07945848, 0.        , 0.        ,
       0.        , 0.26765791, 0.        , 0.        , 0.        ,
       0.        , 0.53710365, 0.        , 0.        , 0.        ])

In [24]:
%%time
topic_matrix = pd.DataFrame([get_lda_vector(lda, text) for text in news['title'].values])
topic_matrix.columns = [f'topic_{i}' for i in range(N_topic)]
topic_matrix['doc_id'] = news['doc_id'].values
topic_matrix = topic_matrix[['doc_id']+[f'topic_{i}' for i in range(N_topic)]]
topic_matrix.head(5)

Wall time: 22.6 s


Unnamed: 0,doc_id,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,...,topic_10,topic_11,topic_12,topic_13,topic_14,topic_15,topic_16,topic_17,topic_18,topic_19
0,6,0.0,0.101092,0.0,0.0,0.0,0.0,0.0,0.079496,0.0,...,0.0,0.267616,0.0,0.0,0.0,0.0,0.537107,0.0,0.0,0.0
1,4896,0.0,0.289266,0.0,0.053288,0.0,0.0,0.0,0.0,0.0,...,0.0,0.233842,0.0,0.0,0.0,0.0,0.403568,0.0,0.0,0.0
2,4897,0.0,0.236964,0.0,0.0,0.0,0.045688,0.090338,0.089794,0.0,...,0.0,0.0,0.026189,0.023459,0.0,0.045266,0.0,0.0,0.29771,0.0
3,4898,0.016529,0.193931,0.0,0.0,0.0,0.02496,0.07317,0.0,0.0,...,0.0,0.0,0.0,0.034227,0.0,0.0,0.158796,0.0,0.463233,0.027817
4,4899,0.087657,0.061566,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.827117,0.0,0.0,0.0


In [25]:
topic_matrix.head()

Unnamed: 0,doc_id,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,...,topic_10,topic_11,topic_12,topic_13,topic_14,topic_15,topic_16,topic_17,topic_18,topic_19
0,6,0.0,0.101092,0.0,0.0,0.0,0.0,0.0,0.079496,0.0,...,0.0,0.267616,0.0,0.0,0.0,0.0,0.537107,0.0,0.0,0.0
1,4896,0.0,0.289266,0.0,0.053288,0.0,0.0,0.0,0.0,0.0,...,0.0,0.233842,0.0,0.0,0.0,0.0,0.403568,0.0,0.0,0.0
2,4897,0.0,0.236964,0.0,0.0,0.0,0.045688,0.090338,0.089794,0.0,...,0.0,0.0,0.026189,0.023459,0.0,0.045266,0.0,0.0,0.29771,0.0
3,4898,0.016529,0.193931,0.0,0.0,0.0,0.02496,0.07317,0.0,0.0,...,0.0,0.0,0.0,0.034227,0.0,0.0,0.158796,0.0,0.463233,0.027817
4,4899,0.087657,0.061566,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.827117,0.0,0.0,0.0


In [26]:
users.head(3)

Unnamed: 0,uid,articles
0,u105138,"[293672, 293328, 293001, 293622, 293126, 1852]"
1,u108690,"[3405, 1739, 2972, 1158, 1599, 322665]"
2,u108339,"[1845, 2009, 2356, 1424, 2939, 323389]"


In [27]:
doc_dict = dict(zip(topic_matrix['doc_id'].values, topic_matrix[[f'topic_{i}' for i in range(N_topic)]].values))

In [28]:
doc_dict[293672]

array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.2564289 , 0.27587086, 0.        ,
       0.        , 0.12808338, 0.04906131, 0.        , 0.        ,
       0.        , 0.27388248, 0.        , 0.        , 0.        ])

### Задание 2

Модифицировать код функции get_user_embedding таким образом, чтобы считалось не среднее (как в примере np.mean), а медиана. Применить такое преобразование к данным, обучить модель прогнозирования оттока и посчитать метрики качества и сохранить их: roc auc, precision/recall/f_score (для 3 последних - подобрать оптимальный порог)

### Задание 3

Повторить п.2, но используя уже не медиану, а max

In [29]:
def get_user_embedding(user_articles_list, doc_dict, method='mean'):
    user_articles_list = eval(user_articles_list)
    user_vector = np.array([doc_dict[doc_id] for doc_id in user_articles_list])
    if method == 'mean':
        user_vector = np.mean(user_vector, 0)
    elif method == 'median':
        user_vector = np.median(user_vector, 0)
    else:
        user_vector = np.max(user_vector, 0)
    return user_vector

In [30]:
%%time
user_embeddings_median = pd.DataFrame([i for i in users['articles'].apply(lambda x: get_user_embedding(x, doc_dict, 'median'))])
user_embeddings_median.columns = [f'topic_{i}' for i in range(N_topic)]
user_embeddings_median['uid'] = users['uid'].values
user_embeddings_median = user_embeddings_median[['uid']+[f'topic_{i}' for i in range(N_topic)]]
user_embeddings_median.head(3)

Wall time: 734 ms


Unnamed: 0,uid,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,...,topic_10,topic_11,topic_12,topic_13,topic_14,topic_15,topic_16,topic_17,topic_18,topic_19
0,u105138,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.206141,...,0.0,0.169858,0.0,0.0,0.0,0.0,0.172812,0.0,0.075296,0.086739
1,u108690,0.0,0.0,0.0,0.0,0.0,0.007058,0.027952,0.036704,0.0,...,0.0,0.15032,0.017576,0.0,0.0,0.0,0.241239,0.0,0.14088,0.054727
2,u108339,0.0,0.0,0.0,0.0,0.041128,0.0,0.0,0.162235,0.069314,...,0.005751,0.318968,0.01212,0.0,0.0,0.0,0.107477,0.0,0.059934,0.111857


In [31]:
%%time
user_embeddings_max = pd.DataFrame([i for i in users['articles'].apply(lambda x: get_user_embedding(x, doc_dict, 'max'))])
user_embeddings_max.columns = [f'topic_{i}' for i in range(N_topic)]
user_embeddings_max['uid'] = users['uid'].values
user_embeddings_max = user_embeddings_max[['uid']+[f'topic_{i}' for i in range(N_topic)]]
user_embeddings_max.head(3)

Wall time: 319 ms


Unnamed: 0,uid,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,...,topic_10,topic_11,topic_12,topic_13,topic_14,topic_15,topic_16,topic_17,topic_18,topic_19
0,u105138,0.030973,0.015965,0.080916,0.112065,0.048485,0.010608,0.051407,0.256429,0.275871,...,0.06679,0.465373,0.049061,0.046524,0.0,0.0,0.273882,0.0,0.460972,0.346842
1,u108690,0.029919,0.0,0.0,0.031497,0.026501,0.019716,0.270814,0.260104,0.111285,...,0.024055,0.704852,0.048428,0.0,0.010008,0.025371,0.485796,0.0,0.334459,0.294791
2,u108339,0.021157,0.012787,0.032879,0.011519,0.142443,0.0,0.195651,0.285376,0.152179,...,0.021084,0.414587,0.017299,0.0,0.014677,0.0,0.26085,0.021372,0.172249,0.262344


In [32]:
target = pd.read_csv("users_churn.csv")
target.head(3)

Unnamed: 0,uid,churn
0,u107120,0
1,u102277,0
2,u102444,0


In [33]:
X_median = pd.merge(user_embeddings_median, target, 'left')
X_median.head(3)

Unnamed: 0,uid,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,...,topic_11,topic_12,topic_13,topic_14,topic_15,topic_16,topic_17,topic_18,topic_19,churn
0,u105138,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.206141,...,0.169858,0.0,0.0,0.0,0.0,0.172812,0.0,0.075296,0.086739,0
1,u108690,0.0,0.0,0.0,0.0,0.0,0.007058,0.027952,0.036704,0.0,...,0.15032,0.017576,0.0,0.0,0.0,0.241239,0.0,0.14088,0.054727,1
2,u108339,0.0,0.0,0.0,0.0,0.041128,0.0,0.0,0.162235,0.069314,...,0.318968,0.01212,0.0,0.0,0.0,0.107477,0.0,0.059934,0.111857,1


In [34]:
X_max = pd.merge(user_embeddings_max, target, 'left')
X_max.head(3)

Unnamed: 0,uid,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,...,topic_11,topic_12,topic_13,topic_14,topic_15,topic_16,topic_17,topic_18,topic_19,churn
0,u105138,0.030973,0.015965,0.080916,0.112065,0.048485,0.010608,0.051407,0.256429,0.275871,...,0.465373,0.049061,0.046524,0.0,0.0,0.273882,0.0,0.460972,0.346842,0
1,u108690,0.029919,0.0,0.0,0.031497,0.026501,0.019716,0.270814,0.260104,0.111285,...,0.704852,0.048428,0.0,0.010008,0.025371,0.485796,0.0,0.334459,0.294791,1
2,u108339,0.021157,0.012787,0.032879,0.011519,0.142443,0.0,0.195651,0.285376,0.152179,...,0.414587,0.017299,0.0,0.014677,0.0,0.26085,0.021372,0.172249,0.262344,1


In [35]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt

In [36]:
#разделим данные на train/test
X_train_median, X_test_median, y_train_median, y_test_median = train_test_split(X_median[[f'topic_{i}' for i in range(N_topic)]], 
                                                    X_median['churn'], random_state=0)

X_train_max, X_test_max, y_train_max, y_test_max = train_test_split(X_max[[f'topic_{i}' for i in range(N_topic)]], 
                                                    X_max['churn'], random_state=0)

In [37]:
logreg_median = LogisticRegression()
logreg_max = LogisticRegression()
# обучим 
logreg_median.fit(X_train_median, y_train_median)
logreg_max.fit(X_train_max, y_train_max)

LogisticRegression()

In [38]:
#наши прогнозы для тестовой выборки
preds_median = logreg_median.predict_proba(X_test_median)[:, 1]
print(preds_median[:10])

preds_max = logreg_max.predict_proba(X_test_max)[:, 1]
print(preds_max[:10])

[0.09588062 0.0056786  0.59991919 0.32285527 0.02171979 0.06660128
 0.0544884  0.14627997 0.18820718 0.07144477]
[1.13619298e-01 2.32892982e-04 7.51378643e-01 1.02219769e-01
 3.46362468e-02 1.38170496e-01 4.03859259e-02 2.28804673e-02
 6.28000718e-02 2.78520550e-02]


In [39]:
from sklearn.metrics import (f1_score, roc_auc_score, precision_score,
                             classification_report, precision_recall_curve, confusion_matrix)

In [41]:
precision_median, recall_median, thresholds_median = precision_recall_curve(y_test_median, preds_median)
fscore_median = (2 * precision_median * recall_median) / (precision_median + recall_median)
# locate the index of the largest f score
ix_median = np.argmax(fscore_median)
print(f'Method = Median. Best Threshold={thresholds_median[ix_median]:.3f}, F-Score={fscore_median[ix_median]:.3f}, Precision={precision_median[ix_median]:.3f}, Recall={recall_median[ix_median]:.3f}')

precision_max, recall_max, thresholds_max = precision_recall_curve(y_test_max, preds_max)
fscore_max = (2 * precision_max * recall_max) / (precision_max + recall_max)
# locate the index of the largest f score
ix_max = np.argmax(fscore_max)
print(f'Max = Max. Best Threshold={thresholds_max[ix_max]:.3f}, F-Score={fscore_max[ix_max]:.3f}, Precision={precision_max[ix_max]:.3f}, Recall={recall_max[ix_max]:.3f}')


Method = Median. Best Threshold=0.275, F-Score=0.780, Precision=0.728, Recall=0.841
Max = Max. Best Threshold=0.357, F-Score=0.802, Precision=0.785, Recall=0.820


In [42]:
roc_auc_median = roc_auc_score(y_test_median, preds_median)
roc_auc_max = roc_auc_score(y_test_max, preds_max)

print(f'ROC AUC score for median method: {roc_auc_median}')
print(f'ROC AUC score for max method: {roc_auc_max}')

ROC AUC score for median method: 0.9729402872260016
ROC AUC score for max method: 0.9749543578115007


### Задание 4

*Воспользовавшись полученными знаниями из п.1, повторить пункт 2, но уже взвешивая новости по tfidf (взяв список новостей пользователя)
    - подсказка 1: нужно получить веса-коэффициенты для каждого документа. Не все документы одинаково информативны и несут какой-то положительный сигнал
    - подсказка 2: нужен именно idf, как вес.

### Задание 5

Сформировать на выходе единую таблицу, сравнивающую качество 2/3 разных метода получения эмбедингов пользователей: median, max, idf_mean по метрикам roc_auc, precision, recall, f_score

In [44]:
metrics = pd.DataFrame(
    {
        'Method': ['Median', 'Max'],
        'Threshold': [thresholds_median[ix_median], thresholds_max[ix_max]],
        'F-Score': [fscore_median[ix_median], fscore_max[ix_max]],
        'Precision': [precision_median[ix_median], precision_max[ix_max]],
        'Recall': [recall_median[ix_median], recall_max[ix_max]],
        'ROC AUC score': [roc_auc_median, roc_auc_max]
    })

metrics

Unnamed: 0,Method,Threshold,F-Score,Precision,Recall,ROC AUC score
0,Median,0.275342,0.780303,0.727915,0.840816,0.97294
1,Max,0.357282,0.802395,0.785156,0.820408,0.974954


### Задание 6

Сделать самостоятельные выводы и предположения о том, почему тот или ной способ оказался эффективнее остальных