Nesse notebook decidi aprimorar o meu primeiro modelo, testei a implementação de TFIF e tunei o parâmetro de *n_estimators*, o qual implica na criação de mais árvores de decisão, assim tornando o algoritmo mais robusto.

In [1]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

In [2]:
features = pd.read_excel('raw_data_with_labels.xlsx')
features = features[features['y'].notnull()]
features.shape

(500, 5)

In [3]:
features['view_per_day'] = round(features['view_count'] / features['tempo_desde_pub'], 4)
features = features.drop(['tempo_desde_pub'], axis=1)
features.head()

Unnamed: 0,title,y,upload_date,view_count,view_per_day
0,How Far is Too Far? | The Age of A.I.,0.0,2019-12-18,49218295,79002.0787
1,AlphaGo - The Movie | Full award-winning docum...,0.0,2020-03-13,26896993,50087.5102
2,Artificial intelligence and algorithms: pros a...,0.0,2019-09-26,6424768,9100.238
3,"#AndroidDevChallenge - Helpful innovation, pow...",0.0,2020-06-22,5779436,13255.5872
4,Become a DATA ANALYST with NO degree?!? The Go...,0.0,2021-03-17,2037151,12125.8988


In [5]:
X = features.copy().drop(['y', 'upload_date'], axis=1)
y = features['y']

In [6]:
Xtrain, Xval, ytrain, yval = train_test_split(X, y, test_size=0.5, random_state=0)
Xtrain.shape, Xval.shape, ytrain.shape, yval.shape

((250, 3), (250, 3), (250,), (250,))

Nesse momento, configuro o TFIF com um mínimo de 2 exemplos de palavra

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

title_train = Xtrain['title']
title_val = Xval['title']

title_vec = TfidfVectorizer(min_df=2) #min_df = minimo de exemplos de palavra
title_bow_train = title_vec.fit_transform(title_train)
title_bow_val = title_vec.transform(title_val)

In [8]:
title_bow_train.shape

(250, 238)

In [9]:
from scipy.sparse import hstack

Xtrain_wtitle = hstack([Xtrain.drop(['title'], axis=1), title_bow_train])
Xval_wtitle = hstack([Xval.drop(['title'], axis=1), title_bow_val])

In [10]:
Xtrain_wtitle.shape, Xval_wtitle.shape

((250, 240), (250, 240))

In [11]:
mdl = RandomForestClassifier(n_estimators=1000, random_state=0, class_weight='balanced', n_jobs=6)
mdl.fit(Xtrain_wtitle, ytrain)

RandomForestClassifier(class_weight='balanced', n_estimators=1000, n_jobs=6,
                       random_state=0)

In [12]:
p = mdl.predict_proba(Xval_wtitle)[ : , 1]

In [13]:
from sklearn.metrics import roc_auc_score, average_precision_score

In [14]:
average_precision_score(yval, p)

0.32909939825559575

In [15]:
roc_auc_score(yval, p)

0.7491956241956242

Após avaliar o modelo, concluí que com as alterações feitas nesse segundo momento foram benéficas para o algoritmo.
Tive um aumento nas duas métricas (**Precisão** e **ROC AUC**)

## Active Learning

Decidi testar Active Learning para a melhoria do algortimo.
Para isso, separei exemplos que eu não havia categorizado se assistiria ou não o vídeo e o meu modelo anterior apontou um score entre 0.18 e 1.
O motivo disso é: esses vídeos são os que o algoritmo anterior não conseguiu classificar de forma binária (muito perto de 0 ou muito perto de 1), e irei usar eles para classificar com 0 ou 1 e treinar o modelo novamente.

*Spoiler Alert:* Não contribuiu muito para o próximo algoritmo.

In [20]:
df_unlabeled = pd.read_excel('raw_data_with_labels.xlsx')
df_unlabeled = df_unlabeled[df_unlabeled['y'].isnull()].dropna(how='all')
df_unlabeled.shape

(698, 5)

In [22]:
features_u = df_unlabeled.copy()
features_u['view_per_day'] = round(features_u['view_count'] / features_u['tempo_desde_pub'], 4)
features_u = features_u.drop(['tempo_desde_pub'], axis=1)
features_u.head()

Unnamed: 0,title,y,upload_date,view_count,view_per_day
500,A day in the life of a Data Scientist (lifesty...,,2020-12-10,259210,978.1509
501,Live- Implementation of End To End Kaggle Mach...,,2020-06-30,254697,595.0864
502,Live- Implementation of End To End Kaggle Mach...,,2020-06-30,254697,595.0864
503,How to learn data science in 2021 (the minimiz...,,2020-12-26,251737,1010.992
504,All Machine Learning Models Explained in 5 Min...,,2020-05-15,248865,525.0316


In [26]:
title_u = features_u['title']
title_bow_u = title_vec.transform(title_u)
title_bow_u

<698x238 sparse matrix of type '<class 'numpy.float64'>'
	with 3500 stored elements in Compressed Sparse Row format>

In [28]:
Xu_wtitle = hstack([features_u[['view_count', 'view_per_day']], title_bow_u])
Xu_wtitle

<698x240 sparse matrix of type '<class 'numpy.float64'>'
	with 4896 stored elements in COOrdinate format>

In [29]:
pu = mdl.predict_proba(Xu_wtitle)[ : , 1]
df_unlabeled['p'] = pu

In [30]:
df_unlabeled.head()

Unnamed: 0,title,y,upload_date,view_count,tempo_desde_pub,p
500,A day in the life of a Data Scientist (lifesty...,,2020-12-10,259210,265.0,0.154
501,Live- Implementation of End To End Kaggle Mach...,,2020-06-30,254697,428.0,0.245
502,Live- Implementation of End To End Kaggle Mach...,,2020-06-30,254697,428.0,0.245
503,How to learn data science in 2021 (the minimiz...,,2020-12-26,251737,249.0,0.13
504,All Machine Learning Models Explained in 5 Min...,,2020-05-15,248865,474.0,0.097


In [63]:
mask_u = (df_unlabeled['p'] >= 0.18) & (df_unlabeled['p'] <= 1)
mask_u.sum()

72

In [65]:
dificeis = df_unlabeled[mask_u].sort_values('p')
dificeis.head()

Unnamed: 0,title,y,upload_date,view_count,tempo_desde_pub,p
978,Data Science In Biology | How a biologist beca...,,2020-08-30,4999,367.0,0.18
840,Five Data Science Project Ideas,,2020-07-02,12356,426.0,0.18
555,Cheapest Deep Learning PC in 2020,,2020-02-10,100602,569.0,0.181
973,Step By Step Process In EDA And Feature Engine...,,2021-08-29,5093,3.0,0.182
880,AIML Facemask Detector | Mask Detection Using ...,,2021-02-22,9806,191.0,0.182


In [67]:
aleatorios = df_unlabeled[~mask_u].sample(28)
aleatorios.head()

Unnamed: 0,title,y,upload_date,view_count,tempo_desde_pub,p
1029,"BSc Mathematics, Statistics and Data Science",,2020-07-01,3247,427.0,0.097
875,Machine Learning Steps | What Is Machine Learn...,,2021-04-27,10084,127.0,0.135
929,Winner's interview: Dark of the Moon - 1st in ...,,2020-10-13,7206,323.0,0.169
1147,Institutions in Hyderabad | Data Science Insti...,,2021-08-27,1409,5.0,0.069
708,Is Becoming a Data Scientist Hard?,,2020-06-11,24547,447.0,0.028


In [68]:
pd.concat([dificeis, aleatorios]).to_csv('active_label1.csv')