Nesse notebook, eu apliquei a técnica de Active Learning (a qual descrevo no arquivo **modelo2.ipynb**).

In [37]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

In [38]:
df1 = pd.read_excel('raw_data_with_labels.xlsx').reset_index(drop=True)
df1 = df1[df1['y'].notnull()]
df1.shape

(500, 5)

In [39]:
df2 = pd.read_csv('active_label1.csv', index_col=0).reset_index(drop=True)
df2 = df2[df2['y'].notnull()]
df2['novo'] = 1
df2.shape

(100, 7)

Após a leitura do arquivo *active_label.csv* (o qual contém a minha classificação dos vídeos que o algoritmo esteve "confuso" a definir a classificação), eu gero uma métrica de **Precisão e ROC AUC** para avaliar o quanto o meu modelo tinha acertado com esses vídeos.

In [40]:
from sklearn.metrics import roc_auc_score, average_precision_score
average_precision_score(df2['y'], df2['p']), roc_auc_score(df2['y'], df2['p'])

(0.1306875860584758, 0.47344228804902966)

In [43]:
df = pd.concat([df1, df2.drop('p', axis=1)])
df['novo'] = df['novo'].fillna(0)

In [44]:
df['view_per_day'] = round(df['view_count'] / df['tempo_desde_pub'], 4)
df = df.drop(['tempo_desde_pub'], axis=1)
df.head()

Unnamed: 0,title,y,upload_date,view_count,novo,view_per_day
0,How Far is Too Far? | The Age of A.I.,0.0,2019-12-18 00:00:00,49218295,0.0,79002.0787
1,AlphaGo - The Movie | Full award-winning docum...,0.0,2020-03-13 00:00:00,26896993,0.0,50087.5102
2,Artificial intelligence and algorithms: pros a...,0.0,2019-09-26 00:00:00,6424768,0.0,9100.238
3,"#AndroidDevChallenge - Helpful innovation, pow...",0.0,2020-06-22 00:00:00,5779436,0.0,13255.5872
4,Become a DATA ANALYST with NO degree?!? The Go...,0.0,2021-03-17 00:00:00,2037151,0.0,12125.8988


### Active Learning

**Aplicando a técnica de Active Learning**

* Fiz a leitura dos novos dados, os quais categorizei mais alguns vídeos
* Separei em treino e teste
* Apliquei o modelo que já estava utilizando (TFIF com *min_df=2* e Random Forest com *n_estimators=1000*
* Avaliei o modelo e comparei com os anteriores

In [45]:
X = df.copy().drop(['y', 'upload_date'], axis=1)
y = df['y']

In [46]:
Xtrain, Xval, ytrain, yval = train_test_split(X, y, test_size=0.6, random_state=0)
Xtrain.shape, Xval.shape, ytrain.shape, yval.shape

((240, 4), (360, 4), (240,), (360,))

In [47]:
from sklearn.feature_extraction.text import TfidfVectorizer

title_train = Xtrain['title']
title_val = Xval['title']

title_vec = TfidfVectorizer(min_df=2) #min_df = minimo de exemplos de palavra
title_bow_train = title_vec.fit_transform(title_train)
title_bow_val = title_vec.transform(title_val)

In [48]:
title_bow_train.shape

(240, 217)

In [49]:
from scipy.sparse import hstack

Xtrain_wtitle = hstack([Xtrain.drop(['title'], axis=1), title_bow_train])
Xval_wtitle = hstack([Xval.drop(['title'], axis=1), title_bow_val])

In [50]:
Xtrain_wtitle.shape, Xval_wtitle.shape

((240, 220), (360, 220))

In [52]:
mdl = RandomForestClassifier(n_estimators=1000, random_state=0, class_weight='balanced', n_jobs=6)
mdl.fit(Xtrain_wtitle, ytrain)

RandomForestClassifier(class_weight='balanced', n_estimators=1000, n_jobs=6,
                       random_state=0)

In [53]:
p = mdl.predict_proba(Xval_wtitle)[ : , 1]

In [65]:
print('precision: {} and roc: {}'.format(average_precision_score(yval, p), roc_auc_score(yval, p)))

precision: 0.22625303141722752 and roc: 0.6403515625


O modelo com Active Learning não superou o modelo anterior.
A precisão caiu 34% e o ROC AUC caiu 15%

## Aumentando o treino

Minha próxima tentativa de melhorar o modelo, foi aumentando o número de dados para treino, porém essa aplicação tambem não positiva.

In [66]:
Xtrain, Xval, ytrain, yval = train_test_split(X, y, test_size=0.45, random_state=0)
Xtrain.shape, Xval.shape, ytrain.shape, yval.shape

title_train = Xtrain['title']
title_val = Xval['title']

title_vec = TfidfVectorizer(min_df=2) #min_df = minimo de exemplos de palavra
title_bow_train = title_vec.fit_transform(title_train)
title_bow_val = title_vec.transform(title_val)

Xtrain_wtitle = hstack([Xtrain.drop(['title'], axis=1), title_bow_train])
Xval_wtitle = hstack([Xval.drop(['title'], axis=1), title_bow_val])

mdl = RandomForestClassifier(n_estimators=1000, random_state=0, class_weight='balanced', n_jobs=6)
mdl.fit(Xtrain_wtitle, ytrain)

p = mdl.predict_proba(Xval_wtitle)[ : , 1]

print('precision: {} and roc: {}'.format(average_precision_score(yval, p), roc_auc_score(yval, p)))

precision: 0.19466273955140856 and roc: 0.6481481481481481


MODELO 1
    PRECISION: 0.11342732376793511
    AUC: 0.4945302445302446

MODELO 2
    PRECISION: 0.32909939825559575
    AUC: 0.7491956241956242

MODELO 3
    PRECISION: 0.22625303141722752
    AUC: 0.6403515625

MODELO 4
    PRECISION: 0.19466273955140856 
    AUC: 0.6481481481481481