Nesse notebook, começo a testar outros algoritmos e faço processo de tunning nos parâmetros do modelo.
A ideia aqui é chegar em melhores métricas de avaliação de modelo.

In [1]:
import pandas as pd
from scipy.sparse import hstack

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import roc_auc_score, average_precision_score
from skopt import forest_minimize

from lightgbm import LGBMClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

In [2]:
df = pd.read_csv("fully_annoteded.csv").dropna(subset=['y'])
df.head()

Unnamed: 0,title,y,upload_date,view_count,tempo_desde_pub
0,"""Kaggle: command not found fix"" for Mac Users",0,30/11/2020,216,275
1,#1 - Data Science na PrÃ¡tica - Conhecendo o K...,0,03/08/2020,808,394
2,"#AndroidDevChallenge - Helpful innovation, pow...",0,22/06/2020,5779436,436
3,#Data Science SQL Full Outer join Keyword | Ek...,0,28/08/2021,1,4
4,#Data Science SQL Left join Keyword | Ekasclou...,0,28/08/2021,0,4


In [3]:
df.duplicated().mean()

0.0

In [4]:
df.duplicated(['title']).mean()

0.0

In [5]:
df.shape

(1182, 5)

In [6]:
df['view_per_day'] = round(df['view_count'] / df['tempo_desde_pub'], 4)
df = df.drop(['tempo_desde_pub'], axis=1)
df.head()

Unnamed: 0,title,y,upload_date,view_count,view_per_day
0,"""Kaggle: command not found fix"" for Mac Users",0,30/11/2020,216,0.7855
1,#1 - Data Science na PrÃ¡tica - Conhecendo o K...,0,03/08/2020,808,2.0508
2,"#AndroidDevChallenge - Helpful innovation, pow...",0,22/06/2020,5779436,13255.5872
3,#Data Science SQL Full Outer join Keyword | Ek...,0,28/08/2021,1,0.25
4,#Data Science SQL Left join Keyword | Ekasclou...,0,28/08/2021,0,0.0


### Random Forest Classifier

Para esse algoritmo, testei diferentes valores para os seguintes parâmetros:
* n_estimators
* min_samples_leaf
* class_weight
* n_jobs

Além de testar variações do parâmetro de ngram no TFIDF

In [7]:
X = df.copy().drop(['y', 'upload_date'], axis=1)
y = df['y']

In [8]:
Xtrain, Xval, ytrain, yval = train_test_split(X, y, test_size=0.55, random_state=0)
Xtrain.shape, Xval.shape, ytrain.shape, yval.shape

((531, 3), (651, 3), (531,), (651,))

In [9]:
title_train = Xtrain['title']
title_val = Xval['title']

#min_df = minimo de exemplos de palavra
#ngram_range = numero de unigramas
title_vec = TfidfVectorizer(min_df=1, ngram_range=(1,1))
title_bow_train = title_vec.fit_transform(title_train)
title_bow_val = title_vec.transform(title_val)

In [10]:
Xtrain_wtitle = hstack([Xtrain.drop(['title'], axis=1), title_bow_train])
Xval_wtitle = hstack([Xval.drop(['title'], axis=1), title_bow_val])

Xtrain_wtitle.shape, Xval_wtitle.shape

((531, 1529), (651, 1529))

In [11]:
mdl = RandomForestClassifier(n_estimators=100, random_state=0, min_samples_leaf=1, class_weight='balanced', n_jobs=6)
mdl.fit(Xtrain_wtitle, ytrain)

RandomForestClassifier(class_weight='balanced', n_jobs=6, random_state=0)

In [12]:
p = mdl.predict_proba(Xval_wtitle)[ : , 1]

In [13]:
print('precision: {} and roc: {}'.format(average_precision_score(yval, p), roc_auc_score(yval, p)))

precision: 0.32113680450630394 and roc: 0.7318334350213545


precision: 0.32113680450630394 and roc: 0.7318334350213545 -> unigrama
precision: 0.31808388628023254 and roc: 0.7176479560707748 -> bigrama

### Testando o LightGBM

Utilizei a função *forest_minimize* da biblioteca *skopt* para chegar nos melhores parâmetros para o modelo LGBM.
Utilizei esse método pois ele é melhor que o GridSearch, já que este procura de forma aleatória e tenta prever o erro, para o próximo 'loop' ser melhor que o anterior.

In [14]:
mdl2 = LGBMClassifier(random_state=0, class_weight='balanced', n_jobs=6)
mdl2.fit(Xtrain_wtitle, ytrain)

LGBMClassifier(class_weight='balanced', n_jobs=6, random_state=0)

In [15]:
p2 = mdl2.predict_proba(Xval_wtitle)[ : , 1]



Métricas de avaliação sem tunning:

In [16]:
print('precision: {} and roc: {}'.format(average_precision_score(yval, p2), roc_auc_score(yval, p2)))

precision: 0.12864127819187215 and roc: 0.5719188529591214


Criação da função para otimização, aqui a ideia é chegar numa melhor **precisão**, mas tambem acompanhar o **ROC**

In [17]:
def tune_lgbm(params):
    print(params)
    lr = params[0]
    max_depth = params[1]
    min_child_samples = params[2]
    subsample = params[3]
    colsample_bytree = params[4]
    n_estimators = params[5]
    min_df = params[6]
    ngram_range = (1, params[7])

    title_vec = TfidfVectorizer(min_df=min_df, ngram_range=ngram_range)
    title_bow_train = title_vec.fit_transform(title_train)
    title_bow_val = title_vec.transform(title_val)

    Xtrain_wtitle = hstack([Xtrain.drop(['title'], axis=1), title_bow_train])
    Xval_wtitle = hstack([Xval.drop(['title'], axis=1), title_bow_val])

    mdl3 = LGBMClassifier(learning_rate=lr, num_leaves=2 ** max_depth, max_depth=max_depth,
                          min_child_samples=min_child_samples, subsample=subsample,
                          colsample_bytree=colsample_bytree, bagging_freq=1, n_estimators=n_estimators, random_state=0,
                          class_weight="balanced", n_jobs=6)

    mdl3.fit(Xtrain_wtitle, ytrain)

    p = mdl3.predict_proba(Xval_wtitle)[ : , 1]
    print(roc_auc_score(yval, p))

    return -average_precision_score(yval, p)

In [18]:
space = [(1e-3, 1e-1, 'log-uniform'), # lr
         (1, 10),                      # max_depth
         (1, 20),                      # min_child_samples
         (0.05, 1.),                   # subsample
         (0.05, 1.),                   # colsample_bytree
         (100, 1000),                  # n_estimators
         (1, 5),                       # mind_df
         (1, 5)]                       # ngram_range

In [19]:
res = forest_minimize(tune_lgbm, space, random_state=160745, n_random_starts=20, n_calls=50, verbose=1)

Iteration No: 1 started. Evaluating function at random point.
[0.009944912110647982, 5, 1, 0.4677107511929402, 0.49263223036174764, 272, 3, 1]




0.597895057962172
Iteration No: 1 ended. Evaluation done at random point.
Time taken: 0.2170
Function value obtained: -0.1500
Current minimum: -0.1500
Iteration No: 2 started. Evaluating function at random point.
[0.053887464791860025, 1, 15, 0.7437489153990157, 0.8675167974293533, 549, 3, 4]
0.5827181208053691
Iteration No: 2 ended. Evaluation done at random point.
Time taken: 0.1490
Function value obtained: -0.1089
Current minimum: -0.1500
Iteration No: 3 started. Evaluating function at random point.
[0.004151454520895999, 6, 20, 0.8682075103820793, 0.9491436163200662, 411, 4, 3]
0.5854636973764491
Iteration No: 3 ended. Evaluation done at random point.
Time taken: 0.1780
Function value obtained: -0.1263
Current minimum: -0.1500
Iteration No: 4 started. Evaluating function at random point.
[0.0014099928811969545, 9, 9, 0.6502182010234373, 0.6866210554187129, 828, 5, 2]




0.5910158633312995
Iteration No: 4 ended. Evaluation done at random point.
Time taken: 0.4860
Function value obtained: -0.1434
Current minimum: -0.1500
Iteration No: 5 started. Evaluating function at random point.
[0.08530558241838007, 8, 19, 0.2137736299768322, 0.1313765544201984, 961, 4, 1]
0.5801250762660158
Iteration No: 5 ended. Evaluation done at random point.
Time taken: 0.1510
Function value obtained: -0.1090
Current minimum: -0.1500
Iteration No: 6 started. Evaluating function at random point.
[0.003567949451535685, 10, 19, 0.7232951768944309, 0.7298538828427115, 939, 4, 3]




0.5730475899938986
Iteration No: 6 ended. Evaluation done at random point.
Time taken: 0.4860
Function value obtained: -0.1107
Current minimum: -0.1500
Iteration No: 7 started. Evaluating function at random point.
[0.014828577273549474, 7, 1, 0.18428087097824575, 0.3261556557915816, 274, 1, 2]




0.6007931665649785
Iteration No: 7 ended. Evaluation done at random point.
Time taken: 0.5090
Function value obtained: -0.1967
Current minimum: -0.1967
Iteration No: 8 started. Evaluating function at random point.
[0.0015212976972079912, 3, 12, 0.44234694306528044, 0.399351303640462, 272, 3, 5]
0.5669310555216595
Iteration No: 8 ended. Evaluation done at random point.
Time taken: 0.1450
Function value obtained: -0.1237
Current minimum: -0.1967
Iteration No: 9 started. Evaluating function at random point.
[0.01946212855369041, 9, 18, 0.5235636153223084, 0.6728679300083596, 747, 4, 5]




0.5455003050640634
Iteration No: 9 ended. Evaluation done at random point.
Time taken: 0.3540
Function value obtained: -0.1006
Current minimum: -0.1967
Iteration No: 10 started. Evaluating function at random point.
[0.0012116790683302117, 3, 2, 0.06616307483844217, 0.23025600705315752, 677, 2, 5]




0.5403904820012203
Iteration No: 10 ended. Evaluation done at random point.
Time taken: 0.4920
Function value obtained: -0.1249
Current minimum: -0.1967
Iteration No: 11 started. Evaluating function at random point.
[0.0053139776214487944, 6, 9, 0.14251441334450304, 0.8175761405215897, 297, 1, 5]




0.5386058572300183
Iteration No: 11 ended. Evaluation done at random point.
Time taken: 0.2170
Function value obtained: -0.1411
Current minimum: -0.1967
Iteration No: 12 started. Evaluating function at random point.
[0.0068572961982704935, 10, 5, 0.2390386584472456, 0.49053406102209746, 176, 2, 4]




0.5906040268456375
Iteration No: 12 ended. Evaluation done at random point.
Time taken: 0.2540
Function value obtained: -0.1246
Current minimum: -0.1967
Iteration No: 13 started. Evaluating function at random point.
[0.00781968225875022, 3, 4, 0.7078936710077383, 0.31818755505678337, 275, 4, 4]




0.622071384990848
Iteration No: 13 ended. Evaluation done at random point.
Time taken: 0.1990
Function value obtained: -0.1733
Current minimum: -0.1967
Iteration No: 14 started. Evaluating function at random point.
[0.017293945600511968, 2, 15, 0.9007557574888567, 0.41026441194439994, 316, 5, 1]
0.5585723001830385
Iteration No: 14 ended. Evaluation done at random point.
Time taken: 0.1270
Function value obtained: -0.1200
Current minimum: -0.1967
Iteration No: 15 started. Evaluating function at random point.
[0.012250750764764855, 8, 6, 0.5976582413192033, 0.2474882432951916, 516, 4, 4]




0.6170988407565589
Iteration No: 15 ended. Evaluation done at random point.
Time taken: 0.3870
Function value obtained: -0.1797
Current minimum: -0.1967
Iteration No: 16 started. Evaluating function at random point.
[0.018353598126553926, 4, 3, 0.47305622526323254, 0.1404164811277527, 133, 4, 1]
0.6087400854179378
Iteration No: 16 ended. Evaluation done at random point.
Time taken: 0.0970
Function value obtained: -0.1561
Current minimum: -0.1967
Iteration No: 17 started. Evaluating function at random point.
[0.0010383234748454694, 9, 19, 0.9256771571832196, 0.9321438677645206, 312, 4, 3]




0.5762202562538133
Iteration No: 17 ended. Evaluation done at random point.
Time taken: 0.2480
Function value obtained: -0.1264
Current minimum: -0.1967
Iteration No: 18 started. Evaluating function at random point.
[0.004955229758078229, 5, 5, 0.06939551310802591, 0.4193273080472823, 725, 4, 1]




0.5303386211104333
Iteration No: 18 ended. Evaluation done at random point.
Time taken: 0.2540
Function value obtained: -0.1153
Current minimum: -0.1967
Iteration No: 19 started. Evaluating function at random point.
[0.0699516121742407, 9, 10, 0.6477856515609233, 0.8594430701440198, 616, 1, 1]




0.5839993898718732
Iteration No: 19 ended. Evaluation done at random point.
Time taken: 0.5030
Function value obtained: -0.1316
Current minimum: -0.1967
Iteration No: 20 started. Evaluating function at random point.
[0.0014752743467850462, 5, 4, 0.9747950537021096, 0.982207187458162, 909, 2, 4]




0.5874618669920683
Iteration No: 20 ended. Evaluation done at random point.
Time taken: 0.8340
Function value obtained: -0.1760
Current minimum: -0.1967
Iteration No: 21 started. Searching for the next optimal point.
[0.019080846842779074, 7, 2, 0.24361348759218493, 0.18723153144935162, 639, 3, 2]




0.6054301403294692
Iteration No: 21 ended. Search finished for the next optimal point.
Time taken: 0.8560
Function value obtained: -0.1217
Current minimum: -0.1967
Iteration No: 22 started. Searching for the next optimal point.
[0.02187172241925626, 5, 3, 0.25463815869386314, 0.2730630100788364, 147, 1, 2]
0.6139261744966442




Iteration No: 22 ended. Search finished for the next optimal point.
Time taken: 0.4290
Function value obtained: -0.2285
Current minimum: -0.2285
Iteration No: 23 started. Searching for the next optimal point.
[0.020350398200048143, 7, 3, 0.22668831845228354, 0.10249107402007007, 216, 1, 1]
0.6106467358145211




Iteration No: 23 ended. Search finished for the next optimal point.
Time taken: 0.3790
Function value obtained: -0.1314
Current minimum: -0.2285
Iteration No: 24 started. Searching for the next optimal point.
[0.09948376949984697, 3, 3, 0.9083810130858175, 0.09074174239538503, 191, 1, 2]
0.6703782794386821




Iteration No: 24 ended. Search finished for the next optimal point.
Time taken: 0.3830
Function value obtained: -0.2585
Current minimum: -0.2585
Iteration No: 25 started. Searching for the next optimal point.
[0.048834605040452984, 4, 1, 0.919152527076462, 0.1813037074818199, 154, 1, 5]




0.6300183038438072
Iteration No: 25 ended. Search finished for the next optimal point.
Time taken: 0.7700
Function value obtained: -0.1931
Current minimum: -0.2585
Iteration No: 26 started. Searching for the next optimal point.
[0.012660828507962848, 3, 3, 0.8300916811920659, 0.14336214642387474, 170, 1, 1]
0.6506558877364247




Iteration No: 26 ended. Search finished for the next optimal point.
Time taken: 0.3270
Function value obtained: -0.2089
Current minimum: -0.2585
Iteration No: 27 started. Searching for the next optimal point.
[0.01992595624442714, 1, 1, 0.9383722939438743, 0.48040573464995723, 145, 1, 3]




0.6237339841366687
Iteration No: 27 ended. Search finished for the next optimal point.
Time taken: 0.5220
Function value obtained: -0.1926
Current minimum: -0.2585
Iteration No: 28 started. Searching for the next optimal point.
[0.0033639030162318827, 1, 7, 0.7491693078746711, 0.08815518723443469, 145, 1, 2]
0.5857992678462477




Iteration No: 28 ended. Search finished for the next optimal point.
Time taken: 0.3490
Function value obtained: -0.1598
Current minimum: -0.2585
Iteration No: 29 started. Searching for the next optimal point.
[0.06642164430244973, 3, 3, 0.9675145318709999, 0.11998068106918362, 231, 1, 1]
0.6711714460036609




Iteration No: 29 ended. Search finished for the next optimal point.
Time taken: 0.3870
Function value obtained: -0.2181
Current minimum: -0.2585
Iteration No: 30 started. Searching for the next optimal point.
[0.04891839940989368, 3, 1, 0.8271286677847324, 0.17635860948488918, 276, 1, 2]




0.6712782184258694
Iteration No: 30 ended. Search finished for the next optimal point.
Time taken: 0.5850
Function value obtained: -0.2701
Current minimum: -0.2701
Iteration No: 31 started. Searching for the next optimal point.
[0.09219929578148572, 2, 3, 0.6434385231905073, 0.07792081575341589, 929, 1, 2]




0.666275167785235
Iteration No: 31 ended. Search finished for the next optimal point.
Time taken: 0.6830
Function value obtained: -0.2446
Current minimum: -0.2701
Iteration No: 32 started. Searching for the next optimal point.
[0.07315900983060428, 1, 3, 0.16205291426125756, 0.2152328492817463, 803, 1, 2]




0.613987187309335
Iteration No: 32 ended. Search finished for the next optimal point.
Time taken: 0.7160
Function value obtained: -0.2056
Current minimum: -0.2701
Iteration No: 33 started. Searching for the next optimal point.
[0.0941113514658276, 2, 2, 0.5877754854295536, 0.06190710953011873, 470, 1, 3]




0.6426937156802928
Iteration No: 33 ended. Search finished for the next optimal point.
Time taken: 0.5790
Function value obtained: -0.2383
Current minimum: -0.2701
Iteration No: 34 started. Searching for the next optimal point.
[0.0935864303514451, 1, 4, 0.7322955071643373, 0.08105859121438219, 708, 1, 5]




0.6115771812080537
Iteration No: 34 ended. Search finished for the next optimal point.
Time taken: 0.6070
Function value obtained: -0.1958
Current minimum: -0.2701
Iteration No: 35 started. Searching for the next optimal point.
[0.06188857443309315, 3, 7, 0.43163319966004976, 0.056421455460231904, 306, 1, 2]
0.6426632092739475




Iteration No: 35 ended. Search finished for the next optimal point.
Time taken: 0.4140
Function value obtained: -0.1983
Current minimum: -0.2701
Iteration No: 36 started. Searching for the next optimal point.
[0.046989668440193394, 3, 3, 0.7682792724858065, 0.7840789886861175, 335, 1, 1]
0.6552776082977426




Iteration No: 36 ended. Search finished for the next optimal point.
Time taken: 0.3960
Function value obtained: -0.2168
Current minimum: -0.2701
Iteration No: 37 started. Searching for the next optimal point.
[0.04596727111921882, 4, 1, 0.8185931394848268, 0.7050686772127213, 490, 1, 2]




0.6666107382550335
Iteration No: 37 ended. Search finished for the next optimal point.
Time taken: 0.8570
Function value obtained: -0.2616
Current minimum: -0.2701
Iteration No: 38 started. Searching for the next optimal point.
[0.037432277375474984, 5, 2, 0.7672833446719736, 0.9202343393807229, 563, 1, 2]




0.6398413666870043
Iteration No: 38 ended. Search finished for the next optimal point.
Time taken: 0.6520
Function value obtained: -0.2183
Current minimum: -0.2701
Iteration No: 39 started. Searching for the next optimal point.
[0.08026074047975379, 6, 2, 0.7119670853373242, 0.7032695818317162, 627, 1, 2]




0.6478035387431361
Iteration No: 39 ended. Search finished for the next optimal point.
Time taken: 0.7960
Function value obtained: -0.2064
Current minimum: -0.2701
Iteration No: 40 started. Searching for the next optimal point.
[0.05164083955865311, 6, 1, 0.7985203185821215, 0.7894672538682616, 227, 1, 2]




0.6359365466748016
Iteration No: 40 ended. Search finished for the next optimal point.
Time taken: 0.9300
Function value obtained: -0.2211
Current minimum: -0.2701
Iteration No: 41 started. Searching for the next optimal point.
[0.03803018682323388, 3, 1, 0.8687986333663504, 0.4794490742756456, 582, 1, 2]




0.6739932885906039
Iteration No: 41 ended. Search finished for the next optimal point.
Time taken: 0.9560
Function value obtained: -0.2785
Current minimum: -0.2785
Iteration No: 42 started. Searching for the next optimal point.
[0.011842471741587635, 3, 1, 0.7084918731147317, 0.3676051143236056, 877, 1, 2]




0.663438071995119
Iteration No: 42 ended. Search finished for the next optimal point.
Time taken: 1.1890
Function value obtained: -0.2492
Current minimum: -0.2785
Iteration No: 43 started. Searching for the next optimal point.
[0.00280015953314303, 3, 1, 0.8250047423658959, 0.4462356026228884, 960, 1, 2]




0.6414276998169616
Iteration No: 43 ended. Search finished for the next optimal point.
Time taken: 1.2690
Function value obtained: -0.2337
Current minimum: -0.2785
Iteration No: 44 started. Searching for the next optimal point.
[0.05251267000012717, 3, 1, 0.9160788546847641, 0.9322395138915186, 433, 2, 2]
0.6104789505796216




Iteration No: 44 ended. Search finished for the next optimal point.
Time taken: 0.5290
Function value obtained: -0.1734
Current minimum: -0.2785
Iteration No: 45 started. Searching for the next optimal point.
[0.0018677972039913679, 3, 1, 0.8517798555503436, 0.44402732503065917, 276, 1, 2]




0.6209579011592434
Iteration No: 45 ended. Search finished for the next optimal point.
Time taken: 0.6380
Function value obtained: -0.2160
Current minimum: -0.2785
Iteration No: 46 started. Searching for the next optimal point.
[0.06735108910949815, 3, 1, 0.312956396385669, 0.13993417219979404, 759, 1, 2]




0.6675259304453934
Iteration No: 46 ended. Search finished for the next optimal point.
Time taken: 1.0420
Function value obtained: -0.2414
Current minimum: -0.2785
Iteration No: 47 started. Searching for the next optimal point.
[0.08044258969516314, 2, 1, 0.21313068252878759, 0.3305303876326218, 695, 1, 2]




0.6564826113483831
Iteration No: 47 ended. Search finished for the next optimal point.
Time taken: 1.0380
Function value obtained: -0.2146
Current minimum: -0.2785
Iteration No: 48 started. Searching for the next optimal point.
[0.06252986996117567, 8, 1, 0.8554230529737475, 0.08714621525391666, 860, 1, 2]




0.67477120195241
Iteration No: 48 ended. Search finished for the next optimal point.
Time taken: 1.3650
Function value obtained: -0.2561
Current minimum: -0.2785
Iteration No: 49 started. Searching for the next optimal point.
[0.07473498545892235, 9, 1, 0.8020499855076734, 0.10705426389117481, 562, 1, 2]




0.6855399633923124
Iteration No: 49 ended. Search finished for the next optimal point.
Time taken: 1.4480
Function value obtained: -0.2934
Current minimum: -0.2934
Iteration No: 50 started. Searching for the next optimal point.
[0.09543054224256677, 8, 1, 0.9453999830880128, 0.26181401391730624, 589, 1, 4]




0.6219798657718121
Iteration No: 50 ended. Search finished for the next optimal point.
Time taken: 1.8030
Function value obtained: -0.2379
Current minimum: -0.2934


Ao selecionar o atributo **x** após a aplicação da busca de melhores parâmetros, temos o melhor valor de cada parâmetro que decidi tunar.

In [20]:
print("""
lr: {}
max_depth: {}
min_child_samples: {}
subsample: {}
colsample_bytree: {}
n_estimators: {}
min_df: {}
ngram_range: {}
""".format( res.x[0], res.x[1], res.x[2], res.x[3], res.x[4], res.x[5], res.x[6], res.x[7]))


lr: 0.07473498545892235
max_depth: 9
min_child_samples: 1
subsample: 0.8020499855076734
colsample_bytree: 0.10705426389117481
n_estimators: 562
min_df: 1
ngram_range: 2



O atributo **fun** mostra qual a melhor métrica que o modelo conseguiu chegar.

In [21]:
res.fun

-0.293430888435203

### Logistic Regression

Nas próximas células, testo a aplicação de uma regressão logistica, alterando os tipos de scaler (já que esse algoritmo é sensível à amplitude dos dados)

In [22]:
from sklearn.preprocessing import MaxAbsScaler, StandardScaler
from scipy.sparse import csr_matrix

In [29]:
Xtrain_wtitle2 = csr_matrix(Xtrain_wtitle.copy())
Xval_wtitle2 = csr_matrix(Xval_wtitle.copy())

#scaler = StandardScaler()
#Xtrain_wtitle2[: , :2] = scaler.fit_transform(Xtrain_wtitle2[: , :2].todense())
#Xval_wtitle2[: , :2] = scaler.fit_transform(Xval_wtitle2[: , :2].todense())

scaler = MaxAbsScaler()
Xtrain_wtitle2 = scaler.fit_transform(Xtrain_wtitle2)
Xval_wtitle2 = scaler.fit_transform(Xval_wtitle2)

In [30]:
Xval_wtitle2.shape

(651, 1529)

In [71]:
mdl3 = LogisticRegression(C=1.5, n_jobs=6, random_state=0)
mdl3.fit(Xtrain_wtitle2, ytrain)

LogisticRegression(C=1.5, n_jobs=6, random_state=0)

In [72]:
p = mdl3.predict_proba(Xval_wtitle2)[: , 1]

In [73]:
average_precision_score(yval, p)

0.28445728618896904

In [74]:
roc_auc_score(yval, p)

0.7430750457596095

standard_scaler = acc: 0.2641978963609311 / roc: 0.7303233679072605
abs_scaler = acc: 0.2828859479124313 / roc: 0.7421598535692494
abs_scaler + C=1.5 = acc: 0.28294177477226495 / roc: 0.7429225137278829