# Metodo 2: Vocabulario definido

Objetivo: clasificar las noticias en relacion a la compañia o ticker al cual hacen referencia.

En esta notebook utilizamos el dataset 'news' preprocesado de la siguiente forma:

     1 - Sobre el dataset original se realiza una reduccion teniendo en cuenta aquellos tickers que poseen mas de 500 registros, 
     con el objetivo eliminar los tickers que poseen pocos registros y tambien por cuestiones de memoria y procesamiento.
     
     2 - Con la idea de que una noticia que hace referencia a una compañia en particular debe mencionarla, se crea un vocabulario 
     que incluye ticker y nombre de compañia correspondiente a ese ticker.
     
     3 - Se aplica, sobre las columnas 'titulo' y 'contenido' del dataset original, una funcion que conserva solo las palabras 
     que estan incluidas en el vocabulario, obteniendose el dataset preprocesado.

In [143]:
import numpy as np
import matplotlib.pyplot as plt
np.set_printoptions(precision=4,suppress=True)  # no usar notacion "e"
import pandas as pd

In [144]:
df_preprocesado = pd.read_csv("Datasets/DataFrame_news_preprocesado_voc.csv")

In [145]:
df=df_preprocesado.copy()

In [146]:
df.head()

Unnamed: 0,id,ticker,title,title_features,content_features,category,release_date,provider,url,article_id
0,221591,UBER,Uber Aims At Divesting Indian Food Delivery Un...,Uber,Uber Technologies Inc UBER business In Uber Ub...,opinion,2020-01-22,Zacks Investment Research,https://www.investing.com/analysis/uber-aims-a...,200500218
1,221592,UBER,Starbucks Vs McDonald s Which Is A Better Res...,Starbucks A,Uber UBER In Starbucks SBUX Starbucks Starbuck...,opinion,2020-01-12,Zacks Investment Research,https://www.investing.com/analysis/starbucks-v...,200498322
2,221593,UBER,The Zacks Analyst Blog Highlights Advanced Mi...,Intel Uber,news Intel INTC Uber UBER technology In Intel ...,opinion,2020-01-12,Zacks Investment Research,https://www.investing.com/analysis/the-zacks-a...,200498277
3,221594,UBER,Top Research Reports For UnitedHealth CVS M...,Morgan Stanley,Group Health Morgan Stanley MS MS Group techno...,opinion,2020-01-13,Zacks Investment Research,https://www.investing.com/analysis/top-researc...,200498433
4,221595,UBER,The Zacks Analyst Blog Highlights UnitedHealt...,Health Morgan Stanley Uber Technologies,news Group Health Morgan Stanley MS Uber Techn...,opinion,2020-01-13,Zacks Investment Research,https://www.investing.com/analysis/the-zacks-a...,200498650


Observamos la cantidad de registros que existen para cada ticker es mayor a 500, debido a preprocesamiento

In [147]:
df.ticker.value_counts()

AAPL    20231
MSFT     8110
BAC      7408
AMZN     6330
NWSA     5914
        ...  
ADP       543
JWN       540
BHC       526
AVGO      517
NLOK      513
Name: ticker, Length: 63, dtype: int64

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 134092 entries, 0 to 134091
Data columns (total 10 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   id                134092 non-null  int64 
 1   ticker            134092 non-null  object
 2   title             134092 non-null  object
 3   title_features    59715 non-null   object
 4   content_features  133497 non-null  object
 5   category          134092 non-null  object
 6   release_date      134092 non-null  object
 7   provider          134092 non-null  object
 8   url               134092 non-null  object
 9   article_id        134092 non-null  int64 
dtypes: int64(2), object(8)
memory usage: 10.2+ MB


In [152]:
df_sample=df.copy()

Para poder concatenar las columnas antes de ingresar a Countvectorizer, es necesario que no existan valores NaN, rellenamos estos valores con espacios vacios.

In [153]:
df_sample = df_sample.fillna(' ')

In [154]:
df_sample.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 134092 entries, 0 to 134091
Data columns (total 10 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   id                134092 non-null  int64 
 1   ticker            134092 non-null  object
 2   title             134092 non-null  object
 3   title_features    134092 non-null  object
 4   content_features  134092 non-null  object
 5   category          134092 non-null  object
 6   release_date      134092 non-null  object
 7   provider          134092 non-null  object
 8   url               134092 non-null  object
 9   article_id        134092 non-null  int64 
dtypes: int64(2), object(8)
memory usage: 10.2+ MB


Concatenamos los valores de las columnas seleccionadas

In [155]:
X=df_sample['title_features']+' '+df_sample['content_features']+' '+df_sample['category']+' '+df_sample['provider']
y=df_sample['ticker']

## Division entrenamiento y test

In [156]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=0)

In [157]:
X_train.shape , X_test.shape

((107273,), (26819,))

In [158]:
y_train.shape,y_test.shape

((107273,), (26819,))

In [159]:
X_train[X_train.isna()]

Series([], dtype: object)

## Preprocesamiento con Countvectorizer

In [160]:
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer()

In [161]:
vect.fit(X_train)
V_X_train=vect.transform(X_train)
V_X_test=vect.transform(X_test)

In [162]:
V_X_train.shape , V_X_test.shape

((107273, 1433), (26819, 1433))

## Modelos

### Modelo Naive - Bayes

Entrenamos el modelo Naive-Bayes y observamos los acuraccy sobre los conjuntos de Train y Test

In [92]:
from sklearn.naive_bayes import MultinomialNB
mnb = MultinomialNB()
mnb.fit(V_X_train, y_train)

MultinomialNB()

In [103]:
y_predict_mnb_train = mnb.predict(V_X_train)
y_predict_mnb_test = mnb.predict(V_X_test)

In [105]:
from sklearn.metrics import accuracy_score

print(accuracy_score(y_train, y_predict_mnb_train))
print(accuracy_score(y_test, y_predict_mnb_test))

0.7775395486282662
0.7748238189343376


### Modelos lineales

Entrenamos modelos lineales, utilizando la busqueda aleatoria para encontrar los mejores hiperparametros

Para utilizar modelos lineales es necesario que los datos hayan sido previamente escalados, utilizamos MaxAbsScaler por ser un escalador que funciona con matrices esparsas.

In [164]:
from sklearn.preprocessing import MaxAbsScaler

scaler = MaxAbsScaler()
V_X_train_scaled = scaler.fit_transform(V_X_train)

In [165]:
#from sklearn.preprocessing import StandardScaler

#scaler = StandardScaler()
#V_X_train_scaled = scaler.fit_transform(V_X_train)
#print(V_X_train_scaled)

Utilizamos SGDClassifier para probar modelos lineales, y RandomizedSearchCV, para probar hiperparametros de manera aleatoria y realizar Cross-Validation.

In [179]:
from sklearn.utils.fixes import loguniform
from scipy import stats

param_dist = {
    'loss': [
        'hinge',        # SVM
        'log',          # logistic regression
        #'preceptron',  # perceptron (not supported)
    ],
    'alpha': loguniform(1e-8, 1e2),  # 
}

In [180]:
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import RandomizedSearchCV

model = SGDClassifier(random_state=0)

cv = RandomizedSearchCV(model, param_dist, n_iter=20, cv=3, random_state=0)
cv.fit(V_X_train_scaled, y_train);

In [204]:
import pandas as pd
results = cv.cv_results_
df_result_lineal = pd.DataFrame(results)
df_result_lineal[df_result_lineal.rank_test_score<6][['param_loss', 'param_alpha', 'mean_test_score', 'std_test_score', 'rank_test_score']]

Unnamed: 0,param_loss,param_alpha,mean_test_score,std_test_score,rank_test_score
4,log,0.000172404,0.40039,0.004346,5
9,log,6.82991e-05,0.536155,0.001316,4
14,log,5.13287e-08,0.805263,0.013901,2
15,log,7.43521e-08,0.813,0.015333,1
16,log,1.59288e-08,0.7499,0.042043,3


In [169]:
cv.best_params_

{'alpha': 3.222938194971138e-07, 'loss': 'log'}

### Random Forest

Probamos Random Forest con algunos hiperparametros

In [193]:
param_grid_rf = {
    'criterion': ['gini', 'entropy'],
    'n_estimators': [10,20]
}

In [194]:
from sklearn.model_selection import GridSearchCV
from sklearn import ensemble

clf = ensemble.RandomForestClassifier(random_state=0)

cv_rf = GridSearchCV(clf, param_grid_rf, scoring='accuracy', cv=3)
cv_rf.fit(V_X_train, y_train)

GridSearchCV(cv=3, estimator=RandomForestClassifier(random_state=0),
             param_grid={'criterion': ['gini', 'entropy'],
                         'n_estimators': [10, 20]},
             scoring='accuracy')

In [195]:
results = cv_rf.cv_results_
params = results['params']
mean = results['mean_test_score']
std = results['std_test_score']
rank = results['rank_test_score']

print("crit.\tdepth\t| mean\tstd\trank")
for p, m, s, r in zip(params, mean, std, rank):
    print(f"{p['criterion']}\t{p['n_estimators']}\t| {m:0.2f}\t{s:0.2f}\t{r}")

crit.	depth	| mean	std	rank
gini	10	| 0.90	0.00	2
gini	20	| 0.91	0.00	1
entropy	10	| 0.88	0.00	4
entropy	20	| 0.89	0.00	3


In [199]:
cv_rf.best_params_

{'criterion': 'gini', 'n_estimators': 20}

In [200]:
best_RF = cv_rf.best_estimator_

In [201]:
y_pred_RF_train = best_RF.predict(V_X_train)
y_pred_RF_test = best_RF.predict(V_X_test)

In [202]:
from sklearn.metrics import classification_report

print(classification_report(y_train, y_pred_RF_train))

              precision    recall  f1-score   support

          AA       1.00      1.00      1.00       836
        AAPL       1.00      1.00      1.00     16240
         ADP       1.00      1.00      1.00       433
         AGN       0.98      0.97      0.98       672
        AMGN       1.00      0.99      0.99       565
        AMZN       0.99      0.99      0.99      5092
        AVGO       1.00      0.98      0.99       410
          BA       0.95      1.00      0.97      4691
         BAC       1.00      1.00      1.00      5957
         BHC       0.97      0.89      0.93       425
         BKR       1.00      1.00      1.00       690
         BLK       1.00      1.00      1.00       941
           C       1.00      0.99      0.99      1677
         CAJ       1.00      0.99      1.00       432
         CAT       1.00      0.99      1.00       679
         CME       1.00      1.00      1.00       919
         CMG       1.00      1.00      1.00       812
         CVX       1.00    

In [206]:
print(classification_report(y_test, y_pred_RF_test))

              precision    recall  f1-score   support

          AA       0.94      0.89      0.91       243
        AAPL       0.93      0.98      0.96      3991
         ADP       0.94      0.84      0.88       110
         AGN       0.87      0.69      0.77       151
        AMGN       0.93      0.94      0.94       123
        AMZN       0.92      0.93      0.93      1238
        AVGO       0.99      0.89      0.94       107
          BA       0.90      0.98      0.93      1188
         BAC       0.90      0.95      0.93      1451
         BHC       0.60      0.54      0.57       101
         BKR       0.96      0.85      0.90       164
         BLK       0.99      0.96      0.97       245
           C       0.84      0.84      0.84       404
         CAJ       0.96      0.84      0.90       114
         CAT       0.87      0.81      0.84       193
         CME       0.96      0.94      0.95       234
         CMG       0.96      0.90      0.93       221
         CVX       0.81    

## Conclusion

El modelo Random Forest con n=20 y criterio 'gini' es el que mejor resultados reporta tanto en entrenamiento como en test, obteniendo en test una accuracy de 0.91.
