# Metodo 1

Objetivo: clasificar noticias segun ticker o compañia a la que hacen referencia.

Realizamos una reduccion del dataset original, tomando solo aquellos tickers que tienen mas de 500 registros, preprocesamos con countvectorizer para obtener bolsa de palabras, realizamos una reduccion de dimensionalidad, probamos modelos.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
np.set_printoptions(precision=4,suppress=True)  # no usar notacion "e"
import pandas as pd

In [2]:
df_original = pd.read_csv("Datasets/news_dataset.csv")

In [3]:
df=df_original.copy()

In [4]:
df.head()

Unnamed: 0,id,ticker,title,category,content,release_date,provider,url,article_id
0,221515,NIO,Why Shares of Chinese Electric Car Maker NIO A...,news,What s happening\nShares of Chinese electric c...,2020-01-15,The Motley Fool,https://invst.ly/pigqi,2060327
1,221516,NIO,NIO only consumer gainer Workhorse Group amon...,news,Gainers NIO NYSE NIO 7 \nLosers MGP Ingr...,2020-01-18,Seeking Alpha,https://invst.ly/pje9c,2062196
2,221517,NIO,NIO leads consumer gainers Beyond Meat and Ma...,news,Gainers NIO NYSE NIO 14 Village Farms In...,2020-01-15,Seeking Alpha,https://invst.ly/pifmv,2060249
3,221518,NIO,NIO NVAX among premarket gainers,news,Cemtrex NASDAQ CETX 85 after FY results \n...,2020-01-15,Seeking Alpha,https://invst.ly/picu8,2060039
4,221519,NIO,PLUG NIO among premarket gainers,news,aTyr Pharma NASDAQ LIFE 63 on Kyorin Pharm...,2020-01-06,Seeking Alpha,https://seekingalpha.com/news/3529772-plug-nio...,2053096


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 221513 entries, 0 to 221512
Data columns (total 9 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   id            221513 non-null  int64 
 1   ticker        221513 non-null  object
 2   title         221513 non-null  object
 3   category      221513 non-null  object
 4   content       221505 non-null  object
 5   release_date  221513 non-null  object
 6   provider      221513 non-null  object
 7   url           221513 non-null  object
 8   article_id    221513 non-null  int64 
dtypes: int64(2), object(7)
memory usage: 15.2+ MB


In [5]:
df.ticker.value_counts()

AAPL     20231
MSFT      8110
BAC       7409
AMZN      6330
NWSA      5914
         ...  
BUSE         1
CRMBQ        1
CHCI         1
EDMCQ        1
CART         1
Name: ticker, Length: 802, dtype: int64

Vemos que hay registros que contienen un unico valor, realizamos una reduccion a los que poseen al menos 500.

In [27]:
# Reduccion 
registros=500
ticker_count = df.ticker.value_counts().reset_index().rename(columns={'index':'ticker','ticker':'count'})
ticker_reduccion = list(ticker_count[ticker_count['count']>registros].ticker)

In [28]:
len(ticker_reduccion)

63

In [29]:
df_red = df[df.ticker.isin(ticker_reduccion)]
df_red.shape

(134098, 9)

In [30]:
df_red.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 134098 entries, 76 to 221512
Data columns (total 9 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   id            134098 non-null  int64 
 1   ticker        134098 non-null  object
 2   title         134098 non-null  object
 3   category      134098 non-null  object
 4   content       134092 non-null  object
 5   release_date  134098 non-null  object
 6   provider      134098 non-null  object
 7   url           134098 non-null  object
 8   article_id    134098 non-null  int64 
dtypes: int64(2), object(7)
memory usage: 10.2+ MB


Existen 6 valores nulos, los descartamos.

In [123]:
df_sample=df_red.dropna()

Concatenamos los valores de las columnas que consideramos relevantes para la clasificacion.

In [32]:
X=df_sample['title']+' '+df_sample['content']+' ' + df_sample['category']+' '+df_sample['provider']
y=df_sample['ticker']

## Division Entrenamiento - Test

Realizamos una division de los datos en entrenamientos y test, utilizamos un 80% entrenamiento.

In [33]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=0)

In [34]:
X_train.shape , X_test.shape

((107273,), (26819,))

In [35]:
y_train.shape,y_test.shape

((107273,), (26819,))

## Preprocesamiento CountVectorizer

Transformamos el texto en bolsa de palabras, utilizando CountVectorizer con su preprocesamiento.

In [37]:
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer()

In [38]:
vect.fit(X_train)
V_X_train=vect.transform(X_train)
V_X_test=vect.transform(X_test)

In [39]:
V_X_train.shape , V_X_test.shape

((107273, 230576), (26819, 230576))

## Reduccion dimensionalidad

Realizamos una reduccion de las dimensiones, ya que la matriz de salida de Countvectorizer posee mas de 200 mil features. Utilizamos TruncatedSVD, una descomposicion que permite trabajar con matrices esparsas.

In [124]:
from sklearn.decomposition import TruncatedSVD


svd = TruncatedSVD(n_components=100, n_iter=7, random_state=42)

svd.fit(V_X_train)

#print(svd.explained_variance_ratio_)
print(svd.explained_variance_ratio_.sum())
#print(svd.singular_values_)

0.777199960268369


Las primeras 100 componentes explican el 0.77 de la varianza, nos quedamos con estas componentes

In [128]:
V_X_train_SVD_nn = svd.transform(V_X_train)
V_X_test_SVD_nn = svd.transform(V_X_test)

Normalizamos las matrices.

In [129]:
from sklearn.preprocessing import Normalizer

transformer = Normalizer().fit(V_X_train_SVD_nn)  # fit does nothing.

In [130]:
V_X_train_SVD = transformer.transform(V_X_train_SVD_nn)
V_X_test_SVD=transformer.transform(V_X_test_SVD_nn)

In [107]:
V_X_train_SVD

array([[ 0.8628,  0.1112, -0.0565, ...,  0.0353,  0.0022, -0.0026],
       [ 0.9095,  0.017 ,  0.0409, ...,  0.0251, -0.018 , -0.0289],
       [ 0.9179,  0.0199,  0.1447, ..., -0.0369, -0.0194,  0.0313],
       ...,
       [ 0.9084, -0.1474,  0.1291, ..., -0.0171, -0.0136, -0.0041],
       [ 0.8433,  0.0656, -0.3328, ..., -0.0104, -0.0138, -0.014 ],
       [ 0.8673,  0.0556, -0.0647, ..., -0.0143,  0.0001, -0.0055]])

## Prueba de Modelos

### Modelos lineales

Utilizamos SGDClassifier para probar modelos SVM y logisticos, y RandomizedSearchCV, para probar hiperparametros de manera aleatoria y realizar Cross-Validation.

In [133]:
from sklearn.utils.fixes import loguniform
from scipy import stats

param_dist = {
    'loss': [
        'hinge',        # SVM
        'log',          # logistic regression
        ],
    'alpha': loguniform(1e-4, 1e2),  # de 0.0001 a 100.0
}

In [134]:
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import RandomizedSearchCV

model = SGDClassifier(random_state=0)

cv = RandomizedSearchCV(model, param_dist, n_iter=20, cv=3, random_state=0)
cv.fit(V_X_train_SVD, y_train)

RandomizedSearchCV(cv=3, estimator=SGDClassifier(random_state=0), n_iter=20,
                   param_distributions={'alpha': <scipy.stats._distn_infrastructure.rv_frozen object at 0x0000022F00F7A4E0>,
                                        'loss': ['hinge', 'log']},
                   random_state=0)

In [140]:
results = cv.cv_results_
df_result_lineal = pd.DataFrame(results)
df_result_lineal[df_result_lineal.rank_test_score<6][['param_loss', 'param_alpha', 'mean_test_score', 'std_test_score', 'rank_test_score']]

Unnamed: 0,param_loss,param_alpha,mean_test_score,std_test_score,rank_test_score
3,hinge,0.551293,0.410849,0.003438,5
4,hinge,0.0422205,0.422707,0.004925,4
5,hinge,0.000218916,0.433352,0.005771,3
6,hinge,0.0199825,0.435739,0.007315,1
16,hinge,0.000512444,0.434313,0.008321,2


In [141]:
cv.best_params_

{'alpha': 0.01998246739232945, 'loss': 'hinge'}

### Arbol de decision

In [162]:
range(1,10)

range(1, 10)

In [169]:
param_dist_dtree = {
    
    'criterion': ['gini', 'entropy'],
    'max_depth': range(1,4)

}

In [170]:
from sklearn.tree import DecisionTreeClassifier

dtree = DecisionTreeClassifier(random_state=0)

cv_dt = RandomizedSearchCV(dtree, param_dist_dtree, n_iter=10, cv=3, random_state=0)

cv_dt.fit(V_X_train_SVD, y_train)



RandomizedSearchCV(cv=3, estimator=DecisionTreeClassifier(random_state=0),
                   param_distributions={'criterion': ['gini', 'entropy'],
                                        'max_depth': range(1, 4)},
                   random_state=0)

In [171]:
results_tree = cv_dt.cv_results_
df_result_tree = pd.DataFrame(results_tree)
df_result_tree[['param_max_depth', 'param_criterion', 'mean_test_score', 'std_test_score', 'rank_test_score']] #[df_result_lineal.rank_test_score<6]

Unnamed: 0,param_max_depth,param_criterion,mean_test_score,std_test_score,rank_test_score
0,1,gini,0.151389,1.2e-05,6
1,2,gini,0.159155,8.2e-05,4
2,3,gini,0.176885,0.000261,1
3,1,entropy,0.151641,0.000372,5
4,2,entropy,0.169567,0.000471,3
5,3,entropy,0.175207,0.000449,2


### Random Forest

In [116]:
from sklearn import ensemble
clf = ensemble.RandomForestClassifier(n_estimators=10, random_state=2)
clf.fit(V_X_train_SVD, y_train)

RandomForestClassifier(n_estimators=10, random_state=2)

In [172]:
from sklearn.metrics import classification_report
y_pred_train = clf.predict(V_X_train_SVD)
print(classification_report(y_train, y_pred_train))

              precision    recall  f1-score   support

          AA       1.00      1.00      1.00       836
        AAPL       0.99      1.00      0.99     16240
         ADP       1.00      1.00      1.00       433
         AGN       0.99      1.00      0.99       672
        AMGN       1.00      1.00      1.00       565
        AMZN       0.99      1.00      0.99      5092
        AVGO       1.00      0.99      0.99       410
          BA       0.99      1.00      1.00      4691
         BAC       0.99      1.00      0.99      5957
         BHC       1.00      1.00      1.00       425
         BKR       1.00      1.00      1.00       690
         BLK       1.00      1.00      1.00       941
           C       1.00      0.99      1.00      1677
         CAJ       1.00      1.00      1.00       432
         CAT       1.00      1.00      1.00       679
         CME       1.00      1.00      1.00       919
         CMG       1.00      0.99      1.00       812
         CVX       1.00    

In [176]:
y_predicts_test_RF = clf.predict(V_X_test_SVD)

In [177]:
print(classification_report(y_test, y_predicts_test_RF))

              precision    recall  f1-score   support

          AA       0.08      0.12      0.10       243
        AAPL       0.34      0.77      0.47      3991
         ADP       0.21      0.16      0.18       110
         AGN       0.06      0.06      0.06       151
        AMGN       0.23      0.25      0.24       123
        AMZN       0.32      0.46      0.38      1238
        AVGO       0.21      0.10      0.14       107
          BA       0.52      0.65      0.58      1188
         BAC       0.27      0.42      0.33      1451
         BHC       0.56      0.22      0.31       101
         BKR       0.52      0.57      0.54       164
         BLK       0.17      0.08      0.11       245
           C       0.09      0.06      0.07       404
         CAJ       0.41      0.27      0.33       114
         CAT       0.20      0.08      0.11       193
         CME       0.43      0.27      0.33       234
         CMG       0.11      0.04      0.05       221
         CVX       0.23    

## Conclusion

Observando las metricas de train podemos ver que la mayor accuracy es en modelos lineales, siendo de aprox 0.45, random forest obtiene 0.99 overfiteando sobre los datos de entrenamiento, ya que al realizar predicciones sobre el conjunto de test su accuracy cae a 0.35.
Probamos cambiando el metodo de preprocesamiento en Metodo 2.